WO2023009758A1 - Quality score calibration of basecalling systems - Google Patents

Quality score calibration of basecalling systems Download PDF

Info

Publication number
WO2023009758A1
WO2023009758A1 PCT/US2022/038729 US2022038729W WO2023009758A1 WO 2023009758 A1 WO2023009758 A1 WO 2023009758A1 US 2022038729 W US2022038729 W US 2022038729W WO 2023009758 A1 WO2023009758 A1 WO 2023009758A1
Authority
WO
WIPO (PCT)
Prior art keywords
base
range
sensor data
value
intensity
Prior art date
Application number
PCT/US2022/038729
Other languages
English (en)
French (fr)
Inventor
Rohan PAUL
Dorna KASHEFHAGHIGHI
John S. Vieceli
Andrew Dodge Heiberg
Original Assignee
Illumina, Inc.
Illumina Software, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/839,387 external-priority patent/US20230029970A1/en
Application filed by Illumina, Inc., Illumina Software, Inc. filed Critical Illumina, Inc.
Priority to AU2022319125A priority Critical patent/AU2022319125A1/en
Priority to CN202280043793.8A priority patent/CN117529780A/zh
Priority to JP2023579782A priority patent/JP2024532049A/ja
Priority to EP22761681.0A priority patent/EP4377960A1/en
Priority to KR1020237043770A priority patent/KR20240037882A/ko
Priority to IL309786A priority patent/IL309786A/en
Priority to CA3223746A priority patent/CA3223746A1/en
Publication of WO2023009758A1 publication Critical patent/WO2023009758A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
  • intelligence i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems
  • systems for reasoning with uncertainty e.g., fuzzy logic systems
  • adaptive systems e.g., machine learning systems
  • artificial neural networks e.g., neural networks
  • the technology disclosed relates to using deep neural networks such as deep convolutional neural networks for analyzing data.
  • CNNs Deep Convolution Neural Networks
  • GPU Graphics Processing Unit
  • FPGA Field Programmable Gate Array
  • Convolution As convolution contributes to most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator.
  • Convolution involves multiply and accumulate (MAC) operations with four levels of loops that slide along kernel and feature maps.
  • the first loop level computes the MAC of pixels within a kernel window.
  • the second loop level accumulates the sum of products of the MAC across different input feature maps. After finishing the first and second loop levels, a final output element in the output feature map is obtained by adding the bias.
  • the third loop level slides the kernel window within an input feature map.
  • the fourth loop level generates different output feature maps.
  • FPGAs have gained increasing interest and popularity in particular to accelerate inference tasks, due to their (1) high degree of reconfigurability, (2) faster development time compared to
  • ASICs Application Specific Integrated Circuits
  • ASICs Application Specific Integrated Circuits
  • the high performance and efficiency of an FPGA can be realized by synthesizing a circuit that is customized for a specific computation to directly process billions of operations with the customized memory systems. For instance, hundreds to thousands of digital signal processing (DSP) blocks on modem FPGAs support the core convolution operation, e.g., multiplication and addition, with high parallelism.
  • DSP digital signal processing
  • Dedicated data buffers between external on-chip memory and on-chip processing engines (PEs) can be designed to realize the preferred dataflow by configuring tens of Mbyte on-chip block random access memories (BRAM) on the FPGA chip.
  • BRAM block random access memories
  • Efficient dataflow and hardware architecture of CNN acceleration are desired to minimize data communication while maximizing resource utilization to achieve high performance.
  • Deep neural networks have great promise for bioinformatics research because of their broad applicability and enhanced prediction power.
  • Convolutional neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification, and gene expression inference.
  • Convolutional neural networks use a weight-sharing strategy that is especially useful for studying DNA because it can capture sequence motifs, which are short, recurring local patterns in DNA that are presumed to have significant biological functions.
  • Neural networks can capture long- range dependencies in sequential data of varying lengths, such as protein or DNA sequences. Therefore, an opportunity arises to use a principled deep learning-based framework for base calling.
  • nucleic acid sequencing data there is a need for increasing the quality and quantity of nucleic acid sequencing data that can be obtained rapidly and cost-effectively for a wide variety of uses, including for genomics (e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations), pharmacogenomics, transcriptomics, diagnostics, prognostics, biomedical risk assessment, clinical and research genetics, personalized medicine, drug efficacy and drug interactions assessments, veterinary medicine, agriculture, evolutionary and biodiversity studies, aquaculture, forestry, oceanography, ecological and environmental management, and other purposes.
  • genomics e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations
  • pharmacogenomics transcriptomics
  • diagnostics e.g., prognostics
  • biomedical risk assessment clinical and research genetics
  • personalized medicine e.g., drug efficacy and drug interactions assessments
  • veterinary medicine e.g., agriculture, evolutionary and biodiversity studies, aquaculture, forestry, oceanography
  • quality scores provide indication of, in logarithmic scale, probabilities of a base being called an adenine (A), thymine (T), guanine (G), or cytosine (C).
  • A adenine
  • T thymine
  • G guanine
  • C cytosine
  • a quality score Q(A) for a base provides an indication of a probability of the base being an A
  • a quality score Q(C) for the base provides an indication of a probability of the base being an C, and so on.
  • the quality scores are used to make critical decisions, such as critical health care decisions.
  • quality scores associated with detecting bases of a human tissue sample may affect an approach to treat a health condition.
  • it is desirable that the quality scores generated for base calling are relatively accurate and dependable.
  • the quality scores generated for base calling are better aligned to empirically determined quality scores (which are representative of true quality scores).
  • Fig. 1 illustrates a cross-section of a biosensor that can be used in various embodiments.
  • Fig. 2 depicts one implementation of a flow cell that contains clusters in its tiles.
  • Fig. 3 illustrates an example flow cell with eight lanes, and also illustrates a zoom-in on one tile and its clusters and their surrounding background.
  • Fig. 4 is a simplified block diagram of the system for analysis of sensor data from a sequencing system, such as base call sensor outputs.
  • Fig. 5 is a simplified diagram showing aspects of the base calling operation, including functions of a runtime program executed by a host processor.
  • Fig. 6 is a simplified diagram of a configuration of a configurable processor such as that of Fig. 4.
  • Fig. 7 is a diagram of a neural network architecture which can be executed using a configurable or a reconfigurable array configured as described herein.
  • Fig. 8 A is a simplified illustration of an organization of tiles of sensor data used by a neural network architecture like that of Fig. 7.
  • Fig. 8B is a simplified illustration of patches of tiles of sensor data used by a neural network architecture like that of Fig. 7.
  • Fig. 9 illustrates part of a configuration for a neural network like that of Fig. 7 on a configurable or a reconfigurable array, such as a Field Programmable Gate Array (FPGA).
  • FPGA Field Programmable Gate Array
  • Fig. 10 is a diagram of another alternative neural network architecture which can be executed using a configurable or a reconfigurable array configured as described herein.
  • Fig. 11 illustrates one implementation of a specialized architecture of the neural network- based base caller that is used to segregate processing of data for different sequencing cycles.
  • Fig. 12 depicts one implementation of segregated layers, each of which can include convolutions.
  • Fig. 13A depicts one implementation of combinatory layers, each of which can include convolutions.
  • Fig. 13B depicts another implementation of the combinatory layers, each of which can include convolutions.
  • Fig. 14A illustrates a base calling system generating quality scores corresponding to A, C, T, and G for various bases to be called.
  • Fig. 14B illustrates a table indicating a relationship between probability scores, quality scores, corresponding error probabilities, and corresponding error rates.
  • Fig. 14C illustrates a comparison operation between predicted quality scores predicted by the base calling system of Fig. 14A and true (e.g., empirically calculated) quality scores.
  • Fig. 14D illustrates determination of true (e.g., empirically determined) quality scores of Fig. 14C.
  • Fig. 15A illustrates a graph depicting a comparison between predicted quality scores and true quality scores
  • Fig. 15B illustrates another graph depicting another comparison between predicted quality scores and true quality scores
  • Fig. 16 illustrates another graph depicting a comparison between predicted quality scores and true quality scores.
  • Fig. 17A illustrates a base calling system including a normalization module for normalizing sensor data that are received by a base caller.
  • Fig. 17B illustrates two graphs depicting a normalization operation on sensor data performed by the normalization module of the base calling system of Fig. 17A.
  • Fig. 17C illustrates a graph depicting a comparison between predicted quality scores and true quality scores, wherein the sensor data have been normalized by the normalization module of the base calling system of Fig. 17A while generating data for the graph of Fig. 17C.
  • Fig. 17D illustrates a plot indicating expected calibration error (ECE) for a base calling system having input normalization versus another base calling system lacking such an input normalization.
  • ECE expected calibration error
  • Fig. 17E illustrates a color comparison between sensor data prior to normalization and normalized sensor data.
  • Fig. 17F illustrates a flowchart depicting an example method for normalizing sensor data, and using normalized sensor data for base calling operations.
  • Fig. 18A illustrates a base calling system including a quality score remapping module for selectively remapping quality scores predicted by the base caller of the base calling system.
  • Figs. 18B1, 18B2, 18B3, 18B4, and 18B5 in combination, illustrate examples of quality score remapping and quantization.
  • Fig. 18C1 and 18C2 illustrate two further examples of quality score remapping and quantization.
  • Fig. 19 illustrates a table depicting, for some specific base sequences, deviations between (i) an average of quality scores of the specific base sequences and (ii) an average of remapped quality scores of the specific base sequences, where the remapping is performed in accordance with a general Look Up Table (LUT) of, for example, Fig. 18B2.
  • LUT Look Up Table
  • Fig. 20A illustrates a LUT that is usable to remap predicted quality scores of a specific base sequence to remapped quality scores.
  • Fig. 20B illustrates remapping of predicted quality scores for a specific base sequence using the LUT of Fig. 20A.
  • Fig. 21 illustrates a base calling system that includes a loss penalization module to selectively penalize loss for one or more specific base sequences.
  • Figs. 22A-22E in combination, illustrate penalization of a loss function (e.g. , by the loss penalization module 2106), in response to a detection of a specific base sequence.
  • a loss function e.g. , by the loss penalization module 2106
  • Fig. 22F illustrates application of a specialized weight to loss associated with a middle base of a specific base sequence.
  • Fig. 22G illustrates two graphs comparing performance of a base calling system that does not penalize loss, versus a base calling system that penalizes loss for a specific base sequence.
  • Fig. 23 illustrates a base calling system that includes (i) the normalization module of the base calling system of Fig. 17A, (ii) the quality score remapping module and the quality score quantization module of the base calling system of Fig. 18A, and (iii) the loss penalization module of the base calling system of Fig. 21.
  • Fig. 24 is a block diagram of a base calling system in accordance with one implementation.
  • Fig. 25 is a block diagram of a system controller that can be used in the system of Fig. 24.
  • Fig. 26 is a simplified block diagram of a computer system that can be used to implement the technology disclosed.
  • polynucleotide or “nucleic acids” refer to deoxyribonucleic acid (DNA), but where appropriate the skilled artisan will recognize that the systems and devices herein can also be utilized with ribonucleic acid (RNA).
  • RNA ribonucleic acid
  • the terms should be understood to include, as equivalents, analogs of either DNA or RNA made from nucleotide analogs.
  • the terms as used herein also encompass cDNA, that is complementary, or copy, DNA produced from an RNA template, for example by the action of reverse transcriptase.
  • the single stranded polynucleotide molecules sequenced by the systems and devices herein can have originated in single -stranded form, as DNA or RNA or have originated in double -stranded DNA (dsDNA) form (e.g., genomic DNA fragments, PCR and amplification products and the like).
  • dsDNA double -stranded DNA
  • a single stranded polynucleotide may be the sense or antisense strand of a polynucleotide duplex.
  • Methods of preparation of single stranded polynucleotide molecules suitable for use in the method of the disclosure using standard techniques are well known in the art.
  • the precise sequence of the primary polynucleotide molecules is generally not material to the disclosure, and may be known or unknown.
  • the single stranded polynucleotide molecules can represent genomic DNA molecules (e.g., human genomic DNA) including both intron and exon sequences (coding sequences), as well as non-coding regulatory sequence
  • the nucleic acid to be sequenced through use of the current disclosure is immobilized upon a substrate (e.g., a substrate within a flowcell or one or more beads upon a substrate such as a flowcell, etc.).
  • a substrate e.g., a substrate within a flowcell or one or more beads upon a substrate such as a flowcell, etc.
  • immobilized as used herein is intended to encompass direct or indirect, covalent or non-covalent attachment, unless indicated otherwise, either explicitly or by context.
  • covalent attachment may be preferred, but generally all that is required is that the molecules (e.g. nucleic acids) remain immobilized or attached to the support under conditions in which it is intended to use the support, for example in applications requiring nucleic acid sequencing.
  • solid support refers to any inert substrate or matrix to which nucleic acids can be attached, such as for example glass surfaces, plastic surfaces, latex, dextran, polystyrene surfaces, polypropylene surfaces, polyacrylamide gels, gold surfaces, and silicon wafers.
  • the solid support is a glass surface (e.g., the planar surface of a flowcell channel).
  • the solid support may comprise an inert substrate or matrix which has been “functionalized,” for example by the application of a layer or coating of an intermediate material comprising reactive groups which permit covalent attachment to molecules such as polynucleotides.
  • such supports can include polyacrylamide hydrogels supported on an inert substrate such as glass.
  • the molecules can be directly covalently attached to the intermediate material (e.g., the hydrogel) but the intermediate material can itself be non-covalently attached to the substrate or matrix (e.g., the glass substrate). Covalent attachment to a solid support is to be interpreted accordingly as encompassing this type of arrangement.
  • the present disclosure comprises novel systems and devices for sequencing nucleic acids.
  • references herein to a particular nucleic acid sequence may, depending on the context, also refer to nucleic acid molecules which comprise such nucleic acid sequence.
  • Sequencing of a target fragment means that a read of the chronological order of bases is established.
  • the bases that are read do not need to be contiguous, although this is preferred, nor does every base on the entire fragment have to be sequenced during the sequencing.
  • Sequencing can be carried out using any suitable sequencing technique, wherein nucleotides or oligonucleotides are added successively to a free 3' hydroxyl group, resulting in synthesis of a polynucleotide chain in the 5' to 3' direction.
  • the nature of the nucleotide added is preferably determined after each nucleotide addition.
  • Sequencing techniques using sequencing by ligation wherein not every contiguous base is sequenced, and techniques such as massively parallel signature sequencing (MPSS) where bases are removed from, rather than added to, the strands on the surface are also amenable to use with the systems and devices of the disclosure.
  • MPSS massively parallel signature sequencing
  • the current disclosure discloses sequencing-by-synthesis (SBS).
  • SBS sequencing-by-synthesis
  • four fluorescently labeled modified nucleotides are used to sequence dense clusters of amplified DNA (possibly millions of clusters) present on the surface of a substrate (e.g., a flowcell).
  • a substrate e.g., a flowcell.
  • SBS procedures and methods which can be utilized with the systems and devices herein, are disclosed in, for example, W004018497, W004018493 and U.S. Pat. No.
  • the flowcells containing the nucleic acid samples for sequencing are placed within the appropriate flowcell holder.
  • the samples for sequencing can take the form of single molecules, amplified single molecules in the form of clusters, or beads comprising molecules of nucleic acid.
  • the nucleic acids are prepared such that they comprise an oligonucleotide primer adjacent to an unknown target sequence.
  • To initiate the first SBS sequencing cycle one or more differently labeled nucleotides, and DNA polymerase, etc., are flowed into/through the flowcell by the fluid flow subsystem (various embodiments of which are described herein).
  • Either a single nucleotide can be added at a time, or the nucleotides used in the sequencing procedure can be specially designed to possess a reversible termination property, thus allowing each cycle of the sequencing reaction to occur simultaneously in the presence of all four labeled nucleotides (A, C, T, G).
  • the polymerase is able to select the correct base to incorporate and each sequence is extended by a single base.
  • the natural competition between all four alternatives leads to higher accuracy than wherein only one nucleotide is present in the reaction mixture (where most of the sequences are therefore not exposed to the correct nucleotide).
  • Sequences where a particular base is repeated one after another e.g., homopolymers
  • the fluid flow subsystem also flows the appropriate reagents to remove the blocked 3' terminus (if appropriate) and the fluorophore from each incorporated base.
  • the substrate can be exposed either to a second round of the four blocked nucleotides, or optionally to a second round with a different individual nucleotide. Such cycles are then repeated, and the sequence of each cluster is read over the multiple chemistry cycles.
  • the computer aspect of the current disclosure can optionally align the sequence data gathered from each single molecule, cluster or bead to determine the sequence of longer polymers, etc. Alternatively, the image processing and alignment can be performed on a separate computer.
  • the heating/cooling components of the system regulate the reaction conditions within the flowcell channels and reagent storage areas/containers (and optionally the camera, optics, and/or other components), while the fluid flow components allow the substrate surface to be exposed to suitable reagents for incorporation (e.g., the appropriate fluorescently labeled nucleotides to be incorporated) while unincorporated reagents are rinsed away.
  • suitable reagents for incorporation e.g., the appropriate fluorescently labeled nucleotides to be incorporated
  • An optional movable stage upon which the flowcell is placed allows the flowcell to be brought into proper orientation for laser (or other light) excitation of the substrate and optionally moved in relation to a lens objective to allow reading of different areas of the substrate.
  • other components of the system are also optionally movable/adjustable (e.g., the camera, the lens objective, the heater/cooler, etc.).
  • the image/location of emitted fluorescence from the nucleic acids on the substrate is captured by the camera component, thereby, recording the identity, in the computer component, of the first base for each single molecule, cluster or bead.
  • Embodiments described herein may be used in various biological or chemical processes and systems for academic or commercial analysis. More specifically, embodiments described herein may be used in various processes and systems where it is desired to detect an event, property, quality, or characteristic that is indicative of a desired reaction.
  • embodiments described herein include cartridges, biosensors, and their components as well as bioassay systems that operate with cartridges and biosensors.
  • the cartridges and biosensors include a flow cell and one or more sensors, pixels, light detectors, or photodiodes that are coupled together in a substantially unitary structure.
  • a “desired reaction” includes a change in at least one of a chemical, electrical, physical, or optical property (or quality) of an analyte-of-interest.
  • the desired reaction is a positive binding event (e.g., incorporation of a fluorescently labeled biomolecule with the analyte-of-interest).
  • the desired reaction may be a chemical transformation, chemical change, or chemical interaction.
  • the desired reaction may also be a change in electrical properties.
  • the desired reaction may be a change in ion concentration within a solution.
  • Exemplary reactions include, but are not limited to, chemical reactions such as reduction, oxidation, addition, elimination, rearrangement, esterification, amidation, etherification, cyclization, or substitution; binding interactions in which a first chemical binds to a second chemical; dissociation reactions in which two or more chemicals detach from each other; fluorescence; luminescence; bioluminescence; chemiluminescence; and biological reactions, such as nucleic acid replication, nucleic acid amplification, nucleic acid hybridization, nucleic acid ligation, phosphorylation, enzymatic catalysis, receptor binding, or ligand binding.
  • chemical reactions such as reduction, oxidation, addition, elimination, rearrangement, esterification, amidation, etherification, cyclization, or substitution
  • binding interactions in which a first chemical binds to a second chemical
  • dissociation reactions in which two or more chemicals detach from each other
  • fluorescence luminescence
  • bioluminescence bioluminescence
  • the desired reaction can also be an addition or elimination of a proton, for example, detectable as a change in pH of a surrounding solution or environment.
  • An additional desired reaction can be detecting the flow of ions across a membrane (e.g. , natural or synthetic bilayer membrane), for example as ions flow through a membrane the current is disrupted and the disruption can be detected.
  • a membrane e.g. , natural or synthetic bilayer membrane
  • the desired reaction includes the incorporation of a fluorescently- labeled molecule to an analyte.
  • the analyte may be an oligonucleotide and the fluorescently -labeled molecule may be a nucleotide.
  • the desired reaction may be detected when an excitation light is directed toward the oligonucleotide having the labeled nucleotide, and the fluorophore emits a detectable fluorescent signal.
  • the detected fluorescence is a result of chemiluminescence or bioluminescence.
  • a desired reaction may also increase fluorescence (or Forster) resonance energy transfer (FRET), for example, by bringing a donor fluorophore in proximity to an acceptor fluorophore, decrease FRET by separating donor and acceptor fluorophores, increase fluorescence by separating a quencher from a fluorophore or decrease fluorescence by co-locating a quencher and fluorophore.
  • FRET fluorescence resonance energy transfer
  • reaction component or “reactant” includes any substance that may be used to obtain a desired reaction.
  • reaction components include reagents, enzymes, samples, other biomolecules, and buffer solutions.
  • the reaction components are typically delivered to a reaction site in a solution and/or immobilized at a reaction site.
  • the reaction components may interact directly or indirectly with another substance, such as the analyte-of-interest.
  • reaction site is a localized region where a desired reaction may occur.
  • a reaction site may include support surfaces of a substrate where a substance may be immobilized thereon.
  • a reaction site may include a substantially planar surface in a channel of a flow cell that has a colony of nucleic acids thereon.
  • the nucleic acids in the colony have the same sequence, being for example, clonal copies of a single stranded or double stranded template.
  • a reaction site may contain only a single nucleic acid molecule, for example, in a single stranded or double stranded form.
  • reaction sites may be unevenly distributed along the support surface or arranged in a predetermined manner (e.g., side-by-side in a matrix, such as in microarrays).
  • a reaction site can also include a reaction chamber (or well) that at least partially defines a spatial region or volume configured to compartmentalize the desired reaction.
  • reaction chamber and “well” interchangeably.
  • reaction chamber or “well” includes a spatial region that is in fluid communication with a flow channel.
  • the reaction chamber may be at least partially separated from the surrounding environment or other spatial regions. For example, a plurality of reaction chambers may be separated from each other by shared walls.
  • the reaction chamber may include a cavity defined by interior surfaces of a well and have an opening or aperture so that the cavity may be in fluid communication with a flow channel.
  • Biosensors including such reaction chambers are described in greater detail in international application no. PCT/US2011/057111, filed on October 20, 2011, which is incorporated herein by reference in its entirety.
  • the reaction chambers are sized and shaped relative to solids (including semi-solids) so that the solids may be inserted, fully or partially, therein.
  • the reaction chamber may be sized and shaped to accommodate only one capture bead.
  • the capture bead may have clonally amplified DNA or other substances thereon.
  • the reaction chamber may be sized and shaped to receive an approximate number of beads or solid substrates.
  • the reaction chambers may also be filled with a porous gel or substance that is configured to control diffusion or filter fluids that may flow into the reaction chamber.
  • sensors e.g., light detectors, photodiodes
  • a pixel area is a geometrical construct that represents an area on the biosensor’s sample surface for one sensor (or pixel).
  • a sensor that is associated with a pixel area detects light emissions gathered from the associated pixel area when a desired reaction has occurred at a reaction site or a reaction chamber overlying the associated pixel area.
  • a “biosensor” includes a structure having a plurality of reaction sites and/or reaction chambers (or wells).
  • a biosensor may include a solid-state imaging device (e.g., CCD or CMOS imager) and, optionally, a flow cell mounted thereto.
  • the flow cell may include at least one flow channel that is in fluid communication with the reaction sites and/or the reaction chambers.
  • the biosensor is configured to fluidically and electrically couple to a bioassay system.
  • the bioassay system may deliver reactants to the reaction sites and/or the reaction chambers according to a predetermined protocol (e.g., sequencing-by-synthesis) and perform a plurality of imaging events.
  • the bioassay system may direct solutions to flow along the reaction sites and/or the reaction chambers. At least one of the solutions may include four types of nucleotides having the same or different fluorescent labels.
  • the nucleotides may bind to corresponding oligonucleotides located at the reaction sites and/or the reaction chambers.
  • the bioassay system may then illuminate the reaction sites and/or the reaction chambers using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes or LEDs).
  • the excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths.
  • the excited fluorescent labels provide emission signals that may be captured by the sensors.
  • the biosensor may include electrodes or other types of sensors configured to detect other identifiable properties.
  • the sensors may be configured to detect a change in ion concentration.
  • the sensors may be configured to detect the ion current flow across a membrane.
  • a “cluster” is a colony of similar or identical molecules or nucleotide sequences or DNA strands.
  • a cluster can be an amplified oligonucleotide or any other group of a polynucleotide or polypeptide with a same or similar sequence.
  • a cluster can be any element or group of elements that occupy a physical area on a sample surface.
  • clusters are immobilized to a reaction site and/or a reaction chamber during a base calling cycle.
  • the term “immobilized,” when used with respect to a biomolecule or biological or chemical substance, includes substantially attaching the biomolecule or biological or chemical substance at a molecular level to a surface.
  • a biomolecule or biological or chemical substance may be immobilized to a surface of the substrate material using adsorption techniques including non-covalent interactions (e.g., electrostatic forces, van der Waals, and dehydration of hydrophobic interfaces) and covalent binding techniques where functional groups or linkers facilitate attaching the biomolecules to the surface.
  • Immobilizing biomolecules or biological or chemical substances to a surface of a substrate material may be based upon the properties of the substrate surface, the liquid medium carrying the biomolecule or biological or chemical substance, and the properties of the biomolecules or biological or chemical substances themselves.
  • a substrate surface may be functionalized (e.g., chemically or physically modified) to facilitate immobilizing the biomolecules (or biological or chemical substances) to the substrate surface.
  • the substrate surface may be first modified to have functional groups bound to the surface. The functional groups may then bind to biomolecules or biological or chemical substances to immobilize them thereon.
  • a substance can be immobilized to a surface via a gel, for example, as described in US Patent Publ. No. US 2011/0059865 Al, which is incorporated herein by reference.
  • nucleic acids can be attached to a surface and amplified using bridge amplification.
  • Useful bridge amplification methods are described, for example, in U.S. Patent No. 5,641,658; WO 2007/010251; U.S. Pat. No. 6,090,592; U.S. Patent Publ. No. 2002/0055100 Al; U.S. Patent No. 7,115,400; U.S. Patent Publ. No. 2004/0096853 Al; U.S. Patent Publ. No. 2004/0002090 Al; U.S. Patent Publ. No. 2007/0128624 Al; and U.S. Patent Publ. No. 2008/0009420 Al, each of which is incorporated herein in its entirety.
  • the nucleic acids can be attached to a surface and amplified using one or more primer pairs.
  • one of the primers can be in solution and the other primer can be immobilized on the surface (e.g ., 5'-attached).
  • a nucleic acid molecule can hybridize to one of the primers on the surface followed by extension of the immobilized primer to produce a first copy of the nucleic acid.
  • the primer in solution then hybridizes to the first copy of the nucleic acid which can be extended using the first copy of the nucleic acid as a template.
  • the original nucleic acid molecule can hybridize to a second immobilized primer on the surface and can be extended at the same time or after the primer in solution is extended.
  • repeated rounds of extension e.g., amplification
  • using the immobilized primer and primer in solution provide multiple copies of the nucleic acid.
  • the assay protocols executed by the systems and methods described herein include the use of natural nucleotides and also enzymes that are configured to interact with the natural nucleotides.
  • Natural nucleotides include, for example, ribonucleotides (RNA) or deoxyribonucleotides (DNA).
  • Natural nucleotides can be in the mono-, di-, or tri -phosphate form and can have a base selected from adenine (A), thymine (T), uracil (U), guanine (G) or cytosine (C). It will be understood however that non-natural nucleotides, modified nucleotides or analogs of the aforementioned nucleotides can be used.
  • Some examples of useful non-natural nucleotides are set forth below in regard to reversible terminator-based sequencing by synthesis methods.
  • items or solid substances may be disposed within the reaction chambers.
  • the item or solid may be physically held or immobilized within the reaction chamber through an interference fit, adhesion, or entrapment.
  • Exemplary items or solids that may be disposed within the reaction chambers include polymer beads, pellets, agarose gel, powders, quantum dots, or other solids that may be compressed and/or held within the reaction chamber.
  • a nucleic acid superstructure such as a DNA ball, can be disposed in or at a reaction chamber, for example, by attachment to an interior surface of the reaction chamber or by residence in a liquid within the reaction chamber.
  • a DNA ball or other nucleic acid superstructure can be preformed and then disposed in or at the reaction chamber.
  • a DNA ball can be synthesized in situ at the reaction chamber.
  • a DNA ball can be synthesized by rolling circle amplification to produce a concatemer of a particular nucleic acid sequence and the concatemer can be treated with conditions that form a relatively compact ball.
  • DNA balls and methods for their synthesis are described, for example in, U.S. Patent Publication Nos. 2008/0242560 A1 or 2008/0234136 Al, each of which is incorporated herein in its entirety.
  • a substance that is held or disposed in a reaction chamber can be in a solid, liquid, or gaseous state.
  • base calling identifies a nucleotide base in a nucleic acid sequence.
  • Base calling refers to the process of determining a base call (A, C, G, T) for every cluster at a specific cycle.
  • base calling can be performed utilizing four-channel, two-channel or one-channel methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
  • a base calling cycle is referred to as a “sampling event.”
  • a sampling event comprises two illumination stages in time sequence, such that a pixel signal is generated at each stage. The first illumination stage induces illumination from a given cluster indicating nucleotide bases A and T in a AT pixel signal, and the second illumination stage induces illumination from a given cluster indicating nucleotide bases C and T in a CT pixel signal.
  • the technology disclosed can be implemented on processors like Central Processing Units (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application-Specific Integrated Circuits (ASICs), Application Specific Instruction-set Processor (ASIP), and Digital Signal Processors (DSPs).
  • CPUs Central Processing Units
  • GPUs Graphics Processing Units
  • FPGAs Field Programmable Gate Arrays
  • CGRAs Coarse-Grained Reconfigurable Architectures
  • ASICs Application-Specific Integrated Circuits
  • ASIP Application Specific Instruction-set Processor
  • DSPs Digital Signal Processors
  • Fig. 1 illustrates a cross-section of a biosensor 100 that can be used in various embodiments.
  • Biosensor 100 has pixel areas 106', 108', 110', 112', and 114' that can each hold more than one cluster during a base calling cycle (e.g., 2 clusters per pixel area).
  • the biosensor 100 may include a flow cell 102 that is mounted onto a sampling device 104.
  • the flow cell 102 is affixed directly to the sampling device 104.
  • the flow cell 102 may be removably coupled to the sampling device 104.
  • the sampling device 104 has a sample surface 134 that may be functionalized (e.g., chemically or physically modified in a suitable manner for conducting the desired reactions).
  • the sample surface 134 may be functionalized and may include a plurality of pixel areas 106', 108', 110', 112', and 114' that can each hold more than one cluster during a base calling cycle (e.g., each having a corresponding cluster pair 106A, 106B; 108A, 108B;
  • Each pixel area is associated with a corresponding sensor (or pixel or photodiode) 106, 108, 110, 112, and 114, such that light received by the pixel area is captured by the corresponding sensor.
  • a pixel area 106' can also be associated with a corresponding reaction site 106" on the sample surface 134 that holds a cluster pair, such that light emitted from the reaction site 106" is received by the pixel area 106' and captured by the corresponding sensor 106.
  • the pixel signal in that base calling cycle carries information based on all of the two or more clusters.
  • signal processing as described herein is used to distinguish each cluster, where there are more clusters than pixel signals in a given sampling event of a particular base calling cycle.
  • the flow cell 102 includes sidewalls 138, 125, and a flow cover 136 that is supported by the sidewalls 138, 125.
  • the sidewalls 138, 125 are coupled to the sample surface 134, and extend between the flow cover 136 and the sample surface 134.
  • the sidewalls 138, 125 are formed from a curable adhesive layer that bonds the flow cover 136 to the sampling device 104.
  • the sidewalls 138, 125 are sized and shaped so that a flow channel 144 exists between the flow cover 136 and the sampling device 104.
  • the flow cover 136 may include a material that is transparent to excitation light 101 propagating from an exterior of the biosensor 100 into the flow channel 144. In an example, the excitation light 101 approaches the flow cover 136 at a non-orthogonal angle.
  • the flow cover 136 may include inlet and outlet ports 142, 146 that are configured to fluidically engage other ports (not shown). For example, the other ports may be from the cartridge or the workstation.
  • the flow channel 144 is sized and shaped to direct a fluid along the sample surface 134. A height Hi and other dimensions of the flow channel 144 may be configured to maintain a substantially even flow of a fluid along the sample surface 134. The dimensions of the flow channel 144 may also be configured to control bubble formation.
  • the flow cover 136 may comprise a transparent material, such as glass or plastic.
  • the flow cover 136 may constitute a substantially rectangular block having a planar exterior surface and a planar inner surface that defines the flow channel 144.
  • the block may be mounted onto the sidewalls 138, 125.
  • the flow cell 102 may be etched to define the flow cover 136 and the sidewalls 138, 125.
  • a recess may be etched into the transparent material. When the etched material is mounted to the sampling device 104, the recess may become the flow channel 144.
  • the sampling device 104 may be similar to, for example, an integrated circuit comprising a plurality of stacked substrate layers 120-126.
  • the substrate layers 120-126 may include a base substrate 120, a solid-state imager 122 (e.g., CMOS image sensor), a filter or light-management layer 124, and a passivation layer 126. It should be noted that the above is only illustrative and that other embodiments may include fewer or additional layers. Moreover, each of the substrate layers 120-126 may include a plurality of sub-layers.
  • the sampling device 104 may be manufactured using processes that are similar to those used in manufacturing integrated circuits, such as CMOS image sensors and CCDs. For example, the substrate layers 120-126 or portions thereof may be grown, deposited, etched, and the like to form the sampling device 104.
  • the passivation layer 126 is configured to shield the filter layer 124 from the fluidic environment of the flow channel 144.
  • the passivation layer 126 is also configured to provide a solid surface (i.e.. the sample surface 134) that permits biomolecules or other analytes-of- interest to be immobilized thereon.
  • each of the reaction sites may include a cluster of biomolecules that are immobilized to the sample surface 134.
  • the passivation layer 126 may be formed from a material that permits the reaction sites to be immobilized thereto.
  • the passivation layer 126 may also comprise a material that is at least transparent to a desired fluorescent light.
  • the passivation layer 126 may include silicon nitride (S12N4) and/or silica (S1O2). However, other suitable material(s) may be used. In the illustrated embodiment, the passivation layer 126 may be substantially planar. However, in alternative embodiments, the passivation layer 126 may include recesses, such as pits, wells, grooves, and the like. In the illustrated embodiment, the passivation layer 126 has a thickness that is about 150-200 nm and, more particularly, about 170 nm.
  • the fdter layer 124 may include various features that affect the transmission of light.
  • the filter layer 124 can perform multiple functions.
  • the filter layer 124 may be configured to (a) filter unwanted light signals, such as light signals from an excitation light source; (b) direct emission signals from the reaction sites toward corresponding sensors 106, 108, 110, 112, and 114 that are configured to detect the emission signals from the reaction sites; or (c) block or prevent detection of unwanted emission signals from adjacent reaction sites.
  • the filter layer 124 may also be referred to as a light-management layer.
  • the filter layer 124 has a thickness that is about 1-5 pm and, more particularly, about 2-4 pm.
  • the filter layer 124 may include an array of microlenses or other optical components. Each of the microlenses may be configured to direct emission signals from an associated reaction site to a sensor.
  • the solid-state imager 122 and the base substrate 120 may be provided together as a previously constructed solid-state imaging device (e.g., CMOS chip).
  • the base substrate 120 may be a wafer of silicon and the solid-state imager 122 may be mounted thereon.
  • the solid-state imager 122 includes a layer of semiconductor material (e.g., silicon) and the sensors 106, 108,
  • the sensors are photodiodes configured to detect light.
  • the sensors comprise light detectors.
  • the solid-state imager 122 may be manufactured as a single chip through a CMOS-based fabrication processes.
  • the solid-state imager 122 may include a dense array of sensors 106, 108, 110, 112, and 114 that are configured to detect activity indicative of a desired reaction from within or along the flow channel 144.
  • each sensor has a pixel area (or detection area) that is about 1-2 square micrometers (pm 2 ).
  • the array can include 500,000 sensors, 5 million sensors, 10 million sensors, or even 120 million sensors.
  • the sensors 106, 108, 110, 112, and 114 can be configured to detect a predetermined wavelength of light that is indicative of the desired reactions.
  • the sampling device 104 includes a microcircuit arrangement, such as the microcircuit arrangement described in U.S. Patent No. 7,595,882, which is incorporated herein by reference in its entirety. More specifically, the sampling device 104 may comprise an integrated circuit having a planar array of the sensors 106, 108, 110, 112, and 114. Circuitry formed within the sampling device 104 may be configured for at least one of signal amplification, digitization, storage, and processing. The circuitry may collect and analyze the detected fluorescent light and generate pixel signals (or detection signals) for communicating detection data to a signal processor. The circuitry may also perform additional analog and/or digital signal processing in the sampling device 104. Sampling device 104 may include conductive vias 130 that perform signal routing (e.g., transmit the pixel signals to the signal processor). The pixel signals may also be transmitted through electrical contacts of the sampling device 104.
  • a microcircuit arrangement such as the microcircuit arrangement described in U.S. Patent No. 7,595,882, which is incorporated herein by reference in its
  • sampling device 104 is discussed in further details with respect to U.S. Nonprovisional Patent Application No. 16/874,599, titled “Systems and Devices for Characterization and Performance Analysis of Pixel-Based Sequencing,” fded May 14, 2020 (Attorney Docket No. ILLM 1011-4/IP- 1750- US), which is incorporated by reference as if fully set forth hereim
  • the sampling device 104 is not limited to the above constructions or uses as described above.
  • the sampling device 104 may take other forms.
  • the sampling device 104 may comprise a CCD device, such as a CCD camera, that is coupled to a flow cell or is moved to interface with a flow cell having reaction sites therein.
  • Fig. 2 depicts one implementation of a flow cell 200 that contains clusters in its tiles.
  • the flow cell 200 corresponds to the flow cell 102 of Fig. 1, e.g., without the flow cover 136.
  • the depiction of the flow cell 200 is symbolic in nature, and the flow cell 200 symbolically depicts various lanes and tiles therewithin, without illustrating various other components therewithin.
  • Fig. 2 illustrates a top view of the flow cell 200.
  • the flow cell 200 is divided or partitioned in a plurality of lanes, such as lanes 202a, 202b, ..., 202P, i.e., P number of lanes.
  • individual lanes 202 are further partitioned into non-overlapping regions called “tiles” 212.
  • Fig. 2 illustrates a magnified view of a section 208 of an example lane.
  • the section 208 is illustrated to comprise a plurality of tiles 212.
  • each lane 202 comprises one or more columns of tiles.
  • each lane 202 comprises two corresponding columns of tiles 212, as illustrated within the magnified section 208.
  • a number of tiles within each column of tiles within each lane is implementation specific, and in one example, there can be 50 tiles, 60 tiles, 100 tiles, or another appropriate number of tiles in each column of tiles within each lane.
  • Each tile comprises a corresponding plurality of clusters.
  • the clusters and their surrounding background on the tiles are imaged.
  • Fig. 2 illustrates example clusters 216 within an example tile.
  • Fig. 3 illustrates an example Illumina GA-IIxTM flow cell with eight lanes, and also illustrates a zoom-in on one tile and its clusters and their surrounding background. For example, there are a hundred tiles per lane in Illumina Genome Analyzer II and sixty-eight tiles per lane in Illumina HiSeq2000. A tile 212 holds hundreds of thousands to millions of clusters.
  • an image generated from a tile with clusters shown as bright spots is shown at 308 (e.g. , 308 is a magnified image view of a tile), with an example cluster 304 labelled.
  • a cluster 304 comprises approximately one thousand identical copies of a template molecule, though clusters vary in size and shape.
  • the clusters are grown from the template molecule, prior to the sequencing run, by bridge amplification of the input library.
  • the purpose of the amplification and cluster growth is to increase the intensity of the emitted signal since the imaging device cannot reliably sense a single fluorophore.
  • the physical distance of the DNA fragments within a cluster 304 is small, so the imaging device perceives the cluster of fragments as a single spot 304.
  • Fig. 4 is a simplified block diagram of the system for analysis of sensor data from a sequencing system, such as base call sensor outputs (e.g., see Fig. 1).
  • the system includes a sequencing machine 400 and a configurable processor 450.
  • the configurable processor 450 can execute a neural network-based base caller in coordination with a runtime program executed by a host processor, such as a central processing unit (CPU) 402.
  • the sequencing machine 400 comprises base call sensors and a flow cell 401 (e.g., discussed with respect to Figs. 1-3).
  • the flow cell can comprise one or more tiles in which clusters of genetic material are exposed to a sequence of analyte flows used to cause reactions in the clusters to identify the bases in the genetic material, as discussed with respect to Figs. 1-3.
  • the sensors sense the reactions for each cycle of the sequence in each tile of the flow cell to provide tile data. Examples of this technology are described in more detail below. Genetic sequencing is a data intensive operation, which translates base call sensor data into sequences of base calls for each cluster of genetic material sensed during a base call operation.
  • the system in this example includes the CPU 402 which executes a runtime program to coordinate the base call operations, memory 403 to store sequences of arrays of tile data, base call reads produced by the base calling operation, and other information used in the base call operations. Also, in this illustration the system includes memory 404 to store a configuration file (or files), such as FPGA bit files, and model parameters for the neural network used to configure and reconfigure the configurable processor 450 and execute the neural network.
  • the sequencing machine 400 can include a program for configuring a configurable processor, and in some embodiments, a reconfigurable processor to execute the neural network.
  • the sequencing machine 400 is coupled by a bus 405 to the configurable processor 450.
  • the bus 405 can be implemented using a high throughput technology, such as in one example bus technology compatible with the PCIe standards (Peripheral Component Interconnect Express) currently maintained and developed by the PCI-SIG (PCI Special Interest Group).
  • a memory 460 is coupled to the configurable processor 450 by a bus 461.
  • the memory 460 can be on-board memory, disposed on a circuit board with the configurable processor 450.
  • the memory 460 is used for high-speed access by the configurable processor 450 of working data used in the base call operation.
  • the bus 461 can also be implemented using a high throughput technology, such as bus technology compatible with the PCIe standards.
  • Configurable processors including Field Programmable Gate Arrays (FPGAs), Coarse Grained Reconfigurable Arrays (CGRAs), and other configurable and reconfigurable devices, can be configured to implement a variety of functions more efficiently or faster than might be achieved using a general-purpose processor executing a computer program.
  • Configuration of configurable processors involves compiling a functional description to produce a configuration file, referred to sometimes as a bitstream or bit file, and distributing the configuration file to the configurable elements on the processor.
  • the configuration file defines the logic functions to be executed by the configurable processor, by configuring the circuit to set data flow patterns, use of distributed memory and other on-chip memory resources, lookup table contents, operations of configurable logic blocks and configurable execution units like multiply-and-accumulate units, configurable interconnects and other elements of the configurable array.
  • a configurable processor is reconfigurable if the configuration file may be changed in the field, by changing the loaded configuration file.
  • the configuration file may be stored in volatile SRAM elements, in non-volatile read-write memory elements, and in combinations of the same, distributed among the array of configurable elements on the configurable or reconfigurable processor.
  • a variety of commercially available configurable processors are suitable for use in a base calling operation as described herein.
  • Examples include commercially available products such as Xilinx AlveoTM U200, Xilinx AlveoTM U250, Xilinx AlveoTM U280, Intel/Altera StratixTM GX2800, Intel/Altera StratixTM GX2800, and Intel StratixTM GX10M.
  • a host CPU can be implemented on the same integrated circuit as the configurable processor.
  • Embodiments described herein implement the multi -cycle neural network using a configurable processor 450.
  • the configuration file for a configurable processor can be implemented by specifying the logic functions to be executed using a high-level description language (HDL) or a register transfer level (RTL) language specification.
  • HDL high-level description language
  • RTL register transfer level
  • the specification can be compiled using the resources designed for the selected configurable processor to generate the configuration file.
  • the same or similar specification can be compiled for the purposes of generating a design for an application-specific integrated circuit which may not be a configurable processor.
  • Alternatives for the configurable processor in all embodiments described herein, therefore include a configured processor comprising an application specific ASIC or special purpose integrated circuit or set of integrated circuits, or a system-on-a-chip SOC device, configured to execute a neural network based base call operation as described herein.
  • neural network processors In general, configurable processors and configured processors described herein, as configured to execute runs of a neural network, are referred to herein as neural network processors.
  • the configurable processor 450 is configured in this example by a configuration file loaded using a program executed by the CPU 402, or by other sources, which configures the array of configurable elements on the configurable processor 450 to execute the base call function.
  • the configuration includes data flow logic 451 which is coupled to the buses 405 and 461 and executes functions for distributing data and control parameters among the elements used in the base call operation.
  • the configurable processor 450 is configured with base call execution logic 452 to execute a multi-cycle neural network.
  • the logic 452 comprises a plurality of multi-cycle execution clusters (e.g., 453) which, in this example, includes multi -cycle cluster 1 through multi-cycle cluster X. The number of multi -cycle clusters can be selected according to a trade-off involving the desired throughput of the operation, and the available resources on the configurable processor.
  • the multi-cycle clusters are coupled to the data flow logic 451 by data flow paths 454 implemented using configurable interconnect and memory resources on the configurable processor. Also, the multi -cycle clusters are coupled to the data flow logic 451 by control paths 455 implemented using configurable interconnect and memory resources, for example, on the configurable processor, which provide control signals indicating available clusters, readiness to provide input units for execution of a run of the neural network to the available clusters, readiness to provide trained parameters for the neural network, readiness to provide output patches of base call classification data, and other control data used for execution of the neural network.
  • the configurable processor is configured to execute runs of a multi-cycle neural network using trained parameters to produce classification data for sensing cycles of the base flow operation.
  • a run of the neural network is executed to produce classification data for a subject sensing cycle of the base call operation.
  • a run of the neural network operates on a sequence including a number N of arrays of tile data from respective sensing cycles of N sensing cycles, where the N sensing cycles provide sensor data for different base call operations for one base position per operation in time sequence in the examples described herein.
  • some of the N sensing cycles can be out of sequence if needed according to a particular neural network model being executed.
  • the number N can be any number greater than one.
  • sensing cycles of the N sensing cycles represent a set of sensing cycles for at least one sensing cycle preceding the subject sensing cycle and at least one sensing cycle following the subject cycle in time sequence. Examples are described herein in which the number N is an integer equal to or greater than five.
  • the data flow logic 451 is configured to move tile data and at least some trained parameters of the model from the memory 460 to the configurable processor for runs of the neural network, using input units for a given run including tile data for spatially aligned patches of the N arrays.
  • the input units can be moved by direct memory access operations in one DMA operation, or in smaller units moved during available time slots in coordination with the execution of the neural network deployed.
  • Tile data for a sensing cycle as described herein can comprise an array of sensor data having one or more features.
  • the sensor data can comprise two images which are analyzed to identify one of four bases at a base position in a genetic sequence of DNA, RNA, or other genetic material.
  • the tile data can also include metadata about the images and the sensors.
  • the tile data can comprise information about alignment of the images with the clusters such as distance from center information indicating the distance of each pixel in the array of sensor data from the center of a cluster of genetic material on the tile.
  • tile data can also include data produced during execution of the multi-cycle neural network, referred to as intermediate data, which can be reused rather than recomputed during a run of the multi -cycle neural network.
  • intermediate data data produced during execution of the multi-cycle neural network
  • the data flow logic can write intermediate data to the memory 460 in place of the sensor data for a given patch of an array of tile data. Embodiments like this are described in more detail below.
  • a system for analysis of base call sensor output, comprising memory (e.g., 460) accessible by the runtime program storing tile data including sensor data for a tile from sensing cycles of a base calling operation.
  • the system includes a neural network processor, such as configurable processor 450 having access to the memory.
  • the neural network processor is configured to execute runs of a neural network using trained parameters to produce classification data for sensing cycles.
  • a run of the neural network is operating on a sequence of N arrays of tile data from respective sensing cycles of N sensing cycles, including a subject cycle, to produce the classification data for the subject cycle.
  • the data flow logic 451 is provided to move tile data and the trained parameters from the memory to the neural network processor for runs of the neural network using input units including data for spatially aligned patches of the N arrays from respective sensing cycles of N sensing cycles.
  • the neural network processor has access to the memory, and includes a plurality of execution clusters, the execution logic clusters in the plurality of execution clusters configured to execute a neural network.
  • the data flow logic has access to the memory and to execution clusters in the plurality of execution clusters, to provide input units of tile data to available execution clusters in the plurality of execution clusters, the input units including a number N of spatially aligned patches of arrays of tile data from respective sensing cycles, including a subject sensing cycle, and to cause the execution clusters to apply the N spatially aligned patches to the neural network to produce output patches of classification data for the spatially aligned patch of the subject sensing cycle, where N is greater than 1.
  • Fig. 5 is a simplified diagram showing aspects of the base calling operation, including functions of a runtime program executed by a host processor.
  • the output of image sensors from a flow cell (such as those illustrated in Figs. 1-2) are provided on lines 500 to image processing threads 501, which can perform processes on images such as resampling, alignment and arrangement in an array of sensor data for the individual tiles, and can be used by processes which calculate a tile cluster mask for each tile in the flow cell, which identifies pixels in the array of sensor data that correspond to clusters of genetic material on the corresponding tile of the flow cell.
  • one example algorithm is based on a process to detect clusters which are unreliable in the early sequencing cycles using a metric derived from the softmax output, and then the data from those wells/clusters are discarded, and no output data is produced for those clusters.
  • a process can identify clusters with high reliability during the first N1 (e.g., 25) base-calls, and reject the others.
  • Rejected clusters might be polyclonal or very weak intensity or obscured by fiducials. This procedure can be performed on the host CPU. In alternative implementations, this information would potentially be used to identify the necessary clusters of interest to be passed back to the CPU, thereby limiting the storage required for intermediate data.
  • the outputs of the image processing threads 501 are provided on lines 506 to a dispatch logic 510 in the CPU which routes the arrays of tile data to a data cache 504 on a high-speed bus 507, or on high-speed bus 505 to the multi-cluster neural network processor hardware 520, such as the configurable processor of Fig. 4, according to the state of the base calling operation.
  • the hardware 520 returns classification data output by the neural network to the dispatch logic 510, which passes the information to the data cache 504, or on lines 511 to threads 502 that perform base call and quality score computations using the classification data, and can arrange the data in standard formats for base call reads.
  • the outputs of the threads 502 that perform base calling and quality score computations are provided on lines 512 to threads 503 that aggregate the base call reads, perform other operations such as data compression, and write the resulting base call outputs to specified destinations for utilization by the customers.
  • the host can include threads (not shown) that perform final processing of the output of the hardware 520 in support of the neural network.
  • the hardware 520 can provide outputs of classification data from a final layer of the multi-cluster neural network.
  • the host processor can execute an output activation function, such as a softmax function, over the classification data to configure the data for use by the base call and quality score threads 502.
  • the host processor can execute input operations (not shown), such as resampling, batch normalization or other adjustments of the tile data prior to input to the hardware 520.
  • Fig. 6 is a simplified diagram of a configuration of a configurable processor such as that of Fig. 4.
  • the configurable processor comprises an FPGA with a plurality of high speed PCIe interfaces.
  • the FPGA is configured with a wrapper 600 which comprises the data flow logic described with reference to Fig. 1.
  • the wrapper 600 manages the interface and coordination with a runtime program in the CPU across the CPU communication link 609 and manages communication with the on-board DRAM 602 (e.g ., memory 460) via DRAM communication link 610.
  • the data flow logic in the wrapper 600 provides patch data retrieved by traversing the arrays of tile data on the on-board DRAM 602 for the number N cycles to a cluster 601 and retrieves process data 615 from the cluster 601 for delivery back to the on-board DRAM 602.
  • the wrapper 600 also manages transfer of data between the on-board DRAM 602 and host memory, for both the input arrays of tile data, and for the output patches of classification data.
  • the wrapper transfers patch data on line 613 to the allocated cluster 601.
  • the wrapper 600 provides trained parameters, such as weights and biases on line 612 to the cluster 601 retrieved from the on-board DRAM 602.
  • the wrapper 600 provides configuration and control data on line 611 to the cluster 601 provided from, or generated in response to, the runtime program on the host via the CPU communication link 609.
  • the cluster can also provide status signals on line 616 to the wrapper 600, which are used in cooperation with control signals from the host to manage traversal of the arrays of tile data to provide spatially aligned patch data, and to execute the multi -cycle neural network over the patch data using the resources of the cluster 601.
  • Each cluster can be configured to provide classification data for base calls in a subject sensing cycle using the tile data of multiple sensing cycles described herein.
  • model data including kernel data like filter weights and biases can be sent from the host CPU to the configurable processor, so that the model can be updated as a function of cycle number.
  • a base calling operation can comprise, for a representative example, on the order of hundreds of sensing cycles.
  • Base calling operation can include paired end reads in some embodiments.
  • the model trained parameters may be updated once every 20 cycles (or other number of cycles), or according to update patterns implemented for particular systems and neural network models.
  • the trained parameters can be updated on the transition from the first part to the second part.
  • image data for multiple cycles of sensing data for a tile can be sent from the CPU to the wrapper 600.
  • the wrapper 600 can optionally do some pre-processing and transformation of the sensing data and write the information to the on-board DRAM 602.
  • the input tile data for each sensing cycle can include arrays of sensor data including on the order of 4000 x 3000 pixels per sensing cycle per tile or more, with two features representing colors of two images of the tile, and one or two bytes per feature per pixel.
  • the array of tile data for each run of the multi -cycle neural network can consume on the order of hundreds of megabytes per tile.
  • the tile data also includes an array of DFC data, stored once per tile, or other type of metadata about the sensor data and the tiles.
  • the wrapper allocates a patch to the cluster.
  • the wrapper fetches a next patch of tile data in the traversal of the tile and sends it to the allocated cluster along with appropriate control and configuration information.
  • the cluster can be configured with enough memory on the configurable processor to hold a patch of data including patches from multiple cycles in some systems, that is being worked on in place, and a patch of data that is to be worked on when the current patch of processing is finished using a ping-pong buffer technique or raster scanning technique in various embodiments.
  • an allocated cluster When an allocated cluster completes its run of the neural network for the current patch and produces an output patch, it will signal the wrapper.
  • the wrapper will read the output patch from the allocated cluster, or alternatively the allocated cluster will push the data out to the wrapper. Then the wrapper will assemble output patches for the processed tile in the DRAM 602.
  • the wrapper sends the processed output array for the tile back to the host/CPU in a specified format.
  • the on-board DRAM 602 is managed by memory management logic in the wrapper 600.
  • the runtime program can control the sequencing operations to complete analysis of all the arrays of tile data for all the cycles in the run in a continuous flow to provide real time analysis.
  • Fig. 7 is a diagram of a multi-cycle neural network model which can be executed using the system described herein.
  • the example shown in Fig. 7 can be referred to as a five-cycle input, one-cycle output neural network.
  • the inputs to the multi-cycle neural network model include five spatially aligned patches (e.g., 700) from the tile data arrays of five sensing cycles of a given tile. Spatially aligned patches have the same aligned row and column dimensions (x,y) as other patches in the set, so that the information relates to the same clusters of genetic material on the tile in sequence cycles.
  • a subject patch is a patch from the array of tile data for cycle K.
  • the set of five spatially aligned patches includes a patch from cycle K-2 preceding the subject patch by two cycles, a patch from cycle K-l preceding the subject patch by one cycle, a patch from cycle K+l following the patch from the subject cycle by one cycle, and a patch from cycle K+2 following the patch from the subject cycle by two cycles.
  • the model includes a segregated stack 701 of layers of the neural network for each of the input patches.
  • stack 701 receives, as input, tile data for the patch from cycle K+2, and is segregated from the stacks 702, 703, 704, and 705 so they do not share input data or intermediate data.
  • all of the stacks 710-705 can have identical models, and identical trained parameters.
  • the models and trained parameters may be different in the different stacks.
  • Stack 702 receives as input, tile data for the patch from cycle K+l .
  • Stack 703 receives as input, tile data for the patch from cycle K.
  • Stack 704 receives, as input, tile data for the patch from cycle K-l.
  • Stack 705 receives as input, tile data for the patch from cycle K-2.
  • the layers of the segregated stacks each execute a convolution operation of a kernel including a plurality of filters over the input data for the layer.
  • the patch 700 may include three features.
  • the output of the layer 710 may include many more features, such as 10 to 20 features.
  • the outputs of each of layers 711 to 716 can include any number of features suitable for a particular implementation.
  • the parameters of the filters are trained parameters for the neural network, such as weights and biases.
  • the output feature set (intermediate data) from each of the stacks 701-705 is provided as input to an inverse hierarchy 720 of temporal combinatorial layers, in which the intermediate data from the multiple cycles is combined.
  • the inverse hierarchy 720 includes a first layer including three combinatorial layers 721, 722, 723, each receiving intermediate data from three of the segregated stacks, and a final layer including one combinatorial layer 730 receiving intermediate data from the three temporal layers 721, 722, 723.
  • the output of the final combinatorial layer 730 is an output patch of classification data for clusters located in the corresponding patch of the tile from cycle K.
  • the output patches can be assembled into an output array of classification data for the tile for cycle K.
  • the output patches may have sizes and dimensions different from the input patches.
  • the output patches may include pixel-by-pixel data that can be filtered by the host to select cluster data.
  • the output classification data can then be applied to a softmax function 740 (or other output activation function) optionally executed by the host, or on the configurable processor, depending on the particular implementation.
  • a softmax function 740 or other output activation function
  • An output function different from softmax could be used (e.g., making a base call output parameter according to largest output, then using a learned nonlinear mapping using context/network outputs to give base quality).
  • the output of the softmax function 740 can be provided as base call probabilities for cycle K (750) and stored in host memory to be used in subsequent processing.
  • Other systems may use another function for output probability calculation, e.g., another nonlinear model.
  • the neural network can be implemented using a configurable processor with a plurality of execution clusters so as to complete evaluation of one tile cycle within the duration of the time interval, or close to the duration of the time interval, of one sensing cycle, effectively providing the output data in real time.
  • Data flow logic can be configured to distribute input units of tile data and trained parameters to the execution clusters, and to distribute output patches for aggregation in memory.
  • Input units of data for a five-cycle input, one-cycle output neural network like that of Fig. 7 are described with reference to Figs. 8 A and 8B for a base call operation using two-channel sensor data.
  • the base call operation can execute two flows of analyte and two reactions that generate two channels of signals, such as images, which can be processed to identify which one of four bases is located at a current position in the genetic sequence for each cluster of genetic material.
  • a different number of channels of sensing data may be utilized.
  • base calling can be performed utilizing one-channel methods and systems.
  • Incorporated materials of U.S. Patent Application Publication No. 2013/0079232 discuss base calling using various number of channels, such as one-channel, two-channels, or four-channels.
  • Fig. 8A shows arrays of tile data for five cycles for a given tile, tile M, used for the purposes of executing a five-cycle input, one-cycle output neural network.
  • the five-cycle input tile data in this example can be written to the on-board DRAM, or other memory in the system which can be accessed by the data flow logic and, for cycle K-2, includes an array 801 for channel 1 and an array 811 for channel 2, for cycle K-l, an array 802 for channel 1 and an array 812 for channel 2, for cycle K, an array 803 for channel 1 and an array 813 for channel 2, for cycle K+l, an array 804 for channel 1 and an array 814 for channel 2, for cycle K+2, an array 805 for channel 1 and an array 815 for channel 2.
  • an array 820 of metadata for the tile can be written once in the memory, in this case a DFC file, included for use as input to the neural network along with each cycle.
  • Fig. 8 A discusses two-channel base calling operations, using two channels is merely an example, and base calling can be performed using any other appropriate number of channels.
  • base calling can be performed using any other appropriate number of channels.
  • incorporated materials of U.S. Patent Application Publication No. 2013/0079232 discuss base calling using various number of channels, such as one-channel, two-channels, or four-channels, or another appropriate number of channels.
  • the data flow logic composes input units, which can be understood with reference to Fig. 8B, of tile data that includes spatially aligned patches of the arrays of tile data for each execution cluster configured to execute a run of the neural network over an input patch.
  • An input unit for an allocated execution cluster is composed by the data flow logic by reading spatially aligned patches (e.g. , 851, 852, 861, 862, 870) from each of the arrays 801-805, 811, 815, 820 of tile data for the five input cycles, and delivering them via data paths (schematically 850) to memory on the configurable processor configured for use by the allocated execution cluster.
  • the allocated execution cluster executes a run of the five-cycle input/one-cycle output neural network, and delivers an output patch for the subject cycle K of classification data for the same patch of the tile in the subject cycle K.
  • Fig. 9 is a simplified representation of a stack of a neural network usable in a system like that of Fig. 7 (e.g., 701 and 720).
  • some functions of the neural network e.g., 900, 902 are executed on the host, and other portions of the neural network (e.g., 901) are executed on the configurable processor.
  • a first function can be batch normalization (layer 910) formed on the CPU.
  • batch normalization as a function may be fused into one or more layers, and no separate batch normalization layer may be present.
  • a number of spatial, segregated convolution layers are executed as a first set of convolution layers of the neural network, as discussed above on the configurable processor.
  • the first set of convolution layers applies 2D convolutions spatially.
  • a first spatial convolution 921 is executed, followed by a second spatial convolution 922, followed by a third spatial convolution 923, and so on for a number L/2 of spatially segregated neural network layers in each stack (L is described with reference to Figure 7).
  • the number of spatial layers can be any practical number, which for context may range from a few to more than 20 in different embodiments.
  • kernel weights are stored for example in a (1,6, 6,3, L) structure since there are 3 input channels to this layer.
  • the “6” in this structure is due to storing coefficients in the transformed Winograd domain (the kernel size is 3x3 in the spatial domain but expands in the transform domain).
  • the outputs of the stack of spatial layers are provided to temporal layers, including convolution layers 924, 925 executed on the FPGA.
  • Layers 924 and 925 can be convolution layers applying ID convolutions across cycles.
  • the number of temporal layers can be any practical number, which for context may range from a few to more than 20 in different embodiments.
  • the first temporal layer, TEMP_CONV_0 layer 824 reduces the number of cycle channels from 5 to 3, as illustrated in Fig. 7.
  • the second temporal layer, layer 925 reduces the number of cycle channels from 3 to 1 as illustrated in Fig. 7, and reduces the number of feature maps to four outputs for each pixel, representing confidence in each base call.
  • CPU to apply for example, a softmax function 930, or other function to normalize the base call probabilities.
  • Fig. 10 illustrates an alternative implementation showing a 10-input, six-output neural network which can be executed for a base calling operation.
  • tile data for spatially aligned input patches from cycles 0 to 9 are applied to segregated stacks of spatial layers, such as stack 1001 for cycle 9.
  • the outputs of the segregated stacks are applied to an inverse hierarchical arrangement of temporal stacks 1020, having outputs 1035(2) through 1035(7) providing base call classification data for subject cycles 2 through 7.
  • Fig. 11 illustrates one implementation of the specialized architecture of the neural network- based base caller (e.g., Fig. 7) that is used to segregate processing of data for different sequencing cycles. The motivation for using the specialized architecture is described first.
  • the neural network-based base caller processes data for a current sequencing cycle, one or more preceding sequencing cycles, and one or more successive sequencing cycles. Data for additional sequencing cycles provides sequence-specific context. The neural network-based base caller leams the sequence-specific context during training and base call them. Furthermore, data for pre and post sequencing cycles provides second order contribution of pre-phasing and phasing signals to the current sequencing cycle.
  • the specialized architecture comprises spatial convolution layers that do not mix information between sequencing cycles and only mix information within a sequencing cycle.
  • Spatial convolution layers use so-called “segregated convolutions” that operationalize the segregation by independently processing data for each of a plurality of sequencing cycles through a “dedicated, non-shared” sequence of convolutions.
  • the segregated convolutions convolve over data and resulting feature maps of only a given sequencing cycle, i.e., intra-cycle, without convolving over data and resulting feature maps of any other sequencing cycle.
  • the input data comprises (i) current data for a current (time t) sequencing cycle to be base called, (ii) previous data for a previous (time t- 1) sequencing cycle, and (iii) next data for a next (time H-l) sequencing cycle.
  • the specialized architecture then initiates three separate data processing pipelines (or convolution pipelines), namely, a current data processing pipeline, a previous data processing pipeline, and a next data processing pipeline.
  • the current data processing pipeline receives as input the current data for the current (time t) sequencing cycle and independently processes it through a plurality of spatial convolution layers to produce a so-called “current spatially convolved representation” as the output of a final spatial convolution layer.
  • the previous data processing pipeline receives as input the previous data for the previous (time t- 1) sequencing cycle and independently processes it through the plurality of spatial convolution layers to produce a so-called “previous spatially convolved representation” as the output of the final spatial convolution layer.
  • the next data processing pipeline receives as input the next data for the next (time H-l) sequencing cycle and independently processes it through the plurality of spatial convolution layers to produce a so-called “next spatially convolved representation” as the output of the final spatial convolution layer.
  • the current pipeline, one or more previous pipelines, and one or more next processing pipelines are executed in parallel.
  • the spatial convolution layers are part of a spatial convolutional network (or subnetwork) within the specialized architecture.
  • the neural network-based base caller further comprises temporal convolution layers that mix information between sequencing cycles, i.e., inter-cycles.
  • the temporal convolution layers receive their inputs from the spatial convolutional network and operate on the spatially convolved representations produced by the final spatial convolution layer for the respective data processing pipelines.
  • the inter-cycle operability freedom of the temporal convolution layers emanates from the fact that the misalignment property, which exists in the image data fed as input to the spatial convolutional network, is purged out from the spatially convolved representations by the stack, or cascade, of segregated convolutions performed by the sequence of spatial convolution layers.
  • Temporal convolution layers use so-called “combinatory convolutions” that groupwise convolve over input channels in successive inputs on a sliding window basis.
  • the successive inputs are successive outputs produced by a previous spatial convolution layer or a previous temporal convolution layer.
  • the temporal convolution layers are part of a temporal convolutional network (or subnetwork) within the specialized architecture.
  • the temporal convolutional network receives its inputs from the spatial convolutional network.
  • a first temporal convolution layer of the temporal convolutional network groupwise combines the spatially convolved representations between the sequencing cycles.
  • subsequent temporal convolution layers of the temporal convolutional network combine successive outputs of previous temporal convolution layers.
  • the output of the final temporal convolution layer is fed to an output layer that produces an output.
  • the output is used to base call one or more clusters at one or more sequencing cycles.
  • the specialized architecture processes information from a plurality of inputs in two stages.
  • segregated convolutions are used to prevent mixing of information between the inputs.
  • combinatory convolutions are used to mix information between the inputs.
  • the results from the second stage are used to make a single inference for the plurality of inputs.
  • the specialized architecture maps the plurality of inputs to the single inference.
  • the single inference can comprise more than one prediction, such as a classification score for each of the four bases (A, C, T, and G).
  • the inputs have temporal ordering such that each input is generated at a different time step and has a plurality of input channels.
  • the plurality of inputs can include the following three inputs: a current input generated by a current sequencing cycle at time step (/). a previous input generated by a previous sequencing cycle at time step (/- 1), and a next input generated by a next sequencing cycle at time step (/+ 1 ).
  • each input is respectively derived from the current, previous, and next inputs by one or more previous convolution layers and includes k feature maps.
  • each input can include the following five input channels: a red image channel (in red), a red distance channel (in yellow), a green image channel (in green), a green distance channel (in purple), and a scaling channel (in blue).
  • each input can be in blue and violet color channels (or one or more other appropriate color channels), instead of or in addition to red and green channels.
  • each input can be in blue and violet color channels, instead of or in addition to red, green, purple, and/or yellow channels.
  • each input can include k feature maps produced by a previous convolution layer and each feature map is treated as an input channel.
  • each input can have merely one channel, two channels, or another different number of channels. Incorporated materials of U.S. Patent Application Publication No. 2013/0079232 discuss base calling using various number of channels, such as one-channel, two-channels, or four-channels.
  • Fig. 12 depicts one implementation of segregated layers, each of which can include convolutions.
  • Segregated convolutions process the plurality of inputs at once by applying a convolution filter to each input in parallel.
  • the convolution filter combines input channels in a same input and does not combine input channels in different inputs.
  • a same convolution filter is applied to each input in parallel.
  • a different convolution filter is applied to each input in parallel.
  • each spatial convolution layer comprises a bank of k convolution filters, each of which applies to each input in parallel.
  • Fig. 13A depicts one implementation of combinatory layers, each of which can include convolutions.
  • Fig. 13B depicts another implementation of the combinatory layers, each of which can include convolutions.
  • Combinatory convolutions mix information between different inputs by grouping corresponding input channels of the different inputs and applying a convolution filter to each group. The grouping of the corresponding input channels and application of the convolution filter occurs on a sliding window basis.
  • a window spans two or more successive input channels representing, for instance, outputs for two successive sequencing cycles. Since the window is a sliding window, most input channels are used in two or more windows.
  • the different inputs originate from an output sequence produced by a preceding spatial or temporal convolution layer.
  • the different inputs are arranged as successive outputs and therefore viewed by a next temporal convolution layer as successive inputs.
  • the combinatory convolutions apply the convolution filter to groups of corresponding input channels in the successive inputs.
  • each successive input has temporal ordering such that a current input is generated by a current sequencing cycle at time step (/), a previous input is generated by a previous sequencing cycle at time step (/- 1), and a next input is generated by a next sequencing cycle at time step (H-l).
  • each successive input is respectively derived from the current, previous, and next inputs by one or more previous convolution layers and includes k feature maps.
  • each input can include the following five input channels: a red image channel (in red), a red distance channel (in yellow), a green image channel (in green), a green distance channel (in purple), and a scaling channel (in blue).
  • each input can include k feature maps produced by a previous convolution layer and each feature map is treated as an input channel.
  • the depth B of the convolution filter is dependent upon the number of successive inputs whose corresponding input channels are groupwise convolved by the convolution filter on a sliding window basis. In other words, the depth B is equal to the number of successive inputs in each sliding window and the group size.
  • each temporal convolution layer comprises a bank of k convolution filters, each of which applies to the successive inputs on a sliding window basis.
  • Fig. 14A illustrates a base calling system 1400 generating quality scores corresponding to A, C, T, and G for various bases to be called.
  • the base calling system 1400 comprises a sequencing machine 1404, such as the sequencing machine 400 of Fig. 4.
  • the sequencing machine 1404 includes a biosensor (not illustrated in Fig. 14A) comprising a flow cell 1405, similar to the flow cell 102 of the biosensor 100 of Fig. 1.
  • the flow cell 1405 of the system 1400 comprises a plurality of tiles 1406, where each tile comprises a plurality of corresponding clusters 1407.
  • the flow cell 1405 comprises a plurality of lanes of tiles, with each tile including a corresponding plurality of clusters, as discussed with respect to Fig. 2.
  • the flow cell 1405 is illustrated to include some such example clusters 1407 of an example tile.
  • a base call (A, C, G, T) for every cluster at a specific sequencing cycle is predicted, accompanied by corresponding probability scores 1424 and/or quality scores 1432, as will be discussed in further detail herein.
  • the sequencing machine 1404 generates sensor data 1412. For example, sensor data for individual clusters and for individual sequencing cycles are generated. Sensor data for a specific cluster and for a specific sequencing cycle is indicative of a base populating the specific cluster for the specific sequencing cycle.
  • the system 1400 comprises a base caller 1416. Based on the sensor data 1412, the base caller 1416 calls bases of the sequence loaded in the clusters. For example, during a base calling cycle, the base caller 1416 identifies a nucleotide base in a nucleic acid sequence in individual clusters.
  • Base calling refers to the process of determining a base call (A, C, G, T) for every cluster at a specific cycle. As an example, base calling can be performed utilizing four-channel, two-channel or one-channel methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
  • a type of sensor data 1412 generated by the sequencing machine 1404 is based on the type of sequencing machine 1404 used.
  • some of the sequencing machines discussed herein generates the sensor data 1412 in the form of images captured by sensors in the flow cell, as discussed herein previously.
  • image data is derived from sequencing images produced by a sequencer of the sequencing machine during a sequencing run.
  • the sensor data 1412 depicts intensity emissions of a set of analytes, where intensity emissions are captured as an image (see Fig. 17E for example images comprising intensity information).
  • the intensity emissions are generated by analytes in the set of analytes during sequencing cycles of a sequencing run.
  • a memory stores the images including the intensity emission of the sensor data 1412.
  • the image data comprises n*n image patches extracted from the sequencing images, where n is any number ranging between 1 and 10,000, or another appropriate range.
  • the sequencing run produces m image(s) per sequencing cycle for corresponding m image channels, and an image patch is extracted from each of the m image(s) to prepare the image data for a particular sequencing cycle.
  • m is 4 or 2.
  • m is 1, 3, or greater than 4.
  • the image data is in the optical, pixel domain in some implementations, and in the upsampled, subpixel domain in other implementations.
  • the image data comprises data for multiple sequencing cycles (e.g., a current sequencing cycle, one or more preceding sequencing cycles, and one or more successive sequencing cycles).
  • the image data comprises data for three sequencing cycles, such that data for a current (time t) sequencing cycle to be base called is accompanied with (i) data for a left flanking/context/previous/preceding/prior (time t-1) sequencing cycle and (ii) data for a right flanking/context/next/successive/subsequent (time t+1) sequencing cycle (e.g., see Figs. 7 and 10).
  • the image data comprises data for a single sequencing cycle.
  • the image data depicts intensity emissions of one or more clusters and their surrounding background.
  • the image patches are extracted from the sequencing images in such a way that each image patch contains the center of the target cluster in its center pixel, a concept referred to herein as the “target cluster-centered patch extraction”.
  • the image data is encoded in input data using intensity channels (also called image channels). For each of the m images obtained from the sequencer for a particular sequencing cycle, a separate image channel is used to encode its intensity data.
  • the input data comprises (i) a first red image channel with nxn pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the red image and (ii) a second green image channel with nxn pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the green image.
  • a biosensor comprises an array of light sensors.
  • a light sensor is configured to sense information from a corresponding pixel area (e.g., a reaction site/well/nanowell) on the detection surface of the biosensor.
  • An analyte disposed in a pixel area is said to be associated with the pixel area, i.e., the associated analyte.
  • the light sensor corresponding to the pixel area is configured to detect/capture/sense emissions/photons from the associated analyte and, in response, generate a pixel signal for each imaged channel.
  • each imaged channel corresponds to one of a plurality of filter wavelength bands.
  • each imaged channel corresponds to one of a plurality of imaging events at a sequencing cycle.
  • each imaged channel corresponds to a combination of illumination with a specific laser and imaging through a specific optical filter.
  • Pixel signals from the light sensors are communicated to a signal processor coupled to the biosensor (e.g., via a communication port).
  • the signal processor produces an image whose pixels respectively depict/contain/denote/represent/characterize pixel signals obtained from the corresponding light sensors.
  • a pixel in the image corresponds to: (i) a light sensor of the biosensor that generated the pixel signal depicted by the pixel, (ii) an associated analyte whose emissions were detected by the corresponding light sensor and converted into the pixel signal, and (iii) a pixel area on the detection surface of the biosensor that holds the associated analyte.
  • a sequencing run uses two different imaged channels: a red channel and a green channel. Then, at each sequencing cycle, the signal processor produces a red image and a green image. This way, for a series of k sequencing cycles of the sequencing run, a sequence with k pairs of red and green images is produced as output.
  • Pixels in the red and green images have one-to-one correspondence within a sequencing cycle. This means that corresponding pixels in a pair of the red and green images depict intensity data for the same associated analyte, albeit in different imaged channels. Similarly, pixels across the pairs of red and green images have one-to-one correspondence between the sequencing cycles. This means that corresponding pixels in different pairs of the red and green images depict intensity data for the same associated analyte, albeit for different acquisition events/timesteps (sequencing cycles) of the sequencing run. Corresponding pixels in the red and green images (i.e..).
  • different imaged channels can be considered a pixel of a “per-cycle image” that expresses intensity data in a first red channel and a second green channel.
  • a per-cycle image whose pixels depict pixel signals for a subset of the pixel areas, i.e., a region (tile) of the detection surface of the biosensor, is called a “per-cycle tile image.”
  • a patch extracted from a per-cycle tile image is called a “per-cycle image patch.”
  • the patch extraction is performed by an input preparer.
  • the image data comprises a sequence of per-cycle image patches generated for a series of k sequencing cycles of a sequencing run.
  • the pixels in the per-cycle image patches contain intensity data for associated analytes and the intensity data is obtained for one or more imaged channels (e.g., a red channel and a green channel) by corresponding light sensors configured to detect emissions from the associated analytes.
  • the per-cycle image patches are centered at a center pixel that contains intensity data for a target associated analyte and non-center pixels in the per-cycle image patches contain intensity data for associated analytes adjacent to the target associated analyte.
  • the image data is prepared by an input preparer.
  • the sensor data 1412 can be indicative of a chemical property (such as pH level) that, in turn, is indicative of a base to be predicted.
  • a chemical property such as pH level
  • pH changes may be induced by the release of hydrogen ions during molecule extension.
  • the pH changes are detected and converted to a voltage change that is proportional to the number of bases incorporated (e.g., in the case of Ion Torrent).
  • the sensor data 1412 can be in the form of electrical signals (e.g., current or voltage) generated by the flow cell 1405.
  • the sensor data 1412 is constructed from nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • nanopore sensing uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • ONT Oxford Nanopore Technologies
  • the Oxford Nanopore Technologies (ONT) sequencing is based on the following concept: pass a single strand of DNA (or RNA) through a membrane via a nanopore and apply a voltage difference across the membrane.
  • the nucleotides present in the pore will affect the pore's electrical resistance, so current measurements overtime can indicate the sequence of DNA bases passing through the pore.
  • This electrical current signal (the ‘squiggle’ due to its appearance when plotted) is the raw data gathered by an ONT sequencer.
  • DAC integer data acquisition
  • the base caller 1416 can be any appropriate type of base caller.
  • the base caller 1416 can be a neural network based base caller, which is also referred to herein as “Deep Learning” based base caller, discussed with respect to Figs. 7-13B.
  • the base caller is a “RTA” based base caller, which comprises a non-neural network model that is at least in part linear. Examples of a Deep Learning based base caller and an RTA base caller are discussed in U.S. Non-Provisional Patent Application No. 16/826,126, entitled “Artificial Intelligence-Based Base Calling,” filed 20 March 2020 (Attorney Docket No.
  • ILLM 1008-18/IP-1744-US which is incorporated by reference for all purposes as if fully set forth herein.
  • the principles of this disclosure are not limited to a type of base caller used to generate base calls.
  • the base caller 1416 may be of some other appropriate type, which can process any appropriate type of sensor data, such as image and/or non-image type of sensor data previously discussed herein.
  • the base caller 1416 is local to the sequencing machine 1404.
  • the base caller 1416 and the sequencing machine 1404 are proximally located (e.g., within a same housing, or within two proximally located housing), and the base caller 1416 receives the sensor data 1412 directly from the sequencing machine 1404.
  • the base caller 1416 is located remotely relative to the sequencing machine 1404, which is an example of the so-called cloud-based base caller.
  • the base caller 1416 receives the sensor data 1412 from the sequencing machine 1404 via a computer network, such as the Internet.
  • the base caller 1416 comprises an output layer 1420 to generate probability scores of the bases to be called.
  • the output layer 1420 produces likelihoods (classification scores) of a base incorporated in the single target cluster at the current sequencing cycle being one of A, C, T, and G, and classifies the base as one of A, C, T, or G based on the likelihoods (e.g., the base with the maximum likelihood is selected).
  • the likelihoods are exponentially normalized scores produced by a softmax classification layer and sum to unity.
  • the output layer 1420 which for example may include a softmax layer, predicts a called base and corresponding probabilities P(A), P(C), P(T), P(G).
  • the probability scores P(A)+P(C)+P(T)+P(G) is 1, /. e.. the probability scores are normalized (e.g., using a softmax function in, or subsequent to, output layer 1420).
  • the probability scores 1424 are also referred to herein as likelihood scores, softmax scores, confidence scores, and/or the like.
  • the probability scores 1424 are generated for each cluster and for each sequencing cycle of the sequencing run.
  • the base caller 1416 may also call a base.
  • the base caller 1416 may call the base to be an A, based on the probability score P(A) being higher than a threshold value and/or based on the probability score P(A) being higher than each of P(C), P(T), or P(G).
  • the base caller 1416 may call the base to be a G, based on the probability score P(G) being higher than a threshold value and/or based on the probability score P(G) being higher than each of P(A), P(C), or P(T).
  • the base calling system 1400 further comprises a quality score generation module 1428 configured to transform the probability scores 1424 to corresponding quality scores 1432.
  • a quality score Q is related a corresponding probability score P as follows:
  • P(A), P(C), P(T), P(G) are respectively probabilities of the base being called is an A, a C, a T, or a G.
  • E(A) is an error probability associated with the base being called an A
  • E(C) is an error probability associated with the base being called a C
  • E(T) is an error probability associated with the base being called a T
  • E(G) is an error probability associated with the base being called a G.
  • the quality scores can also be rewritten as:
  • the quality scores are defined as a property which is logarithmically related to the base calling probability scores P or base calling error probability scores E.
  • a quality score Q(A) is a likelihood, in logarithmic scale, of a likelihood of a base to be called being an A
  • a quality score Q(C) is a likelihood, in logarithmic scale, of a likelihood of a base to be called being a C
  • the quality scores Q are also referred to as “Phred” scores, and are a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing machines, such as by the sequencing machine 1404.
  • Fig. 14A illustrates example quality scores 1422 corresponding to the probability scores 1424 for the example clusters 1407a and 1407b.
  • quality scores are calculated for relatively higher probability scores, such as for probability scores higher than a threshold values (such as higher than 0.9), e.g., as illustrated in Fig. 14B.
  • Fig. 14B illustrates a table 1460 indicating the relationship between the probability scores 1424, the quality scores 1432, corresponding error probabilities, and corresponding error rates.
  • the table 14B is derived from equations 1, 2, and 3.
  • the table 1460 is self-explanatory.
  • a quality score is a measure of the probability of a sequencing error in a base call.
  • a relatively high value of the quality score implies that a base call is more reliable and less likely to be incorrect, and vice versa. For example, as seen in the table 1460, if the quality score of a base is 30, the probability that this base is called incorrectly is 0.001. This also indicates that the base call accuracy is 99.9%.
  • each of these modules are executed by a processor (e.g., the CPU 402 and/or the configurable processor 450, see Fig. 4).
  • a processor e.g., the CPU 402 and/or the configurable processor 450, see Fig. 4
  • computer readable instructions executable by such processor(s) cause implementation of these modules.
  • Fig. 14C illustrates a comparison operation between predicted quality scores 1432 predicted by the base calling system 1400 of Fig. 14A and true (e.g., empirically calculated) quality scores 1440.
  • a true quality score generation module 1448 generates true (e.g., empirically calculated) quality scores 1440.
  • a quality score comparison module 1436 receives the predicted quality scores 1432 that are predicted by the base calling system 1400. Note that the quality scores 1432 of Fig. 14A are referred to as predicted quality scores 1432 in Fig. 14C, to better distinguish these quality scores from the true quality scores 1440. The quality score comparison module 1436 also receives the true quality scores
  • Fig. 14D illustrates determination of true (e.g., empirically determined) quality scores 1440 of Fig. 14C.
  • true (e.g., empirically calculated) quality score generation module 1448 determines the true quality scores, e.g., by empirically calculating quality scores that are likely to be representative of a true likelihood associated with the quality scores.
  • Fig. 14D assume that the base caller 1416 of Fig. 14A receives 1,000 inputs xl, x2, ..., x 1000, which are sensor data 1412. Note that the 1,000 number of samples is a non limiting example. Also assume that the base caller 1416 generates 1,000 probability scores 1424, such as probability scores PI, P2, ... , PI 000. Each of these probability scores is associated with a corresponding base being called a corresponding one of A, C, T, or G.
  • PI probability scores
  • P2 is a probability P2(T) of a base being called a T, and has a value of 0.992; and assume that P33 is a probability P33(A) of a base being called an A, and has a value of 0.21, as illustrated in Fig. 14D.
  • the associated probabilities are P2(A), P2(C), P2(T), and P2(G).
  • P2(T) is highest among P2(A), P2(C), P2(T), and P2(G). Accordingly, in the example of Fig.
  • P2 is assumed to simply be P2(T) (and not P2(A), P2(C), or P2(G)). That is, P2 is the highest among the associated four probability scores for the base number 2. Similarly, P33 is the highest among the associated four probability scores for the base number 33, and so on.
  • true or ground truth base labels y 1, y2, ... , y 1000 are received by the true quality score generation module 1448 (i.e., true base label yl is for input xl, true base label y2 is for input x2, and so on).
  • a true base label is an actual ground truth base label for the base to be called. For example, assume, for input xl generated at a specific cluster for a specific sequencing cycles, base calling probabilities P(A), P(C), P(T), and P(G) are predicted.
  • the true base label y 1 is the actual base (which can be one of A, C, T, or G) in that cluster and for that sequencing cycle.
  • the true base labels yl, ... , ylOOO are known a-priori, e.g., by sequencing a known base sequence.
  • each predicted probability score 1424 is assigned to a corresponding one of several pre-specified bins.
  • the predicted probability scores 1424 are assigned to corresponding ones of the following pre-specified bins: [0,0.1), [0.1, 0.2), ..., [0.9, 1.0], as illustrated in Fig. 14D.
  • the predicted probability score P33 is assigned to the bin [0.2, 0.3); and as P2 is 0.992, the predicted probability score P2 is assigned to the bin [0.9, 1.0]
  • predicted probability scores P33, P500, ..., P904 are assigned to the bin [0.2, 0.3)
  • predicted probability scores PI, P48, ..., P997 are assigned to the bin [0.8, 0.9)
  • predicted probability scores P2, P50, ..., P909 are assigned to the bin [0.9, 1.0]
  • the true quality score generation module 1448 calculates an accuracy or “true empirical likelihood” of individual bins.
  • the true quality score generation module 1448 checks to see if the corresponding true base label y2 is a T. If y2 is indeed a T, then the prediction P2 is correct. [0217] This validation (or verification) process is repeated for each prediction and for each bin, e.g., to calculate a true probability of each bin. For example, assume that there are 50 probabilities PI, P48, ..., P907 in the bin [0.8, 0.9) and it is determined that 42 of those probabilities match with their corresponding true base labels yl, y48, ..., y907, respectively. Then a “true” or empirically determined probability for that bin is 42/50 or 0.84.
  • the true quality score 1440 for entries in that bin is then determined using equation 1. Specifically, the true quality score 1440 for entries in that bin is -10>Togio(l-0.84) or 7.9588. Thus, probabilities PI, P48, ... , P907 in the bin [0.8,0.9) are assigned the true quality score of 7.9588. [0218] In contrast, assume merely as an example, that predicted probability P997 assigned to the bin [0.8, 0.9) is 0.81, which corresponds to a quality score of -10>Togio(l-0.81) or 7.2124.
  • the quality score comparison module 1436 outputs quality score comparison results 1444, which compare the true quality scores 1440 with the predicted quality scores 1432, as will be discussed herein later in turn.
  • the binning illustrated in Fig. 14D is merely an oversimplified example.
  • the predicted probabilities are assigned among merely 10 bins.
  • the single bin [0.9, 1.0] may be subdivided in multiple bins, such as [0.9,0.91), [0.91,0.92), ..., [0.99,1.0]
  • the predicted quality scores 1432 may be binned instead. For example, the predicted quality scores 1432 are assigned in corresponding bins. Also, true quality scores for individual bins are calculated in the above discussed manner. Then the quality score comparison module 1436 can directly compare the true quality scores 1440 with the predicted quality scores 1432.
  • Fig. 15A illustrates a graph 1500a depicting a comparison between predicted quality scores 1432 and true quality scores 1440
  • Fig. 15B illustrates another graph 1500b depicting another comparison between predicted quality scores 1432 and true quality scores 1440.
  • Graph 1500a has a dashed line 1505a having a slope of 1. Thus, any point on the line 1505a has equal values of predicted quality score 1432 and true quality score 1440.
  • graph 1500b has a dashed line 1505b having a slope of 1. Thus, any point on the line 1505b has equal values of predicted quality score 1432 and true quality score 1440.
  • Note that many of the subsequent graphs presented herein will have a dashed line that has a slope of 1. For purposes of this disclosure, such lines are also referred to herein as “slope 1 line” or “lines with slope 1”.
  • Graph 1500a of Fig. 15A has a line 1510a depicting the relationship between the predicted quality scores 1432 (X axis) and the true quality scores 1440 (Y axis), for a specific implementation of a base caller.
  • the predicted score 1432 is usually more than the corresponding true score 1440.
  • a predicted quality score of 45 roughly corresponds to a true quality score 1440 of about 32.
  • the base caller is predicting quality scores that are higher than corresponding true or empirically calculated quality scores.
  • the base caller that generates the graph 1500a of Fig. 15A is “overconfident” about the prediction of the quality scores.
  • Graph 1500b of Fig. 15B has a line 1510b depicting the relationship between the predicted quality score 1432 and the true quality score 1440, for another specific implementation of a base caller.
  • the predicted score 1432 is usually less than the corresponding true score 1440.
  • a predicted quality score of 45 roughly corresponds to a true quality score 1440 of about 50.
  • the base caller is predicting quality scores that are lower than corresponding true or empirically calculated quality scores.
  • the base caller that generates the graph 1500b of Fig. 15B is “underconfident” about the prediction of the quality scores.
  • a base caller can be overconfident or underconfident when predicting the quality scores.
  • the quality scores predicted by the base caller should fully, or at least substantially (e.g. , within a threshold of 1% or 5% or less) match the true quality scores.
  • any point in the slope 1 line e.g., lines 1505a and 1505b of Figs. 15A and 15B, respectively
  • the predicted quality scores versus true quality scores graph should overlap the slope 1 line, or should closely follow (or closely align to) the slope 1 line.
  • Figs. 15A and 15B the predicted quality scores versus true quality scores graph
  • the quality scores predicted by the base caller may not always match with the true quality scores (/. e. , the points on the graph may not lie on the slope 1 line), thereby resulting in not fully accurate quality scores being generated by the base caller.
  • Fig. 16 illustrates another graph 1600 depicting a comparison between predicted quality scores 1432 (X axis) and true quality scores 1440 (Y axis). Similar to Figs. 15A and 15B, the graph 1600 of Fig. 16 also includes a “slope 1” line 1605 having a slope of 1.
  • the graph 1600 has a plurality of sampling points, for example, corresponding to human genome, and various other types of genomes, such as genomes of Acinetobacter baumannii (A. baumannii) bacteria, Bacillus cereus (B. cereus) bacteria, exomes, and bug pool genomes. [0230] In the graph 1600 of Fig.
  • region 1625 also referred to herein as diagonal mismatch region 1625, identifies mismatch between the predicted quality score 132 and the true quality score 1440 in the diagonal region of the graph (e.g. , on a region lying on the slope 1 line).
  • the diagonal mismatch region 1625 is mainly between true quality scores of about 15 to 40.
  • the sampling points are scattered around the slope 1 line, and many sampling points are off or misaligned with respect to the slope 1 line.
  • the substantially widest section of the region 1625 has a width of LI. Again, ideally, this width should be close to zero, with all the sampling points being close to the slope 1 line.
  • the region 1620 is also referred to herein as overconfident region 1620 (or saturation region 1620), as the base caller 1416 is overconfident in this region.
  • overconfident region 1620 or saturation region 1620
  • a corresponding predicted quality score is higher than a corresponding true quality score.
  • the true quality scores of sampling points within this region 1620 are between about 35 and 40.
  • the predicted quality score of sampling points within this region 1620 are above 40.
  • an example sampling point within this region 1620 has a predicted quality score that is as high as 70, but has a true quality score of about 38.
  • the base caller 1416 is overconfident in its quality score prediction.
  • the predicted quality score saturates. That is, in the overconfident region 1620, an increase in the predicted quality score does not result in corresponding significant increase in the true quality score.
  • the overconfident region 1620 is also referred to as saturation region.
  • the true probability quality scores 1440 does not go above a threshold true score, which is about 40 (which translates to a probability score of 0.9999 and an error rate of 0.01%) in an example. This may be because of errors in the sequencing machine 1404 and/or the base calling system, which may occur due to amplification, preparation, bridge PCR, or other reasons. For example, during the previously discussed amplification process, amplification error may occur. For example, library preparation error may occur for preparation of input library during the amplification process. Another example of an error is associated with the bridge PCR. Such errors impose a limit on maximum achievable true quality scores. For example, due to these errors, even an adequately trained base caller may not predict quality scores that are truly above a threshold quality score.
  • each bin should have an adequate number of basecalls, to determine if the quality score is relatively well calibrated.
  • bin Q40 i.e., a bin including quality score of 40
  • the threshold quality score in the example of Fig. 16 is about 40 or 45.
  • Fig. 17A illustrates a base calling system 1700 including a normalization module 1704 for normalizing sensor data that are received by a base caller 1416.
  • the base calling system 1700 of Fig. 17A is at least in part similar to the base calling system 1400 of Fig. 14A, and similar components in the two systems are labeled using the same labels.
  • the base calling system 1700 of Fig. 17A includes the sequencing machine 1404 comprising the flow cell 1405, where the flow cell 1405 generates sensor data 1412.
  • the base calling system 1700 of Fig. 17A includes the base caller 1416 and the quality score generation module 1428.
  • the base calling system 1700 of Fig. 17A includes a normalization module 1704 configured to receive the sensor data 1412, normalize the sensor data 1412 to generate normalized sensor data 1712, and provide the normalized sensor data 1712 to the base caller 1416.
  • the base caller 1416 of the system 1700 of Fig. 17A now operates on the normalized sensor data 1712.
  • Fig. 17B illustrates two graphs 1701 and 1711 depicting a normalization operation on sensor data performed by the normalization module 1704 of the base calling system of Fig. 17A.
  • the first graph 1701 of Fig. 17B illustrates a histogram associated with the sensor data 1412
  • the second graph 1711 of Fig. 17B illustrates another histogram associated with the normalized sensor data 1712.
  • Referring now to the first graph 1701 of Fig. 17B illustrated is a histogram depicting distribution of intensity of sensor data 1412.
  • the sensor data 1412 is assumed to be images of clusters having specific intensities. However, such an assumption does not limit the scope of this disclosure.
  • the teachings of this disclosure are also applicable for other types of sensor data, such as when the sensor data are represented by electrical signals (such as voltages or currents), chemical properties (e.g., pH levels), or the like.
  • the image intensity in the X axis of the graph 1701 ranges from about 220 to about 820, which is labeled as a first range 1702 in the graph 1701, where the image intensity has any appropriate unit.
  • the first range 1702 thus, is defined by a corresponding lower intensity of 220 and a corresponding upper intensity of 820.
  • the intensities are captured by image sensors in the flow cell, As previously discussed herein, an image intensity captured from a cluster during a sequencing cycle is indicative of a base to be called for that cluster for that sequencing cycle.
  • intensity value 240 represents a lower 0.5 th percentile, where only 0.5% of intensities are below 240 and the remaining 99.5% intensities are above 240.
  • intensity value 820 represents an upper 99.5 th percentile, where 99.5% intensities are below 820 and only 0.5% intensities are above 820. That is, 99% of the intensities are between intensity range 240 to 820, which is labelled as the second range 1706 in Fig. 17B.
  • the second range 1706 are defined by a lower intensity of 240 and an upper intensity of 760. As seen, the second range 1706 is fully encompassed with the first range 1702.
  • the intensities outside this second range 1706 are outlier intensities that may not, in some examples, help in generating predicted quality scores matching true quality scores. Put differently, the outlier intensities result in some mismatch between the predicted quality scores and the true quality scores. Accordingly, in an embodiment, these outliers are removed during the normalization process.
  • intensities that are lower than the second range 1706 are assigned a value corresponding to a lower intensity of the second range 1706.
  • the lower outlier intensities i.e. intensities that are between 220 and 240
  • the lower outlier intensities are assigned an intensity of 240.
  • the lower outlier intensities are simply removed from consideration during the normalization process.
  • intensities that are higher than the second range 1706 are assigned a value corresponding to an upper intensity of the second range 1706.
  • the higher outlier intensities i.e. intensities that are between 760 and 820
  • the higher outlier intensities are simply removed from consideration during the normalization process.
  • the third range 1722 is defined by a lower intensity of 0 and an upper intensity of 255.
  • intensities within the third range 1722 can be represented using 8-bit data. In other examples, other upper and lower intensities for the third range 1722 can be used.
  • the third range is less than the second range.
  • the second range is from intensity 240 to 760, i.e., an intensity range of 520.
  • the third range is from intensity 0 to 255, i.e. , an intensity range of 256. That is, the intensities in the second range are squeezed and mapped to the third range.
  • a sensor data having a first intensity value in the second range 1706 is mapped to have a second intensity value in the third range 1722.
  • the third range is defined by intensities 0 and 255 - i.e., has an intensity range of 256.
  • intensities between 240 and 242 in the second range 1706 are mapped to intensity 0 in the third range 1722; intensities between 242 and 244 in the second range 1706 are mapped to intensity 1 in the third range 1722; intensities between 758 and 760 in the second range 1706 are mapped to intensity 255 in the third range 1722, and so on.
  • the two histograms in the graphs 1701 and 1711 have somewhat same shape. In an example, a sum of all the bars in the histogram of graph 1701 and a sum of all the bars in the histogram of graph 1701 are substantially the same.
  • an area covered under the first histogram (associated with the graph 1701) within the second range 1706 and an area covered under the second histogram (associated with the graph 1711) within the third range 1722 are substantially equal.
  • the normalization which includes processing the outlier intensities and the mapping, lowers variability between images from different sequencing runs and different sequencing run preparation process, and knowledge is more transferrable between images of the sensor data.
  • Fig. 17C illustrates a graph 1710 depicting a comparison between predicted quality scores 1432 and true quality scores 1440, wherein the sensor data 1412 have been normalized by the normalization module 1704 of the base calling system 1700 of Fig. 17A while generating data for the graph of Fig. 17C. Similar to Fig. 16, the graph 1710 of Fig. 17C also includes a “slope 1” line 1785 having a slope of 1.
  • the graph 1710 has a plurality of sampling points for, for example, human genome, and various other types of genomes, such as genomes of Acinetobacter baumannii (A. baumannii) bacteria, Bacillus cereus ( B .
  • the graph 1600 of Fig. 16 is generated by a base calling system that does not normalize the sensor data 1412 (e.g., the base calling system 1400 of Fig. 14A), whereas the graph 1710 of Fig. 17C is generated by a base calling system that normalizes the sensor data 1412 and uses the normalized sensor data for base calling (e.g., the base calling system 1700 of Fig. 17A).
  • the diagonal mismatch region 1625 of the graph 1600 of Fig. 16 Comparing the diagonal mismatch region 1625 of the graph 1600 of Fig. 16 and a similar diagonal mismatch region 1725 of the graph 1710 of Fig. 17C, significant performance improvement is noticed.
  • the diagonal mismatch region identifies mismatch between the predicted quality score 132 and the true quality score 1440 in the diagonal region of the graph (e.g., on a region lying on the slope 1 line).
  • the diagonal mismatch region is mainly between true quality scores of 15 to 40. In this region the sampling points are scattered around the slope 1 line, and many sampling points are off the slope 1 line.
  • the substantially widest section of the region 1625 in the graph 1600 of Fig. 16 has a width of LI. Again, ideally, this width should be close to zero, with all the sampling points being close to the slope 1 line.
  • a corresponding substantially widest section of the region 1725 in the graph 1710 of Fig. 17C has a width of L2.
  • L2 in Fig. 17C is substantially lower than LI in Fig. 16 ( i.e . , L2 ⁇ LI). That is, in the graph 1710 of Fig. 17C, due to the normalization process, the sampling points are less scattered and better aligned to the slope 1 line, e.g. , compared to the scattering and alignment of the sampling points in the graph 1600 of Fig. 16.
  • the inventors of this disclosure have found that, for true quality score between about 15 and 40, the normalization process helps the predicted quality score 1432 be better aligned to the true quality score 1440 (e.g., compared to a scenario without normalization).
  • Fig. 17D illustrates a plot indicating expected calibration error (ECE) for a base calling system having input normalization versus another base calling system lacking such an input normalization.
  • ECE expected calibration error
  • Fig. 17E illustrates a color comparison between the sensor data 1412 prior to normalization and normalized sensor data 1712.
  • a first image 1790a illustrates sensor data 1412 captured from the flow cell, prior to any normalization.
  • Locations of fiducials are illustrated in the image 1790a using oval shapes.
  • a solid support upon which a biological specimen is imaged can include such fiducial markers, to facilitate determination of the orientation of the specimen or the image thereof in relation to probes that are attached to the solid support.
  • Exemplary fiducials include, but are not limited to beads (with or without fluorescent moieties or moieties such as nucleic acids to which labeled probes can be bound), fluorescent molecules attached at known or determinable features, or structures that combine morphological shapes with fluorescent moieties.
  • Exemplary fiducials are set forth in U.S. Patent Publication No. 2002/0150909, which is incorporated herein by reference. Multiple (such as hundreds of thousands, or even millions) of clusters, although not labelled, are included in the illustration of Fig. 17E. Image data on and around a cluster is to be analyzed, to make a base call for the cluster. Note that the intensity scale in image 1790a is from 0 to 2000, with intensities around 200 to 800 being predominantly present, as discussed with respect to Fig. 17B.
  • a second image 1790b illustrates normalized sensor data 1712, e.g., after normalization has been performed on the sensor data 1412. Locations of the clusters are illustrated in the image 1790a using oval shapes. Image data on and around a cluster is to be analyzed, to make a base call for the cluster. Note that the intensity scale in image 1790b is from 0 to 255, e.g., as a result of normalization.
  • Fig. 17F illustrates a flowchart depicting an example method 1750 for normalizing sensor data, and using normalized sensor data for base calling operations.
  • a plurality of sensor data is received (e.g. , by the normalization module 1704 of Fig. 17A) from a flow cell, where the plurality of sensor data are within a first range (e.g., first range 1702).
  • a first range e.g., first range 1702
  • Fig. 17B illustrates an example in which the plurality of sensor data comprises the plurality of intensity values that are within the first range 1702.
  • a second range is identified (e.g., by the normalization module 1704 of Fig. 17A), such that at least a threshold percentage of the plurality of sensor data are within the second range.
  • Fig. 17B illustrates an example of the second range 1706, such that 99.0% of the sensor data are within this range. Note that 99.0% is used merely as an example, and other threshold percentages can also be envisioned by those skilled in the art, based on the teachings of this disclosure.
  • the outlier sensor data e.g. , sensor data that are outside the second range
  • the lower outlier sensor data e.g., intensities that are between 220 and 240 in Fig. 17B
  • the upper outlier sensor data e.g., intensities that are between 760 and 820 in Fig. 17B
  • the outlier sensor data are simply ignored or taken out of consideration.
  • At 1770 at least a subset of the plurality of sensor data, e.g., which are within the second range, are mapped to a third range (e.g., by the normalization module 1704 of Fig. 17A), to generate a plurality of normalized sensor data 1770.
  • a third range e.g., by the normalization module 1704 of Fig. 17A
  • intensities in the second range in the graph 1701 are mapped to corresponding intensities in the third range in the graph 1711.
  • outlier sensor data are taken out of consideration, then such outlier sensor data are not mapped at 1770, and only a subset of the plurality of sensor data, which are in the second range, are mapped to the third range.
  • the plurality of normalized sensor data is processed in a base caller, to call, for each of the plurality of normalized sensor data, a corresponding base.
  • the base caller 1416 of Fig. 17A receives the normalized sensor data 1712, and generates corresponding base calls.
  • Fig. 18A illustrates a base calling system 1800 including a quality score remapping module 1804 for selectively remapping quality scores 1432 predicted by the base caller 1416.
  • the base calling system 1800 of Fig. 18A is at least in part similar to the base calling system 1400 of Fig. 14A, and similar components in the two systems are labelled using the same labels.
  • the base calling system 1800 of Fig. 18A includes the sequencing machine 1404 comprising the flow cell 1405, where the flow cell 1405 generates sensor data 1412.
  • the base calling system 1800 of Fig. 18A includes the base caller 1416 and the quality score generation module 1428.
  • the system 1800 of Fig. 18A may include the normalization module 1704 of Fig. 17A.
  • the base caller 1416 operates on the normalized sensor data 1712.
  • the system 1800 of Fig. 18A lacks such a normalization module 1704.
  • the base calling system 1800 of Fig. 18A includes a quality score remapping module 1804 configured to selectively remap the quality scores 1432 generated by the quality score generation module 1428, as discussed herein below.
  • the base calling system 1800 in addition to remapping the quality scores, also may include a quality score quantization module 1812 that quantizes the remapped quality scores 1832, to generate quantized remapped quality scores 1836.
  • the quality score quantization module 1812 is optional, and hence, is illustrated using dashed lines in Fig. 18A.
  • the system 1800 further comprises one or more Look Up Table(s) (LUTs) 1808 stored in a memory that is accessible to the quality score remapping module 1804.
  • LUTs Look Up Table(s)
  • Figs. 18B1, 18B2, 18B3, 18B4, and 18B5 in combination, illustrate examples of quality score remapping and quantization.
  • Fig. 18B1 illustrated is a graph 1828a depicting predicted quality scores 1432 output by the base caller 1416 in the X axis, and corresponding true quality scores 1440 in the Y axis.
  • the predicted quality score is higher than corresponding true score.
  • a sampling point 1827 in the overconfident region 1820
  • a true quality score of 19 a sampling point 1827 corresponding to a specific base of a specific cluster
  • the remapping module 1804 maps a quality score 1432 having a value of 56 to a remapped quality score having a value of 19.
  • the graph 1828a includes two types of sampling points: calibration points and operational points.
  • the calibration points have known ground truth base calls and known true quality scores 1440.
  • the calibration points are used to generate a LUT for the remapping (see Fig. 18B2), and subsequently the operational points use the LUT for being remapped to new quality scores.
  • the assumption here is that the remapping LUT generated using the calibration points is applicable for the operational points as well.
  • remapping LUT 1808a that stores mapping data between predicted quality scores 1432 and true quality scores 1440.
  • a predicted quality score of 56 actually corresponds to a true quality score of 19, as indicated in a first row of the remapping LUT 1808a.
  • Other rows of the remapping LUT 1808a are similarly populated.
  • the LUT 1808a is an oversimplified remapping LUT, to illustrate the teachings of this disclosure.
  • a remapping LUT is likely to have many more rows, for remapping various predicted quality scores 1432 to corresponding true quality scores 1440.
  • a graph 1828c depicting remapped quality scores for the operational points of the graph 1828a of Fig. 18B 1.
  • the sampling points corresponding to the quality scores now better align with the line with slope 1 (e.g., relative to the alignment of Fig. 18B1).
  • the remapped quality scores of Fig. 18B3 are now substantially closer to (equal to) their respective true quality scores (e.g., relative to the alignment of Fig. 18B1). Note that the remapping helps in alignment in the overconfident region 1820.
  • Fig. 18B4 illustrates a LUT 1808b for quantizing the remapped quality scores.
  • each remapped quality score is assigned to one of 3 quantized quality scores corresponding to the three rows of the LUT 1808b.
  • a number of quantized quality scores is a mere example and does not limit the scope of this disclosure.
  • each remapped quality score can be assigned to one of Q number of quantized quality scores corresponding to Q number of rows of a LUT, where Q can be two, four, higher.
  • the remapped quality scores are assigned or grouped into three bins [0,18), [18,30), and [30, infinite) (see first column of the LUT 1808b), although the ranges of the bins are mere examples and does not limit the scope of this disclosure.
  • the second column of the LUT 1808b indicates example quantized remapped quality scores corresponding to each bin.
  • remapped quality scores included in the bin [0,18) are assigned a quantized remapped quality score of 9.550; remapped quality scores included in the bin [18,30) are assigned a quantized remapped quality score of 22.840; and remapped quality scores included in the bin [30, inf) are assigned a quantized remapped quality score of 37.382.
  • the quantized quality scores 9.550, 22.840, and 37.382 are pre-specified in the LUT.
  • these numbers are generated by averaging the true quality scores of calibration sampling points (see Fig. 18B1) assigned to corresponding bins. For example, assume that 300 calibration sampling points are assigned to the bin [0,18). An average of the true quality scores of these 300 calibration sampling points, which are assigned to the bin [0,18), is determined to be 9.550. Accordingly, the bin [0,18) is assigned remapped quantized quality score of 9.550, which is an average of true quality scores of calibration sampling points included in this bin.
  • the third column of the LUT 1808b indicates an average of original (i.e.. not remapped) quality scores in the respectively bin. For example, following on with the above example where the 300 calibration sampling points are assigned to the bin [0,18), an average of their quality scores prior to the remapping is 9.347. Thus, by comparing the second and third columns of the LUT, one can apprehend how much the remapping changes or deviates the quality score. Put differently, for a given row (i.e.. a given quality score bin), a deviation between the second and third columns of the LUT is an indication of a change in the average quality scores due to the remapping.
  • Fig. 18B5 is a graph 1828d illustrating the quantized scores.
  • Fig. 18B5 is at least in part similar to the graph 1828c ofFig. 18B3.
  • the system 1800 outputs the quantized remapped quality scores 1836 ( e.g ., instead of the remapped quality scores).
  • Fig. 18C1 and Fig. 18C2 illustrate two further examples of quality score remapping and quantization.
  • quality score remapping and quantization for sequencing read cycle 1 (referred to as Read 1) and sequencing read cycle 2 (referred to as Read 2) are illustrated.
  • Read 1 two graphs are illustrated under Read 1 : (i) a top graph 1840a illustrating remapping and quantization, and (ii) a bottom graph 1840b which is a histogram.
  • a top graph 1840a illustrating remapping and quantization
  • a bottom graph 1840b which is a histogram.
  • quality scores above about 40 deviate away from the line with slope 1.
  • the quality scores are remapped, which are illustrated using blue dots.
  • the remapped quality scores are better aligned with the slope 1 line (e.g., relative to the quality scores before remapping).
  • the histogram 1840b illustrates the original quality scores in red, and the remapped quality scores in blue. As illustrated, the original scores can be as high as 65 or 70, whereas the remapped quality scores are less than about 52.
  • Read 2 two graphs are illustrated under Read 2: (i) a top graph 1840c illustrating remapping and quantization, and (ii) a bottom graph 1840d which is a histogram, each of which will be evident based on the above discussion with respect to the graphs of Read 1.
  • the base caller 1416 makes a base call for a current sequencing cycle by processing a window of sequencing images for a plurality of sequencing cycles, including the current sequencing cycle contextualized by right and left sequencing cycles.
  • the base “G” is indicated by a dark or off state in the sequencing images. Accordingly, in an example, repeat patterns of the base “G” can lead to higher likelihood of erroneous base calls. Such erroneous base calls may also occur when the current sequencing cycle is for a non-G base (e.g., base “T”), but right and left flanked by Gs.
  • base sequences of homopolymers e.g., GGGGG
  • flanked- homopolymers e.g., GGTGG
  • the probability of error in base calling is relatively high.
  • GGTCG flanked- homopolymers
  • such specific base calling sequence patterns have multiple G’s, such as G’s at least at a beginning and at an end of the sequence, and possibly a third G between the two end-G’s in the 5 -base sequence.
  • Other examples of such specific base calling sequences include GGXGG, GXGGG, GGGXG, GXXGG, and GGXXG, where X can be any of A, C, T, or G.
  • Fig. 19 illustrates a table depicting, for some specific base sequences, deviations between (i) an average of quality scores of the specific base sequences and (ii) an average of remapped quality scores of the specific base sequences, where the remapping is performed in accordance with a general LUT of, for example, Fig. 18B2.
  • the table of Fig. 19 is divided in two sections 1901a and 1901b due to limitations in space for depicting the undivided table.
  • the specific sequences depicted in the table are ACGGC, TCGAG, and so on, and finally GGGGG, GGTGG, and so on. Deviations for a read sequence 1 and read sequence 2 for various specific base sequences are illustrated.
  • Acinetobacter baumannii A. baumannii
  • human genome Bacillus cereus (B. cereus) bacteria
  • Rhodobacter Rhodobacter
  • a corresponding count of base sequences and a corresponding deviation is used.
  • an average deviation for the various specific base sequences are listed in the last column of the section 1901b of the table of Fig. 19. The deviations presented in Fig.
  • 19 represents an amount by which average quality scores change due to the remapping process, when a generate purpose LUT (such as in Fig. 18B2) is used for the remapping.
  • the second column i.e.. Specific base sequences
  • the last column i.e.. average deviation
  • the deviations for at least some of the specific base sequences are significant.
  • the average deviation for read 2 of GGGGG is 7.51
  • the average deviation for read 2 of GGTGG is 6, which are significant (e.g., compared to an average deviation of3.37 for read 1 of ACGGC).
  • a remapping that works for general base sequences may not adequately work for at least some of the specific base sequences.
  • Fig. 20A illustrates a LUT 2000 that is usable to remap predicted quality scores of a specific base sequence (e.g., a homopolymer sequence of GGGGG) to remapped true quality scores.
  • the LUT 2000 is specifically for the homopolymer sequence of GGGGG, which can be derived by repeatedly testing with the homopolymer sequence of GGGGG, and generating true quality scores for the predicted base sequence. More specifically, the LUT 2000 is for remapping predicted quality score of a middle G of the sequence of GGGGG. For example, referring to an encircled entry of the LUT 2000, a predicted quality score of 27 can be remapped to a true quality score of 30 for a middle G of the specific sequence of GGGGG.
  • Fig. 20B illustrates remapping of predicted quality scores for a specific base sequence (e.g., a homopolymer sequence of GGGGG) using the LUT 2000 of Fig. 20A.
  • a base sequence of G, A, C, G, G, G, G, G, T is output by the base caller, along with corresponding predicted respective quality scores of Q25, Q23, Q25, Q27, Q37, Q27, Q27, Q32, and Q27 for individual bases on the precited sequence, as illustrated in the first two rows of the table of Fig. 20B. That is, the first G in the sequence is associated with a predicted quality score of 25, the second A in the sequence is associated with a predicted quality score of 23, and so on. Note the presence of the specific homopolymer sequence of GGGGG in the base calls.
  • the predicted quality scores for all the bases, except for the middle G of the homopolymer sequence of GGGGG, are remapped using the LUT 1808b of Fig. 18B2 (or another similar “general purpose” LUT).
  • the LUT 1808b of Fig. 18B2 is referred to herein as a “general purpose” remapping LUT, as this LUT is used to remap general base sequences.
  • the LUT 2000 of Fig. 20A is a “base sequence specific” LUT that is dedicated specifically to a middle base of a specific base sequence of GGGGG.
  • the predicted quality middle Q27 of the middle G of this sequence in Fig. 20B is replaced in accordance with the dotted encircled entry the LUT 2000.
  • the 4 th base of G, the 6 th base of G, and the 9 th base of T in the sequence of Fig. 20B each has a quality score of Q27.
  • the quality score of Q27 for the 4 th base of G and the 9 th base of T may be remapped similarly, e.g., using a general purpose LUT, whereas the 6 th base of G (which is a middle one of the specific base sequence) will be remapped differently, e.g., using abase sequence specific LUT.
  • the 4 th base of G and the 9 th base of T may be remapped to a remapped quality score of Q32 in accordance with the general purpose LUT, whereas the 6 th base of G (which is a middle one of the specific base sequence) may be remapped to Q30 in accordance with the base sequence specific LUT 2000 of Fig. 20A.
  • Figs. 20A and 20B are directed to the specific homopolymer sequence GGGGG. Similar specific LUTs can be generated for other specific homopolymer or flanked-homopolymer sequences, such as GGTGG, GGTCG, GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, or the like, where X can be any of A, C, T, or G.
  • Fig. 21 illustrates a base calling system 2100 that includes a loss penalization module 2106 to selectively penalize loss for one or more specific base sequences.
  • the base calling system 2100 of Fig. 21 is at least in part similar to the base calling system 1400 of Fig. 14A, and similar components in the two systems are labelled using the same labels.
  • the base calling system 2100 of Fig. 21 includes the sequencing machine 1404 comprising the flow cell 1405, where the flow cell 1405 generates sensor data 1412.
  • the base calling system 2100 of Fig. 21 includes the base caller 1416 and the quality score generation module 1428.
  • the base caller 1416 includes a forward pass section 2108, a backpropagation pass section 2112, a loss generation module 2104, and a loss penalization module 2106 of a neural network model.
  • the loss generation module 2104 receives an output of the forward pass section (e.g. , predicted base calls) and a ground truth (e.g. , ground truth base sequences), and generates a loss function 2109 based on a comparison of the output of the forward pass section 2108 and the ground truth 2105.
  • the loss penalization module 2106 penalizes the loss function 2109, to generate penalized loss function 2111.
  • the penalized loss function 2111 is used by the backpropagation section 2112 for generating input gradients and/or weight gradients, which are in turn used for adapting weights of the neural network model and thereby training the neural network model.
  • the loss penalization module 2106 selectively penalizes the loss function 2109, e.g., if a specific base sequence (e.g., a homopolymer or a flanked-homopolymer, such as GGXGG, where X is any of A, C, T, or G) is detected.
  • a specific base sequence e.g., a homopolymer or a flanked-homopolymer, such as GGXGG, where X is any of A, C, T, or G
  • a goal of training deep neural networks is optimization of the weight parameters in each layer of the forward pass, which gradually combines simpler features into complex features so that the most suitable hierarchical representations can be learned from data.
  • a single cycle of the optimization process is organized as follows. First, given a training dataset, the forward pass section sequentially computes the output in each layer and propagates the function signals forward through the network. In the final layer of the forward pass section, an objective loss function (e.g., generated by the loss generation module 2104) measures error between the inferred outputs and the given labels.
  • the loss penalization module 2106 penalizes the loss function 2109, to generate penalized loss function 2111.
  • the backpropagation pass uses the chain rule to backpropagate error signals (e.g., the penalized loss function 2111) and compute gradients with respect to all weights throughout the neural network.
  • the weight parameters are updated using optimization algorithms based on gradient descent. Whereas batch gradient descent performs parameter updates for each complete dataset, stochastic gradient descent provides stochastic approximations by performing the updates for each small set of data examples.
  • a loss function generated by the loss generation module 2104 can be of any appropriate type, such as logistic regression/log loss, multi class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss.
  • Base callers including neural network models, which comprise a forward pass section, a backpropagation section, and a loss generation module have been discussed in further detail in U.S. Nonprovisional Patent Application No.
  • Figs. 22A-22E in combination, illustrate penalization of a loss function (e.g. , by the loss penalization module 2106), in response to a detection of a specific base sequence.
  • the specific base sequence discussed with respect to the example of Figs. 22A-22E is GGXGG, where “X” can be any of A, C, T, or G.
  • X can be any of A, C, T, or G.
  • teachings of this disclosure are not limited to any specific “specific base sequence,” and can be applied to any homopolymers, flanked homopolymers, and/or any other specific base sequence discussed herein with respect to Figs. 19, 20A, and 20B.
  • a section of a cross entropy matrix 2204a which is a loss matrix of the loss function 2109. Also illustrated is a penalization matrix 2208a.
  • the penalization matrix 2208a is to selectively penalize the loss function of the cross-entropy matrix 2204a.
  • the cross-entropy matrix 2204a and penalization matrix 2208a of Fig. 22A are for a sequencing cycle (t- 2). Note that each of the cross-entropy matrix 2204a and penalization matrix 2208a has multiple elements arranged in an array form, and corresponds to pixels (or subpixels) of one or more images generated for the various clusters from the flow cell.
  • element wise multiplication of the cross-entropy matrix 2204a and penalization matrix 2208a is performed. For example, an element at position (1,1) of the cross-entropy matrix 2204a is multiplied to an element at position (1,1) of the penalization matrix 2208a; an element at position (1,2) of the cross-entropy matrix 2204a is multiplied to an element at position (1,2) of the penalization matrix 2208a; and generally speaking, an element at position (i,j) of the cross-entropy matrix 2204a is multiplied to an element at position (i,j) of the penalization matrix 2208a.
  • such multiplication of the cross-entropy matrix 2204a and penalization matrix 2208a generates the penalized loss function 2111 for sequencing cycle (t-2).
  • each of the elements of the penalization matrix 2208a has a weight or penalty of wl, which may be, for example, 1.
  • wl 1 for an element of the penalization matrix 2208a
  • the element of the penalization matrix 2208a does not impose a penalty (or imposes a penalty of 1) to a corresponding element of the cross-entropy matrix 2204a.
  • the penalization matrix 2208a does not impose a penalty in Fig. 22A.
  • Fig. 22C illustrated are a section of a cross entropy matrix 2204c, which is a loss matrix of the loss function 2109, and a penalization matrix 2208c for a sequencing cycle (t-1). Also illustrated is a checkered entry in the cross-entropy matrix 2204c. Also assume, for the sequencing cycle (t) of Fig. 22C, the base corresponding to the checkered box has a ground truth of X, where X can be any of A, C, T, or G. Also assume, for the sequencing cycle (t+1) of Fig. 22D, the base corresponding to the checkered box has a ground truth of G; and assume, for the sequencing cycle (t+2) of Fig.
  • the base corresponding to the checkered box has a ground truth of G.
  • the (3,4) position of the cross-entropy matrices 2204a, 2204b, 2204c, 2204d, and 2204e of Figs. 22A-22E, respectively, are associated with a specific base sequence of GGXGG. Accordingly, a middle base of this specific base sequence is penalized by the corresponding penalization matrix 2208c.
  • a penalty corresponding to the (3,4) position of the penalization matrix 2208c of Fig. 22C, which is to be multiplied by the loss associated with the middle X of the specific base sequence (i.e., multiplied by the (3,4) element of the cross entropy matrix 2204c), is W2, where W2 is greater than wl (i.e., W2 > wl).
  • W2 is at least twice the value of wl.
  • W2 is greater than 2, whereas wl is 1.
  • W2 20 or higher.
  • Remaining elements of the penalization matrix 2208c are still wl.
  • the penalization matrix 2208c does not impose a penalty to any of the elements of the cross-entropy matrix 2204c in Fig. 22C, except for the (3,4) element of the cross-entropy matrix 2204c that is penalized by the weight W2.
  • Fig. 22D illustrated are a section of a cross entropy matrix 2204d, which is a loss matrix of the loss function 2109, and a penalization matrix 2208d for a sequencing cycle (t+1). Also illustrated is a checkered entry in the cross-entropy matrix 2204d.
  • the base corresponding to the checkered box has a ground truth of G.
  • Fig. 22E illustrated are a section of a cross entropy matrix 2204e, which is a loss matrix of the loss function 2109, and a penalization matrix 2208e for a sequencing cycle (t+2). Also illustrated is a checkered entry in the cross-entropy matrix 2204e.
  • the base corresponding to the checkered box has a ground truth of G.
  • the checkered boxes are associated with the base sequence GGXGG, which is a homopolymer or a flanked homopolymer ( e.g ., based on the value of X).
  • the loss for the middle X e.g., which is flanked by G’s on both sides
  • the loss for the middle X for this specific base sequence is penalized differently from penalization of loss for other bases of the sequences, as well as for other general base sequences.
  • the loss for the middle X for this specific base sequence is amplified, by a corresponding amplification of the corresponding penalty ofW2 that is greater than 1 (i.e... W2>1).
  • the loss penalization module 2106 detects a specific base sequence in the ground truth data, the loss penalization module 2106 applies a specialized amplified weight or penalty to one or more bases of such a specific base sequence.
  • the penalty W2 of the penalization matrix 2208c of Fig. 22C is different (e.g., amplified or higher) from various other penalties of the various penalization matrices 2208.
  • W2 of Fig. 22C is different (e.g., amplified or higher) than wl of Figs. 22A, 22B, 22D, and/or 22E.
  • the loss penalization is performed during a training phase of a neural network based base caller.
  • the ground truth base sequence is known a-priori, e.g. , prior to the multiplication discussed with respect to Figs. 22A-22E.
  • the penalty W2 corresponding to the middle base of the specific base sequence can be made high in Fig. 22C (e.g., even before performing the operations at Figs. 22D and 22E and processing the last two bases on the specific base sequence), as discussed herein.
  • a memory stores the loss penalization matrices 2208a, 2208b, ... , 2208e.
  • the penalty W2 corresponding to the middle base of the specific base sequence is altered (e.g., made high), as discussed with respect to Fig. 22C.
  • Fig. 22F illustrates application of a specialized weight to loss associated with a middle base of a specific base sequence.
  • the specific base sequence here is GGXGG, where “X” can be any of A, C,
  • a penalty ofW2 is applied to the corresponding loss, where W2 is different from (e.g. , higher than) the regular weights.
  • Fig. 22G illustrates two graphs 2280 and 2284 comparing the performance of a base calling system that does not penalize loss, versus a base calling system penalizes loss for a specific base sequence.
  • the specific base sequence used in these graphs is GGGGG.
  • the X axis in each of these plots is the predicted quality score 1432, and the Y axis in each of these plots is the true quality score 1440.
  • Graph 2280 is for a base calling system that does not specifically penalize loss for the specific base sequence GGGGG. As seen, the base calling of the specific sequence in the graph 2280 has an error of 6.4979%.
  • Graph 2284 is for a base calling system that assigns a penalty of 20 for the middle base of the specific base sequence GGGGG. As seen, the base calling of the specific sequence in the graph 2284 has an error of 1.9941%.
  • This disclosure discusses various approaches for calibration of quality scores, e.g., such that the calibrated quality scores are better aligned to the true quality scores.
  • the quality scores may or may not change the underlying base calls.
  • the base being called without calibration is A.
  • the base being called with calibration is still A. Thus, the calibration does not change the underlying base call.
  • the calibration may or may not change the underlying base call, providing an accurate quality score and underlying accurate confidence level is important in many practical applications.
  • the quality scores are used to make critical health care decisions.
  • confidence scores associated with detecting bases of a human tissue sample may affect an approach to treat a health condition.
  • high quality scores (/. e. , high confidence level) in multiple bases of the sample can indicate a high probability of cancer
  • low quality scores (i.e., low confidence level) in multiple bases of the sample can indicate a questionable probability of cancer - treatment decisions, thus, can change based on the quality score levels.
  • calibrating the quality scores and reporting calibrated quality scores helps in deciding various downstream tasks, which may possibly include healthcare decisions that are associated with levels of quality scores.
  • Fig. 23 illustrates a base calling system 2300 that includes (i) the normalization module 1704 of the base calling system 1700 of Fig. 17A, (ii) the quality score remapping module 1804 and the quality score quantization module 1812 of the base calling system 1800 of Fig. 18A, and (iii) the loss penalization module 2106 of the base calling system 2100 of Fig. 21.
  • the base calling system 2300 can perform one or more of input normalization, quality score remapping and quantization, and/or loss penalization, as discussed throughout this disclosure.
  • Base calling architecture Base calling architecture
  • Fig. 24 is a block diagram of a base calling system 2400 in accordance with one implementation.
  • the base calling system 2400 may operate to obtain any information or data that relates to at least one of a biological or chemical substance.
  • the base calling system 2400 is a workstation that may be similar to a bench-top device or desktop computer. For example, a majority (or all) of the systems and components for conducting the desired reactions can be within a common housing 2416.
  • the base calling system 2400 is a nucleic acid sequencing system (or sequencer) configured for various applications, including but not limited to de novo sequencing, resequencing of whole genomes or target genomic regions, and metagenomics. The sequencer may also be used for DNA or RNA analysis.
  • the base calling system 2400 may also be configured to generate reaction sites in a biosensor.
  • the base calling system 2400 may be configured to receive a sample and generate surface attached clusters of clonally amplified nucleic acids derived from the sample. Each cluster may constitute or be part of a reaction site in the biosensor.
  • the exemplary base calling system 2400 may include a system receptacle or interface 2412 that is configured to interact with a biosensor 2402 to perform desired reactions within the biosensor 2402.
  • the biosensor 2402 is loaded into the system receptacle 2412.
  • a cartridge that includes the biosensor 2402 may be inserted into the system receptacle 2412 and in some states the cartridge can be removed temporarily or permanently.
  • the cartridge may include, among other things, fluidic control and fluidic storage components.
  • the base calling system 2400 is configured to perform a large number of parallel reactions within the biosensor 2402.
  • the biosensor 2402 includes one or more reaction sites where desired reactions can occur.
  • the reaction sites may be, for example, immobilized to a solid surface of the biosensor or immobilized to beads (or other movable substrates) that are located within corresponding reaction chambers of the biosensor.
  • the reaction sites can include, for example, clusters of clonally amplified nucleic acids.
  • the biosensor 2402 may include a solid-state imaging device (e.g., CCD or CMOS imager) and a flow cell mounted thereto.
  • the flow cell may include one or more flow channels that receive a solution from the base calling system 2400 and direct the solution toward the reaction sites.
  • the biosensor 2402 can be configured to engage a thermal element for transferring thermal energy into or out of the flow channel.
  • the base calling system 2400 may include various components, assemblies, and systems (or sub-systems) that interact with each other to perform a predetermined method or assay protocol for biological or chemical analysis.
  • the base calling system 2400 includes a system controller 2404 that may communicate with the various components, assemblies, and sub-systems of the base calling system 2400 and also the biosensor 2402.
  • the base calling system 2400 may also include a fluidic control system 2406 to control the flow of fluid throughout a fluid network of the base calling system 2400 and the biosensor 2402; a fluidic storage system 2408 that is configured to hold all fluids (e.g., gas or liquids) that may be used by the bioassay system; a temperature control system 2410 that may regulate the temperature of the fluid in the fluid network, the fluidic storage system 2408, and/or the biosensor 2402; and an illumination system 2409 that is configured to illuminate the biosensor 2402.
  • the cartridge may also include fluidic control and fluidic storage components.
  • the base calling system 2400 may include a user interface 2414 that interacts with the user.
  • the user interface 2414 may include a display 2413 to display or request information from a user and a user input device 2415 to receive user inputs.
  • the display 2413 and the user input device 2415 are the same device.
  • the user interface 2414 may include a touch-sensitive display configured to detect the presence of an individual's touch and also identify a location of the touch on the display.
  • other user input devices 2415 may be used, such as a mouse, touchpad, keyboard, keypad, handheld scanner, voice-recognition system, motion-recognition system, and the like.
  • the base calling system 2400 may communicate with various components, including the biosensor 2402 (e.g., in the form of a cartridge), to perform the desired reactions.
  • the base calling system 2400 may also be configured to analyze data obtained from the biosensor to provide a user with desired information.
  • the system controller 2404 may include any processor-based or microprocessor-based system, including systems using microcontrollers, Reduced Instruction Set Computers (RISC), and the like.
  • RISC Reduced Instruction Set Computers
  • the system controller 2404 executes a set of instructions that are stored in one or more storage elements, memories, or modules in order to at least one of obtain and analyze detection data.
  • Detection data can include a plurality of sequences of pixel signals, such that a sequence of pixel signals from each of the millions of sensors (or pixels) can be detected over many base calling cycles.
  • Storage elements may be in the form of information sources or physical memory elements within the base calling system 2400.
  • the set of instructions may include various commands that instruct the base calling system 2400 or biosensor 2402 to perform specific operations such as the methods and processes of the various implementations described herein.
  • the set of instructions may be in the form of a software program, which may form part of a tangible, non-transitory computer readable medium or media.
  • the terms “software” and “firmware” are interchangeable and include any computer program stored in memory for execution by a computer, including RAM memory, ROM memory, EPROM memory, EEPROM memory, and non-volatile RAM (NVRAM) memory.
  • RAM memory random access memory
  • ROM memory read only memory
  • EPROM memory electrically erasable programmable read-only memory
  • EEPROM memory electrically erasable programmable read-only memory
  • NVRAM non-volatile RAM
  • the software may be in various forms such as system software or application software. Further, the software may be in the form of a collection of separate programs, or a program module within a larger program or a portion of a program module. The software also may include modular programming in the form of object-oriented programming. After obtaining the detection data, the detection data may be automatically processed by the base calling system 2400, processed in response to user inputs, or processed in response to a request made by another processing machine (e.g. , a remote request through a communication link).
  • the system controller 2404 includes an analysis module 2538 (illustrated in Fig. 25). In other implementations, system controller 2404 does not include the analysis module 2538 and instead has access to the analysis module 2538 (e.g., the analysis module 2538 may be separately hosted on cloud).
  • the system controller 2404 may be connected to the biosensor 2402 and the other components of the base calling system 2400 via communication links.
  • the system controller 2404 may also be communicatively connected to off-site systems or servers.
  • the communication links may be hardwired, corded, or wireless.
  • the system controller 2404 may receive user inputs or commands, from the user interface 2414 and the user input device 2415.
  • the fluidic control system 2406 includes a fluid network and is configured to direct and regulate the flow of one or more fluids through the fluid network.
  • the fluid network may be in fluid communication with the biosensor 2402 and the fluidic storage system 2408.
  • select fluids may be drawn from the fluidic storage system 2408 and directed to the biosensor 2402 in a controlled manner, or the fluids may be drawn from the biosensor 2402 and directed toward, for example, a waste reservoir in the fluidic storage system 2408.
  • the fluidic control system 2406 may include flow sensors that detect a flow rate or pressure of the fluids within the fluid network. The sensors may communicate with the system controller 2404.
  • the temperature control system 2410 is configured to regulate the temperature of fluids at different regions of the fluid network, the fluidic storage system 2408, and/or the biosensor 2402.
  • the temperature control system 2410 may include a thermocycler that interfaces with the biosensor 2402 and controls the temperature of the fluid that flows along the reaction sites in the biosensor 2402.
  • the temperature control system 2410 may also regulate the temperature of solid elements or components of the base calling system 2400 or the biosensor 2402.
  • the temperature control system 2410 may include sensors to detect the temperature of the fluid or other components. The sensors may communicate with the system controller 2404.
  • the fluidic storage system 2408 is in fluid communication with the biosensor 2402 and may store various reaction components or reactants that are used to conduct the desired reactions therein.
  • the fluidic storage system 2408 may also store fluids for washing or cleaning the fluid network and biosensor 2402 and for diluting the reactants.
  • the fluid storage system 2408 may include various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous, and non polar solutions, and the like.
  • the fluidic storage system 2408 may also include waste reservoirs for receiving waste products from the biosensor 2402.
  • the cartridge may include one or more of a fluid storage system, fluidic control system or temperature control system.
  • a cartridge can have various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous, and non polar solutions, waste, and the like.
  • a fluid storage system, fluidic control system or temperature control system can be removably engaged with a bioassay system via a cartridge or other biosensor.
  • the illumination system 2409 may include a light source (e.g., one or more LEDs) and a plurality of optical components to illuminate the biosensor.
  • light sources may include lasers, arc lamps, LEDs, or laser diodes.
  • the optical components may be, for example, reflectors, dichroics, beam splitters, collimators, lenses, fdters, wedges, prisms, mirrors, detectors, and the like.
  • the illumination system 2409 may be configured to direct an excitation light to reaction sites.
  • fluorophores may be excited by green wavelengths of light, as such the wavelength of the excitation light may be approximately 532 nm.
  • the illumination system 2409 is configured to produce illumination that is parallel to a surface normal of a surface of the biosensor 2402. In another implementation, the illumination system 2409 is configured to produce illumination that is off-angle relative to the surface normal of the surface of the biosensor 2402. In yet another implementation, the illumination system 2409 is configured to produce illumination that has plural angles, including some parallel illumination and some off-angle illumination.
  • the system receptacle or interface 2412 is configured to engage the biosensor 2402 in at least one of a mechanical, electrical, and fluidic manner. The system receptacle 2412 may hold the biosensor 2402 in a desired orientation to facilitate the flow of fluid through the biosensor 2402.
  • the system receptacle 2412 may also include electrical contacts that are configured to engage the biosensor 2402 so that the base calling system 2400 may communicate with the biosensor 2402 and/or provide power to the biosensor 2402. Furthermore, the system receptacle 2412 may include fluidic ports (e.g., nozzles) that are configured to engage the biosensor 2402. In some implementations, the biosensor 2402 is removably coupled to the system receptacle 2412 in a mechanical manner, in an electrical manner, and also in a fluidic manner.
  • the base calling system 2400 may communicate remotely with other systems or networks or with other bioassay systems 2400. Detection data obtained by the bioassay system(s) 2400 may be stored in a remote database.
  • Fig. 25 is a block diagram of the system controller 2404 that can be used in the system of Fig. 24.
  • the system controller 2404 includes one or more processors or modules that can communicate with one another.
  • Each of the processors or modules may include an algorithm (e.g., instructions stored on a tangible and/or non-transitory computer readable storage medium) or sub algorithms to perform particular processes.
  • the system controller 2404 is illustrated conceptually as a collection of modules, but may be implemented utilizing any combination of dedicated hardware boards, DSPs, processors, etc. Alternatively, the system controller 2404 may be implemented utilizing an off-the- shelf PC with a single processor or multiple processors, with the functional operations distributed between the processors.
  • modules described below may be implemented utilizing a hybrid configuration in which certain modular functions are performed utilizing dedicated hardware, while the remaining modular functions are performed utilizing an off-the-shelf PC and the like.
  • the modules also may be implemented as software modules within a processing unit.
  • a communication port 2520 may transmit information (e.g. , commands) to or receive information (e.g., data) from the biosensor 2402 (Fig. 24) and/or the sub-systems 2406, 2408, 2410 (Fig. 24).
  • the communication port 2520 may output a plurality of sequences of pixel signals.
  • a communication port 2520 may receive user input from the user interface 2414 (Fig. 24) and transmit data or information to the user interface 2414.
  • Data from the biosensor 2402 or sub-systems 2406, 2408, 2410 may be processed by the system controller 2404 in real-time during a bioassay session. Additionally, or alternatively, data may be stored temporarily in a system memory during a bioassay session and processed in slower than real-time or off-line operation.
  • the system controller 2404 may include a plurality of modules 2531- 2539 that communicate with a main control module 2530.
  • the main control module 2530 may communicate with the user interface 2414 (Fig. 24).
  • the modules 2531-2539 are shown as communicating directly with the main control module 2530, the modules 2531-2539 may also communicate directly with each other, the user interface 2414, and the biosensor 2402. Also, the modules 2531-2539 may communicate with the main control module 2530 through the other modules.
  • the plurality of modules 2531-2539 include system modules 2531-2533, 2539 that communicate with the sub-systems 2406, 2408, 2410, and 2409, respectively.
  • the fluidic control module 2531 may communicate with the fluidic control system 2406 to control the valves and flow sensors of the fluid network for controlling the flow of one or more fluids through the fluid network.
  • the fluidic storage module 2532 may notify the user when fluids are low or when the waste reservoir is at or near capacity.
  • the fluidic storage module 2532 may also communicate with the temperature control module 2533 so that the fluids may be stored at a desired temperature.
  • the illumination module 2539 may communicate with the illumination system 2409 to illuminate the reaction sites at designated times during a protocol, such as after the desired reactions (e.g., binding events) have occurred. In some implementations, the illumination module 2539 may communicate with the illumination system 2409 to illuminate the reaction sites at designated angles.
  • the plurality of modules 2531-2539 may also include a device module 2534 that communicates with the biosensor 2402 and an identification module 2535 that determines identification information relating to the biosensor 2402.
  • the device module 2534 may, for example, communicate with the system receptacle 2412 to confirm that the biosensor has established an electrical and fluidic connection with the base calling system 2400.
  • the identification module 2535 may receive signals that identify the biosensor 2402.
  • the identification module 2535 may use the identity of the biosensor 2402 to provide other information to the user. For example, the identification module 2535 may determine and then display a lot number, a date of manufacture, or a protocol that is recommended to be run with the biosensor 2402.
  • the plurality of modules 2531-2539 also includes an analysis module 2538 (also called signal processing module or signal processor) that receives and analyzes the signal data (e.g., image data) from the biosensor 2402.
  • Analysis module 2538 includes memory (e.g., RAM or Flash) to store detection data.
  • Detection data can include a plurality of sequences of pixel signals, such that a sequence of pixel signals from each of the millions of sensors (or pixels) can be detected over many base calling cycles.
  • the signal data may be stored for subsequent analysis or may be transmitted to the user interface 2414 to display desired information to the user.
  • the signal data may be processed by the solid- state imager (e.g., CMOS image sensor) before the analysis module 2538 receives the signal data.
  • the solid- state imager e.g., CMOS image sensor
  • the analysis module 2538 is configured to obtain image data from the light detectors at each of a plurality of sequencing cycles.
  • the image data is derived from the emission signals detected by the light detectors and process the image data for each of the plurality of sequencing cycles through a neural network (e.g. , a neural network-based template generator 2548, a neural network-based base caller 2558 (e.g., see Figs. 7, 9, and 10), and/or a neural network-based quality scorer 2568) and produce a base call for at least some of the analytes at each of the plurality of sequencing cycle.
  • a neural network e.g. , a neural network-based template generator 2548, a neural network-based base caller 2558 (e.g., see Figs. 7, 9, and 10), and/or a neural network-based quality scorer 2568) and produce a base call for at least some of the analytes at each of the plurality of sequencing cycle.
  • Protocol modules 2536 and 2537 communicate with the main control module 2530 to control the operation of the sub-systems 2406, 2408, and 2410 when conducting predetermined assay protocols.
  • the protocol modules 2536 and 2537 may include sets of instructions for instructing the base calling system 2400 to perform specific operations pursuant to predetermined protocols.
  • the protocol module may be a sequencing-by-synthesis (SBS) module 2536 that is configured to issue various commands for performing sequencing-by-synthesis processes.
  • SBS sequencing-by-synthesis
  • extension of a nucleic acid primer along a nucleic acid template is monitored to determine the sequence of nucleotides in the template.
  • the underlying chemical process can be polymerization (e.g., as catalyzed by a polymerase enzyme) or ligation (e.g., catalyzed by a ligase enzyme).
  • fluorescently labeled nucleotides are added to a primer (thereby extending the primer) in a template dependent fashion such that detection of the order and type of nucleotides added to the primer can be used to determine the sequence of the template.
  • commands can be given to deliver one or more labeled nucleotides, DNA polymerase, etc., into/through a flow cell that houses an array of nucleic acid templates.
  • the nucleic acid templates may be located at corresponding reaction sites. Those reaction sites where primer extension causes a labeled nucleotide to be incorporated can be detected through an imaging event. During an imaging event, the illumination system 2409 may provide an excitation light to the reaction sites.
  • the nucleotides can further include a reversible termination property that terminates further primer extension once a nucleotide has been added to a primer. For example, a nucleotide analog having a reversible terminator moiety can be added to a primer such that subsequent extension cannot occur until a deblocking agent is delivered to remove the moiety.
  • a command can be given to deliver a deblocking reagent to the flow cell (before or after detection occurs).
  • One or more commands can be given to effect wash(es) between the various delivery steps.
  • the cycle can then be repeated n times to extend the primer by n nucleotides, thereby detecting a sequence of length n.
  • Exemplary sequencing techniques are described, for example, in Bentley et al., Nature 456:53-59 (2008); WO 04/018497; US 7,057,026; WO
  • nucleotide delivery step of an SBS cycle either a single type of nucleotide can be delivered at a time, or multiple different nucleotide types (e.g. , A, C, T and G together) can be delivered.
  • nucleotide delivery configuration where only a single type of nucleotide is present at a time, the different nucleotides need not have distinct labels since they can be distinguished based on temporal separation inherent in the individualized delivery.
  • a sequencing method or apparatus can use single color detection. For example, an excitation source need only provide excitation at a single wavelength or in a single range of wavelengths.
  • sites that incorporate different nucleotide types can be distinguished based on different fluorescent labels that are attached to respective nucleotide types in the mixture.
  • four different nucleotides can be used, each having one of four different fluorophores.
  • the four different fluorophores can be distinguished using excitation in four different regions of the spectrum.
  • four different excitation radiation sources can be used.
  • fewer than four different excitation sources can be used, but optical filtration of the excitation radiation from a single source can be used to produce different ranges of excitation radiation at the flow cell.
  • fewer than four different colors can be detected in a mixture having four different nucleotides.
  • pairs of nucleotides can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g., via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
  • Exemplary apparatus and methods for distinguishing four different nucleotides using detection of fewer than four colors are described for example in US Pat. App. Ser. Nos. 61/538,294 and 61/619,878, which are incorporated herein by reference in their entireties.
  • U.S. Application No. 13/624,200 which was filed on September 21, 2012, is also incorporated by reference in its entirety.
  • the plurality of protocol modules may also include a sample-preparation (or generation) module 2537 that is configured to issue commands to the fluidic control system 2406 and the temperature control system 2410 for amplifying a product within the biosensor 2402.
  • the biosensor 2402 may be engaged to the base calling system 2400.
  • the amplification module 2537 may issue instructions to the fluidic control system 2406 to deliver necessary amplification components to reaction chambers within the biosensor 2402.
  • the reaction sites may already contain some components for amplification, such as the template DNA and/or primers.
  • the amplification module 2537 may instruct the temperature control system 2410 to cycle through different temperature stages according to known amplification protocols.
  • the amplification and/or nucleotide incorporation is performed isothermally.
  • the SBS module 2536 may issue commands to perform bridge PCR where clusters of clonal amplicons are formed on localized areas within a channel of a flow cell. After generating the amplicons through bridge PCR, the amplicons may be “linearized” to make single stranded template DNA, or sstDNA, and a sequencing primer may be hybridized to a universal sequence that flanks a region of interest. For example, a reversible terminator-based sequencing by synthesis method can be used as set forth above or as follows.
  • Each base calling or sequencing cycle can extend an sstDNA by a single base which can be accomplished for example by using a modified DNA polymerase and a mixture of four types of nucleotides.
  • the different types of nucleotides can have unique fluorescent labels, and each nucleotide can further have a reversible terminator that allows only a single-base incorporation to occur in each cycle. After a single base is added to the sstDNA, excitation light may be incident upon the reaction sites and fluorescent emissions may be detected. After detection, the fluorescent label and the terminator may be chemically cleaved from the sstDNA. Another similar base calling or sequencing cycle may follow.
  • the SBS module 2536 may instruct the fluidic control system 2406 to direct a flow of reagent and enzyme solutions through the biosensor 2402.
  • Exemplary reversible terminator-based SBS methods which can be utilized with the apparatus and methods set forth herein are described in US Patent Application Publication No. 2007/0166705 Al, US Patent Application Publication No. 2006/0188901 Al, US Patent No. 7,057,026, US Patent Application Publication No. 2006/0240439 Al, US Patent Application Publication No. 2006/02814714709 Al, PCT Publication No. WO 05/065814, PCT Publication No. WO 06/064199, each of which is incorporated herein by reference in its entirety.
  • Exemplary reagents for reversible terminator-based SBS are described in US 7,541,444; US 7,057,026;
  • the amplification and SBS modules may operate in a single assay protocol where, for example, template nucleic acid is amplified and subsequently sequenced within the same cartridge.
  • the base calling system 2400 may also allow the user to reconfigure an assay protocol.
  • the base calling system 2400 may offer options to the user through the user interface 2414 for modifying the determined protocol. For example, if it is determined that the biosensor 2402 is to be used for amplification, the base calling system 2400 may request a temperature for the annealing cycle. Furthermore, the base calling system 2400 may issue warnings to a user if a user has provided user inputs that are generally not acceptable for the selected assay protocol.
  • the biosensor 2402 includes millions of sensors (or pixels), each of which generates a plurality of sequences of pixel signals over successive base calling cycles.
  • the analysis module 2538 detects the plurality of sequences of pixel signals and attributes them to corresponding sensors (or pixels) in accordance to the row -wise and/or column-wise location of the sensors on an array of sensors.
  • Each sensor in the array of sensors can produce sensor data for a tile of the flow cell, where a tile in an area on the flow cell at which clusters of genetic material are disposed during the based calling operation.
  • the sensor data can comprise image data in an array of pixels.
  • the sensor data can include more than one image, producing multiple features per pixel as the tile data.
  • Fig. 26 is a simplified block diagram of a computer 2600 system that can be used to implement the technology disclosed.
  • Computer system 2600 includes at least one central processing unit (CPU) 2672 that communicates with a number of peripheral devices via bus subsystem 2655.
  • peripheral devices can include a storage subsystem 2610 including, for example, memory devices and a file storage subsystem 2636, user interface input devices 2638, user interface output devices 2676, and a network interface subsystem 2674.
  • the input and output devices allow user interaction with computer system 2600.
  • Network interface subsystem 2674 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • User interface input devices 2638 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 2600.
  • User interface output devices 2676 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an FED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (FCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system 2600 to the user or to another machine or computer system.
  • Storage subsystem 2610 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 2678.
  • the neural networks are implemented using deep learning processors 2678 can be configurable and reconfigurable processors, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs) and graphics processing units (GPUs) other configured devices.
  • Deep learning processors 2678 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • Examples of deep learning processors 2678 include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX149 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA’s VoltaTM, NVIDIA’s DRIVE PXTM, NVIDIA’s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s
  • Memory subsystem 2622 used in the storage subsystem 2610 can include a number of memories including a main random access memory (RAM) 2634 for storage of instructions and data during program execution and a read only memory (ROM) 2632 in which fixed instructions are stored.
  • a file storage subsystem 2636 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 2636 in the storage subsystem 2610, or in other machines accessible by the processor.
  • Bus subsystem 2655 provides a mechanism for letting the various components and subsystems of computer system 2600 communicate with each other as intended. Although bus subsystem 2655 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 2600 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2600 depicted in Fig. 26 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 2600 are possible having more or less components than the computer system depicted in Fig. 26.
  • a computer-implemented method of generating base calls by a base caller including: receiving a plurality of sensor data from a flow cell, wherein the plurality of sensor data is within a first range; identifying a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; mapping at least a subset of the plurality of sensor data, that are within the second range, to a third range, thereby generating a plurality of normalized sensor data; and processing the plurality of normalized sensor data in a base caller, to call, for the plurality of normalized sensor data, one or more corresponding bases.
  • identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
  • each of the lower threshold percentage and the upper threshold percentage is 1% or less.
  • mapping at least a subset of the plurality of sensor data comprises: mapping a first sensor data within the subset from a first value that is within the second range to a second value that is within the third range; and mapping a second sensor data within the subset from a third value that is within the second range to a fourth value that is within the third range.
  • identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of intensity values have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of intensity values have a value that is higher than the high value, wherein the threshold percentage is a sum of the lower threshold percentage and the higher threshold percentage, wherein the second range is defined by the low value and the high value.
  • non-transitory computer readable storage medium of clause 20 further comprising: identifying (i) a first outlier intensity value of the plurality of intensity values that is lower than the low value and (ii) a second outlier intensity value of the plurality of intensity values that is higher than the high value; and prior to the mapping, assigning the low value to the first outlier intensity value, and assigning the high value to the second outlier intensity value, such that the first outlier intensity value and the second outlier intensity value are within the second range subsequent to the assignment.
  • non-transitory computer readable storage medium of clause 20 further comprising: identifying (i) a first outlier intensity value of the plurality of intensity values that is lower than the low value and (ii) a second outlier intensity value of the plurality of intensity values that is higher than the high value; and excluding the first outlier intensity value and the second outlier intensity value from the subset of the plurality of intensity values during the mapping, for being outside the second range, such that the first outlier intensity value and the second outlier intensity value are not mapped to the third range.
  • mapping comprises: mapping a first intensity value from a first value that is within the second range to a second value that is within the third range; and mapping a second intensity value from a third value that is within the second range to a fourth value that is within the third range.
  • a system for base calling comprising: memory storing images that depict original intensity emissions of a set of analytes, the original intensity emissions generated by analytes in the set of analytes during sequencing cycles of a sequencing run; a normalization module configured to receive the original intensity emissions and remap the original intensity emissions to generate remapped intensity emissions, such that a remapped intensity emission has a different intensity value relative to the original intensity emission; and a base caller configured to process the remapped intensity emissions, to generate base calls for the set of analytes.
  • a computer-implemented method of calibrating quality scores generated by a base caller comprising: processing sensor data in a base caller, to generate a plurality of probability scores, wherein each of the plurality of probability scores identifies a corresponding likelihood of a base being a corresponding one of A, C, T, or G; transforming each probability score to a corresponding quality score, thereby generating a plurality of quality scores corresponding to the plurality of probability scores, wherein each of the plurality of quality scores indicates, in a logarithmic scale, a corresponding likelihood of a base being a corresponding one of A, C, T, or G; and remapping one or more of the plurality of quality scores, to generate a corresponding plurality of remapped quality scores.
  • a first quality score of the plurality of quality scores is remapped to a first remapped quality score of the plurality of remapped quality scores; the first quality score indicates a first likelihood of a corresponding first base being an X, where X is one of A, C, T, and G; the first remapped quality score indicates a first remapped likelihood of the corresponding first base being the X; and the first remapped likelihood is more aligned to an empirically determined likelihood of the corresponding first base being the X, compared to an alignment of the first remapped likelihood to the empirically determined likelihood.
  • the first quality score indicates the first likelihood in the logarithmic scale, and the first remapped quality score indicates the first remapped likelihood in the logarithmic scale.
  • the remapping comprises: identifying, from a lookup table (LUT), that a first quality score of the plurality of quality scores is to remap to a first remapped quality score; and assigning the first remapped quality score to the first quality score, thereby remapping the first quality score to the first remapped quality score of the plurality of remapped quality scores.
  • LUT lookup table
  • the remapping comprises: using a lookup table (LUT) to remap one or more of the plurality of quality scores, to generate the corresponding plurality of remapped quality scores.
  • LUT lookup table
  • including each of the plurality of remapped quality scores in a corresponding one of the plurality of groups comprises: assigning, to each group of the plurality of groups, a corresponding range of remapped quality scores; including a first remapped quality score in the first group, in response to the first remapped quality score being within a first range assigned to the first group; and including a second remapped quality score in the second group, in response to the second remapped quality score being within a second range assigned to the second group.
  • processing the sensor data comprises: processing the sensor data in the base caller, to generate a sequence of base calls; and identifying (i) a first base call sequence in the sequence of base calls and (ii) a second base call sequence in the sequence of base calls, and further identifying that the second base call sequence has a specific base sequence pattern, wherein remapping the one or more of the plurality of quality scores comprises, in response to identifying that the second base call sequence has the specific base sequence pattern, using a first Look Up Table (LUT) to remap quality scores associated with (i) each base of the first base call sequence and (ii) a first subset of the bases of the second base call sequence, and using a second LUT to remap quality scores associated with a second subset of the bases of the second base call sequence.
  • LUT Look Up Table
  • each of a first base of the first base call sequence, a second base of the first subset of the bases of the second base call sequence, and a third base of the second subset of the bases of the second base call sequence has a quality score of Ql; each of the first base of the first base call sequence and the second base of the first subset of the bases of the second base call sequence is remapped, using the first LUT, to a remapped quality score of Q2; the third base of the second subset of the bases of the second base call sequence is remapped, using the second LUT, to a remapped quality score of Q3; and the remapped quality score of Q2, the remapped quality score of Q3, and the quality score of Ql are different from each other.
  • the second subset of the bases of the second base call sequence includes a middle one of the bases of the second base call sequence; and the first subset of the bases of the second base call sequence includes all the bases of the second base call sequence, except for the middle one of the bases of the second base call sequence.
  • the first LUT is a general purpose LUT that is applicable to quality scores of all bases, except for a middle base of the second base call sequence; and the second LUT is a base sequence specific LUT specifically applicable to quality scores of the middle base of the second base call sequence.
  • the specific base sequence pattern comprises a homopolymer pattern or a flanked-homopolymer pattern.
  • the specific base sequence pattern comprises five bases, with at least a first and a last base being a G.
  • the specific base sequence pattern comprises at least five bases, with at least three bases of the specific base sequence pattern being a G.
  • the specific base sequence pattern comprises any of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, where X is any of A, C, T, or G.
  • the specific base sequence pattern comprises at least five bases, with at least three bases of the specific base sequence pattern associated with dark cycles within the sensor data.
  • a first output of the plurality of outputs provides a first likelihood of a corresponding first analyte being one of A, C, T, or G; the first output is remapped to generate a first remapped output that provides a second likelihood of the corresponding first analyte being one of A, C, T, or G; and the first likelihood is different from the second likelihood.
  • each of the first output and the first remapped output respectively express the first likelihood and the first remapped likelihood in a logarithmic scale.
  • LUT lookup table
  • each of a first base of the first base call sequence, a second base of the first subset of the bases of the second base call sequence, and a third base of the second subset of the bases of the second base call sequence has a quality score of Ql; each of the first base of the first base call sequence and the second base of the first subset of the bases of the second base call sequence is remapped, using the first LUT, to a remapped quality score of Q2; the third base of the second subset of the bases of the second base call sequence is remapped, using the second LUT, to a remapped quality score of Q3; and the remapped quality score of Q2, the remapped quality score of Q3, and the quality score of Ql are different from each other.
  • the second subset of the bases of the second base call sequence includes a middle one of the bases of the second base call sequence; and the first subset of the bases of the second base call sequence includes all the bases of the second base call sequence, except for the middle one of the bases of the second base call sequence.
  • the first LUT is a general purpose LUT that is applicable to quality scores of all bases, except for a middle base of the second base call sequence; and the second LUT is a base sequence specific LUT specifically applicable to quality scores of the middle base of the second base call sequence.
  • the specific base sequence pattern comprises five bases, with at least a first and a last base being a G.
  • the specific base sequence pattern comprises at least five bases, with at least three bases of the specific base sequence pattern being a G.
  • the specific base sequence pattern comprises any of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, where X is any of A, C, T, or G. 36.
  • the specific base sequence pattern comprises at least five bases, with at least three bases of the specific base sequence pattern associated with dark cycles within the sensor data.
  • a computer-implemented method of training a neural network model used for base calling comprising: during a training phase of the neural network model of a base caller, processing sensor data in a forward pass section of the neural network model to predict base calls; based on the predicted base calls and ground truth base calls, generating a loss function; penalizing the loss function, based at least in part on the ground truth base calls indicating a specific base sequence, to generate a penalized loss function; and processing, in a backpropagation section of the neural network model, the penalized loss function, to adapt weights of the neural network model, thereby training the neural network model for base calling.
  • penalizing the loss function comprises: multiple individual elements of the loss function with a corresponding penalty.
  • penalizing the loss function comprises: multiple individual elements of a loss function matrix with corresponding individual elements of a penalty matrix.
  • processing the penalized loss function comprises: processing the penalized loss function, to generate input gradients, wherein the input gradients are used to adapt weights of the neural network model, thereby training the neural network model for base calling.
  • non-transitory computer readable storage medium of clause 17, further comprising: identifying, from the ground truth base calls, the specific base sequence having (i) a first base and (ii) one or more second bases flanking the first base, wherein penalizing the loss function comprises penalizing (i) a first element of the loss function, which is associated with the first base, with a first penalty, and (ii) each of one or more second elements of the loss function, which are respectively associated with the one or more second bases flanking the first base, with a second penalty that is different from the first penalty.
  • non-transitory computer readable storage medium of clause 18, further comprising: identifying, from the ground truth base calls, one or more third bases that are not included in the specific base sequence, wherein penalizing the loss function comprises penalizing each of one or more third elements of the loss function, which are respectively associated with the one or more third bases, with the second penalty.
  • penalizing the loss function comprises: multiple individual elements of a loss function matrix with corresponding individual elements of a penalty matrix.
  • processing the penalized loss function comprises: processing the penalized loss function, to generate input gradients, wherein the input gradients are used to adapt weights of the neural network model, thereby training the neural network model for base calling.
  • a system for base calling comprising: memory storing sensor data; and a base caller comprising a neural network model configured to call bases, based on the sensor data, the neural network model comprising: a forward pass section configured to process the sensor data, to predict base calls, a loss generation module configured to compare the predicted base calls and ground truth base calls, to generate a loss function, a loss penalization module configured to selectively penalize the loss function, to generate a penalized loss function; and a backpropagation section to process the penalized loss function, to facilitate adaptation of weights of the neural network model, thereby training the neural network model for base calling.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Signal Processing (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)
PCT/US2022/038729 2021-07-28 2022-07-28 Quality score calibration of basecalling systems WO2023009758A1 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
AU2022319125A AU2022319125A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems
CN202280043793.8A CN117529780A (zh) 2021-07-28 2022-07-28 碱基检出系统的质量分数校准
JP2023579782A JP2024532049A (ja) 2021-07-28 2022-07-28 ベースコールシステムの品質スコア較正
EP22761681.0A EP4377960A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems
KR1020237043770A KR20240037882A (ko) 2021-07-28 2022-07-28 염기 호출 시스템의 품질 점수 보정
IL309786A IL309786A (en) 2021-07-28 2022-07-28 Quality score calibration of BASECALLING systems
CA3223746A CA3223746A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163226707P 2021-07-28 2021-07-28
US63/226,707 2021-07-28
US17/839,387 2022-06-13
US17/839,387 US20230029970A1 (en) 2021-07-28 2022-06-13 Quality score calibration of basecalling systems

Publications (1)

Publication Number Publication Date
WO2023009758A1 true WO2023009758A1 (en) 2023-02-02

Family

ID=83149575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/038729 WO2023009758A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Country Status (7)

Country Link
EP (1) EP4377960A1 (ko)
JP (1) JP2024532049A (ko)
KR (1) KR20240037882A (ko)
AU (1) AU2022319125A1 (ko)
CA (1) CA3223746A1 (ko)
IL (1) IL309786A (ko)
WO (1) WO2023009758A1 (ko)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118053503A (zh) * 2024-01-11 2024-05-17 中国农业科学院农业基因组研究所 一种入侵生物多组学数据库构建方法及系统

Citations (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (en) 1989-10-26 1991-05-16 Sri International Dna sequencing
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
WO1998044151A1 (en) 1997-04-01 1998-10-08 Glaxo Group Limited Method of nucleic acid amplification
US6090592A (en) 1994-08-03 2000-07-18 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid on supports
US20020055100A1 (en) 1997-04-01 2002-05-09 Kawashima Eric H. Method of nucleic acid sequencing
US20020150909A1 (en) 1999-02-09 2002-10-17 Stuelpnagel John R. Automated information processing in randomly ordered arrays
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
WO2004018493A1 (en) 2002-08-23 2004-03-04 Solexa Limited Labelled nucleotides
WO2004018497A2 (en) 2002-08-23 2004-03-04 Solexa Limited Modified nucleotides for polynucleotide sequencing
US20040096853A1 (en) 2000-12-08 2004-05-20 Pascal Mayer Isothermal amplification of nucleic acids on a solid support
WO2005024010A1 (en) 2003-09-11 2005-03-17 Solexa Limited Modified polymerases for improved incorporation of nucleotide analogues
WO2005065814A1 (en) 2004-01-07 2005-07-21 Solexa Limited Modified molecular arrays
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
US7115400B1 (en) 1998-09-30 2006-10-03 Solexa Ltd. Methods of nucleic acid amplification and sequencing
WO2006120433A1 (en) 2005-05-10 2006-11-16 Solexa Limited Improved polymerases
US20060281471A1 (en) 2005-06-08 2006-12-14 Cisco Technology,Inc. Method and system for communicating using position information
WO2007010251A2 (en) 2005-07-20 2007-01-25 Solexa Limited Preparation of templates for nucleic acid sequencing
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
US20070128624A1 (en) 2005-11-01 2007-06-07 Gormley Niall A Method of preparing libraries of template polynucleotides
WO2007123744A2 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
US20080009420A1 (en) 2006-03-17 2008-01-10 Schroth Gary P Isothermal methods for creating clonal single molecule arrays
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
US20080242560A1 (en) 2006-11-21 2008-10-02 Gunderson Kevin L Methods for generating amplified nucleic acid arrays
US7592435B2 (en) 2005-08-19 2009-09-22 Illumina Cambridge Limited Modified nucleosides and nucleotides and uses thereof
US7595882B1 (en) 2008-04-14 2009-09-29 Geneal Electric Company Hollow-core waveguide-based raman systems and methods
US20130079232A1 (en) 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20160110499A1 (en) * 2014-10-21 2016-04-21 Life Technologies Corporation Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data
US20190237160A1 (en) * 2018-01-26 2019-08-01 Quantum-Si Incorporated Machine learning enabled pulse and base calling for sequencing devices
US20200327377A1 (en) * 2019-03-21 2020-10-15 Illumina, Inc. Artificial Intelligence-Based Quality Scoring

Patent Citations (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (en) 1989-10-26 1991-05-16 Sri International Dna sequencing
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
US6090592A (en) 1994-08-03 2000-07-18 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid on supports
WO1998044151A1 (en) 1997-04-01 1998-10-08 Glaxo Group Limited Method of nucleic acid amplification
US20020055100A1 (en) 1997-04-01 2002-05-09 Kawashima Eric H. Method of nucleic acid sequencing
US7115400B1 (en) 1998-09-30 2006-10-03 Solexa Ltd. Methods of nucleic acid amplification and sequencing
US20020150909A1 (en) 1999-02-09 2002-10-17 Stuelpnagel John R. Automated information processing in randomly ordered arrays
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
US20040096853A1 (en) 2000-12-08 2004-05-20 Pascal Mayer Isothermal amplification of nucleic acids on a solid support
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US20060188901A1 (en) 2001-12-04 2006-08-24 Solexa Limited Labelled nucleotides
US7427673B2 (en) 2001-12-04 2008-09-23 Illumina Cambridge Limited Labelled nucleotides
US7566537B2 (en) 2001-12-04 2009-07-28 Illumina Cambridge Limited Labelled nucleotides
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
WO2004018497A2 (en) 2002-08-23 2004-03-04 Solexa Limited Modified nucleotides for polynucleotide sequencing
WO2004018493A1 (en) 2002-08-23 2004-03-04 Solexa Limited Labelled nucleotides
US20070166705A1 (en) 2002-08-23 2007-07-19 John Milton Modified nucleotides
US7541444B2 (en) 2002-08-23 2009-06-02 Illumina Cambridge Limited Modified nucleotides
US20060240439A1 (en) 2003-09-11 2006-10-26 Smith Geoffrey P Modified polymerases for improved incorporation of nucleotide analogues
WO2005024010A1 (en) 2003-09-11 2005-03-17 Solexa Limited Modified polymerases for improved incorporation of nucleotide analogues
US20110059865A1 (en) 2004-01-07 2011-03-10 Mark Edward Brennan Smith Modified Molecular Arrays
WO2005065814A1 (en) 2004-01-07 2005-07-21 Solexa Limited Modified molecular arrays
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
WO2006120433A1 (en) 2005-05-10 2006-11-16 Solexa Limited Improved polymerases
US20060281471A1 (en) 2005-06-08 2006-12-14 Cisco Technology,Inc. Method and system for communicating using position information
WO2007010251A2 (en) 2005-07-20 2007-01-25 Solexa Limited Preparation of templates for nucleic acid sequencing
US7592435B2 (en) 2005-08-19 2009-09-22 Illumina Cambridge Limited Modified nucleosides and nucleotides and uses thereof
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
US20070128624A1 (en) 2005-11-01 2007-06-07 Gormley Niall A Method of preparing libraries of template polynucleotides
US20080009420A1 (en) 2006-03-17 2008-01-10 Schroth Gary P Isothermal methods for creating clonal single molecule arrays
WO2007123744A2 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US20080242560A1 (en) 2006-11-21 2008-10-02 Gunderson Kevin L Methods for generating amplified nucleic acid arrays
US7595882B1 (en) 2008-04-14 2009-09-29 Geneal Electric Company Hollow-core waveguide-based raman systems and methods
US20130079232A1 (en) 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20160110499A1 (en) * 2014-10-21 2016-04-21 Life Technologies Corporation Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data
US20190237160A1 (en) * 2018-01-26 2019-08-01 Quantum-Si Incorporated Machine learning enabled pulse and base calling for sequencing devices
US20200327377A1 (en) * 2019-03-21 2020-10-15 Illumina, Inc. Artificial Intelligence-Based Quality Scoring

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BENTLEY ET AL., NATURE, vol. 456, 2008, pages 53 - 59

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118053503A (zh) * 2024-01-11 2024-05-17 中国农业科学院农业基因组研究所 一种入侵生物多组学数据库构建方法及系统

Also Published As

Publication number Publication date
KR20240037882A (ko) 2024-03-22
IL309786A (en) 2024-02-01
AU2022319125A1 (en) 2024-01-18
CA3223746A1 (en) 2023-02-02
EP4377960A1 (en) 2024-06-05
JP2024532049A (ja) 2024-09-05

Similar Documents

Publication Publication Date Title
EP4107737B1 (en) Knowledge distillation and gradient pruning-based compression of artificial intelligence-based base caller
US20220300811A1 (en) Neural network parameter quantization for base calling
CN115136244A (zh) 基于人工智能的多对多碱基判读
WO2023009758A1 (en) Quality score calibration of basecalling systems
US20230041989A1 (en) Base calling using multiple base caller models
US20230029970A1 (en) Quality score calibration of basecalling systems
US20230026084A1 (en) Self-learned base caller, trained using organism sequences
US20220415445A1 (en) Self-learned base caller, trained using oligo sequences
CN117529780A (zh) 碱基检出系统的质量分数校准
KR20230157230A (ko) 염기 호출을 위한 타일 위치 및/또는 사이클 기반 가중치 세트 선택
KR20240027608A (ko) 유기체 서열을 사용하여 훈련된 자체-학습 염기 호출자
JP2024529843A (ja) 複数のベースコーラモデルを使用するベースコール
WO2022197752A1 (en) Tile location and/or cycle based weight set selection for base calling
CN117546248A (zh) 使用多个碱基检出器模型的碱基检出
CN117546249A (zh) 使用寡核苷酸序列训练的自学碱基检出器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22761681

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280043793.8

Country of ref document: CN

Ref document number: 2022319125

Country of ref document: AU

Ref document number: 3223746

Country of ref document: CA

Ref document number: AU2022319125

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2023579782

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 309786

Country of ref document: IL

ENP Entry into the national phase

Ref document number: 2022319125

Country of ref document: AU

Date of ref document: 20220728

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2022761681

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022761681

Country of ref document: EP

Effective date: 20240228