EP3970151A1 - Appel de base au moyen de convolutions - Google Patents

Appel de base au moyen de convolutions

Info

Publication number
EP3970151A1
EP3970151A1 EP20730877.6A EP20730877A EP3970151A1 EP 3970151 A1 EP3970151 A1 EP 3970151A1 EP 20730877 A EP20730877 A EP 20730877A EP 3970151 A1 EP3970151 A1 EP 3970151A1
Authority
EP
European Patent Office
Prior art keywords
per
convolutions
cycle
features
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20730877.6A
Other languages
German (de)
English (en)
Inventor
Emrah Kostem
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Illumina Inc
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/874,599 external-priority patent/US11423306B2/en
Priority claimed from US16/874,633 external-priority patent/US11593649B2/en
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of EP3970151A1 publication Critical patent/EP3970151A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/62Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light
    • G01N21/63Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light optically excited
    • G01N21/64Fluorescence; Phosphorescence
    • G01N21/6428Measuring fluorescence of fluorescent products of reactions or of fluorochrome labelled reactive substances, e.g. measuring quenching effects, using measuring "optrodes"
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/62Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light
    • G01N21/63Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light optically excited
    • G01N21/64Fluorescence; Phosphorescence
    • G01N21/645Specially adapted constructive features of fluorimeters
    • G01N21/6452Individual samples arranged in a regular 2D-array, e.g. multiwell plates
    • G01N21/6454Individual samples arranged in a regular 2D-array, e.g. multiwell plates using an integrated detector array
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/048Fuzzy inferencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
  • intelligence i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems
  • systems for reasoning with uncertainty e.g., fuzzy logic systems
  • adaptive systems e.g., machine learning systems
  • artificial neural networks e.g., neural network for analyzing data.
  • PCT Patent Application No. PCT/US2016/047253 titled“IN-LINE PRESSURE ACCUMULATOR AND FLOW-CONTROL SYSTEM FOR BIOLOGICAL OR CHEMICAL ASSAYS,” filed on August 17, 2016, subsequently published as PCT Publication No. WO 2017/034868 Al, published on March 2, 2017;
  • FIG. 1 illustrates a cross-section of a biosensor in accordance with one implementation and also illustrates a top view of a detection device of the biosensor.
  • FIG. 2 illustrates, in one example, a cross-section of a portion of the detection device of FIG. 1 illustrating a portion of a reaction structure and a light guide thereof and also illustrates, in one example, an enlarged portion of the cross-section.
  • FIG. 3 depicts one implementation of base calling using convolutions.
  • FIG. 4 depicts three-dimensional (3D) convolutions used in the convolution-based base calling in accordance with one implementation that mixes information between the imaged channels.
  • FIG. 5 shows output features produced by the 3D convolutions in accordance with one implementation.
  • FIG. 6 shows intensity data features generated for a center pixel and used as
  • FIG. 7 illustrates the output features of FIG. 5 supplemented with the intensity data features of FIG. 7 in accordance with one implementation.
  • FIG. 8 illustrates one-dimensional (ID) convolutions used in the convolution-based base calling in accordance with one implementation.
  • FIG. 9 depicts further output features produced by the ID convolutions in accordance with one implementation.
  • FIG. 10 depicts pointwise convolutions used in the convolution-based base calling in accordance with one implementation.
  • FIG. 11 shows an output layer that processes the final output features produced by the pointwise convolutions and emits base calls for a center pixel in accordance with one implementation.
  • FIG. 12 shows intensity data features generated for a pixel patch and used as supplemental input in the convolution-based base calling in accordance with one
  • FIG. 13 illustrates the output features of FIG. 5 supplemented with the intensity data features of FIG. 12 in accordance with one implementation.
  • FIG. 14 illustrates the output layer processing the final output features produced by the pointwise convolutions and emitting base calls for pixels in the pixel patch in accordance with one implementation.
  • FIG. 15 depicts one implementation of the convolution-based base calling using segregated convolutions that do not mix information between the imaged channels.
  • FIG. 16 depicts one implementation of the convolution-based base calling using segregated 3D convolutions that do not mix information between the imaged channels and ID convolutions that mix information between the imaged channels.
  • FIG. 17 shows probability distribution of polymerase population movement in accordance with one implementation.
  • FIG. 18 shows phasing and prephasing data that specifies the probability distribution of polymerase population movement of FIG. 17 and is used as input for the compact convolution- based base calling in accordance with one implementation.
  • FIG. 19 illustrates base context data for three cycles that is used as input for the compact convolution-based base calling in accordance with one implementation.
  • FIG. 20 illustrates base context data for five cycles that is used as input for the compact convolution-based base calling in accordance with one implementation.
  • FIG. 21 depicts one example of the compact convolution-based base calling using image data for three cycles.
  • FIG. 22 depicts another example of the compact convolution-based base calling using image data for five cycles.
  • FIG. 23 shows one implementation of the convolutions used to mix the image data, the phasing and prephasing data, and the base context data for the compact convolution-based base calling in a timestep/convolution window/sequencing cycle.
  • FIG. 24 shows one implementation of pull-push and push-pull convolutions in which a combination of the ID convolutions and transposed convolutions is used for the compact convolution-based base calling.
  • FIG. 25 depicts one implementation of performing the compact convolution-based base calling during inference on a central processing unit (CPU) by using image data from only a subset of the sequencing cycles.
  • CPU central processing unit
  • FIG. 26 is a block diagram that shows various system modules and data stores used for the convolution-based base calling and the compact convolution-based base calling in accordance with one implementation.
  • FIG. 27 illustrates one implementation of a 3D convolution used in the convolution- based base calling.
  • FIG. 28 illustrates one implementation of a ID convolution used in the convolution- based base calling.
  • FIG. 29 illustrates one implementation of a pointwise convolution used in the convolution-based base calling.
  • FIG. 30 illustrates one example of the phasing and prephasing effect.
  • FIG. 31 illustrates one example of spatial crosstalk.
  • FIG. 32 illustrates one example of emission overlap.
  • FIG. 33 illustrates one example of fading.
  • FIG. 34 shows one example of quality score mapping produced by a quality score mapper.
  • FIG. 35 depicts one example of transposed convolution.
  • FIG. 36 is a computer system that can be used to implement the convolution-based base calling and the compact convolution-based base calling disclosed herein.
  • the neural network-based base caller detects and accounts for stationary, kinetic, and mechanistic properties of the sequencing process, mapping what is observed at each sequence cycle in the assay data to the underlying sequence of nucleotides.
  • the neural network-based base caller combines the tasks of feature engineering, dimension reduction, discretization, and kinetic modelling into a single end-to-end learning framework.
  • the neural network-based base caller uses a combination of 3D convolutions, ID convolutions, and pointwise convolutions to detect and account for assay biases such as phasing and prephasing effect, spatial crosstalk, emission overlap, and fading.
  • Deep neural networks are a type of artificial neural networks that use multiple nonlinear and complex transforming layers to successively model high-level features. Deep neural networks provide feedback via backpropagation which carries the difference between observed and predicted output to adjust parameters. Deep neural networks have evolved with the availability of large training datasets, the power of parallel and distributed computing, and sophisticated training algorithms. Deep neural networks have facilitated major advances in numerous domains such as computer vision, speech recognition, and natural language processing.
  • Convolutional neural networks and recurrent neural networks (RNNs) are components of deep neural networks.
  • Convolutional neural networks have succeeded particularly in image recognition with an architecture that comprises convolution layers, nonlinear layers, and pooling layers.
  • Recurrent neural networks are designed to utilize sequential information of input data with cyclic connections among building blocks like perceptrons, long short-term memory units, and gated recurrent units.
  • many other emergent deep neural networks have been proposed for limited contexts, such as deep spatio-temporal neural networks, multi dimensional recurrent neural networks, and convolutional auto-encoders.
  • the goal of training deep neural networks is optimization of the weight parameters in each layer, which gradually combines simpler features into complex features so that the most suitable hierarchical representations can be learned from data.
  • a single cycle of the optimization process is organized as follows. First, given a training dataset, the forward pass sequentially computes the output in each layer and propagates the function signals forward through the network. In the final output layer, an objective loss function measures error between the inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to backpropagate error signals and compute gradients with respect to all weights throughout the neural network. Finally, the weight parameters are updated using optimization algorithms based on stochastic gradient descent.
  • stochastic gradient descent provides stochastic approximations by performing the updates for each small set of data examples.
  • optimization algorithms stem from stochastic gradient descent.
  • the Adagrad and Adam training algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency and moments of the gradients for each parameter, respectively.
  • Another core element in the training of deep neural networks is regularization, which refers to strategies intended to avoid overfitting and thus achieve good generalization performance.
  • regularization refers to strategies intended to avoid overfitting and thus achieve good generalization performance.
  • weight decay adds a penalty term to the objective loss function so that weight parameters converge to smaller absolute values.
  • Dropout randomly removes hidden units from neural networks during training and can be considered an ensemble of possible subnetworks.
  • maxout a new activation function
  • mnDrop a variant of dropout for recurrent neural networks
  • batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters.
  • Convolutional neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification, and gene expression inference. Convolutional neural networks use a weight-sharing strategy that is especially useful for studying DNA because it can capture sequence motifs, which are short, recurring local patterns in DNA that are presumed to have significant biological functions. A hallmark of convolutional neural networks is the use of convolution filters.
  • Examples described herein may be used in various biological or chemical processes and systems for academic or commercial analysis. More specifically, examples described herein may be used in various processes and systems where it is desired to detect an event, property, quality, or characteristic that is indicative of a designated reaction.
  • examples described herein include light detection devices, biosensors, and their components, as well as bioassay systems that operate with biosensors.
  • the devices, biosensors and systems may include a flow cell and one or more light sensors that are coupled together (removably or fixedly) in a substantially unitary structure.
  • the devices, biosensors and bioassay systems may be configured to perform a plurality of designated reactions that may be detected individually or collectively.
  • the devices, biosensors and bioassay systems may be configured to perform numerous cycles in which the plurality of designated reactions occurs in parallel.
  • the devices, biosensors and bioassay systems may be used to sequence a dense array of DNA features through iterative cycles of enzymatic manipulation and light or image detection/acquisition.
  • the devices, biosensors and bioassay systems (e.g., via one or more cartridges) may include one or more microfluidic channel that delivers reagents or other reaction components in a reaction solution to a reaction site of the devices, biosensors and bioassay systems.
  • the reaction solution may be substantially acidic, such as comprising a pH of less than or equal to about 5, or less than or equal to about 4, or less than or equal to about 3.
  • the reaction solution may be substantially alkaline/basic, such as comprising a pH of greater than or equal to about 8, or greater than or equal to about 9, or greater than or equal to about 10.
  • the term“acidity” and grammatical variants thereof refer to a pH value of less than about 7
  • the terms“basicity,”“alkalinity” and grammatical variants thereof refer to a pH value of greater than about 7.
  • the reaction sites are provided or spaced apart in a predetermined manner, such as in a uniform or repeating pattern. In some other examples, the reaction sites are randomly distributed. Each of the reaction sites may be associated with one or more light guides and one or more light sensors that detect light from the associated reaction site. In some examples, the reaction sites are located in reaction recesses or chambers, which may at least partially compartmentalize the designated reactions therein.
  • a“designated reaction” includes a change in at least one of a chemical, electrical, physical, or optical property (or quality) of a chemical or biological substance of interest, such as an analyte-of-interest.
  • a designated reaction is a positive binding event, such as incorporation of a fluorescently labeled biomolecule with an analyte-of- interest, for example.
  • a designated reaction may be a chemical transformation, chemical change, or chemical interaction.
  • a designated reaction may also be a change in electrical properties.
  • a designated reaction includes the incorporation of a fluorescently-labeled molecule with an analyte.
  • the analyte may be an oligonucleotide and the fluorescently-labeled molecule may be a nucleotide.
  • a designated reaction may be detected when an excitation light is directed toward the oligonucleotide having the labeled nucleotide, and the fluorophore emits a detectable fluorescent signal.
  • the detected fluorescence is a result of chemiluminescence or bioluminescence.
  • a designated reaction may also increase fluorescence (or Forster) resonance energy transfer (FRET), for example, by bringing a donor fluorophore in proximity to an acceptor fluorophore, decrease FRET by separating donor and acceptor fluorophores, increase fluorescence by separating a quencher from a fluorophore, or decrease fluorescence by co-locating a quencher and fluorophore.
  • FRET fluorescence resonance energy transfer
  • a“reaction solution,”“reaction component” or“reactant” includes any substance that may be used to obtain at least one designated reaction.
  • potential reaction components include reagents, enzymes, samples, other biomolecules, and buffer solutions, for example.
  • the reaction components may be delivered to a reaction site in a solution and/or immobilized at a reaction site.
  • the reaction components may interact directly or indirectly with another substance, such as an analyte-of-interest immobilized at a reaction site.
  • the reaction solution may be substantially acidic (i.e., include a relatively high acidity) (e.g., comprising a pH of less than or equal to about 5, a pH less than or equal to about 4, or a pH less than or equal to about 3) or substantially alkaline/basic (i.e., include a relatively high alkalinity /basicity) (e.g., comprising a pH of greater than or equal to about 8, a pH of greater than or equal to about 9, or a pH of greater than or equal to about 10).
  • a relatively high acidity e.g., comprising a pH of less than or equal to about 5, a pH less than or equal to about 4, or a pH less than or equal to about 3
  • substantially alkaline/basic i.e., include a relatively high alkalinity /basicity
  • reaction site is a localized region where at least one designated reaction may occur.
  • a reaction site may include support surfaces of a reaction structure or substrate where a substance may be immobilized thereon.
  • a reaction site may include a surface of a reaction structure (which may be positioned in a channel of a flow cell) that has a reaction component thereon, such as a colony of nucleic acids thereon.
  • the nucleic acids in the colony have the same sequence, being for example, clonal copies of a single stranded or double stranded template.
  • a reaction site may contain only a single nucleic acid molecule, for example, in a single stranded or double stranded form.
  • a plurality of reaction sites may be randomly distributed along the reaction structure or arranged in a predetermined manner (e.g., side-by-side in a matrix, such as in microarrays).
  • a reaction site can also include a reaction chamber or recess that at least partially defines a spatial region or volume configured to compartmentalize the designated reaction.
  • the term“reaction chamber” or“reaction recess” includes a defined spatial region of the support structure (which is often in fluid communication with a flow channel).
  • a reaction recess may be at least partially separated from the surrounding environment other or spatial regions. For example, a plurality of reaction recesses may be separated from each other by shared walls, such as a detection surface.
  • reaction recesses may be nanowells comprising an indent, pit, well, groove, cavity or depression defined by interior surfaces of a detection surface and have an opening or aperture (i.e., be open-sided) so that the nano wells can be in fluid communication with a flow channel.
  • reaction recesses of the reaction structure are sized and shaped relative to solids (including semi-solids) so that the solids may be inserted, fully or partially, therein.
  • the reaction recesses may be sized and shaped to accommodate a capture bead.
  • the capture bead may have clonally amplified DNA or other substances thereon.
  • reaction recesses may be sized and shaped to receive an approximate number of beads or solid substrates.
  • reaction recesses may be filled with a porous gel or substance that is configured to control diffusion or filter fluids or solutions that may flow into the reaction recesses.
  • light sensors e.g., photodiodes
  • a light sensor that is associated with a reaction site is configured to detect light emissions from the associated reaction site via at least one light guide when a designated reaction has occurred at the associated reaction site.
  • a plurality of light sensors e.g.
  • a single light sensor e.g. a single pixel
  • the light sensor, the reaction site, and other features of the biosensor may be configured so that at least some of the light is directly detected by the light sensor without being reflected.
  • a“biological or chemical substance” includes biomolecules, samples-of- interest, analytes-of-interest, and other chemical compound(s).
  • a biological or chemical substance may be used to detect, identify, or analyze other chemical compound(s), or function as intermediaries to study or analyze other chemical compound(s).
  • the biological or chemical substances include a biomolecule.
  • a“biomolecule” includes at least one of a biopolymer, nucleoside, nucleic acid, polynucleotide, oligonucleotide, protein, enzyme, polypeptide, antibody, antigen, ligand, receptor, polysaccharide, carbohydrate, polyphosphate, cell, tissue, organism, or fragment thereof or any other biologically active chemical compound(s) such as analogs or mimetics of the aforementioned species.
  • a biological or chemical substance or a biomolecule includes an enzyme or reagent used in a coupled reaction to detect the product of another reaction such as an enzyme or reagent, such as an enzyme or reagent used to detect pyrophosphate in a pyrosequencing reaction.
  • Enzymes and reagents useful for pyrophosphate detection are described, for example, in U.S. Patent Publication No. 2005/0244870 Al, which is incorporated by reference in its entirety.
  • Biomolecules, samples, and biological or chemical substances may be naturally occurring or synthetic and may be suspended in a solution or mixture within a reaction recess or region. Biomolecules, samples, and biological or chemical substances may also be bound to a solid phase or gel material. Biomolecules, samples, and biological or chemical substances may also include a pharmaceutical composition. In some cases, biomolecules, samples, and biological or chemical substances of interest may be referred to as targets, probes, or analytes.
  • a“biosensor” includes a device that includes a reaction structure with a plurality of reaction sites that is configured to detect designated reactions that occur at or proximate to the reaction sites.
  • a biosensor may include a solid-state light detection or “imaging” device (e.g., CCD or CMOS light detection device) and, optionally, a flow cell mounted thereto.
  • the flow cell may include at least one flow channel that is in fluid
  • the biosensor is configured to fluidically and electrically couple to a bioassay system.
  • the bioassay system may deliver a reaction solution to the reaction sites according to a predetermined protocol (e.g., sequencing-by synthesis) and perform a plurality of imaging events.
  • the bioassay system may direct reaction solutions to flow along the reaction sites.
  • At least one of the reaction solutions may include four types of nucleotides having the same or different fluorescent labels.
  • the nucleotides may bind to the reaction sites, such as to corresponding oligonucleotides at the reaction sites.
  • the bioassay system may then illuminate the reaction sites using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes (LEDs)).
  • the excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths.
  • the fluorescent labels excited by the incident excitation light may provide emission signals (e.g., light of a wavelength or wavelengths that differ from the excitation light and, potentially, each other) that may be detected by the light sensors.
  • the term“immobilized,” when used with respect to a biomolecule or biological or chemical substance, includes substantially attaching the biomolecule or biological or chemical substance at a molecular level to a surface, such as to a detection surface of a light detection device or reaction structure.
  • a biomolecule or biological or chemical substance may be immobilized to a surface of the reaction structure using adsorption techniques including non-covalent interactions (e.g., electrostatic forces, van der Waals, and dehydration of hydrophobic interfaces) and covalent binding techniques where functional groups or linkers facilitate attaching the biomolecules to the surface.
  • Immobilizing biomolecules or biological or chemical substances to the surface may be based upon the properties of the surface, the liquid medium carrying the biomolecule or biological or chemical substance, and the properties of the biomolecules or biological or chemical substances themselves.
  • the surface may be functionalized (e.g., chemically or physically modified) to facilitate immobilizing the biomolecules (or biological or chemical substances) to the surface.
  • nucleic acids can be immobilized to the reaction structure, such as to surfaces of reaction recesses thereof.
  • the devices, biosensors, bioassay systems and methods described herein may include the use of natural nucleotides and also enzymes that are configured to interact with the natural nucleotides.
  • Natural nucleotides include, for example, ribonucleotides or deoxyribonucleotides. Natural nucleotides can be in the mono-, di-, or tri-phosphate form and can have a base selected from adenine (A), Thymine (T), uracil (U), guanine (G) or cytosine (C). It will be understood, however, that non-natural nucleotides, modified nucleotides or analogs of the aforementioned nucleotides can be used.
  • a biomolecule or biological or chemical substance may be immobilized at a reaction site in a reaction recess of a reaction structure. Such a biomolecule or biological substance may be physically held or immobilized within the reaction recesses through an interference fit, adhesion, covalent bond, or entrapment.
  • Examples of items or solids that may be disposed within the reaction recesses include polymer beads, pellets, agarose gel, powders, quantum dots, or other solids that may be compressed and/or held within the reaction chamber.
  • the reaction recesses may be coated or filled with a hydrogel layer capable of covalently binding DNA oligonucleotides.
  • a nucleic acid superstructure such as a DNA ball
  • a DNA ball can be disposed in or at a reaction recess, for example, by attachment to an interior surface of the reaction recess or by residence in a liquid within the reaction recess.
  • a DNA ball or other nucleic acid superstructure can be performed and then disposed in or at a reaction recess.
  • a DNA ball can be synthesized in situ at a reaction recess.
  • a substance that is immobilized in a reaction recess can be in a solid, liquid, or gaseous state.
  • FIG. 1 illustrates a cross-section of a biosensor 100 in accordance with one
  • the biosensor 100 may include a flow cell 102 that is coupled directly or indirectly to a light detection device 104.
  • the flow cell 102 may be mounted to the light detection device 104.
  • the flow cell 102 is affixed directly to the light detection device 104 through one or more securing mechanisms (e.g., adhesive, bond, fasteners, and the like).
  • the flow cell 102 may be removably coupled to the light detection device 104.
  • the biosensor 100 and/or detection device 104 may be configured for biological or chemical analysis to obtain any information or data that relates thereto.
  • the biosensor 100 and/or detection device 104 may comprise a nucleic acid sequencing system (or sequencer) configured for various applications, including but not limited to de novo sequencing, resequencing of whole genomes or target genomic regions, and metagenomics.
  • the sequencing system may be configured to perform DNA or RNA analysis.
  • the biosensor 100 and/or detection device 104 is configured to perform a large number of parallel reactions within the biosensor 100 and/or detection device 104 to obtain information relating thereto.
  • the flow cell 102 may include one or more flow channels that direct a solution to or toward reaction sites 114 on the detection device 104, as explained further below.
  • the flow cell 102 and/or biosensor 100 may thereby include, or be in fluid communication with, a
  • fluid/solution storage system (not shown) that may store various reaction components or reactants that are used to conduct the designated reactions therein, for example.
  • the fluid storage system may also store fluids or solutions for washing or cleaning a fluid network and the biosensor 100 and/or detection device 104, and potentially for diluting the reactants.
  • the fluid storage system may include various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous, oil and other non-polar solutions, and the like.
  • the fluid or solution provided on the reaction structure 126 may be relatively acidic (e.g., pH less than or equal to about 5) or basic/alkaline (e.g., pH greater than or equal to about 8).
  • the fluid storage system may also include waste reservoirs for receiving waste products from the biosensor 100 and/or detection device 104.
  • the light detection device 104 includes a device base 125 and a reaction structure 126 overlying the device base 125.
  • the device base 125 includes a plurality of stacked layers (e.g., silicon layer or wafer, dielectric layer, metal- dielectric layers, etc.).
  • the device base 125 may include a sensor array 124 of light sensors 140, and a guide array of light guides 118.
  • the reaction structure 126 may include an array of reaction recesses 108 that have at least one corresponding reaction site 114 provided therein (e.g., immobilized on a surface thereof).
  • the light detection device 104 is configured such that each light sensor 140 corresponds (and potentially aligns) with a single light guide 118 and/or a single reaction recess 108 such that it receives photons only therefrom.
  • a single light sensor 140 may receive photons through more than one light guide 118 and/or from more than one reaction recess 108.
  • a single light sensor 140 may thereby form one pixel or more than one pixel.
  • the array of reaction recesses 108 and/or light guides 118 may be provided in a defined repeating pattern such that at least some of the recesses 108 and/or light guides 118 (and potentially light sensors 140) are equally spaced from one another in a defined positional pattern.
  • the reaction recesses 108 and/or light guides 118 (and potentially light sensors 140) may be provided in a random pattern, and/or at least some of the reaction recesses 108 and/or light guides 118 (and potentially light sensors 140) may be variably spaced from each other.
  • the reaction structure 126 of the detection device 104 may define a detector surface 112 over which a reaction solution may flow and reside, as explained further below.
  • the detector surface 112 of the reaction structure 126 may be the top exposed surface of the detection device 104.
  • the detector surface 112 may comprise the surfaces of the recesses 108 and interstitial areas 113 extending between and about the recesses 108.
  • the detector surface 112 of the light detection device 104 may be functionalized (e.g., chemically or physically modified in a suitable manner for conducting designated reactions).
  • the detector surface 112 may be functionalized and may include a plurality of reaction sites 114 having one or more biomolecules immobilized thereto.
  • the detector surface 112 may include an array of reaction recesses 108 (e.g., open-sided reaction chambers). Each of the reaction recesses 108 may include one or more of the reaction site 114.
  • the reaction recesses 108 may be defined by, for example, a change in depth (or thickness) along the detector surface 112. In other examples, the detector surface 112 may be substantially planar.
  • the reaction sites 114 may be distributed in a pattern along the detector surface 112, such as within the reaction recesses 108.
  • the reactions sites 114 may be located in rows and columns along the reaction recesses 108 in a manner that is similar to a microarray.
  • various patterns of reaction sites 114 may be used.
  • the reaction sites 114 may include biological or chemical substances that emit light signals, as explained further below.
  • the biological or chemical substances of the reactions sites 114 may generate light emissions in response to the excitation light 101.
  • the reaction sites 114 include clusters or colonies of biomolecules (e.g., oligonucleotides) that are immobilized on the detector surface 112 within the reaction recesses 108.
  • the reactions sites 114 may generate light emissions in response to incident excitation light after treatment with the reaction solution.
  • the reaction solution may initiate a reaction and/or form a reaction product at the reactions sites 114 (but potentially not at other reaction sites of the reaction structure 126 of the device 104) that generates light emissions in response to the excitation light.
  • the excitation light 101 may be emitted from any illumination source (not shown), which may or may not be part of the bioassay system, biosensor 100 or light detection device 104.
  • the illumination system may include a light source (e.g., one or more LED) and, potentially, a plurality of optical components to illuminate at least the reaction structure 126 of the detection device 104.
  • light sources may include lasers, arc lamps, LEDs, or laser diodes.
  • the optical components may be, for example, reflectors, dichroics, beam splitters, collimators, lenses, filters, wedges, prisms, mirrors, detectors, and the like.
  • the illumination system is configured to direct the excitation light 101 to reaction sites 114 within the recesses 108 of the reaction structure 126 of the detection device 104.
  • the illumination system may emit the excitation light 101 within a range of wavelengths, such as within the range of about 300 nm to about 700 nm for example, or more particularly within the range of about 400 nm to about 600 nm for example.
  • the illumination system may emit the excitation light 101 at a certain wavelength or wavelengths that excites the biological or chemical substance(s) of the reaction sites 108 (e.g., a reaction initiated by the reaction solution and/or reaction product form by the reaction solution at the reactions sites 114) to emit light emissions of a differing wavelength or wavelengths.
  • the excitation light may be about 532 nm and the light emissions may be about 570 nm or more.
  • FIG. 2 shows the detection device 104 in greater detail than FIG. 1. More specifically, FIG. 2 shows a single light sensor 140, a single light guide 118 for directing and passing light emissions from at least one reaction site 114 associated therewith toward the light sensor 140, and associated circuitry 146 for transmitting signals based on the light emissions (e.g., photons) detected by the light sensor 140.
  • the other light sensors 140 of the sensor array 124 and associated components may be configured in an identical or similar manner.
  • the light detection device 104 is not required to be manufactured uniformly throughout. Instead, one or more light sensors 140 and/or associated components may be manufactured differently or have different relationships with respect to one another.
  • the circuitry 146 may include interconnected conductive elements (e.g., conductors, traces, vias, interconnects, etc.) that are capable of conducting electrical current, such as the transmission of data signals that are based on detected photons.
  • the circuitry 146 may comprise a microcircuit arrangement.
  • the light detection device 104 and/or the device base 125 may comprise at least one integrated circuit having an array of the light sensors 140.
  • the circuitry 146 positioned within the detection device 104 may be configured for at least one of signal amplification, digitization, storage, and processing.
  • the circuitry 146 may collect (and potentially analyze) the detected light emissions and generate data signals for communicating detection data to a bioassay system.
  • the circuitry 146 may also perform additional analog and/or digital signal processing in the light detection device 104.
  • the device base 125 and the circuitry 146 may be manufactured using integrated circuit manufacturing processes, such as processes used to manufacture charged-coupled devices or circuits (CCD) or complementary-metal-oxide semiconductor (CMOS) devices or circuits.
  • the device base 125 may be a CMOS device comprising of a plurality of stacked layers including a sensor base 141, which may be a silicon layer (e.g., a wafer) in some examples.
  • the sensor base 141 may include the light sensor 140, and gates 143 formed thereon.
  • the gates 143 may be electrically coupled to the light sensor 140.
  • the light detection device 104 is configured as shown in FIG. 2, the light sensor 140 may be electrically coupled to the circuitry 146 through the gates 143, for example.
  • FIG. 3 depicts one implementation of base calling 300 using convolutions.
  • the base calling 300 is operationalized by the neural network-based base caller 2614. That is, the three- dimensional (3D) convolution filters 304, the skip connection 326, the one-dimensional (ID) convolution filters 308, the pointwise convolution filters 310, and the output layer 314 are components of the neural network-based base caller 2614, which processes the input data 2632 through its components and produces the base calls 332 as output.
  • the convolution operations of the neural network-based base caller 2614 are operationalized by a convolution operator 2615, which is also a component of the neural network-based base caller 2614.
  • the convolution operator 2615 in turn comprises a 3D convolution operator 2616, a ID convolution operator 2617, a pointwise convolution operator 2618, and a transposed convolution operator 2619.
  • the input data 2632 is image data 302 based on intensity signals depicting analyte emissions (e.g., in the case of Illumina).
  • the image data 302 is derived from sequencing images produced by a sequencer during a sequencing run.
  • the image data 302 comprises w x h image patches extracted from the sequencing images, where w (width) and h (height) are any numbers ranging from 1 and 10,000 (e.g., 3 x 3, 5 x 5, 7 x 7, 10 x 10, 15 x 15, 25 x 25).
  • w and h are the same.
  • w and h are different.
  • the sequencing run produces c image(s) per sequencing cycle for corresponding c imaged channels, and an image patch is extracted by an input preparer 2625 from each of the c image(s) to prepare the image data for a particular sequencing cycle.
  • an image patch is extracted by an input preparer 2625 from each of the c image(s) to prepare the image data for a particular sequencing cycle.
  • c is 4 or 2. In other implementations, c is 1, 3, or greater than 4.
  • the image data 302 is in the optical, pixel domain in some implementations, and in the upsampled, subpixel domain in other implementations.
  • the image data 302 comprises data for multiple sequencing cycles (e.g., a current sequencing cycle, one or more preceding sequencing cycles, and one or more successive sequencing cycles).
  • the image data 302 comprises data for three sequencing cycles, such that data for a current (time t) sequencing cycle to be base called is accompanied with (i) data for a left flanking/context/previous/preceding/prior (time t-l) sequencing cycle and (ii) data for a right flanking/context/next/successive/subsequent (time t+l) sequencing cycle.
  • the image data 302 comprises data for a single sequencing cycle.
  • the image data 302 comprises data for 58, 75,
  • the image data 302 depicts intensity emissions of one or more clusters and their surrounding background.
  • the image patches are extracted from the sequencing images by the input preparer 2625 in such a way that each image patch contains intensity signal data from the target cluster in its center pixel.
  • the image data 302 is encoded in the input data 2632 using intensity channels (also called imaged channels). For each of the c images obtained from the sequencer for a particular sequencing cycle, a separate imaged channel is used to encode its intensity signal data.
  • intensity channels also called imaged channels.
  • the input data 2632 comprises (i) a first red imaged channel with w x h pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the red image and (ii) a second green imaged channel with w x h pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the green image.
  • the input data 2632 is based on pH changes induced by the release of hydrogen ions during molecule extension.
  • the pH changes are detected and converted to a voltage change that is proportional to the number of bases incorporated (e.g., in the case of Ion Torrent).
  • the input data 2632 is constructed from nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • nanopore sensing uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • ONT Oxford Nanopore Technologies
  • the Oxford Nanopore Technologies (ONT) sequencing is based on the following concept: pass a single strand of DNA (or RNA) through a membrane via a nanopore and apply a voltage difference across the membrane.
  • the nucleotides present in the pore will affect the pore’s electrical resistance, so current measurements over time can indicate the sequence of DNA bases passing through the pore.
  • This electrical current signal (the‘squiggle’ due to its appearance when plotted) is the raw data gathered by an ONT sequencer.
  • DAC integer data acquisition
  • the input data 2632 comprises normalized or scaled DAC values.
  • the dimensionality of the image data 302 can be expressed as w x h x kx c, where“w” represents the width of the image data 302,“h” represents the height of the image data 302, k represents the number of sequencing cycles for which the image data 302 is obtained, and“c” represents the number of imaged channels in the image data 302.
  • w can be 3, 5, 6, 10, 15, or 25 and h can be the same as w.
  • k can be 1, 3, 5, 7, 9,
  • c can be 1, 2, 3, 4, 6, or 10.
  • the 3D convolution filters 304 apply 3D convolutions (3D CONV) on the image data 302 and produce output features 306.
  • the dimensionality of the 3D convolutions can be expressed as w x h x r x n, where represents the width of a 3D convolution kernel,“h” represents the height of the 3D convolution kernel,“r” represents the receptive field of the 3D convolution kernel, and represents a total number of the 3D convolution filters 304.
  • w can be 3, 5, 6, 10, 15, or 25 and h can be the same as w.
  • r can be 3, 5, 7, 10, 15, or 25.
  • n can be 3, 5, 10, 50, 100, 150, 198, 200, 250, or 300.
  • the 3D convolutions are operationalized by the 3D convolution operator 2616.
  • FIG. 27 illustrates one implementation of a 3D convolution 2700 used in the convolution-based base calling 300.
  • a 3D convolution is a mathematical operation where each voxel present in the input volume is multiplied by a voxel in the equivalent position of the convolution kernel. At the end, the sum of the results is added to the output volume.
  • FIG. 27 it is possible to observe the representation of the 3D convolution, where the voxels 2716a highlighted in the input 2716 are multiplied with their respective voxels in the kernel 2718. After these calculations, their sum 2720a is added to the output 2720.
  • the 3D convolution operation can be mathematically defined as:
  • K is the convolution kernel
  • 3D convolutions in addition to extracting spatial information from matrices like 2D convolutions, extract information present between consecutive matrices. This allows them to map both spatial information of 3D data and temporal information of a set of sequential images.
  • the output features 306 are subjected to nonlinear activation functions such as rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), parametric ReLU (PReLU), sigmoid, and hyperbolic tangent (tanh) to produce activated output features.
  • the nonlinear activation functions are operationalized by a nonlinear activation function applier 504, which is also a component of the neural network-based base caller 2614.
  • batch normalization is applied either before or after the 3D convolutions.
  • the batch normalization is operationalized by a batch normalizer 2622, which is also a component of the neural network-based base caller 2614.
  • a skip connection 326 combines parts 324 of the image data 302 (or the input data 2632) with the output features 306 (or the activated output features). In other implementations, the skip connection 326 combines all of the image data 302 (or the input data 2632) with the output features 306 (or the activated output features). The combining can be accomplished by concatenation or summation. The resulting combined data is referred to as supplemented features 334. In one implementation, when a single target cluster is to be base called, information about the single target cluster is selected from the image data 302 (or the input data 2632) and combined with the output features 306 (or the activated output features).
  • intensity signal data depicted by a pixel (l x l) associated with the single target cluster is selected for each of the imaged channels (c) and for each of the sequencing cycles (k) and combined with the output features 306 (or the activated output features).
  • the skip connection 326 is operationalized by a skip connector 2620, which is also a component of the neural network-based base caller 2614.
  • the ID convolution filters 308 apply ID convolutions (ID CONV) on the supplemented features 334 and produce further output features 328.
  • ID CONV ID convolutions
  • a cascade of the ID convolutions 330 is applied. That is, a first ID convolution in the cascade 330 processes the supplemented features 334 as starting input and produces a first set of the further output features 328. A second ID convolution in the cascade 330 then processes the first set of the further output features 328 and produces a second set of the further output features 328. A third ID convolution in the cascade 330 then processes the second set of the further output features 328 and produces a third set of the further output features 328.
  • An ultimate ID convolution in the cascade 330 processes the penultimate set of the further output features 328 and produces an ultimate set of the further output features 328, which is then fed as starting input to the pointwise convolutions (pointwise CONV).
  • Each ID convolution in the cascade 330 uses a bank (n) of the ID convolution filters 308.
  • each ID convolution in the cascade 330 has a different kernel width or receptive field (/).
  • / can be 3, 5, 7, 9, 11, and 13.
  • some ID convolutions in the cascade 330 have the same /, while other ID convolutions in the cascade 330 have a different /.
  • / can be progressively increased, progressively decreased, randomly varied, or randomly maintained.
  • the ID convolutions are operationalized by the ID convolution operator 2617.
  • FIG. 28 illustrates one implementation of a ID convolution 2800 used in the
  • a ID convolution extracts local ID patches 2812 or subsequences from an input sequence 2802 and obtains an output 2826 from each such ID patch 2812.
  • the ID convolution recognizes local patters in the input sequence 2802. Because the same input transformation is performed on every patch 2812, a pattern learned at a certain position in the input sequence 2802 can be later recognized at a different position, making the ID convolution invariant to temporal translations. For instance, when the ID convolution processes the input sequence 2802 using convolution windows of size five 2804, it leams sequence patterns of length five or less, and thus recognizes base motifs in the input sequence 2802. This way the ID convolution is able to leam the underlying base morphology.
  • the further output features 328 are subjected to nonlinear activation functions such as rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), parametric ReLU (PReLU), sigmoid, and hyperbolic tangent (tanh) to produce activated further output features.
  • nonlinear activation functions such as rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), parametric ReLU (PReLU), sigmoid, and hyperbolic tangent (tanh) to produce activated further output features.
  • batch normalization is applied either before or after each ID convolution in the cascade.
  • the pointwise convolution filters 310 apply pointwise convolutions (pointwise CONV) on the ultimate set of the further output features 328 (or activated further output features) and produce final output features 312.
  • the pointwise convolutions are operationalized by the pointwise convolution operator 2618.
  • FIG. 29 illustrates one implementation of a pointwise convolution 2900 used in the convolution-based base calling 300.
  • a pointwise convolution is a convolution with a 1 x 1 receptive field/kemel width/window/spatial dimensions.
  • the output that has the same spatial dimensionality as the input i.e., the pointwise convolution carries the spatial dimensionality of the input onto the output.
  • the resulting output 2906 has only one channel.
  • another input 2912 is convolved over by a bank of 256 pointwise convolution filters 2914
  • the resulting output 2916 has 256 channels. Note that, in both the examples, the output spatial dimensionality matches the input spatial dimensionality, i.e., 8 x 8.
  • the final output features 312 are subjected to nonlinear activation functions such as rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), parametric ReLU (PReLU), sigmoid, and hyperbolic tangent (tanh) to produce activated final output features.
  • nonlinear activation functions such as rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), parametric ReLU (PReLU), sigmoid, and hyperbolic tangent (tanh) to produce activated final output features.
  • batch normalization is applied either before or after the pointwise convolutions.
  • the output layer 314 processes the final output features 312 and produces base calls 332.
  • the output layer 314 can comprise a fully -connected network 2348, a sigmoid layer, a softmax layer, and/or a regression layer.
  • the neural network-based base caller 2614 uses 3D convolutions that mix information between the input channels and ID convolutions that also mix information between the input channels. In another implementation, the neural network-based base caller 2614 uses 3D convolutions that mix information between the input channels, but ID
  • the neural network-based base caller 2614 uses 3D convolutions that do not mix information between the input channels, but ID convolutions that mix information between the input channels. In yet further implementation, the neural network-based base caller 2614 uses 3D convolutions that do not mix information between the input channels and ID convolutions that also do not mix information between the input channels.
  • the 3D convolutions, the ID convolutions, the pointwise convolutions, and the transposed convolutions can use padding.
  • the padding is SAME or zero padding and produces at least one feature element corresponding to each sequencing cycle.
  • the padding is VALID padding.
  • the intermediate calculations of the neural network-based base caller 2614 are stored as intermediate features 2605.
  • FIG. 4 depicts 3D convolutions 402 used in the convolution-based base calling 400 in accordance with one implementation that mixes information between the imaged channels.
  • the 3D convolutions 402 convolve over the image data 302.
  • the image data 302 includes pixels that contain intensity data for associated analytes and how the intensity data is obtained for one or more imaged channels by corresponding light sensors configured to detect emissions from the associated analytes.
  • the biosensor 100 comprises an array of light sensors.
  • a light sensor is configured to sense information from a corresponding pixel area (e.g., a reaction site/well/nanowell) on the detection surface of the biosensor 100.
  • An analyte disposed in a pixel area is said to be associated with the pixel area, i.e., the associated analyte.
  • the light sensor corresponding to the pixel area is configured to detect/capture/sense emissions/photons from the associated analyte and, in response, generate a pixel signal for each imaged channel.
  • each imaged channel corresponds to one of a plurality of filter wavelength bands.
  • each imaged channel corresponds to one of a plurality of imaging events at a sequencing cycle.
  • each imaged channel corresponds to a combination of illumination with a specific laser and imaging through a specific optical filter.
  • Pixel signals from the light sensors are communicated to a signal processor coupled to the biosensor 100 (e.g., via a communication port). For each sequencing cycle and each imaged channel, the signal processor produces an image whose pixels respectively
  • a pixel in the image corresponds to: (i) a light sensor of the biosensor 100 that generated the pixel signal depicted by the pixel, (ii) an associated analyte whose emissions were detected by the corresponding light sensor and converted into the pixel signal, and (iii) a pixel area on the detection surface of the biosensor 100 that holds the associated analyte.
  • Pixels in the red and green images have one-to-one correspondence within a sequencing cycle. This means that corresponding pixels in a pair of the red and green images depict intensity data for the same associated analyte, albeit in different imaged channels. Similarly, pixels across the pairs of red and green images have one-to-one correspondence between the sequencing cycles. This means that corresponding pixels in different pairs of the red and green images depict intensity data for the same associated analyte, albeit for different acquisition events/timesteps (sequencing cycles) of the sequencing run.
  • Corresponding pixels in the red and green images can be considered a pixel of a“per-cycle image” that expresses intensity data in a first red channel and a second green channel.
  • a per-cycle image whose pixels depict pixel signals for a subset of the pixel areas, i.e., a region (tile) of the detection surface of the biosensor 100, is called a“per- cycle tile image.”
  • a patch extracted from a per-cycle tile image is called a“per-cycle image patch.”
  • the patch extraction is performed by the input preparer 2625.
  • the image data 302 comprises a sequence of per-cycle image patches generated for a series of k sequencing cycles of a sequencing run.
  • the pixels in the per-cycle image patches contain intensity data for associated analytes and the intensity data is obtained for one or more imaged channels (e.g., a red channel 422r and a green channel 422g) by corresponding light sensors configured to detect emissions from the associated analytes.
  • the per-cycle image patches are centered at a center pixel 412 that contains intensity data for a target associated analyte and non-center pixels in the per-cycle image patches contain intensity data for associated analytes adjacent to the target associated analyte.
  • the image data 302 is prepared by the input preparer 2625.
  • a per-cycle image patch for cycle 4 is referenced in FIG. 4 by numerical 490. Also note that, in FIG. 4, the repeated reference to the center pixel 412 across the k per- cycle image patches illustrates the pixel-to-pixel correspondence discussed above.
  • the image data 302 is padded with padding 404.
  • the padding 404 is SAME or zero padding and produces at least one feature element corresponding to each of the k sequencing cycles. In another implementation, the padding 404 is VALID padding.
  • the 3D convolutions 402 are applied on the image data 302 on a sliding convolution window basis.
  • FIG. 4 shows four convolution windows 415, 425, 435, and 485.
  • a convolution window covers a plurality of the per-cycle image patches (e.g., anywhere between 2 to 200 per- cycle image patches forming the plurality) and produces a feature element as output.
  • feature elements 466 corresponding to the convolution windows 415, 425, 435, and 485 of a first 3D convolution filter 418 are il, i2, i3, and ik.
  • the feature elements 466 are arranged in an output feature 502a.
  • the 3D convolutions 402 use imaged channel-specific convolution kernels such that a convolution kernel convolves over data for its own imaged channel and does not convolve over data for another imaged channel.
  • a convolution kernel convolves over data for its own imaged channel and does not convolve over data for another imaged channel.
  • the red convolution kernel 418r convolves over the data in the red channel 422r
  • the green convolution kernel 418g convolves over the data in the green channel 422g (along with bias 418b).
  • the output of a convolution kernel convolving over the plurality of the per-cycle image patches is an intermediate feature element (not shown).
  • a feature element like il, i2, i3, or ik is a result of accumulating
  • the feature element il produced by the first 3D convolution filter 418 for the convolution window 415 is made up of a red intermediate feature element ilr (not shown) produced by the red convolution kernel 418r and a green intermediate feature element ilg (not shown) produced by the green convolution kernel 418g.
  • the neural network-based base caller 2614 can use (i) mixed 3D and ID convolutions, (ii) mixed 3D convolutions but segregated ID convolutions, (iii) segregated 3D convolutions but mixed ID convolutions, and/or (iv) segregated 3D and ID convolutions.
  • the image data 302 is subject to biases such as phasing and prephasing effect, spatial crosstalk, emission overlap, and fading.
  • Phasing is caused by incomplete removal of the 3' terminators and fluorophores as well as sequences in the analyte missing an incorporation cycle.
  • Prephasing is caused by the incorporation of nucleotides without effective 3 '-blocking.
  • Phasing and prephasing effect is a nonstationary distortion and thus the proportion of sequences in each analyte that are affected by phasing and prephasing increases with cycle number; hampering correct base identification and limiting the length of useful sequence reads.
  • Incomplete extension due to phasing results in lagging strands (e.g., i- ⁇ from the current cycle).
  • Addition of multiple nucleotides or probes in a population of identical strands due to prephasing results in leading strands (e.g., t+l from the current cycle).
  • Other terms used to refer to phasing and phasing include falling behind, moved ahead, lagging, leading, dephasing, post phasing, out-of-phase, out-of-sync, out-of-step nucleotide synthesis, asynchronicity, carry forward (CF), incomplete or premature extension (IE), and droop (DR).
  • FIG. 30 illustrates one example of the phasing and prephasing effect 3000.
  • FIG. 30a shows that some strands of an analyte lead (red) while others lag behind (blue), leading to a mixed signal readout of the analyte.
  • FIG. 30b depicts the intensity output of analyte fragments with“C” impulses every 15 cycles in a heterogeneous background. Notice the anticipatory signals (gray arrow) and memory signals (black arrows) due to the phasing and prephasing effect 3000.
  • Spatial crosstalk refers to a signal or light emission from one or more non-associated analytes (or pixel areas) that is detected by a corresponding light detector of an associated analyte (or pixel area). Spatial crosstalk is caused by unwanted emissions from adjacent analytes.
  • the intensities of each analyte should correspond to just one analyte sequence. However, the observed intensities often contain signals from neighboring analyte sequences, other than the interrogated/target one, and, hence, are not pure.
  • FIG. 31 illustrates one example of spatial crosstalk.
  • FIG. 31 illustrates a detection device 3100 having a plurality of pixel areas 3156A-3156D on a detector surface 602.
  • the detection device 3100 includes light sensors 3136A-3136D.
  • the light sensors 3136A-3136D are associated with and correspond to the pixel areas 3156A-3156D, respectively.
  • Corresponding detection paths 3140A-3140D extend between the light sensors 3136A-3136D and corresponding pixel areas 3156A-3156D.
  • the arrows that indicate the detection paths 3140A-3140D are merely to illustrate a general direction that the light propagates through the respective detection path.
  • the detection device 3100 is configured to detect light using the light sensors 3136A-3136D.
  • light emissions or emission signals
  • the light emissions may be indicative of, for example, a positive binding event between the analytes located at the corresponding pixel area and another biomolecule.
  • the pixel areas 3156A-3156D are illuminated by an excitation light (e.g., 532 nm).
  • the pixel areas 3156A and 3156B are bound to respective biomolecules having light labels (e.g., fluorescent moieties).
  • the pixel areas 3156A and 3156B provide light emissions as demonstrated in FIG. 31.
  • the pixel areas 3156 and the light sensors 3136 may be located relatively close to one another such that light emissions from a non-associated pixel area may be detected by a light sensor. Such light emissions may be referred to as crosstalk emissions or spatial crosstalk.
  • the light emissions propagating from the pixel area 3156A include a crosstalk signal and a pixel signal.
  • the pixel signal of the light emissions from the pixel area 3156A is that signal of the light emissions that is configured to be detected by the light sensor 3136A.
  • the pixel signal includes the light emissions that propagate at an angle that is generally toward the light sensor 3136A such that filter walls 3130 defining the detection path 3140A are capable of directing the light emissions toward the light sensor 3136A.
  • the crosstalk signal is that signal of the light emissions that clears the filter walls 3130 defining the detection path 3140A and propagates into, for example, the detection path 3140B.
  • the crosstalk signal may be directed to the light sensor 3136B, which is not associated with the pixel area 3156A.
  • the light sensor 3136B may be referred to as a non-associated light sensor with respect to the pixel area 3156A.
  • the light sensor 3136A may detect the pixel emissions from the pixel area 3156A and the crosstalk emissions from the pixel area 3156B.
  • the light sensor 3136B may detect the pixel emissions from the pixel area 3156B and the crosstalk emissions from the pixel area 3156A.
  • the light sensor 3136C may detect the crosstalk emissions from the pixel area 3156B.
  • the pixel area 3156C is not providing light emissions in FIG. 31.
  • an amount of light detected by the light sensor 3136C is less than the corresponding amounts of light detected by the light sensors 3136A and 3136B.
  • the light sensor 3136C only detects crosstalk emissions from the pixel area 3156B, and the light sensor 3136D does not detect crosstalk emissions or pixel emissions.
  • Emission overlap refers to the recording of light from a single fluorophore in multiple channels.
  • CRT cyclic reversible termination
  • the different fluorophores would have distinct emission spectra and similar yields.
  • the emission spectra of the fluorophores used for sequencing are broad and overlap with one another. Thus, when one fluorophore is excited, its signal also passes through the optical filters of the other channels.
  • FIG. 32 illustrates one example of emission overlap 3200.
  • FIG. 32a shows that the spectrum of the G fluorophore (red) bleeds into the optical spectrum of the T filter (pink hatched region). Thus, when a G fluorophore is excited, a T signal will also be detected.
  • FIG. 32b is a two-dimensional histogram of intensity data of the T channel versus G channel.
  • the G fluorophores (right arrow) transmit to the T channel, hence the positive linearity.
  • the T fluorophores (left arrow) do not transmit to the G channel. Note that there is strong overlap between the“A” and the“C” channels, and the“G” and“T” channels - each pair of fluorescence channels is excited by the same laser.
  • Fading is an exponential decay in fluorescent signal intensity as a function of cycle number. As the sequencing run progress, the analyte strands are washed excessively, exposed to laser emissions that create reactive species, and subject to harsh environmental conditions. All of these lead to a gradual loss of fragments in each analyte, decreasing its fluorescent signal intensity. Fading is also called dimming or signal decay.
  • FIG. 33 illustrates one example of fading 3300. In FIG. 33, the intensity values of analyte fragments with AC microsatellites show exponential decay.
  • the 3D convolutions 402 detect and account for these biases during the convolution- based base calling 400.
  • the 3D convolution filters 304 of the 3D convolutions 402 such as the first 3D convolution filter 418, convolve over - (i) a plurality of the per-cycle image patches along a temporal dimension 428k to detect and account for phasing and prephasing effect between successive ones of the sequencing cycles caused by asynchronous readout of sequence copies of an associated analyte, (ii) a plurality of pixels in each of the per-cycle image patches along spatial dimensions 428w, 428h to detect and account for spatial crosstalk between adjacent analytes caused by detection of emissions from a non- associated analyte by a corresponding light sensor of an associated analyte, and (iii) each of the imaged channels along a depth dimension 428c to detect and account for emission overlap between the imaged channels caused by overlap of dye emission spectra.
  • the 3D convolution filters 304 leam to associate observed inter-cycle emissions that cumulatively create intensity patterns representative of: (i) the signal of the underlying base morphology at the current sequencing cycle and (ii) the noise contributed by the flanking sequencing cycles as the phasing and prephasing effect 3000,— with the correct base call prediction for the current sequencing cycle (which, during training, is communicated via the ground truth 2608).
  • the 3D convolution filters 304 leam to associate observed inter-analyte emissions that cumulatively create intensity patterns representative of: (i) the signal of the interrogated/target analyte and (ii) the noise contributed by the adjacent analytes as the spatial crosstalk 3100,— with the correct base call prediction for the interrogated/target analyte (which, during training, is communicated via the ground truth 2608).
  • the 3D convolution filters 304 leam to associate observed inter-channel emissions that cumulatively create intensity patterns representative of: (i) the signal of the excited fluorophore in the corresponding imaged channel and (ii) the noise contributed by the non-excited fluorophore(s) in the non-corresponding imaged channel(s) as the emission overlap 3200,— with the correct base call prediction component for the corresponding imaged channel (which, during training, is communicated via the ground truth 2608).
  • the 3D convolution filters 304 leam to associate observed progressive decrease of the intensity values in the elapsed cycles caused by the fading 3300— with the correct base call prediction for the sequencing cycles (which, during training, is communicated via the ground truth 2608).
  • the 3D convolution filters 304 are trained on image data obtained for a variety of flow cells, sequencing instmments, sequencing runs, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities, and therefore leam many different types of such associations found in the raw data and are optimized over many instances or examples of each type of association. In some implementations, hundreds, thousands, or millions of training examples are used. The optimization includes adjusting/evolving/updating the
  • coefficients/weights/parameters of the convolution kernels (and biases) of the 3D convolution filters 304 to minimize the loss between the predicted base calls and the correct base calls identified by the ground truth.
  • the loss is minimized using stochastic gradient descent with backpropagation.
  • a 3D convolution filter produces at least one output feature as a result of convolving over the sequence of per-cycle image patches on the sliding convolution window basis.
  • the first 3D convolution filter 418 produces the output feature 502a.
  • FIG. 5 shows output features 502a-n produced by n 3D convolution filters 304, respectively.
  • An output feature comprises k feature elements corresponding to k sequencing cycles.
  • the neural network-based base caller 2614 uses this configuration to produce a base call for each sequencing cycle in a prediction.
  • the output features 502a-n are subjected to ReLU by the nonlinear activation function applier 504 to produce activated output features 502a-n.
  • FIG. 6 shows intensity data features generated for the center pixel 412 and used as supplemental input 324 in the convolution-based base calling 400 in accordance with one implementation.
  • the skip connection 326 selects intensity values of the center pixel 412 across the per-cycle pixel patches of the k sequencing cycles and creates intensity data features for the center pixel 412. The selection is done separately for each of the imaged channels. For example, the skip connection 326 accesses the pixel patches for the red channel 422r and selects intensity values of the center pixel 412 in the red channel 422r to create a red channel intensity data feature 602r.
  • the skip connection 326 accesses the pixel patches for the green channel 422g and selects intensity values of the center pixel 412 in the green channel 422g to create a green channel intensity data feature 602g.
  • the skip connection 326 concatenates the per-cycle intensity values to create the intensity data features.
  • the skip connection 326 sums the per-cycle intensity values to create the intensity data features.
  • the skip connection 326 supplements the output features 502a-n (or the activated output features 502a-n) with the red and green channel intensity data features 602r, 602g. This causes the neural network-based base caller 2614 to further attend to the intensity data of the center pixel 412.
  • the cascade 330 of ID convolutions 308 is applied to produce the further output features 312.
  • the ID convolutions 308 use different receptive fields to detect varying degrees of the asynchronous readout caused by the phasing and prephasing effect 3000.
  • FIG. 8 shows one implementation of a first ID convolution filter 808 convolving over the supplemented output features 800, which comprise the output features 502a-n and the intensity data features 602r, 602g.
  • wl weights/coefficients
  • each ID convolution filter bank uses a different /. In other implementations, some of the banks have the same /. Within the cascade 330, from one bank to the next, / can progressively increase, progressively decrease, randomly increase, randomly decrease, or randomly kept the same.
  • the weights in the ID convolution filters 308 are element-wise multiplied with the feature elements of the supplemented output features 800. Since each feature element corresponds to one of the k sequencing cycles, element-wise multiplication between the weights and the corresponding feature elements is referred to herein as“cross-cycle multiplication.” In one implementation, the cross-cycle multiplication results in mixing of information between the sequencing cycles. As / changes, the window of sequencing cycles between which the information is mixed also changes to account for different number of flanking sequencing cycles that contribute to the signal of a current sequencing cycle (I). i.e., different levels/orders/degrees of phasing (M, 1-2. 1-3. etc.) and prephasing (7+ 1. 1+2. 1+3. etc.).
  • I current sequencing cycle
  • One instance of the cross-cycle multiplication and subsequent summation yields an intermediate output feature.
  • the intermediate output features 804 are identified using the notation f , where / denotes the output feature or the intensity data feature and j denotes the cycle number.
  • SAME padding By use of SAME padding, the cross-cycle multiplication and summation across the supplemented output features 800 results in k intermediate output features corresponding to the k sequencing cycles.
  • the output of the first ID convolution filter 808 convolving over the supplemented output features 800 is a further output feature 902a.
  • the further output feature 902a is produced by cross-feature accumulation 826 of the intermediate output features 804 such that intermediate output features at the same cycle position (same /) are summed to produce a feature element for that cycle position in the further output feature 902a.
  • the cross-feature accumulation 826 results in the further output feature 902a having k feature elements that correspond to the k sequencing cycles.
  • the neural network-based base caller 2614 uses this configuration to produce a base call for each sequencing cycle in a prediction.
  • each bank in the cascade 330 uses a set of ID convolution filters.
  • Each ID convolution filter as a result of convolving over the supplemented output features 800, produces a further output feature.
  • further output features 902a-n are produced by n ID convolution filters 308, respectively.
  • the further output features 902a-n are subjected to ReLU by the nonlinear activation function applier 504 to produce activated further output features 902a-n.
  • the further output features produced by the last bank of ID convolution filters in the cascade 330 are fed as input to the pointwise convolution filters 310.
  • the activated further output features are fed as input.
  • the number of pointwise convolution filters applied on the ultimate further output features is a function of the number of analytes (pixels) that are to be base called (p). In another implementation, it is a function of: (i) the number of analytes (pixels) that are to be base called (p) as well as (ii) the number of imaged channels for which a base call prediction component (c) is generated by the neural network-based base caller 2614.
  • the number of pointwise convolution filters is p x c. i.e., 2.
  • two pointwise convolution filters 1008, 1048 produce the final output features 1112, 1132 by cross-feature accumulations 1026, 1066, respectively.
  • the pointwise convolution filters 1008, 1048 have their own respective kernel weight/coefficient, which is separately applied on the further output features 328.
  • the resulting final output features 312 have k feature elements corresponding to the k sequencing cycles. Each final output feature corresponds to one of the imaged channels for which a base call prediction component is generated by the neural network-based base caller 2614.
  • the first final output feature 1112 corresponds to the base call prediction component generated for the red channel 422r and the second final output feature 1132 corresponds to the base call prediction component generated for the green channel 422g.
  • the output layer 314 operates on the final output features 312 and produces the base calls 1138.
  • the final output features 312 comprise unnormalized per-cycle values 1122.
  • the nonlinear activation function applier 504 converts the unnormalized per-cycle values 1122 into normalized per-cycle values 1134.
  • the nonlinear activation function applier 504 applies a sigmoid function that squashes the unnormalized per-cycle values 1122 between zero and one, as shown in FIG. 11 with respect to the normalized per-cycle values 1134.
  • a binary assigner 1126 then converts the normalized per-cycle values 1134 into per- cycle binary values 1136 based on a threshold (e.g., 0.5).
  • the binary assigner 1126 can be part of the output layer 314. In one implementation, those squashed per-cycle values that are below the threshold are assigned a zero value and those squashed per-cycle values that are above the threshold are assigned a one value.
  • a base assigner 1128 then base calls the associated analyte of the center pixel 412 at each of the k sequencing cycles based on the per-cycle binary values 1136 at corresponding positions (e.g., il, i2, i3, ..., ik) in the final output features 312.
  • the base assigner 1128 can be part of the output layer 314.
  • the base calls 1138 are assigned using a 2-channel sequencing base calling scheme 1102 that uses on (1) and off (0) bits to assign a base letter.
  • the output layer 314 comprises a softmax function that produces an exponentially normalized probability distribution of a base incorporated at a sequencing cycle in an associated analyte to be base called being A, C, T, and G, and classifies the base as A, C, T, or G based on the distribution.
  • the softmax function is applied by a softmax operator 2623, which can be part of the output layer 314.
  • softmax is an output activation function for multiclass classification.
  • training a so-called softmax classifier is regression to a class probability, rather than a true classifier as it does not return the class but rather a confidence prediction of each class’s likelihood.
  • the softmax function takes a class of values and converts them to probabilities that sum to one.
  • the softmax function squashes a k -dimensional vector of arbitrary real values to k - dimensional vector of real values within the range zero to one.
  • y is a vector of length n , where n is the number of classes in the classification. These elements have values between zero and one, and sum to one so that they represent a valid probability distribution.
  • Softmax activation function 13406 is shown in Figure 134. Softmax 13406
  • the name“softmax” can be somewhat confusing.
  • the function is more closely related to the argmax function than the max function.
  • the term“soft” derives from the fact that the softmax function is continuous and differentiable.
  • the argmax function with its result represented as a one-hot vector, is not continuous or differentiable.
  • the softmax function thus provides a“softened” version of the argmax. It would perhaps be better to call the softmax function“softargmax,” but the current name is an entrenched convention.
  • the neural network-based base caller 2614 can simultaneously base call a plurality of associated analytes depicted by corresponding pixels in a pixel patch 1202.
  • FIG. 12 shows intensity data features 1204r, 1204g generated for the pixel patch 1202 and used as supplemental input 1200 in the convolution-based base calling 1400 in accordance with one implementation.
  • FIG. 13 illustrates the output features 502a-n supplemented 1300 with the intensity data features 1204r, 1204g in accordance with one implementation.
  • FIG. 14 illustrates the output layer 314 processing the final output features 1402 produced by the pointwise convolutions and emitting base calls 1408 for pixels in the pixel patch 1202 in accordance with one
  • FIG. 14 also shows the normalized per-cycle values 1404 for the pixel patch 1202 and the per-cycle binary values 1406 for the pixel patch 1202. Base Calling - Segregated Convolutions
  • FIG. 15 depicts one implementation of the convolution-based base calling 1500 using segregated convolutions that do not mix information between the imaged channels.
  • the image data 1502 has pixel intensity data in two channels, a red channel and a green channel.
  • a first 3D convolution filter 1516a has two convolution kernels : a red kernel 1514 and a green kernel 1524.
  • the red kernel 1514 convolves over the pixel intensity data in the red channel
  • the green kernel 1524 convolves over the pixel intensity data in the green channel.
  • Red kernels of n 3D convolution filters produce n red output channels 1504.
  • Green kernels of the n 3D convolution filters produce n green output channels 1534.
  • the outputs of the red and green kernels are not mixed and kept segregated. Then, separate processing pipelines are initiated for the red and green output channels 1504, 1534 such that downstream convolutions that operate on the red and green output channels 1504, 1534 do not mix information between the red and green output channels 1504, 1534.
  • the downstream convolutions (e.g., ID convolutions and pointwise convolutions) produce separate red and green output channels such as 1506 (red), 1546 (green) and 1508 (red), 1558 (green).
  • a sigmoid function 1528 produces a binary sequence for the red channel 1530r and a binary sequence for the green channel 1530g, which are in turn used to infer base calls 1532 based on the position-wise pairs.
  • FIG. 16 depicts one implementation of the convolution-based base calling 1600 using segregated 3D convolutions that do not mix information between the imaged channels and ID convolutions that mix information between the imaged channels.
  • the image data 1602 has pixel intensity data in two channels, a red channel and a green channel.
  • a first 3D convolution filter 1616a has two convolution kernels: a red kernel 1614 and a green kernel 1624. The red kernel 1614 convolves over the pixel intensity data in the red channel and the green kernel 1624 convolves over the pixel intensity data in the green channel.
  • Red kernels of n 3D convolution filters produce n red output channels 1604.
  • Green kernels of the n 3D convolution filters produce n green output channels 1634.
  • the outputs of the red and green kernels 1604, 1634 are not mixed and kept segregated.
  • downstream convolutions e.g., ID convolutions
  • downstream convolutions that operate on the red and green output channels 1604, 1634 mix information between the red and green output channels 1504, 1534 and produce mixed output channels 1606.
  • the mixed output channels 1606 are subjected to pointwise convolutions to produce separate red and green final output channels 1608 (red), 1658 (green). Then, a sigmoid function 1628 produces a binary sequence for the red channel 1630r and a binary sequence for the green channel 1630g, which are in turn used to infer base calls 1632 based on the position-wise pairs.
  • the neural network-based base caller 2614 uses the normalized per-cycle values 1134 in the final output features 312 of the imaged channels to assign quality
  • the quality score mapping is determined by: (i) calculating predicted error rates for
  • the sigmoid outputs as the normalized per-cycle values 1134 can be used to interpret the quality scores 2610 as follows:
  • the quality scores 2610 are generated by a quality score mapper
  • FIG. 34a shows one
  • FIG. 34b shows the observed correspondence between the channel-wise sigmoid scores and the predicted quality scores.
  • the compact convolution-based base calling uses image data from a subset of the k sequencing cycles to predict a base call on a cycle-by-cycle basis. It also uses fewer convolution filters per convolution window compared to the convolution-based base calling 300 discussed above. For these reasons, the compact convolution-based base calling is more suited for real-time base calling and implementation on central processing unit (CPU) computing.
  • CPU central processing unit
  • the compact convolution-based base calling uses signals from a previous
  • timestep/convolution window/sequencing cycle These signals include: (i) the base call predicted in the previous timestep/convolution window/sequencing cycle and (ii) the probability distribution of the polymerase population movement in the previous sequencing cycle.
  • the compact convolution-based base calling uses 3D convolutions, ID convolutions, and pointwise convolutions to predict the base call.
  • the compact convolution-based base calling involves processing the sequence of per-cycle image patches on a sliding convolution window basis such that, in a timestep/convolution window/sequencing cycle, it uses as input: (i) image data comprising a per- cycle image patch for a current sequencing cycle (/). per-cycle image patches for one or more successive sequencing cycles (/+ 1, 1+2. ...), and per-cycle image patches for one or more preceding sequencing cycles (M, 1-2. 7), (ii) phasing and prephasing data, and (iii) base context data, and produces, as output, a base call for the current timestep/convolution
  • the compact convolution-based base calling further involves sequentially outputing the base call at each successive timestep/convolution window/sequencing cycle and base calling the associated analytes at each of the sequencing cycles.
  • the phasing and prephasing data 1800 represents probability distribution of the polymerase population movement 1700.
  • the probability distribution 1700 is across sequence copies of an associated analyte 1702 for: (i) a current sequence position 1724 corresponding to the current sequence cycle (/). (ii) leading sequence positions 1728 corresponding to the successive sequencing cycles (/+ 1, 1+2. ...), and (iii) lagging sequence positions 1722 corresponding to the preceding sequencing cycles (M, 1-2. ).
  • a majority of the polymerase population 1744 observes a normal incorporation 1714 of base C in a complementary strand 1766 of DNA template 1756.
  • a first minority of the polymerase population 1744 observes prephasing 1718 at a first successive sequencing cycle (/+ 1, base A) and at a second successive sequencing cycle (7+2. base G) in the complementary strand 1766 of the DNA template 1756.
  • a second minority of the polymerase population 1744 observes the phasing 1712 at a first preceding sequencing cycle (t- 1, base G) and at a second preceding sequencing cycle (t-2, base T) in the complementary strand 1766 of the DNA template 1756.
  • FIG. 17 also shows an example 1734 of the probability distribution of the polymerase population movement 1700.
  • the probability distribution sums to one.
  • Other examples of probability distribution are 0.0017, 0.9970, 0.0012 (three cycles); 0.0017, 0.9972, 0.0011 (three cycles); and 3.70e-4, 1.28e-4, 8.04e-5, 9.77e-8, 1.05e-7, 1.22e-4, 1.57e-6, 1.67e-3, 9.96e-l, 1.04e-3 (ten cycles).
  • the phasing and prephasing data 1800 is generated by transposed convolution 3500 using one or more convolution kernels.
  • FIG. 18 shows one example of generating the phasing and prephasing data 1800 using a convolution kernel 1802.
  • the convolution kernel 1802 has three weights/coefficients a, b, c, which are learned during the training.
  • the polynomials represented by alphabets a, b, c are for illustration purposes and, in operation, are numbers resulting from the transposed convolution 3500.
  • an initial probability distribution 1804 of the polymerase population movement assumes that all of the polymerase population 1744 is at a first sequence position, i.e., [1, 0, 0, 0, ...]. This way, the initial probability distribution 1804 is preset to specify that, at the first sequencing cycle, the polymerase population movement is limited to the first sequence position.
  • the initial probability distribution 1804 of the polymerase population movement includes position-specific parameters (a) 1806.
  • the position-specific parameters (a) 1806 start from the first sequence position and span one or more successive sequence positions. They are learned during the training to account for the polymerase population movement extending beyond the first sequence position at the first sequencing cycle.
  • the phasing and prephasing data 1800 is determined by transposed convolution 3500 of the convolution kernel 1802 with a probability distribution of the polymerase population movement at a preceding sequencing cycle (M).
  • the transposed convolution 3500 is applied recurrently or repeatedly 1816 until a probability distribution for each of the k sequencing cycles is generated.
  • the probability distribution 1814 at cycle 2 is produced as a result of transposed convolution 3500 between the convolution kernel 1802 and the initial probability distribution 1804 at cycle 1; the probability distribution 1824 at cycle 3 is produced as a result of transposed convolution 3500 between the convolution kernel 1802 and the probability distribution 1814 at cycle 2; the probability distribution 1834 at cycle 4 is produced as a result of transposed convolution 3500 between the convolution kernel 1802 and the probability distribution 1824 at cycle 3; and the probability distribution 1844 at cycle 5 is produced as a result of transposed convolution 3500 between the convolution kernel 1802 and the probability distribution 1834 at cycle 4.
  • SAME or zero padding is used when the convolution kernel 1802 transposedly convolves over the initial probability distribution 1804.
  • the transposed convolution 3500 produces a k x k phasing and prephasing matrix 1800 in which: (i) the rows respectively denote the k sequencing cycles and (ii) the columns also respectively denote the k sequencing cycles. Each row represents the probability distribution of the polymerase population at the corresponding sequencing cycle. Each column specifies the probability of the polymerase population being at a corresponding current sequencing cycle or at a flanking sequencing cycle.
  • FIG. 35 shows one example of how the transposed convolution 3500 is used to calculate the probability distribution as output 3552.
  • the example uses one stride and sums 3542 the intermediate outputs 3512, 3522, 3532 at overlapping positions.
  • the intermediate outputs 3512, 3522, 3532 are calculated by multiplying each element of the convolution kernel 1802 with each element of input 3502.
  • the transposed convolution 3500 is
  • transposed convolution operator 2619 which can be part of the neural network-based base caller 2614.
  • m convolution kernels are used to generate the phasing and prephasing data 1800 and the weights/coefficients of the m convolution kernels are learned during the training. That is, each of the m convolution kernels is used to generate a respective kx k phasing and prephasing matrix by use of recurrent transposed convolution. Accordingly, the phasing and prephasing data 1800 comprises m phasing and prephasing channels 2606 determined for the current sequencing cycle (t) from corresponding convolution kernels in the m convolution kernels.
  • a phasing and prephasing channel for a corresponding current sequencing cycle includes a subset of elements (also called“window-of-interesf’) from a row of a k x k phasing and prephasing matrix generated by a convolution kernel.
  • the row represents the probability distribution of the polymerase population at the corresponding current sequencing cycle.
  • the window-of-interest comprises as many elements as the number of sequencing cycles for which the image data is used as input.
  • the window-of-interest is centered at a probability value for the corresponding current sequencing cycle and includes left and right flanking probability values for the left and right flanking sequencing cycles. For example, if the image data is for three sequencing cycles: a current sequencing cycle (/). a successive/right flanking sequencing cycle (t+1), and a preceding/left flanking sequencing cycle (M), then the window-of- interest includes three elements.
  • the phasing and prephasing data 1800 is generated by a phasing, prephasing data generator 2630, which can be part of the neural network-based base caller 2614.
  • the base context data 1900, 2000 identifies: (i) bases called in one or more preceding sequencing cycles and (ii) base call possibilities in the current sequencing cycle and the successive sequencing cycles.
  • the base context data 1900, 2000 identifies the bases called and the base call possibilities using a base encoding that represents each base by assigning a value for each of the imaged channels.
  • the base context data 1900, 2000 identifies the base call possibilities using an -input truth table, with r representing a count of the current sequencing cycle and the successive sequencing cycles.
  • FIG. 19 shows the base context data 1900 for three sequencing cycles: a current sequencing cycle (z), a previous sequencing cycle (/- 1 ). and a future sequencing cycle (z+l).
  • the base context data 1900 is generated for a red channel 1912 and a green channel 1922.
  • the known base call prediction components for the previous sequencing cycle (z-1) are kept fixed.
  • the base call 1902 in the previous sequencing cycle (/- 1 ) was C, with a 0 base call prediction component in the red channel 1912 and a 1 base call prediction component in the green channel 1922.
  • the base context data 1900 for the red and green channels 1912, 1922 is row- wise concatenated to produce the respective base context channels 2607.
  • FIG. 20 shows the base context data 2000 for five sequencing cycles: a current sequencing cycle ( i ), a first previous sequencing cycle (z-1), a second previous sequencing cycle (z-2), a first future sequencing cycle (z+l), and a second future sequencing cycle (z+2).
  • the base context data 1900 is generated for a red channel 2012 and a green channel 2022.
  • the known base call prediction components for the first previous sequencing cycle (7-1) and the second previous sequencing cycle (7-2) are kept fixed.
  • the base call 2002 in the first previous sequencing cycle (7-1) was C, with a 0 base call prediction component in the red channel 2012 and a 1 base call prediction component in the green channel 2022.
  • the base call 2004 in the second previous sequencing cycle (7-2) was A, with a 1 base call prediction component in the red channel 2012 and a 0 base call prediction component in the green channel 2022.
  • the truth table-style encoding is used to list the base call possibilities for the current sequencing cycle (/), the first future sequencing cycle (7+1), and the second future sequencing cycle (7+2).
  • the base context data 2000 for the red and green channels 2012, 2022 is row-wise concatenated to produce the respective base context channels 2607.
  • the base context data 1900, 2000 is generated by a base context data generator 2631, which can be part of the neural network-based base caller 2614.
  • the base context channels also include as many elements as the number of sequencing cycles for which the image data is used as input, as discussed above.
  • the compact convolution-based base calling 2100 uses image data for three sequencing cycles per timestep/convolution window/sequencing cycle to predict a base call on a cycle-by- cycle basis.
  • the base call prediction from one previous timestep/convolution window/sequencing cycle is used to create the base context data 1900 for a current timestep/convolution
  • the probability distribution of the polymerase population movement in the previous sequencing cycle is used to create the phasing and prephasing data (window-of-interest with three elements) for the current timestep/convolution window/sequencing cycle.
  • data from a previous timestep/convolution window/sequencing cycle is provided to a next timestep/convolution window/sequencing cycle by a data propagator 2624.
  • the image data 2142/ comprises per-cycle image patches for sequencing cycle 1 and sequencing cycle 2, along with SAME or zero padding.
  • the phasing and prephasing data (hO) 2122 for sequencing cycle 1 comprises initial probability distribution of the polymerase population movement for m convolution kernels.
  • the previous base call (bO) 2102 i.e., the base context data, is set to be a starting value or token ( ⁇ s>) that is learned during training.
  • a base call prediction 2104 is made for sequencing cycle 1.
  • the base call prediction 2104 made for sequencing cycle 1 is used to prepare the base context data 2106 for sequencing cycle 2, as discussed above.
  • the phasing and prephasing data (hO) 2122 for sequencing cycle 1 is used to prepare the phasing and prephasing data (hi) 2124 for sequencing cycle 2 by use of transposed convolution 2132 with m convolution kernels, as discussed above.
  • the phasing and prephasing data for each of the sequencing cycles can be prepared in advance by generating the k x k phasing and prephasing matrix using the transposed convolution 2132 with m convolution kernels, as discussed above.
  • each of the m convolution kernels are kept fixed across the
  • the image data 2142/+1 comprises per-cycle image patches for sequencing cycle 1, sequencing cycle 2, and sequencing cycle 3.
  • the image data 2142/+1, the base context data 2106, and the phasing and prephasing data (hi) 2124 are used to produce a base call prediction 2108 for sequencing cycle 2.
  • the base call prediction 2108 made for sequencing cycle 2 is used to prepare the base context data 2110 for sequencing cycle 3, as discussed above.
  • the phasing and prephasing data (hi) 2124 for sequencing cycle 2 is used to prepare the phasing and prephasing data (h2) 2126 for sequencing cycle 3 by use of the transposed convolution 2132 with m convolution kernels, as discussed above.
  • the image data 2142/+2 comprises per-cycle image patches for sequencing cycle 2, sequencing cycle 3, and sequencing cycle 4.
  • the image data 2142/+2, the base context data 2110, and the phasing and prephasing data (h2) 2126 are used to produce a base call prediction 2112 for sequencing cycle 3.
  • the base call prediction 2112 made for sequencing cycle 3 is used to prepare the base context data 2114 for sequencing cycle 4, as discussed above.
  • the phasing and prephasing data (h2) 2126 for sequencing cycle 3 is used to prepare the phasing and prephasing data (h3) 2128 for sequencing cycle 4 by use of the transposed convolution 2132 with m convolution kernels, as discussed above.
  • the image data 2142/+3 comprises per-cycle image patches for sequencing cycle 3, sequencing cycle 4, and sequencing cycle 5.
  • the image data 2142/+3, the base context data 2114, and the phasing and prephasing data (h3) 2128 are used to produce a base call prediction for sequencing cycle 4.
  • the compact convolution-based base calling 2100 sequentially outputs the base call at each successive convolution window and base calls the associated analytes at each of the sequencing cycles.
  • tile-wide global channels 2152/, 2152/+1, 2152/+2, and 2152/+3 are respectively fed.
  • the per-cycle, tile- wide global channels 2601 are determined by a global channel calculator 2626.
  • the per-cycle, tile-wide global channels 2601 are determined using singular value
  • SSD image decomposition
  • PCA principal component analysis
  • a per-cycle, tile-wide global channel includes a set of principal components of the image data features in image data obtained at a corresponding sequencing cycle from the associated analytes disposed across the tile.
  • the image data features include at least one of background, spatial crosstalk, phasing and prephasing effect, emission overlap, signal intensity, and intensity decay.
  • the per-cycle, tile-wide global channels 2601 are fed as supplemental input to convolution windows of corresponding sequencing cycles.
  • the image data used to generate the per-cycle, tile-wide global channels is obtained from a variety of flow cells, sequencing instruments, sequencing runs, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • the image data is obtained from tile and flow cell data 2609 produced by a sequencer 2628.
  • the compact convolution-based base calling 2200 uses image data for five sequencing cycles per timestep/convolution window/sequencing cycle to predict a base call on a cycle-by- cycle basis.
  • the windows/sequencing cycles are used to create the base context data 2000 for a current timestep/convolution window/sequencing cycle.
  • the probability distribution of the polymerase population movement in the previous sequencing cycle is used to create the phasing and prephasing data (window-of-interest with five elements) for the current timestep/convolution window/sequencing cycle.
  • the image data 2234 comprises per-cycle image patches for sequencing cycles 1, 2, 3, 4, and 5.
  • the phasing and prephasing data for sequencing cycle 2 (not shown) is used to prepare the phasing and prephasing data 2212 for sequencing cycle 3 by use of transposed convolution 2224 with m convolution kernels, as discussed above.
  • each of the m convolution kernels are kept fixed across the
  • the base context data 2000 for sequencing cycle 3 is constructed using the base call made at sequencing cycle 1, the base call 2202 made at sequencing cycle 2, the base call possibility at sequencing cycle 3, the base call possibility at sequencing cycle 4, and the base call possibility at sequencing cycle 5.
  • a base call prediction 2204 is made for sequencing cycle 3.
  • the image data 2238 comprises per-cycle image patches for sequencing cycles 2, 3, 4, 5, and 6.
  • the phasing and prephasing data 2212 for sequencing cycle 3 is used to prepare the phasing and prephasing data 2216 for sequencing cycle 4 by use of transposed convolution 2224 with m convolution kernels, as discussed above.
  • the base context data 2206 (with red and green base context channels 22064, 2206g) for sequencing cycle 4 is constructed using the base call 2202 made at sequencing cycle 2, the base call 2204 made at sequencing cycle 3, the base call possibility at sequencing cycle 4, the base call possibility at sequencing cycle 5, and the base call possibility at sequencing cycle 6.
  • a base call prediction 2208 is made for sequencing cycle 4. Also, supplementary per-cycle supplementary global channels 2232, 2236 are also fed as input to the respective timestep/convolution
  • the compact convolution-based base calling 2200 sequentially outputs the base call at each successive convolution window and base calls the associated analytes at each of the sequencing cycles.
  • FIG. 23 shows one implementation of the convolutions used to mix the image data 2302, the phasing and prephasing data 2316, and the base context data 2326 for the compact convolution-based base calling 2100, 2200 in a timestep/convolution window/sequencing cycle.
  • 3D convolutions 2304 are applied on the image data 2302 to produce the image channels 2306, as discussed above.
  • Transposed convolutions 2314 are used to generate the phasing and prephasing data 2316 with the phasing and prephasing channels, as discussed above.
  • Previous base calls 2324 are used to generate the base context data 2326 with base context channels.
  • the image channels 2306, the phasing and prephasing data 2316, and the base context data 2326 are then mixed using the cascade of ID convolutions 330 and the pointwise convolutions 310 to produce the final output features 2328, 2330.
  • the final output features 2328, 2330 are fed to a fully -connected network 2348.
  • the fully-connected network 2348 produces unnormalized per-imaged channel values, which are converted to normalized per-imaged channel values 2358 by the nonlinear activation function applier 504.
  • the normalized per- imaged channel values 2358 are then converted to per-imaged channel binary values 2368 by the binary assigner 1126.
  • the per-imaged channel binary values 2368 are used by the base assigner 1128 to produce the base call 2378 for the current sequencing cycle.
  • FIG. 24 shows one implementation of pull-push and push-pull convolutions in which a combination 2400 of the ID convolutions (pull) 2404, 2408, 2412, 2416 and transposed convolutions (pull) 2406, 2410, 2414, 2418 is used for the compact convolution-based base calling 2100, 2200.
  • the combination 2400 alternates between application of the ID convolutions and the transposed convolutions on the image data 2402.
  • a different bank of 3D convolution filters is used in each timestep/convolution window/sequencing cycle.
  • Each bank includes one to ten 3D convolution filters.
  • FIG. 25 depicts one implementation of performing the compact convolution-based base calling during inference 2506 on a central processing unit (CPU) by using image data from only a subset of the sequencing cycles.
  • the inference 2506 is performed using the per- cycle image patch for the current sequencing cycle, the per-cycle image patches for the one or more successive sequencing cycles, and the per-cycle image patches for the one or more preceding sequencing cycles.
  • the neural network-based base caller 2614 is trained on training data 2505, which in turn comprises sequencing data 2515.
  • the untrained model 2614 can be trained on CPU, GPU, FPGA, ASIC, and/or CGRA to produce the trained model 2614.
  • the trained model 2614 runs on the CPU and performs real-time base calling 2528 on incoming data 2508 that comprises sequencing data 2518 and produce base calls 2548.
  • Inference 2506 is operationalized by a tester 2629.
  • FIG. 26 is a block diagram 2600 that shows various system modules and data stores used for the convolution-based base calling and the compact convolution-based base calling in accordance with one implementation.
  • modules in this application can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some can also be implemented on different processors or computers, or spread among a number of different processors or computers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. Also as used herein, the term“module” can include“sub- modules,” which themselves can be considered herein to constitute modules. The blocks in the figures designated as modules can also be thought of as flowchart steps in a method.
  • Sequencing data 2515, 2518 is produced by a sequencing instrument or sequencer 2628 (e.g., Illumina’s Firefly, iSeq, HiSeqX, HiSeq3000, HiSeq4000, HiSeq2500, NovaSeq 6000, NextSeq, NextSeqDx, MiSeq and MiSeqDx).
  • Illumina Illumina’s Firefly, iSeq, HiSeqX, HiSeq3000, HiSeq4000, HiSeq2500, NovaSeq 6000, NextSeq, NextSeqDx, MiSeq and MiSeqDx.
  • Base calling is the process in which the raw signal of the sequencer 2628, i.e., intensity data extracted from images, is decoded into DNA sequences and quality scores.
  • the Illumina platforms employ cyclic reversible termination (CRT) chemistry for base calling.
  • CRT cyclic reversible termination
  • the process relies on growing nascent DNA strands complementary to template DNA strands with modified nucleotides, while tracking the emitted signal of each newly added nucleotide.
  • the modified nucleotides have a 3' removable block that anchors a fluorophore signal of the nucleotide type.
  • Sequencing occurs in repetitive cycles, each comprising three steps: (a) extension of a nascent strand by adding a modified nucleotide; (b) excitation of the fluorophores using one or more lasers of the optical system and imaging through different filters of the optical system, yielding sequencing images; and (c) cleavage of the fluorophores and removal of the 3’ block in preparation for the next sequencing cycle. Incorporation and imaging cycles are repeated up to a designated number of sequencing cycles, defining the read length of all clusters. Using this approach, each cycle interrogates a new position along the template strands.
  • the sequencing process occurs in a flow cell - a small glass slide that holds the input DNA fragments during the sequencing process.
  • the flow cell is connected to the high-throughput optical system, which comprises microscopic imaging, excitation lasers, and fluorescence filters.
  • the flow cell comprises multiple chambers called lanes.
  • the lanes are physically separated from each other and may contain different tagged sequencing libraries, distinguishable without sample cross contamination.
  • the imaging device e.g., a solid-state imager such as a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) sensor
  • CCD charge-coupled device
  • CMOS complementary metal-oxide-semiconductor
  • a tile holds hundreds of thousands to millions of clusters.
  • a cluster comprises approximately one thousand identical copies of a template molecule, though clusters vary in size and shape.
  • the clusters are grown from the template molecule, prior to the sequencing run, by bridge amplification of the input library.
  • the purpose of the amplification and cluster growth is to increase the intensity of the emitted signal since the imaging device cannot reliably sense a single fluorophore.
  • the physical distance of the DNA fragments within a cluster is small, so the imaging device perceives the cluster of fragments as a single spot.
  • the output of a sequencing run is the sequencing images, each depicting intensity emissions of clusters on the tile in the pixel domain for a specific combination of lane, tile, sequencing cycle, and fluorophore.
  • FIG. 36 is a computer system 3600 that can be used to implement the convolution-based base calling and the compact convolution-based base calling disclosed herein.
  • Computer system 3600 includes at least one central processing unit (CPU) 3672 that communicates with a number of peripheral devices via bus subsystem 3655.
  • peripheral devices can include a storage subsystem 3610 including, for example, memory devices and a file storage subsystem 3636, user interface input devices 3638, user interface output devices 3676, and a network interface subsystem 3674.
  • the input and output devices allow user interaction with computer system 3600.
  • Network interface subsystem 3674 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • the neural network-based base caller 2614 is communicably linked to the storage subsystem 3610 and the user interface input devices 3638.
  • User interface input devices 3638 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • User interface output devices 3676 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system 3600 to the user or to another machine or computer system.
  • Storage subsystem 3610 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 3678.
  • Deep learning processors 3678 can be graphics processing units (GPUs), field- programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning processors 3678 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • GPUs graphics processing units
  • FPGAs field- programmable gate arrays
  • ASICs application-specific integrated circuits
  • CGRAs coarse-grained reconfigurable architectures
  • Deep learning processors 3678 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • Examples of deep learning processors 3678 include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA’s VoltaTM, NVIDIA’s DRIVE PXTM, NVIDIA’s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s DynamicIQTM, IBM TrueNorthTM, and others.
  • TPU Tensor Processing Unit
  • rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM
  • Memory subsystem 3622 used in the storage subsystem 3610 can include a number of memories including a main random access memory (RAM) 3632 for storage of instructions and data during program execution and a read only memory (ROM) 3636 in which fixed instructions are stored.
  • a file storage subsystem 3636 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 3636 in the storage subsystem 3610, or in other machines accessible by the processor.
  • Bus subsystem 3655 provides a mechanism for letting the various components and subsystems of computer system 3600 communicate with each other as intended. Although bus subsystem 3655 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 3600 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 3600 depicted in FIG. 36 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 3600 are possible having more or less components than the computer system depicted in FIG. 36.
  • a neural network-implemented method of base calling analytes includes accessing a sequence of per-cycle image patches generated for a series of sequencing cycles of a sequencing run.
  • the pixels in the per-cycle image patches contain intensity data for associated analytes.
  • the intensity data is obtained for one or more imaged channels by corresponding light sensors configured to detect emissions from the associated analytes.
  • the method includes applying three-dimensional (3D) convolutions on the sequence of per-cycle image patches on a sliding convolution window basis.
  • a 3D convolution filter convolves over: (i) a plurality of the per-cycle image patches along a temporal dimension and detects and accounts for phasing and prephasing effect between successive ones of the sequencing cycles caused by asynchronous readout of sequence copies of an associated analyte, (ii) a plurality of pixels in each of the per-cycle image patches along spatial dimensions and detects and accounts for spatial crosstalk between adjacent analytes caused by detection of emissions from a non-associated analyte by a corresponding light sensor of an associated analyte, and (iii) each of the imaged channels along a depth dimension and detects and accounts for emission overlap between the imaged channels caused by overlap of dye emission spectra and produces at least one output feature as a result of convolving over the sequence of per-cycle image patches on the sliding convolution window basis.
  • the method includes supplementing output features produced as a result of a plurality of 3D convolution filters convolving over the sequence of per-cycle image patches with imaged channel-specific and cross-cycle intensity data features of one or more of the pixels that contain the intensity data for one or more of the associated analytes to be base called.
  • the method includes beginning with the output features supplemented with the intensity data features as starting input, applying a cascade of one-dimensional (ID) convolutions and producing further output features, the cascade using ID convolutions with different receptive fields and detecting varying degrees of the asynchronous readout caused by the phasing and prephasing effect.
  • ID one-dimensional
  • the method includes applying pointwise convolutions on the further output features and producing final output features.
  • the method includes processing the final output features through an output layer and producing base calls for the associated analytes at each of the sequencing cycles.
  • the method includes producing a final output feature for each of the imaged channels, normalizing unnormalized per-cycle values in final output features of the imaged channels, converting the normalized per-cycle values into per- cycle binary values based on a threshold, and base calling the associated analyte at each of the sequencing cycles based on the per-cycle binary values at corresponding positions in the final output features.
  • the output layer comprises a sigmoid function that squashes the unnormalized per-cycle values in the final output features between zero and one.
  • the method includes assigning those squashed per-cycle values that are below the threshold a zero value and assigning those squashed per-cycle values that are above the threshold a one value.
  • the output layer comprises a softmax function that produces an exponentially normalized probability distribution of a base incorporated at a sequencing cycle in an associated analyte to be base called being A, C, T, and G.
  • the method includes classifying the base as A, C, T, or G based on the distribution.
  • the method includes the 3D convolutions separately applying a respective convolution kernel on each of the imaged channels and producing at least one intermediate output feature for each of the imaged channels, the 3D convolutions further combining intermediate output features of the imaged channels and producing output features, wherein the output features represent information mixed between the imaged channels, and beginning with the output features supplemented with the intensity data features as starting input, applying the cascade of ID convolutions.
  • the method includes the 3D convolutions separately applying a respective convolution kernel on each of the imaged channels and producing at least one intermediate output feature for each of the imaged channels, the 3D convolutions further combining intermediate output features of the imaged channels and producing output features, wherein the output features represent information mixed between the imaged channels, and beginning with the output features supplemented with the intensity data features as starting input, applying a plurality of cascade of ID convolutions such that each cascade in the plurality corresponds to one of the imaged channels and operates on the input independent of another cascade.
  • the method includes the 3D convolutions separately applying a respective convolution kernel on each of the imaged channels and producing at least one intermediate output feature for each of the imaged channels, the 3D convolutions not combining intermediate output features of the imaged channels and instead making them available as imaged channel-specific output features, supplementing the imaged channel-specific output features with cross-cycle intensity data features from the corresponding imaged channel of one or more of the pixels that contain the intensity data for one or more of the associated analytes to be base called, and beginning with the imaged channel-specific output features supplemented with the intensity data features as starting input, applying the cascade of ID convolutions.
  • the method includes the 3D convolutions separately applying a respective convolution kernel on each of the imaged channels and producing at least one intermediate output feature for each of the imaged channels, the 3D convolutions not combining intermediate output features of the imaged channels and instead making them available as imaged channel-specific output features, supplementing the imaged channel-specific output features with cross-cycle intensity data features from the corresponding imaged channel of one or more of the pixels that contain the intensity data for one or more of the associated analytes to be base called, and beginning with the imaged channel-specific output features supplemented with the intensity data features as starting input, applying a plurality of cascade of ID convolutions such that each cascade in the plurality corresponds to one of the imaged channels and operates on the input independent of another cascade.
  • the method includes the ID convolutions mixing information between respective per-cycle elements of each of the output features and the intensity data features on a sliding window basis and producing at least one intermediate output feature for each of the output features and the intensity data features, and the ID convolutions accumulating information across intermediate output features of the output features on a per-cycle element basis and producing further output features.
  • size of the sliding window is based on a receptive field of the ID convolutions and varies in the cascade.
  • the method includes applying a combination of the ID convolutions and transposed convolutions instead of the cascade of ID convolutions, wherein the combination alternates between application of the ID convolutions and the transposed convolutions.
  • the method includes the pointwise convolutions respectively convolving over further output features on a per-cycle element basis and producing at least one intermediate output feature for each of the further output features, and the pointwise
  • convolutions accumulating information across intermediate output features of the further output features on a per-cycle element basis and producing at least one final output feature.
  • the method includes using the normalized per-cycle values in the final output features of the imaged channels to assign quality scores to base call predictions emitted by the output layer based on a quality score mapping.
  • the quality score mapping is determined by calculating predicted error rates for base call predictions made on training data and determining corresponding predicted quality scores, determining a fit between the predicted quality scores and empirical quality scores determined from empirical base calling error rates derived from test data, and correlating the predicted quality scores to the empirical quality scores based on the fit.
  • the method includes learning kernel weights of convolution filters applied by the 3D convolutions, the ID convolutions, and the pointwise convolutions using a backpropagation-based gradient update technique during training that progressively matches the base call predictions emitted by the output layer with ground truth 2608
  • the training is operationalized by the trainer 2611
  • the ground truth includes per-cycle binary values for each of the imaged channels.
  • the method includes the backpropagation-based gradient update technique computing an error between the per-cycle binary values in the ground truth 2608 and the corresponding per-cycle binary values in the final output features of the imaged channels.
  • the ground truth includes a one-hot encoding identifying a correct base.
  • the method includes the backpropagation-based gradient update technique computing an error between the one-hot encoding in the ground truth 2608 and the exponentially normalized probability distribution produced by the softmax function.
  • the method includes varying a learning rate of the learning, which is operationalized by a training rate varier 2612. In one implementation, the method includes extracting the per-cycle image patches from respective per-cycle images of a tile of a flow cell on which the analytes are disposed. In one implementation, the training data 2505 (which comprises sequencing data 2515, 2518) is normalized using z-scores by a data normalizer 2602.
  • the method includes base calling analytes disposed throughout the tile by extracting per-cycle image patches from overlapping regions of the tile such that the extracted per-cycle image patches have overlapping pixels.
  • the ID convolutions use bilinear form product to mix information.
  • the method includes applying non-linear activations functions on the output features and producing activated output features for processing by the ID
  • the method includes applying non-linear activations functions on the further output features and producing activated further output features for processing by the pointwise convolutions.
  • the method includes using batch normalization along with the ID convolutions.
  • the method includes using batch normalization along with the pointwise convolutions.
  • the method includes using a plurality of ID convolution filters in each ID convolution in the cascade.
  • the method includes including using a plurality of pointwise convolution filters in the pointwise convolutions such that each pointwise convolution filter in the plurality corresponds to one of the imaged channels and operates on the further output features independent of another pointwise convolution filter.
  • the 3D convolutions, the ID convolutions, and the pointwise convolutions use SAME padding.
  • the method includes the 3D convolution filter convolving over the sequence of per-cycle image patches to detect and account for signal decay due to fading.
  • implementations of the method described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a neural network-implemented method of base calling analytes includes accessing a sequence of per-cycle image patches generated for a series of sequencing cycles of a sequencing run. Each pixel in the per-cycle image patches is associated with an analyte. The per-cycle image patches are centered at a center pixel that contains intensity data for a target associated analyte. Non-center pixels in the per-cycle image patches contain intensity data for associated analytes adjacent to the target associated analyte.
  • the intensity data is obtained for one or more imaged channels.
  • the method includes applying three-dimensional (3D) convolutions on the sequence of per-cycle image patches on a sliding convolution window basis.
  • a 3D convolution filter convolves over: (i) a plurality of the per-cycle image patches along a temporal dimension and detects and accounts for phasing and prephasing effect in a current sequencing cycle from one or more successive sequencing cycles and one or more preceding sequencing cycles due to asynchronous readout of sequence copies of an associated analyte, (ii) the center pixel and the non-center pixels along spatial dimensions and detects and accounts for spatial crosstalk from the non-center pixels in the center pixel due to detection of emissions from the adjacent associated analytes by a corresponding light sensor of the target associated analyte, and (iii) each of the imaged channels along a depth dimension and detects and accounts for emission overlap between the imaged channels due to overlap of dye emission spectra, and produces at least one output feature as a result of convolving over the sequence of per
  • the method includes supplementing output features produced as a result of a plurality of 3D convolution filters convolving over the sequence of per-cycle image patches with imaged channel-specific and cross-cycle intensity data features of the center pixel.
  • the method includes beginning with the output features supplemented with the intensity data features as starting input, applying a cascade of one-dimensional (ID) convolutions and producing further output features, the cascade using ID convolutions with different receptive fields and detecting varying degrees of the asynchronous readout caused by the phasing and prephasing effect.
  • ID one-dimensional
  • the method includes applying pointwise convolutions on the further output features and producing final output features. [0285] The method includes processing the final output features through an output layer and producing an output.
  • the method includes base calling the target associated analyte at each of the sequencing cycles based on the output.
  • a neural network-implemented method of base calling analytes includes accessing a sequence of per-cycle image patches generated for a series of sequencing cycles of a sequencing run. Pixels in the per-cycle image patches contain intensity data for associated analytes in one or more imaged channels.
  • the method includes applying three-dimensional (3D) convolutions on the sequence of per-cycle image patches on a sliding convolution window basis such that, in a convolution window, a 3D convolution filter convolves over a plurality of the per-cycle image patches and produces at least one output feature as a result of convolving over the sequence of per-cycle image patches on the sliding convolution window basis.
  • 3D convolution filter convolves over a plurality of the per-cycle image patches and produces at least one output feature as a result of convolving over the sequence of per-cycle image patches on the sliding convolution window basis.
  • the method includes beginning with output features produced by the 3D convolutions as starting input, applying further convolutions and producing final output features.
  • the method includes processing the final output features through an output layer and producing base calls for one or more of the associated analytes to be base called at each of the sequencing cycles.
  • implementations of the method described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a neural network-implemented method of base calling analytes includes accessing a sequence of per-cycle image patches generated for a series of sequencing cycles of a sequencing run.
  • the pixels in the per-cycle image patches contain intensity data for associated analytes.
  • the intensity data is obtained for one or more imaged channels by corresponding light sensors configured to detect emissions from the associated analytes.
  • the method includes processing the sequence of per-cycle image patches on a sliding convolution window basis such that, in a convolution window, using as input image data comprising a per-cycle image patch for a current sequencing cycle, per-cycle image patches for one or more successive sequencing cycles, and per-cycle image patches for one or more preceding sequencing cycles, phasing and prephasing data representing probability distribution of polymerase population movement across sequence copies of an associated analyte for a current sequence position corresponding to the current sequence cycle, leading sequence positions corresponding to the successive sequencing cycles, and lagging sequence positions corresponding to the preceding sequencing cycles, and base context data identifying bases called in one or more preceding sequencing cycles and base call possibilities in the current sequencing cycle and the successive sequencing cycles, and producing, as output, a base call for the current sequencing cycle and for one or more of the associated analytes to be base called.
  • the method includes sequentially outputting the base call at each successive convolution window and base calling the associated analytes at each of the sequencing cycles.
  • the phasing and prephasing data comprises phasing and prephasing channels determined for the current sequencing cycle from corresponding convolution filters in a plurality of convolution kernels.
  • a phasing and prephasing channel is determined for the current sequencing cycle from a corresponding convolution filter by beginning with an initial probability distribution of the polymerase population movement at a first sequencing cycle as starting input and determining successive probability distributions of the polymerase population movement at successive sequencing cycles as a result of transposed convolution of the corresponding convolution kernel with a probability distribution of the polymerase population movement at a preceding sequencing cycle, selecting from a probability distribution of the polymerase population movement at the current sequencing cycle those values that occur at the current sequence position, the leading sequence positions, and the lagging sequence positions, and including the selected values in the phasing and prephasing channel.
  • the initial probability distribution is preset to specify that, at the first sequencing cycle, the polymerase population movement is limited to a first sequence position.
  • the initial probability distribution includes position-specific parameters which, starting from the first sequence position, span one or more successive sequence positions and are learned during training to account for the polymerase population movement extending beyond the first sequence position at the first sequencing cycle.
  • the base context data identifies the bases called and the base call possibilities using a base encoding that represents each base by assigning a value for each of the imaged channels.
  • the base context data identifies the base call possibilities using an r-input truth table, with r representing a count of the current sequencing cycle and the successive sequencing cycles in the convolution window.
  • the method includes, in the convolution window, processing the image data through a plurality of three-dimensional (3D) convolution filters and producing, as output, a plurality of image channels, beginning with the image channels, the phasing and prephasing data, and the base context data as starting input, applying a cascade of one dimensional (ID) convolutions and producing further output features, and applying pointwise convolutions on the further output features and producing final output features, and processing the final output features through an output layer and producing the base call for the current sequencing cycle and for the associated analytes.
  • 3D three-dimensional
  • the method includes using a different plurality of 3D
  • the method includes using bilinear form product to mix the image channels, the phasing and prephasing data, and the base context data.
  • a 3D convolution filter convolves over a plurality of the per- cycle image patches along a temporal dimension and detects and accounts for phasing and prephasing effect between successive ones of the sequencing cycles caused by asynchronous readout of sequence copies of an associated analyte, a plurality of pixels in each of the per-cycle image patches along spatial dimensions and detects and accounts for spatial crosstalk between adjacent analytes caused by detection of emissions from a non-associated analyte by a corresponding light sensor of an associated analyte, and each of the imaged channels along a depth dimension and detects and accounts for emission overlap between the imaged channels caused by overlap of dye emission spectra, and produces at least one image channel as a result of convolving over the sequence of per-cycle image patches.
  • the ID convolutions use different receptive fields and detect varying degrees of the asynchronous readout.
  • the method includes supplementing the image channels with imaged channel-specific and current cycle-specific intensity data features of one or more of the pixels that contain the intensity data for the associated analytes.
  • the method includes applying a combination of the ID convolutions and transposed convolutions instead of the cascade of ID convolutions. The combination alternates between application of the ID convolutions and the transposed convolutions.
  • the method includes, for an associated analyte to be base called, producing a final output feature for each of the imaged channels, and in the output layer, processing the final output features through a fully-connected network and producing unnormalized per-imaged channel values, normalizing the unnormalized per-imaged channel values, converting the normalized per-imaged channel values into per-imaged channel binary values based on a threshold, and producing the base call for the current sequencing cycle and for the associated analyte based on the per-imaged channel binary values.
  • the output layer comprises a sigmoid function that squashes the unnormalized per-imaged channel values in the final output features between zero and one.
  • the method includes assigning those squashed per-imaged channel values that are below the threshold a zero value and assigning those squashed per-imaged channel values that are above the threshold a one value.
  • the output layer comprises a softmax function that produces an exponentially normalized probability distribution of the base call being A, C, T, and G.
  • the method includes classifying the base call as A, C, T, or G based on the distribution.
  • the method includes determining per-cycle, tile-wide global channels using singular value decomposition (SVD) of image data features in image data of a plurality of associated analytes disposed on a tile of a flow cell.
  • a per-cycle, tile-wide global channel includes a set of principal components of the image data features in image data obtained at a corresponding sequencing cycle from the associated analytes disposed across the tile.
  • the image data features include at least one of background, spatial crosstalk, phasing and prephasing effect, emission overlap, signal intensity, and intensity decay.
  • the method includes feeding the per-cycle, tile-wide global channels as supplemental input to convolution windows of corresponding sequencing cycles.
  • the image data used to generate the per-cycle, tile-wide global channels is obtained from a variety of flow cells, sequencing instruments, sequencing runs, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • the method includes performing the base calling during inference on a central processing unit (CPU) by only using the per-cycle image patch for the current sequencing cycle, the per-cycle image patches for the one or more successive sequencing cycles, and the per-cycle image patches for the one or more preceding sequencing cycles and generating a base call for the current sequencing cycle.
  • CPU central processing unit
  • implementations of the method described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • analyte is intended to mean a point or area in a pattern that can be distinguished from other points or areas according to relative location.
  • An individual analyte can include one or more molecules of a particular type.
  • an analyte can include a single target nucleic acid molecule having a particular sequence or an analyte can include several nucleic acid molecules having the same sequence (and/or complementary sequence, thereof).
  • Different molecules that are at different analytes of a pattern can be differentiated from each other according to the locations of the analytes in the pattern.
  • Example analytes include without limitation, wells in a substrate, beads (or other particles) in or on a substrate, projections from a substrate, ridges on a substrate, pads of gel material on a substrate, or channels in a substrate.
  • target analytes include, but are not limited to, nucleic acids (e.g., DNA, RNA or analogs thereof), proteins, polysaccharides, cells, antibodies, epitopes, receptors, ligands, enzymes (e.g. kinases, phosphatases or polymerases), small molecule drug candidates, cells, viruses, organisms, or the like.
  • nucleic acids e.g., DNA, RNA or analogs thereof
  • proteins e.g., polysaccharides, cells, antibodies, epitopes, receptors, ligands
  • enzymes e.g. kinases, phosphatases or polymerases
  • nucleic acids may be used as templates as provided herein (e.g., a nucleic acid template, or a nucleic acid complement that is complementary to a nucleic acid nucleic acid template) for particular types of nucleic acid analysis, including but not limited to nucleic acid amplification, nucleic acid expression analysis, and/or nucleic acid sequence determination or suitable combinations thereof.
  • Nucleic acids in certain implementations include, for instance, linear polymers of deoxyribonucleotides in 3'-5' phosphodiester or other linkages, such as deoxyribonucleic acids (DNA), for example, single- and double-stranded DNA, genomic DNA, copy DNA or complementary DNA (cDNA), recombinant DNA, or any form of synthetic or modified DNA.
  • DNA deoxyribonucleic acids
  • cDNA complementary DNA
  • recombinant DNA or any form of synthetic or modified DNA.
  • nucleic acids include for instance, linear polymers of ribonucleotides in 3 '-5' phosphodiester or other linkages such as ribonucleic acids (RNA), for example, single- and double-stranded RNA, messenger (mRNA), copy RNA or complementary RNA (cRNA), alternatively spliced mRNA, ribosomal RNA, small nucleolar RNA (snoRNA), microRNAs (miRNA), small interfering RNAs (sRNA), piwi RNAs (piRNA), or any form of synthetic or modified RNA.
  • RNA ribonucleic acids
  • Nucleic acids used in the compositions and methods of the present invention may vary in length and may be intact or full-length molecules or fragments or smaller parts of larger nucleic acid molecules.
  • a nucleic acid may have one or more detectable labels, as described elsewhere herein.
  • nucleic acid cluster comprises a plurality of copies of template nucleic acid and/or complements thereof, attached via their 5' termini to the solid support.
  • the copies of nucleic acid strands making up the nucleic acid clusters may be in a single or double stranded form.
  • Copies of a nucleic acid template that are present in a cluster can have nucleotides at corresponding positions that differ from each other, for example, due to presence of a label moiety.
  • the corresponding positions can also contain analog structures having different chemical structure but similar Watson-Crick base-pairing properties, such as is the case for uracil and thymine.
  • nucleic acid clusters can also be referred to as“nucleic acid clusters”. Nucleic acid colonies can optionally be created by cluster amplification or bridge amplification techniques as set forth in further detail elsewhere herein. Multiple repeats of a target sequence can be present in a single nucleic acid molecule, such as a concatamer created using a rolling circle
  • the nucleic acid clusters of the invention can have different shapes, sizes and densities depending on the conditions used.
  • clusters can have a shape that is substantially round, multi-sided, donut-shaped or ring-shaped.
  • the diameter of a nucleic acid cluster can be designed to be from about 0.2 pm to about 6 pm, about 0.3 pm to about 4 pm, about 0.4 pm to about 3 pm, about 0.5 pm to about 2 pm, about 0.75 pm to about 1.5 pm, or any intervening diameter.
  • the diameter of a nucleic acid cluster is about 0.5 pm, about 1 pm, about 1.5 pm, about 2 pm, about 2.5 pm, about 3 pm, about 4 pm, about 5 pm, or about 6 pm.
  • the diameter of a nucleic acid cluster may be influenced by a number of parameters, including, but not limited to the number of amplification cycles performed in producing the cluster, the length of the nucleic acid template or the density of primers attached to the surface upon which clusters are formed.
  • the density of nucleic acid clusters can be designed to typically be in the range of 0.1/mm 2 , 1/mm 2 , 10/mm 2 , 100/mm 2 , 1,000/mm 2 , 10,000/mm 2 to 100,000/mm 2 .
  • the present invention further contemplates, in part, higher density nucleic acid clusters, for example, 100,000/mm 2 to 1,000,000/mm 2 and 1,000,000/mm 2 to 10,000,000/mm 2 .
  • an“analyte” is an area of interest within a specimen or field of view.
  • an analyte refers to the area occupied by similar or identical molecules.
  • an analyte can be an amplified oligonucleotide or any other group of a polynucleotide or polypeptide with a same or similar sequence.
  • an analyte can be any element or group of elements that occupy a physical area on a specimen.
  • an analyte could be a parcel of land, a body of water or the like. When an analyte is imaged, each analyte will have some area. Thus, in many implementations, an analyte is not merely one pixel.
  • the distances between analytes can be described in any number of ways. In some implementations, the distances between analytes can be described from the center of one analyte to the center of another analyte. In other implementations, the distances can be described from the edge of one analyte to the edge of another analyte, or between the outer-most identifiable points of each analyte. The edge of an analyte can be described as the theoretical or actual physical boundary on a chip, or some point inside the boundary of the analyte. In other implementations, the distances can be described in relation to a fixed point on the specimen or in the image of the specimen.
  • this disclosure provides neural network-based template generation and base calling systems, wherein the systems can include a processor; a storage device; and a program for image analysis, the program including instructions for carrying out one or more of the methods set forth herein. Accordingly, the methods set forth herein can be carried out on a computer, for example, having components set forth herein or otherwise known in the art.
  • the methods and systems set forth herein are useful for analyzing any of a variety of objects.
  • Particularly useful objects are solid supports or solid-phase surfaces with attached analytes.
  • the methods and systems set forth herein provide advantages when used with objects having a repeating pattern of analytes in an xy plane.
  • An example is a microarray having an attached collection of cells, viruses, nucleic acids, proteins, antibodies, carbohydrates, small molecules (such as drug candidates), biologically active molecules or other analytes of interest.
  • An increasing number of applications have been developed for arrays with analytes having biological molecules such as nucleic acids and polypeptides.
  • Such microarrays typically include deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) probes.
  • individual DNA or RNA probes can be attached at individual analytes of an array.
  • a test sample such as from a known person or organism, can be exposed to the array, such that target nucleic acids (e.g., gene fragments, mRNA, or amplicons thereof) hybridize to complementary probes at respective analytes in the array.
  • the probes can be labeled in a target specific process (e.g., due to labels present on the target nucleic acids or due to enzymatic labeling of the probes or targets that are present in hybridized form at the analytes).
  • the array can then be examined by scanning specific frequencies of light over the analytes to identify which target nucleic acids are present in the sample.
  • Bio microarrays may be used for genetic sequencing and similar applications.
  • genetic sequencing comprises determining the order of nucleotides in a length of target nucleic acid, such as a fragment of DNA or RNA. Relatively short sequences are typically sequenced at each analyte, and the resulting sequence information may be used in various bioinformatics methods to logically fit the sequence fragments together so as to reliably determine the sequence of much more extensive lengths of genetic material from which the fragments were derived. Automated, computer-based algorithms for characteristic fragments have been developed, and have been used more recently in genome mapping, identification of genes and their function, and so forth. Microarrays are particularly useful for characterizing genomic content because a large number of variants are present and this supplants the alternative of performing many experiments on individual probes and targets. The microarray is an ideal format for performing such investigations in a practical manner.
  • analyte arrays also referred to as“microarrays”
  • a typical array contains analytes, each having an individual probe or a population of probes.
  • the population of probes at each analyte is typically homogenous having a single species of probe.
  • each analyte can have multiple nucleic acid molecules each having a common sequence.
  • the populations at each analyte of an array can be heterogeneous.
  • protein arrays can have analytes with a single protein or a population of proteins typically, but not always, having the same amino acid sequence.
  • the probes can be attached to the surface of an array for example, via covalent linkage of the probes to the surface or via non-covalent interaction(s) of the probes with the surface.
  • probes such as nucleic acid molecules, can be attached to a surface via a gel layer as described, for example, in U.S. patent application Ser. No. 13/784,368 and US Pat. App. Pub. No.
  • Example arrays include, without limitation, a BeadChip Array available from Illumina, Inc. (San Diego, Calif.) or others such as those where probes are attached to beads that are present on a surface (e.g. beads in wells on a surface) such as those described in U.S. Pat. No. 6,266,459; 6,355,431; 6,770,441; 6,859,570; or 7,622,294; or PCT Publication No. WO
  • microarrays that can be used include, for example, an Affymetrix® GeneChip® microarray or other microarray synthesized in accordance with techniques sometimes referred to as VLSIPSTM (Very Large Scale Immobilized Polymer Synthesis) technologies.
  • a spotted microarray can also be used in a method or system according to some implementations of the present disclosure.
  • An example spotted microarray is a CodeLinkTM Array available from Amersham Biosciences.
  • Another microarray that is useful is one that is manufactured using inkjet printing methods such as SurePrintTM Technology available from Agilent Technologies.
  • arrays having amplicons of genomic fragments are particularly useful such as those described in Bentley et al, Nature 456:53-59 (2008), WO 04/018497; WO 91/06678; WO 07/123744; U.S. Pat. No. 7,329,492; 7,211,414; 7,315,019; 7,405,281, or 7,057,026; or US Pat. App. Pub. No. 2008/0108082 Al, each of which is incorporated herein by reference.
  • Another type of array that is useful for nucleic acid sequencing is an array of particles produced from an emulsion PCR technique.
  • Arrays used for nucleic acid sequencing often have random spatial patterns of nucleic acid analytes.
  • HiSeq or MiSeq sequencing platforms available from Illumina Inc. utilize flow cells upon which nucleic acid arrays are formed by random seeding followed by bridge amplification.
  • patterned arrays can also be used for nucleic acid sequencing or other analytical applications.
  • Example patterned arrays, methods for their manufacture and methods for their use are set forth in U.S. Ser. No. 13/787,396; U.S. Ser. No. 13/783,043; U.S. Ser. No. 13/784,368; US Pat. App. Pub. No. 2013/0116153 Al; and US Pat. App. Pub. No.
  • analyte on an array can be selected to suit a particular application.
  • an analyte of an array can have a size that accommodates only a single nucleic acid molecule.
  • a surface having a plurality of analytes in this size range is useful for constructing an array of molecules for detection at single molecule resolution.
  • Analytes in this size range are also useful for use in arrays having analytes that each contain a colony of nucleic acid molecules.
  • the analytes of an array can each have an area that is no larger than about 1 mm 2 , no larger than about 500 pm 2 , no larger than about 100 pm 2 , no larger than about 10 pm 2 , no larger than about 1 pm 2 , no larger than about 500 nm 2 , or no larger than about 100 nm 2 , no larger than about 10 nm 2 , no larger than about 5 nm 2 , or no larger than about 1 nm 2 .
  • the analytes of an array will be no smaller than about 1 mm 2 , no smaller than about 500 pm 2 , no smaller than about 100 pm 2 , no smaller than about 10 pm 2 , no smaller than about 1 pm 2 , no smaller than about 500 nm 2 , no smaller than about 100 nm 2 , no smaller than about 10 nm 2 , no smaller than about 5 nm 2 , or no smaller than about 1 nm 2 .
  • an analyte can have a size that is in a range between an upper and lower limit selected from those exemplified above.
  • analytes in these size ranges can be used for applications that do not include nucleic acids. It will be further understood that the size of the analytes need not necessarily be confined to a scale used for nucleic acid applications.
  • analytes can be discrete, being separated with spaces between each other.
  • An array useful in the invention can have analytes that are separated by edge to edge distance of at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 pm, or less.
  • an array can have analytes that are separated by an edge to edge distance of at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more. These ranges can apply to the average edge to edge spacing for analytes as well as to the minimum or maximum spacing.
  • the analytes of an array need not be discrete and instead neighboring analytes can abut each other. Whether or not the analytes are discrete, the size of the analytes and/or pitch of the analytes can vary such that arrays can have a desired density.
  • the average analyte pitch in a regular pattern can be at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 pm, or less.
  • the average analyte pitch in a regular pattern can be at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more. These ranges can apply to the maximum or minimum pitch for a regular pattern as well.
  • the maximum analyte pitch for a regular pattern can be at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 mih, or less; and/or the minimum analyte pitch in a regular pattern can be at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more.
  • the density of analytes in an array can also be understood in terms of the number of analytes present per unit area.
  • the average density of analytes for an array can be at least about lxlO 3 analytes/mm 2 , lxlO 4 analytes/mm 2 , lxlO 5 analytes/mm 2 , lxlO 6 analytes/mm 2 , lxlO 7 analytes/mm 2 , lxlO 8 analytes/mm 2 , or lxlO 9 analytes/mm 2 , or higher.
  • the average density of analytes for an array can be at most about lxlO 9
  • lxlO 8 analytes/mm 2
  • lxlO 7 analytes/mm 2
  • lxlO 6 analytes/mm 2
  • lxlO 5 analytes/mm 2
  • lxlO 4 analytes/mm 2
  • lxlO 3 analytes/mm 2 , or less.
  • the analytes in a pattern can have any of a variety of shapes. For example, when observed in a two dimensional plane, such as on the surface of an array, the analytes can appear rounded, circular, oval, rectangular, square, symmetric, asymmetric, triangular, polygonal, or the like.
  • the analytes can be arranged in a regular repeating pattern including, for example, a hexagonal or rectilinear pattern.
  • a pattern can be selected to achieve a desired level of packing. For example, round analytes are optimally packed in a hexagonal arrangement. Of course other packing arrangements can also be used for round analytes and vice versa.
  • a pattern can be characterized in terms of the number of analytes that are present in a subset that forms the smallest geometric unit of the pattern.
  • the subset can include, for example, at least about 2, 3, 4, 5, 6, 10 or more analytes.
  • the geometric unit can occupy an area of less than 1 mm 2 , 500 pm 2 , 100 pm 2 , 50 pm 2 ,
  • the geometric unit can occupy an area of greater than 10 nm 2 , 50 nm 2 , 100 nm 2 , 500 nm 2 , 1 pm 2 , 10 pm 2 , 50 pm 2 , 100 pm 2 , 500 pm 2 , 1 mm 2 , or more.
  • Characteristics of the analytes in a geometric unit such as shape, size, pitch and the like, can be selected from those set forth herein more generally with regard to analytes in an array or pattern.
  • An array having a regular pattern of analytes can be ordered with respect to the relative locations of the analytes but random with respect to one or more other characteristic of each analyte.
  • the nuclei acid analytes can be ordered with respect to their relative locations but random with respect to one’s knowledge of the sequence for the nucleic acid species present at any particular analyte.
  • nucleic acid arrays formed by seeding a repeating pattern of analytes with template nucleic acids and amplifying the template at each analyte to form copies of the template at the analyte will have a regular pattern of nucleic acid analytes but will be random with regard to the distribution of sequences of the nucleic acids across the array.
  • detection of the presence of nucleic acid material generally on the array can yield a repeating pattern of analytes, whereas sequence specific detection can yield non-repeating distribution of signals across the array.
  • patterns, order, randomness and the like pertain not only to analytes on objects, such as analytes on arrays, but also to analytes in images.
  • patterns, order, randomness and the like can be present in any of a variety of formats that are used to store, manipulate or communicate image data including, but not limited to, a computer readable medium or computer component such as a graphical user interface or other output device.
  • the term“image” is intended to mean a representation of all or part of an object.
  • the representation can be an optically detected reproduction.
  • an image can be obtained from fluorescent, luminescent, scatter, or absorption signals.
  • the part of the object that is present in an image can be the surface or other xy plane of the object.
  • an image is a 2 dimensional representation, but in some cases information in the image can be derived from 3 or more dimensions.
  • An image need not include optically detected signals. Non-optical signals can be present instead.
  • An image can be provided in a computer readable format or medium such as one or more of those set forth elsewhere herein.
  • image refers to a reproduction or representation of at least a portion of a specimen or other object.
  • the reproduction is an optical reproduction, for example, produced by a camera or other optical detector.
  • the reproduction can be a non- optical reproduction, for example, a representation of electrical signals obtained from an array of nanopore analytes or a representation of electrical signals obtained from an ion-sensitive CMOS detector.
  • non-optical reproductions can be excluded from a method or apparatus set forth herein.
  • An image can have a resolution capable of distinguishing analytes of a specimen that are present at any of a variety of spacings including, for example, those that are separated by less than 100 pm, 50 pm, 10 pm, 5 pm, 1 pm or 0.5 pm.
  • data acquisition can include generating an image of a specimen, looking for a signal in a specimen, instructing a detection device to look for or generate an image of a signal, giving instructions for further analysis or transformation of an image file, and any number of transformations or manipulations of an image file.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Pathology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Biochemistry (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Optics & Photonics (AREA)
  • Multimedia (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)

Abstract

Selon l'invention, un appelant de base basé sur un réseau neuronal détecte et tient compte de propriétés stationnaires, cinétiques et mécanistes du processus de séquençage, mettant en correspondance ce qui est observé à chaque cycle de séquence des données de dosage avec la séquence sous-jacente de nucléotides. L'appelant de base basé sur un réseau neuronal combine les tâches d'ingénierie de caractéristiques, de réduction de dimension, de discrétisation et de modélisation cinétique en une seule structure d'apprentissage de bout en bout. En particulier, l'appelant de base basé sur un réseau neuronal fait appel à une combinaison de convolutions 3D, de convolutions ID et de convolutions ponctuelles pour détecter et tenir compte de biais de dosage tels que l'effet de phasage et de préphasage, la diaphonie spatiale, le chevauchement d'émission et l'évanouissement.
EP20730877.6A 2019-05-16 2020-05-15 Appel de base au moyen de convolutions Pending EP3970151A1 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201962849132P 2019-05-16 2019-05-16
US201962849091P 2019-05-16 2019-05-16
US201962849133P 2019-05-16 2019-05-16
US16/874,599 US11423306B2 (en) 2019-05-16 2020-05-14 Systems and devices for characterization and performance analysis of pixel-based sequencing
US16/874,633 US11593649B2 (en) 2019-05-16 2020-05-14 Base calling using convolutions
PCT/US2020/033281 WO2020232410A1 (fr) 2019-05-16 2020-05-15 Appel de base au moyen de convolutions

Publications (1)

Publication Number Publication Date
EP3970151A1 true EP3970151A1 (fr) 2022-03-23

Family

ID=74041703

Family Applications (3)

Application Number Title Priority Date Filing Date
EP20730877.6A Pending EP3970151A1 (fr) 2019-05-16 2020-05-15 Appel de base au moyen de convolutions
EP20733084.6A Active EP3969884B1 (fr) 2019-05-16 2020-05-15 Systèmes et procédés pour la caractérisation et l'analyse des performances de séquençage basé sur des pixels
EP24170634.0A Pending EP4394778A3 (fr) 2019-05-16 2020-05-15 Systèmes et procédés pour la caractérisation et l'analyse des performances de séquençage basé sur des pixels

Family Applications After (2)

Application Number Title Priority Date Filing Date
EP20733084.6A Active EP3969884B1 (fr) 2019-05-16 2020-05-15 Systèmes et procédés pour la caractérisation et l'analyse des performances de séquençage basé sur des pixels
EP24170634.0A Pending EP4394778A3 (fr) 2019-05-16 2020-05-15 Systèmes et procédés pour la caractérisation et l'analyse des performances de séquençage basé sur des pixels

Country Status (4)

Country Link
EP (3) EP3970151A1 (fr)
CN (3) CN112313750B (fr)
AU (2) AU2020276115A1 (fr)
CA (2) CA3104851A1 (fr)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210554B2 (en) 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
US11347965B2 (en) 2019-03-21 2022-05-31 Illumina, Inc. Training data generation for artificial intelligence-based sequencing
US11423306B2 (en) 2019-05-16 2022-08-23 Illumina, Inc. Systems and devices for characterization and performance analysis of pixel-based sequencing
US11593649B2 (en) 2019-05-16 2023-02-28 Illumina, Inc. Base calling using convolutions
CN115136244A (zh) 2020-02-20 2022-09-30 因美纳有限公司 基于人工智能的多对多碱基判读
WO2022197754A1 (fr) * 2021-03-16 2022-09-22 Illumina Software, Inc. Quantification de paramètres de réseau neuronal pour appel de base
CN117063240A (zh) 2021-12-24 2023-11-14 上海芯像生物科技有限公司 基于深度学习的核酸测序方法和系统
CN115376613A (zh) * 2022-09-13 2022-11-22 郑州思昆生物工程有限公司 一种碱基类别检测方法、装置、电子设备及存储介质
CN117726621B (zh) * 2024-02-05 2024-06-25 深圳赛陆医疗科技有限公司 基于深度学习的基因测序碱基质量评估方法、产品、设备及介质

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6090592A (en) 1994-08-03 2000-07-18 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid on supports
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
JP2001517948A (ja) 1997-04-01 2001-10-09 グラクソ、グループ、リミテッド 核酸配列決定法
AR021833A1 (es) 1998-09-30 2002-08-07 Applied Research Systems Metodos de amplificacion y secuenciacion de acido nucleico
AR031640A1 (es) 2000-12-08 2003-09-24 Applied Research Systems Amplificacion isotermica de acidos nucleicos en un soporte solido
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
EP3175914A1 (fr) 2004-01-07 2017-06-07 Illumina Cambridge Limited Perfectionnements apportés ou se rapportant à des réseaux moléculaires
US7709197B2 (en) 2005-06-15 2010-05-04 Callida Genomics, Inc. Nucleic acid analysis by random mixtures of non-overlapping fragments
GB0514936D0 (en) 2005-07-20 2005-08-24 Solexa Ltd Preparation of templates for nucleic acid sequencing
GB0522310D0 (en) 2005-11-01 2005-12-07 Solexa Ltd Methods of preparing libraries of template polynucleotides
US7576371B1 (en) * 2006-03-03 2009-08-18 Array Optronix, Inc. Structures and methods to improve the crosstalk between adjacent pixels of back-illuminated photodiode arrays
WO2007107710A1 (fr) 2006-03-17 2007-09-27 Solexa Limited Procédés isothermiques pour créer des réseaux moléculaires clonales simples
US20080242560A1 (en) 2006-11-21 2008-10-02 Gunderson Kevin L Methods for generating amplified nucleic acid arrays
EP2639578B1 (fr) 2006-12-14 2016-09-14 Life Technologies Corporation Appareil de mesure d'analytes à l'aide de matrices de FET à grande échelle
EP2173467B1 (fr) 2007-07-13 2016-05-04 The Board Of Trustees Of The Leland Stanford Junior University Procédé et appareil utilisant un champ électrique pour des dosages biologiques améliorés
US7595882B1 (en) 2008-04-14 2009-09-29 Geneal Electric Company Hollow-core waveguide-based raman systems and methods
US8965076B2 (en) * 2010-01-13 2015-02-24 Illumina, Inc. Data processing system and methods
EP4378584A2 (fr) * 2010-02-19 2024-06-05 Pacific Biosciences Of California, Inc. Système analytique intégré et procédé de mesure de fluorescence
US9096899B2 (en) 2010-10-27 2015-08-04 Illumina, Inc. Microdevices and biosensor cartridges for biological or chemical analysis and systems and methods for the same
US9387476B2 (en) 2010-10-27 2016-07-12 Illumina, Inc. Flow cells for biological or chemical analysis
PT3623481T (pt) 2011-09-23 2021-10-15 Illumina Inc Composições para sequenciação de ácidos nucleicos
US8637242B2 (en) 2011-11-07 2014-01-28 Illumina, Inc. Integrated sequencing apparatuses and methods of use
US9193998B2 (en) 2013-03-15 2015-11-24 Illumina, Inc. Super resolution imaging
US9736388B2 (en) * 2013-12-13 2017-08-15 Bio-Rad Laboratories, Inc. Non-destructive read operations with dynamically growing images
CN105980578B (zh) * 2013-12-16 2020-02-14 深圳华大智造科技有限公司 用于使用机器学习进行dna测序的碱基判定器
EP3116651B1 (fr) 2014-03-11 2020-04-22 Illumina, Inc. Cartouche microfluidique intégrée jetable, et procédés pour la fabriquer
EP3148697A1 (fr) 2014-05-27 2017-04-05 Illumina, Inc. Systèmes et procédés d'analyse biochimique comprenant un instrument de base et une cartouche amovible
CA3225867A1 (fr) 2015-03-24 2016-09-29 Illumina, Inc. Procedes, ensembles de support, et systemes pour l'imagerie d'echantillons pour une analyse biologique ou chimique
EP3130681B1 (fr) * 2015-08-13 2019-11-13 Centrillion Technology Holdings Corporation Procédés de synchronisation de molécules d'acide nucléique
US10976334B2 (en) 2015-08-24 2021-04-13 Illumina, Inc. In-line pressure accumulator and flow-control system for biological or chemical assays
US11579336B2 (en) 2016-04-22 2023-02-14 Illumina, Inc. Photonic structure-based devices and compositions for use in luminescent imaging of multiple sites within a pixel, and methods of using the same
CN109313328A (zh) 2016-06-21 2019-02-05 伊鲁米那股份有限公司 超分辨率显微术
KR102385560B1 (ko) * 2017-01-06 2022-04-11 일루미나, 인코포레이티드 페이징 보정
CN109614981B (zh) * 2018-10-17 2023-06-30 东北大学 基于斯皮尔曼等级相关的卷积神经网络的电力系统智能故障检测方法及系统

Also Published As

Publication number Publication date
CN112368567A (zh) 2021-02-12
EP3969884A1 (fr) 2022-03-23
EP3969884C0 (fr) 2024-04-17
AU2020276115A1 (en) 2021-01-07
CN112313750B (zh) 2023-11-17
CN112313750A (zh) 2021-02-02
AU2020273459A1 (en) 2021-01-07
CN117935916A (zh) 2024-04-26
CA3104854A1 (fr) 2020-11-19
EP4394778A2 (fr) 2024-07-03
EP4394778A3 (fr) 2024-08-28
CA3104851A1 (fr) 2020-11-19
EP3969884B1 (fr) 2024-04-17
CN112368567B (zh) 2024-04-16

Similar Documents

Publication Publication Date Title
US11817182B2 (en) Base calling using three-dimentional (3D) convolution
EP3942072B1 (fr) Génération de données d'apprentissage pour séquençage à base d'intelligence artificielle
US11347965B2 (en) Training data generation for artificial intelligence-based sequencing
US20210265018A1 (en) Knowledge Distillation and Gradient Pruning-Based Compression of Artificial Intelligence-Based Base Caller
CN112313750B (zh) 使用卷积的碱基识别
WO2020191387A1 (fr) Appel de base à base d'intelligence artificielle
NL2023310B1 (en) Training data generation for artificial intelligence-based sequencing
NL2023311B9 (en) Artificial intelligence-based generation of sequencing metadata
US20230343414A1 (en) Sequence-to-sequence base calling
WO2023049215A1 (fr) Appel de base basé sur l'état compressé
EP4405955A1 (fr) Appel de base basé sur l'état compressé

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201222

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40061265

Country of ref document: HK

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20240215

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN