WO2020205296A1 - Génération à base d'intelligence artificielle de métadonnées de séquençage - Google Patents

Génération à base d'intelligence artificielle de métadonnées de séquençage Download PDF

Info

Publication number
WO2020205296A1
WO2020205296A1 PCT/US2020/024087 US2020024087W WO2020205296A1 WO 2020205296 A1 WO2020205296 A1 WO 2020205296A1 US 2020024087 W US2020024087 W US 2020024087W WO 2020205296 A1 WO2020205296 A1 WO 2020205296A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
units
neural network
clusters
image
Prior art date
Application number
PCT/US2020/024087
Other languages
English (en)
Inventor
Anindita Dutta
Dorna KASHEFHAGHIGHI
Amirali Kia
Original Assignee
Illumina, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from NL2023316A external-priority patent/NL2023316B1/en
Priority claimed from NL2023312A external-priority patent/NL2023312B1/en
Priority claimed from NL2023310A external-priority patent/NL2023310B1/en
Priority claimed from NL2023314A external-priority patent/NL2023314B1/en
Priority claimed from NL2023311A external-priority patent/NL2023311B9/en
Priority claimed from US16/825,991 external-priority patent/US11210554B2/en
Priority claimed from US16/825,987 external-priority patent/US11347965B2/en
Priority to AU2020256047A priority Critical patent/AU2020256047A1/en
Priority to MX2020014293A priority patent/MX2020014293A/es
Priority to JP2020572715A priority patent/JP2022525267A/ja
Priority to CN202080003614.9A priority patent/CN112334984A/zh
Priority to KR1020207037713A priority patent/KR20210142529A/ko
Application filed by Illumina, Inc. filed Critical Illumina, Inc.
Priority to SG11202012453PA priority patent/SG11202012453PA/en
Priority to EP20719052.1A priority patent/EP3942071A1/fr
Priority to BR112020026426-1A priority patent/BR112020026426A2/pt
Publication of WO2020205296A1 publication Critical patent/WO2020205296A1/fr
Priority to IL279525A priority patent/IL279525A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/69Microscopic objects, e.g. biological cells or cellular parts
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks
  • intelligence i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems
  • systems for reasoning with uncertainty e.g., fuzzy logic systems
  • adaptive systems e.g., machine learning systems
  • artificial neural networks such as deep convolutional neural networks for analyzing data.
  • Deep neural networks are a type of artificial neural networks that use multiple nonlinear and complex transforming layers to successively model high-level features Deep neural networks provide feedback via backpropagation which carries the difference between observed and predicted output to adjust parameters. Deep neural networks have evolved with the availability of large training datasets, the power of parallel and distributed computing, and sophisticated training algorithms Deep neural networks have facilitated major advances in numerous domains such as computer vision, speech recognition, and natural language processing.
  • Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are components of deep neural networks Convolutional neural networks have succeeded particularly in image recognition with an architecture that comprises convolution layers, nonlinear layers, and pooling layers.
  • Recurrent neural networks are designed to utilize sequential information of input data with cyclic connections among building blocks like perceptrons, long short-term memory units, and gated recurrent units.
  • many other emergent deep neural networks have been proposed for limited contexts, such as deep spatio-temporal neural networks, multi-dimensional recurrent neural networks, and convolutional auto -encoders.
  • the goal of training deep neural networks is optimization of the weight parameters in each layer, which gradually combines simpler features into complex features so that the most suitable hierarchical representations can be learned from data.
  • a single cycle of the optimization process is organized as follows. First, given a training dataset, the forward pass sequentially computes the output in each layer and propagates the function signals forward through the network. In the final output layer, an objective loss function measures error between the inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to backpropagate error signals and compute gradients with respect to all weights throughout the neural network. Finally, the weight parameters are updated using optimization algorithms based on stochastic gradient descent.
  • stochastic gradient descent provides stochastic approximations by performing the updates for each small set of data examples.
  • optimization algorithms stem from stochastic gradient descent.
  • the Adagrad and Adam training algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency and moments of the gradients for each parameter, respectively.
  • regularization refers to strategies intended to avoid overfitting and thus achieve good generalization performance
  • weight decay adds a penalty term to the objective loss function so that weight parameters converge to smaller absolute values.
  • Dropout randomly removes hidden units from neural networks during training and can be considered an ensemble of possible subnetworks.
  • maxout a new activation function
  • mnDrop a variant of dropout for recurrent neural networks
  • batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters.
  • Convolutional neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification, and gene expression inference
  • Convolutional neural networks use a weight-sharing strategy that is especially useful for studying DNA because it can capture sequence motifs, which are short, recurring local patterns in DNA that are presumed to have significant biological functions.
  • a hallmark of convolutional neural networks is the use of convolution filters.
  • convolution filters perform adaptive learning of features, analogous to a process of mapping raw input data to the informative representation of knowledge.
  • the convolution filters serve as a series of motif scanners, since a set of such filters is capable of recognizing relevant patterns in the input and updating themselves during the training procedure
  • Recurrent neural networks can capture long-range dependencies in sequential data of varying lengths, such as protein or DNA sequences.
  • nucleic acid cluster-based genomics methods extend to other areas of genome analysis as well.
  • nucleic acid cluster-based genomics can be used in sequencing applications, diagnostics and screening, gene expression analysis, epigenetic analysis, genetic analysis of polymorphisms, and the like.
  • Each of these nucleic acid cluster-based genomics technologies is limited when there is an inability to resolve data generated from closely proximate or spatially overlapping nucleic acid clusters.
  • nucleic acid sequencing data that can be obtained rapidly and cost-effectively for a wide variety of uses, including for genomics (e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations), pharmacogenomics, transcriptomics, diagnostics, prognostics, biomedical risk assessment, clinical and research genetics, personalized medicine, drug efficacy and drug interactions assessments, veterinary medicine, agriculture, evolutionary and biodiversity studies, aquaculture, forestry, oceanography, ecological and environmental management, and other purposes.
  • genomics e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations
  • pharmacogenomics e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations
  • transcriptomics e.g., for genome characterization of any and all animal, plant, microbial or other biological species or populations
  • diagnostics e.g., for prognostics
  • biomedical risk assessment e.g., for genetic characterization
  • the technology disclosed provides neural network-based methods and systems that address these and similar needs, including increasing the level of throughput in high-throughput nucleic acid sequencing technologies, and offers other related advantages
  • Figure 1 shows one implementation of a processing pipeline that determines cluster metadata using subpixel base calling
  • Figure 2 depicts one implementation of a flow cell that contains clusters in its tiles
  • Figure 3 illustrates one example of the Illumina GA-IIx flow cell with eight lanes.
  • Figure 4 depicts an image set of sequencing images for four-channel chemistry, i e., the image set has four sequencing images, captured using four different wavelength bands (image/imaging channel) in the pixel domain
  • Figure 5 is one implementation of dividing a sequencing image into subpixels (or subpixel regions).
  • Figure 6 shows preliminary center coordinates of the clusters identified by the base caller during the subpixel base calling.
  • Figure 7 depicts one example of merging subpixel base calls produced over the plurality of sequencing cycles to generate the so-called “cluster maps” that contain the cluster metadata.
  • Figure 8a illustrates one example of a cluster map generated by the merging of the subpixel base calls.
  • Figure 8b depicts one implementation of subpixel base calling.
  • Figure 9 shows another example of a cluster map that identifies cluster metadata.
  • Figure 10 shows how a center of mass (COM) of a disjointed region in a cluster map is calculated.
  • Figure 11 depicts one implementation of calculation of a weighted decay factor based on the Euclidean distance from a subpixel in a disjointed region to the COM of the disjointed region
  • Figure 12 illustrates one implementation of an example ground truth decay map derived from an example cluster map produced by the subpixel base calling.
  • Figure 13 illustrates one implementation of deriving a ternary map from a cluster map.
  • Figure 14 illustrates one implementation of deriving a binary map from a cluster map.
  • Figure 15 is a block diagram that shows one implementation of generating training data that is used to train the neural network-based template generator and the neural network-based base caller.
  • Figure 16 shows characteristics of the disclosed training examples used to train the neural network-based template generator and the neural network-based base caller.
  • Figure 17 illustrates one implementation of processing input image data through the disclosed neural network-based template generator and generating an output value for each unit in an array.
  • the array is a decay map.
  • the array is a ternary map.
  • the array is a binary map.
  • Figure 18 shows one implementation of post-processing techniques that are applied to the decay map, the ternary map, or the binary map produced by the neural network-based template generator to derive cluster metadata, including cluster centers, cluster shapes, cluster sizes, cluster background, and/or cluster boundaries
  • Figure 19 depicts one implementation of extracting cluster intensity in the pixel domain.
  • Figure 20 illustrates one implementation of extracting cluster intensity in the subpixel domain
  • Figure 21a shows three different implementations of the neural network-based template generator.
  • Figure 21b depicts one implementation of the input image data that is fed as input to the neural network-based template generator
  • the input image data comprises a series of image sets with sequencing images that are generated during a certain number of initial sequences cycles of a sequencing run.
  • Figure 22 shows one implementation of extracting patches from the series of image sets in Figure 21b to produce a series of“downsized” image sets that form the input image data
  • Figure 23 depicts one implementation of upsampling the series of image sets in Figure 21b to produce a series of“upsampled” image sets that forms the input image data
  • Figure 24 shows one implementation of extracting patches from the series of upsampled image sets in Figure 23 to produce a series of“upsampled and down-sized” image sets that form the input image data.
  • Figure 25 illustrates one implementation of an overall example process of generating ground truth data for training the neural network-based template generator.
  • Figure 26 illustrates one implementation of the regression model.
  • Figure 27 depicts one implementation of generating a ground truth decay map from a cluster map
  • the ground truth decay map is used as ground truth data for training the regression model.
  • Figure 28 is one implementation of training the regression model using abackpropagation-based gradient update technique.
  • Figure 29 is one implementation of template generation by the regression model during inference.
  • Figure 30 illustrates one implementation of subjecting the decay map to post-processing to identify cluster metadata.
  • Figure 31 depicts one implementation of a watershed segmentation technique identifying non-overlapping groups of contiguous cluster/cluster interior subpixels that characterize the clusters.
  • Figure 32 is a table that shows an example U-Net architecture of the regression model.
  • Figure 33 illustrates different approaches of extracting cluster intensity using cluster shape information identified in a template image.
  • Figure 34 shows different approaches of base calling using the outputs of the regression model.
  • Figure 35 illustrates the difference in base calling performance when the RTA base caller uses ground truth center of mass (COM) location as the cluster center, as opposed to using a non-COM location as the cluster center.
  • COM ground truth center of mass
  • Figure 36 shows, on the left, an example decay map produced the regression model. On the right, Figure 36 also shows an example ground tmth decay map that the regression model approximates during the training
  • Figure 37 portrays one implementation of the peak locator identifying cluster centers in the decay map by detecting peaks.
  • Figure 38 compares peaks detected by the peak locator in a decay map produced by the regression model with peaks in a corresponding ground truth decay map
  • Figure 39 illustrates performance of the regression model using precision and recall statistics.
  • Figure 40 compares performance of the regression model with the RTA base caller for 20pM libraiy concentration (normal run).
  • Figure 41 compares performance of the regression model with the RTA base caller for 30pM libraiy concentration (dense run).
  • Figure 42 compares number of non-duplicate proper read pairs, i e , the number of paired reads that do not have both reads aligned inwards within a reasonable distance detected by the regression model versus the same detected by the RTA base caller.
  • Figure 43 shows, on the right, a first decay map produced by the regression model. On the left, Figure 43 shows a second decay map produced by the regression model
  • Figure 44 compares performance of the regression model with the RTA base caller for 40pM libraiy concentration (highly dense run).
  • Figure 45 shows, on the left, a first decay map produced by the regression model. On the right, Figure 45 shows the results of the thresholding, the peak locating, and the watershed segmentation technique applied to the first decay map.
  • Figure 46 illustrates one implementation of the binary classification model
  • Figure 47 is one implementation of training the binary classification model using a backpropagation-based gradient update technique that involves softmax scores.
  • Figure 48 is another implementation of training the binary classification model using a backpropagation-based gradient update technique that involves sigmoid scores
  • Figure 49 illustrates another implementation of the input image data fed to the binary classification model and the corresponding class labels used to train the binary classification model.
  • Figure 50 is one implementation of template generation by the binary classification model during inference.
  • Figure 51 illustrates one implementation of subjecting the binary map to peak detection to identify cluster centers.
  • Figure 52a shows, on the left, an example binary map produced by the binary classification model. On the right, Figure 52a also shows an example ground truth binary map that the binary classification model approximates during the training.
  • Figure 52b illustrates performance of the binary classification model using a precision statistic.
  • Figure 53 is a table that shows an example architecture of the binary classification model
  • Figure 54 illustrates one implementation of the ternary classification model.
  • Figure 55 is one implementation of training the ternary classification model using a backpropagation-based gradient update technique.
  • Figure 56 illustrates another implementation of the input image data fed to the ternary classification model and the corresponding class labels used to train the temaiy classification model.
  • Figure 57 is a table that shows an example architecture of the temaiy classification model.
  • Figure 58 is one implementation of template generation by the temaiy classification model during inference.
  • Figure 59 shows a temaiy map produced by the ternary classification model.
  • Figure 60 depicts an array of units produced by the ternary classification model 5400, along with the unit-wise output values.
  • Figure 61 shows one implementation of subjecting the temaiy map to post-processing to identify cluster centers, cluster background, and cluster interior.
  • Figure 62a shows example predictions of the ternary classification model.
  • Figure 62b illustrates other example predictions of the temaiy classification model
  • Figure 62c shows yet other example predictions of the temaiy classification model
  • Figure 63 depicts one implementation of deriving the cluster centers and cluster shapes from the output of the temaiy classification model in Figure 62a.
  • Figure 64 compares base calling performance of the binary classification model, the regression model, and the RTA base caller
  • Figure 65 compares the performance of the ternary classification model with that of the RTA base caller under three contexts, five sequencing metrics, and two run densities
  • Figure 66 compares the performance of the regression model with that of the RTA base caller under the three contexts, the five sequencing metrics, and the two run densities discussed in Figure 65.
  • Figure 67 focuses on the penultimate layer of the neural network-based template generator
  • Figure 68 visualizes what the penultimate layer of the neural network-based template generator has learned as a result of the backpropagation-based gradient update training.
  • the illustrated implementation visualizes twenty -four out of the thirty -two trained convolution filters of the penultimate layer depicted in Figure 67.
  • Figure 69 overlays cluster center predictions of the binary classification model (in blue) onto those of the RTA base caller (in pink).
  • Figure 70 overlays cluster center predictions made by the RTA base caller (in pink) onto visualization of the trained convolution filters of the penultimate layer of the binaiy classification model.
  • Figure 71 illustrates one implementation of training data used to train the neural network-based template generator
  • Figure 72 is one implementation of using beads for image registration based on cluster center predictions of the neural network-based template generator.
  • Figure 73 illustrates one implementation of cluster statistics of clusters identified by the neural network-based template generator.
  • Figure 74 shows how the neural network-based template generator’s ability to distinguish between adjacent clusters improves when the number of initial sequencing cycles for which the input image data is used increases from five to seven
  • Figure 75 illustrates the difference in base calling performance when a RTA base caller uses ground truth center of mass (COM) location as the cluster center, as opposed to when a non-COM location is used as the cluster center.
  • COM center of mass
  • Figure 76 portrays the performance of the neural network-based template generator on extra detected clusters.
  • Figure 77 shows different datasets used for training the neural network-based template generator.
  • Figures 78A and 78B depict one implementation of a sequencing system.
  • the sequencing system comprises a configurable processor.
  • Figure 79 is a simplified block diagram of a system for analysis of sensor data from the sequencing system, such as base call sensor outputs.
  • Figure 80 is a simplified diagram showing aspects of the base calling operation, including functions of a runtime program executed by a host processor.
  • Figure 81 is a simplified diagram of a configuration of a configurable processor such as the one depicted in Figure 79.
  • Figure 82 is a computer system that canbe used by the sequencing system of Figure 78A to implement the technology disclosed herein.
  • the signal from an image set being evaluated is increasingly faint as classification of bases proceeds in cycles, especially over increasingly long strands of bases.
  • the signal-to-noise ratio decreases as base classification extends over the length of a strand, so reliability decreases. Updated estimates of reliability are expected as the estimated reliability of base classification changes.
  • Digital images are captured from amplified clusters of sample strands. Samples are amplified by duplicating strands using a variety of physical structures and chemistries. During sequencing by synthesis, tags are chemically attached in cycles and stimulated to glow. Digital sensors collect photons from the tags that are read out of pixels to produce images.
  • Cluster positions are not mechanically regulated, so cluster centers are not aligned with pixel centers.
  • a pixel center can be the integer coordinate assigned to a pixel. In other implementations, it can be the top-left corner of the pixel. In yet other implementations, it can be the centroid or center-of-mass of the pixel. Amplification does not produce uniform cluster shapes. Distribution of cluster signals in the digital image is, therefore, a statistical distribution rather than a regular pattern. We call this positional uncertainty
  • One of the signal classes may produce no detectable signal and be classified at a particular position based on a“dark” signal. Thus, templates are necessary for classification during dark cycles. Production of templates resolves initial positional uncertainty using multiple imaging cycles to avoid missing dark signals.
  • values can be assigned to sub pixels by interpolation, such as bilinear interpolation or area weighting Interpolation or bilinear interpolation also is applied when pixels are re-framed by applying an affine transformation to data from physical pixels.
  • High resolution sensors capture only part of an imaged media at a time. The sensor is stepped over the imaged media to cover the whole field. Thousands of digital images canbe collected during one processing cycle.
  • Sensor and illumination design are combined to distinguish among at least four illumination response values that are used to classify bases. If a traditional RGB camera with a Bayer color filter array were used, four sensor pixels would be combined into a single RGB value. This would reduce the effective sensor resolution by four-fold.
  • multiple images canbe collected at a single position using different illumination wavelengths and/or different filters rotated into position between the imaged media and the sensor. The number of images required to distinguish among four base classifications varies between systems. Some systems use one image with four intensity levels for different classes of bases. Other systems use two images with different illumination wavelengths (red and green, for instance) and/or filters with a sort of truth table to classify bases. Systems also can use four images with different illumination wavelengths and/or filters tuned to specific base classes.
  • Massively parallel processing of digital images is practically necessary to align and combine relatively short strands, on the order of 30 to 2000 base pairs, into longer sequences, potentially millions or even billions of bases in length. Redundant samples are desirable over an imaged media, so a part of a sequence may be covered by dozens of sample reads. Millions or at least hundreds of thousands of s ample clusters are imaged from a single imaged media. Massively parallel processing of so many clusters has increased in sequencing capacity while decreasing cost.
  • the technology disclosed improves processing during both template generation to resolve positional uncertainty and during base classification of clusters at resolved positions. Applying the technology disclosed, less expensive hardware can be used to reduce the cost of machines. Near real time analysis can become cost effective, reducing the lag between image collection and base classification.
  • the technology disclosed can use upsampled images produced by interpolating sensor pixels into subpixels and then producing templates that resolve positional uncertainty.
  • a resulting subpixel is submitted to a base caller for classification that treats the subpixel as if it were at the center of a cluster.
  • Clusters are determined from groups of adjoining subpixels that repeatedly receive the same base classification. This aspect of the technology leverages existing base calling technology to determine shapes of clusters and to hyper-locate cluster centers with a subpixel resolution.
  • Another aspect of the technology disclosed is to create ground truth, training data sets that pair images with confidently determined cluster centers and/or cluster shapes. Deep learning systems and other machine learning approaches require substantial training sets. Human curated data is expensive to compile.
  • the technology disclosed canbe used to leverage existing classifiers, in a non-standard mode of operation, to generate large sets of confidently classified training data without intervention or the expense of a human curator.
  • the training data correlates raw images with cluster centers and/or cluster shapes available from existing classifiers, in a non-standard mode of operation, such as CNN-based deep learning systems, which can then directly process image sequences.
  • One training image can be rotated and reflected to produce additional, equally valid examples. Training examples can focus on regions of a predetermined size within an overall image. The context evaluated during base calling determines the size of example training regions, rather than the size of an image from or overall imaged media.
  • the technology disclosed can produce different types of maps, usable as training data or as templates for base classification, which correlate cluster centers and/or cluster shapes with digital images.
  • a subpixel can be classified as a cluster center, thereby localizing a cluster center within a physical sensor pixel.
  • a cluster center can be calculated as the centroid of a cluster shape. This location can be reported with a selected numeric precision.
  • a cluster center can be reported with surrounding subpixels in a decay map, either at subpixel or pixel resolution. A decay map reduces weight given to photons detected in regions as separation of the regions from the cluster center increase, attenuating signals from more distant positions.
  • binary or ternary classifications can be applied to subpixels or pixels in clusters of adjoining regions.
  • a region is classified as belonging to a cluster center or as background.
  • the third class type is assigned to the region that contains the cluster interior, but not the cluster center.
  • Subpixel classification of cluster center locations could be substituted for real valued cluster center coordinates within a larger optical pixel
  • the alternative styles of maps can initially be produced as ground truth data sets, or, with training, they can be produced using a neural network. For instance, clusters can be depicted as disjoint regions of adjoining subpixels with appropriate classifications. Intensity mapped clusters from a neural network can be post-processed by a peak detector filter, to calculate cluster centers, if the centers have not already been determined. Applying a so-called watershed analysis, abutting regions can be assigned to separate clusters. When produced by a neural network inference engine, the maps canbe used as templates for evaluating a sequence of digital images and classifying bases over cycles of base calling.
  • the first step of template generation is determining cluster metadata
  • Cluster metadata identifies spatial distribution of clusters, including their centers, shapes, sizes, background, and/or boundaries.
  • Figure 1 shows one implementation of a processing pipeline that determines cluster metadata using subpixel base calling
  • Figure 2 depicts one implementation of a flow cell that contains clusters in its tiles.
  • the flow cell is partitioned into lanes
  • the lanes are further partitioned into non-overlapping regions called“tiles” During the sequencing procedure, the clusters and their surrounding background on the tiles are imaged.
  • Figure 3 illustrates an example Illumina GA-IIxTM flow cell with eight lanes
  • Figure 3 also shows a zoom-in on one tile and its clusters and their surrounding background.
  • Figure 4 depicts an image set of sequencing images for four-channel chemistry, i.e , the image set has four sequencing images, captured using four different wavelength bands (image/imaging channel) in the pixel domain.
  • Each image in the image set covers a tile of a flow cell and depicts intensity emissions of clusters on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • each imaged channel corresponds to one of a plurality of filter wavelength bands
  • each imaged channel corresponds to one of a plurality of imaging events at a sequencing cycle.
  • each imaged channel corresponds to a combination of illumination with a specific laser and imaging through a specific optical filter.
  • the intensity emissions of a cluster comprise signals detected from an analyte that can be used to classify a base associated with the analyte.
  • the intensity emissions may be signals indicative of photons emitted by tags that are chemically attached to an analyte during a cycle when the tags are stimulated and that may be detected by one or more digital sensors, as described above.
  • Figure 5 is one implementation of dividing a sequencing image into subpixels (or subpixel regions).
  • quarter (0.25) subpixels are used, which results in each pixel in the sequencing image being divided into sixteen subpixels.
  • the illustrated sequencing image has a resolution of 20 x 20 pixels, i e., 400 pixels, the division produces 6400 subpixels.
  • Each of the subpixels is treated by a base caller as a region center for subpixel base calling. In some implementations, this base caller does not use neural network-based processing. In other implementations, this base caller is a neural network-based base caller.
  • the base caller is configured with logic to produce a base call for the given sequencing cycle particular subpixel by performing image processing steps and extracting intensity data for the subpixel from the corresponding image set of the sequencing cycle. This is done for each of the subpixels and for each of a plurality of sequencing cycles. Experiments have also been carried out with quarter subpixel division of 1800 x 1800 pixel resolution tile images of the Illumina MiSeq sequencer. Subpixel base calling was performed for fifty sequencing cycles and for ten tiles of a lane.
  • Figure 6 shows preliminary center coordinates of the clusters identified by the base caller during the subpixel base calling.
  • Figure 6 also shows“origin subpixels” or“center subpixels” that contain the preliminary center coordinates.
  • Figure 7 depicts one example of merging subpixel base calls produced over the plurality of sequencing cycles to generate the so- called“cluster maps” that contain the cluster metadata.
  • the subpixel base calls are merged using a breadth-first search approach.
  • Figure 8a illustrates one example of a cluster map generated by the merging of the subpixel base calls
  • Figure 8b depicts one example of subpixel base calling.
  • Figure 8b also shows one implementation of analyzing subpixel-wise base call sequences produced from the subpixel base calling to generate a cluster map
  • Cluster metadata determination involves analyzing image data produced by a sequencing instrument 102 (e.g., Illumina’s iSeq, HiSeqX, HiSeq3000, HiSeq4000, HiSeq2500, NovaSeq 6000, NextSeq, NextSeqDx, MiSeq and MiSeqDx).
  • Illumina e.g., Illumina’s iSeq, HiSeqX, HiSeq3000, HiSeq4000, HiSeq2500, NovaSeq 6000, NextSeq, NextSeqDx, MiSeq and MiSeqDx.
  • Base calling is the process in which the raw signal of the sequencing instrument 102, i.e , intensity data extracted from images, is decoded into DNA sequences and quality scores
  • the Illumina platforms employ cyclic reversible termination (CRT) chemistry for base calling
  • CRT cyclic reversible termination
  • the process relies on growing nascent DNA strands complementary to template DNA strands with modified nucleotides, while tracking an emitted signal of each newly added nucleotide
  • the modified nucleotides have a 3’ removable block that anchors a fluorophore signal of the nucleotide type.
  • Sequencing occurs in repetitive cycles, each comprising three steps: (a) extension of a nascent strand by adding a modified nucleotide; (b) excitation of the fluorophores using one or more lasers of the optical system 104 and imaging through different filters of the optical system 104, yielding sequencing images 108; and (c) cleavage of the fluorophores and removal of the 3’ block in preparation for the next sequencing cycle Incorporation and imaging cycles are repeated up to a designated number of sequencing cycles, defining the read length of all clusters Using this approach, each cycle interrogates a new position along the template strands.
  • the tremendous power of the Illumina platforms stems from their ability to simultaneously execute and sense millions or even billions of clusters undergoing CRT reactions.
  • the sequencing process occurs in a flow cell 202 - a small glass slide that holds the input DNA fragments during the sequencing process.
  • the flow cell 202 is connected to the high-throughput optical system 104, which comprises microscopic imaging, excitation lasers, and fluorescence filters.
  • the flow cell 202 comprises multiple chambers called lanes 204.
  • the lanes 204 are physically separated from each other and may contain different tagged sequencing libraries, distinguishable without sample cross contamination.
  • the imaging device 106 (e g , a solid-state imager such as a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) sensor) takes snapshots at multiple locations along the lanes 204 in a series of non-overlapping regions called tiles 206.
  • CCD charge-coupled device
  • CMOS complementary metal-oxide-semiconductor
  • a cluster 302 comprises approximately one thousand identical copies of a template molecule, though clusters vary in size and shape.
  • the clusters are grown from the template molecule, prior to the sequencing run, by bridge amplification of the input library.
  • the purpose of the amplification and cluster growth is to increase the intensity of the emitted signal since the imaging device 106 cannot reliably sense a single fluorophore.
  • the physical distance of the DNA fragments within a cluster 302 is small, so the imaging device 106 perceives the cluster of fragments as a single spot 302
  • the output of a sequencing run is the sequencing images 108, each depicting intensity emissions of clusters on the tile in the pixel domain for a specific combination of lane, tile, sequencing cycle, and fluorophore (208A, 208C, 208T, 208G).
  • a biosensor comprises an array of light sensors.
  • a light sensor is configured to sense information from a corresponding pixel area (e.g , a reaction site/well/nanowell) on the detection surface of the biosensor.
  • An analyte disposed in a pixel area is said to be associated with the pixel area, i e , the associated analyte.
  • the light sensor corresponding to the pixel area is configured to detect/capture/sense emissions/photons from the associated analyte and, in response, generate a pixel signal for each imaged channel.
  • each imaged channel corresponds to one of a plurality of filter wavelength bands In another implementation, each imaged channel corresponds to one of a plurality of imaging events at a sequencing cycle In yet another implementation, each imaged channel corresponds to a combination of illumination with a specific laser and imaging through a specific optical filter.
  • Pixel signals from the light sensors are communicated to a signal processor coupled to the biosensor (e.g., via a communication port). For each sequencing cycle and each imaged channel, the signal processor produces an image whose pixels respectively
  • a pixel in the image corresponds to: (i) a light sensor of the biosensor that generated the pixel signal depicted by the pixel, (ii) an associated analyte whose emissions were detected by the corresponding light sensor and converted into the pixel signal, and (iii) a pixel area on the detection surface of the biosensor that holds the associated analyte
  • a sequencing mn uses two different imaged channels: a red channel and a green channel. Then, at each sequencing cycle, the signal processor produces a red image and a green image. This way, for a series of k sequencing cycles of the sequencing mn, a sequence with k pairs of red and green images is produced as output.
  • Pixels in the red and green images i.e., different imaged channels
  • Pixels across the pairs of red and green images have one-to-one correspondence between the sequencing cycles. This means that corresponding pixels in different pairs of the red and green images depict intensity data for the same associated analyte, albeit for different acquisition events/timesteps (sequencing cycles) of the sequencing run.
  • Corresponding pixels in the red and green images can be considered a pixel of a“per-cycle image” that expresses intensity data in a first red channel and a second green channel.
  • a per-cycle image whose pixels depict pixel signals for a subset of the pixel areas, i e., a region (tile) of the detection surface of the biosensor, is called a“per-cycle tile image.”
  • a patch extracted from a per-cycle tile image is called a“per-cycle image patch.”
  • the patch extraction is performed by an input preparer.
  • the image data comprises a sequence of per-cycle image patches generated for a series of k sequencing cycles of a sequencing run.
  • the pixels in the per-cycle image patches contain intensity data for associated analytes and the intensity data is obtained for one or more imaged channels (e.g , a red channel and a green channel) by corresponding light sensors configured to detect emissions from the associated analytes.
  • imaged channels e.g , a red channel and a green channel
  • the per-cycle image patches are centered at a center pixel that contains intensity data for a target associated analyte and non-center pixels in the per-cycle image patches contain intensity data for associated analytes adjacent to the target associated analyte.
  • the image data is prepared by an input preparer
  • the technology disclosed accesses a series of image sets generated during a sequencing run.
  • the image sets comprise the sequencing images 108
  • Each image set in the series is captured during a respective sequencing cycle of the sequencing run.
  • Each image (or sequencing image) in the series captures clusters on a tile of a flow cell and their surrounding background
  • the sequencing run utilizes four-channel chemistry and each image set has four images. In another implementation, the sequencing run utilizes two-channel chemistry and each image set has two images. In yet another implementation, the sequencing run utilizes one-channel chemistry and each image set has two images. In yet other implementations, each image set has only one image.
  • the sequencing images 108 in the pixel domain are first converted into the subpixel domain by a subpixel addresser 110 to produce sequencing images 112 in the subpixel domain
  • each pixel in the sequencing images 108 is divided into sixteen subpixels 502
  • the subpixels 502 are quarter subpixels
  • the subpixels 502 are half subpixels
  • each of the sequencing images 112 in the subpixel domain has a plurality of subpixels 502.
  • the subpixels are then separately fed as input to abase caller 114 to obtain, from the base caller 114, a base call classifying each of the subpixels as one of four bases (A, C, T, and G).
  • the subpixels 502 are identified to the base caller 114 based on their integer or non-integer coordinates.
  • the base caller 114 recovers the underlying DNA sequence for each subpixel. An example of this is illustrated in Figure 8b.
  • the technology disclosed obtains, from the base caller 114, the base call classifying each of the subpixels as one of five bases (A, C, T, G, and N)
  • N base call denotes an undecided base call, usually due to low levels of extracted intensity.
  • Some examples of the base caller 114 include non-neural network-based Illumina offerings such as the RTA (Real Time Analysis), the Firecrest program of the Genome Analyzer Analysis Pipeline, the IPAR (Integrated Primary Analysis and Reporting) machine, and the OLB (Off-Line Basecaller).
  • RTA Real Time Analysis
  • Firecrest program of the Genome Analyzer Analysis Pipeline the Firecrest program of the Genome Analyzer Analysis Pipeline
  • IPAR Integrated Primary Analysis and Reporting
  • OLB Off-Line Basecaller
  • the base caller 114 produces the base call sequences by interpolating intensity of the subpixels, including at least one of nearest neighbor intensity extraction, Gaussian based intensity extraction, intensity extraction based on average of 2 x 2 subpixel area, intensity extraction based on brightest of 2 x 2 subpixel area, intensity extraction based on average of 3 x 3 subpixel area, bilinear intensity extraction, bicubic intensity extraction, and/or intensity extraction based on weighted area coverage.
  • the base caller 114 can be a neural network-based base caller, such as the neural network-based base caller 1514 disclosed herein.
  • the subpixel-wise base call sequences 116 are then fed as input to a searcher 118.
  • the searcher 118 searches for substantially matching base call sequences of contiguous subpixels.
  • the searcher 118 then generates a cluster map 802 that identifies clusters as disjointed regions, e.g., 804a-d, of contiguous subpixels that share a substantially matching base call sequence.
  • This application uses“disjointed”,“disjoint”, and“non-overlapping” interchangeably.
  • the search involves base calling the subpixels that contain parts of clusters to allow linking the called subpixels to contiguous subpixels with which they share a substantially matching base call sequence.
  • the searcher 118 requires that at least some of the disjointed regions have a predetermined minimum number of subpixels (e.g., more than 4, 6, or 10 subpixels) to be processed as a cluster.
  • the base caller 114 also identifies preliminary center coordinates of the clusters Subpixels that contain the preliminary center coordinates are referred to as origin subpixels Some example preliminary center coordinates (604a-c) identified by the base caller 114 and corresponding origin subpixels (606a-c) are shown in Figure 6. However, identification of the origin subpixels (preliminary center coordinates of the clusters) is not needed, as explained below.
  • the searcher 118 uses breadth-first search for identifying substantially matching base call sequences of the subpixels by beginning with the origin subpixels 606a-c and continuing with successively contiguous non-origin subpixels 702a-c. This again is optional, as explained below.
  • FIG 8a illustrates one example of a cluster map 802 generated by the merging of the subpixel base calls.
  • the cluster map identifies a plurality of disjointed regions (depicted in various colors in Figure 8a).
  • Each disjointed region comprises a non-overlapping group of contiguous subpixels that represents a respective cluster on a tile (from whose sequencing images and for which the cluster map is generated via the subpixel base calling).
  • the region between the disjointed regions represents the background on the tile.
  • the subpixels in the background region are called“background subpixels”.
  • the subpixels in the disjointed regions are called“cluster subpixels” or“cluster interior subpixels”.
  • origin subpixels are those subpixels in which preliminary center cluster coordinates determined by the RTA or another base caller, are located.
  • the origin subpixels contain the preliminary center cluster coordinates. This means that the area covered by an origin subpixel includes a coordinate location that coincides with a preliminary center cluster coordinate location. Since the cluster map 802 is an image of logical subpixels, the origin subpixels are some of the subpixels in the cluster map
  • the search to identify clusters with substantially matching base call sequences of the subpixels does not need to begin with identification of the origin subpixels (preliminary center coordinates of the clusters) because the search can be done for all the subpixels and can start from any subpixel (e.g , 0,0 subpixel or any random subpixel).
  • the search since each subpixel is evaluated to determine whether it shares a substantially matching base call sequence with another contiguous subpixel, the search does not depend on origin subpixels; the search can start with any subpixel
  • the origin subpixels (preliminary center coordinates of the clusters) identified by the base caller 114 are used to identify a first set of clusters (by identification of substantially matching base call sequences of contiguous subpixels). Then, subpixels that are not part of the first set of clusters are used to identify a second set of clusters (by identification of substantially matching base call sequences of contiguous subpixels) This allows the technology disclosed to identify additional or extra clusters for which the centers are not identified by the base caller 114. Finally, subpixels that are not part of the first and second sets of clusters are identified as background subpixels.
  • FIG 8b depicts one example of subpixel base calling.
  • each sequencing cycle has an image set with four distinct images (i.e., A, C, T, G images) captured using four different wavelength bands (image/imaging channel) and four different fluorescent dyes (one for each base).
  • pixels in images are divided into sixteen subpixels. Subpixels are then separately base called at each sequencing cycle by the base caller 114 To base call a given subpixel at a particular sequencing cycle, the base caller 114 uses intensities of the given subpixel in each of the four A, C, T, G images. For example, intensities in image regions covered by subpixel 1 in each of the four A, C, T, G images of cycle 1 are used to base call subpixel 1 at cycle 1. For subpixel 1, these image regions include top-left one-sixteenth area of the respective top-left pixels in each of the four A, C, T, G images of cycle 1.
  • intensities in image regions covered by subpixel m in each of the four A, C, T, G images of cycle n are used to base call subpixel m at cycle n
  • these image regions include bottom-right one-sixteenth area of the respective bottom-right pixels in each of the four A, C, T, G images of cycle 1.
  • This process produces subpixel-wise base call sequences 116 across the plurality of sequencing cycles.
  • the searcher 118 evaluates pairs of contiguous subpixels to determine whether they have a substantially matching base call sequence If yes, then the pair of subpixels is stored in the cluster map 802 as belonging to a same cluster in a disjointed region. If no, then the pair of subpixels is stored in the cluster map 802 as notbelonging to a same disjointed region.
  • the cluster map 802 therefore identifies contiguous sets of sub-pixels for which the base calls for the sub-pixels substantially match across a plurality of cycles.
  • Cluster map 802 therefore uses information from multiple cycles to provide a plurality of clusters with a high confidence that each cluster of the plurality of clusters provides sequence data fo r a single DNA strand.
  • a cluster metadata generator 122 then processes the cluster map 802 to determine cluster metadata, including determining spatial distribution of clusters, including their centers (810a), shapes, sizes, background, and/or boundaries based on the disjointed regions ( Figure 9).
  • the cluster metadata generator 122 identifies as background those subpixels in the cluster map 802 that do not belong to any of the disjointed regions and therefore do not contribute to any clusters Such subpixels are referred to as background subpixels 806a-c.
  • the cluster map 802 identifies cluster boundary portions 808a-c between two contiguous subpixels whose base call sequences do not substantially match.
  • the cluster map is stored in memory (e.g., cluster maps data store 120) for use as ground truth for training a classifier such as the neural network-based template generator 1512 and the neural network-based base caller 1514.
  • the cluster metadata can also be stored in memoiy (e g., cluster metadata data store 124).
  • Figure 9 shows another example of a cluster map that identifies cluster metadata, including spatial distribution of the clusters, along with cluster centers, cluster shapes, cluster sizes, cluster background, and/or cluster boundaries.
  • Figure 10 shows how a center of mass (COM) of a disjointed region in a cluster map is calculated.
  • the COM can be used as the “revised” or“improved” center of the corresponding cluster in downstream processing.
  • a center of mass generator 1004 on a cluster-by -cluster basis, determines hyperlocated center coordinates 1006 of the clusters by calculating centers of mass of the disjointed regions of the cluster map as an average of coordinates of respective contiguous subpixels forming the disjointed regions. It then stores the hyperlocated center coordinates of the clusters in the memoiy on the cluster-by -cluster basis for use as ground truth for training the classifier
  • a subpixel categorizer on the cluster-by -cluster basis, identifies centers of mass subpixels 1008 in the disjointed regions 804a-d of the cluster map 802 at the hyperlocated center coordinates 1006 of the clusters.
  • the cluster map is upsampled using interpolation.
  • the upsampled cluster map is stored in the memoiy for use as ground truth for training the classifier.
  • Figure 11 depicts one implementation of calculation of a weighted decay factor for a subpixel based on the Euclidean distance from the subpixel to the center of mass (COM) of the disjointed region to which the subpixel belongs.
  • the weighted decay factor gives the highest value to the subpixel containing the COM and decreases for subpixels further away from the COM.
  • the weighted decay factor is used to derive a ground tmth decay map 1204 from a cluster map generated from the subpixel base calling discussed above.
  • the ground truth decay map 1204 contains an array of units and assigns at least one output value to each unit in the array In some implementations, the units are subpixels and each subpixel is assigned an output value based on the weighted decay factor.
  • the ground truth decay map 1204 is then used as ground truth for training the disclosed neural network-based template generator 1512. In some implementations, information from the ground tmth decay map 1204 is also used to prepare input for the disclosed neural network-based base caller 1514.
  • Figure 12 illustrates one implementation of an example ground tmth decay map 1204 derived from an example cluster map produced by the subpixel base calling as discussed above.
  • a value is assigned to each contiguous subpixel in the disjointed regions based on a decay factor 1102 that is proportional to distance 1106 of a contiguous subpixel from a center of mass subpixel 1104 in a disjointed region to which the contiguous subpixel belongs.
  • Figure 12 depicts a ground tmth decay map 1204
  • the subpixel value is an intensity value normalized between zero and one.
  • a same predetermined value is assigned to all the subpixels identified as the background.
  • the predetermined value is a zero intensity value.
  • the ground tmth decay map 1204 is generated by a ground tmth decay map generator 1202 from the upsampled cluster map that expresses the contiguous subpixels in the disjointed regions and the subpixels identified as the background based on their assigned values.
  • the ground truth decay map 1204 is stored in the memory for use as ground tmth for training the classifier.
  • each subpixel in the ground tmth decay map 1204 has a value normalized between zero and one.
  • Figure 13 illustrates one implementation of deriving a ground tmth ternary map 1304 from a cluster map.
  • the ground tmth ternary map 1304 contains an array of units and assigns at least one output value to each unit in the array
  • ternary map implementations of the ground tmth ternary map 1304 assign three output values to each unit in the array, such that, for each unit, a first output value corresponds to a classification label or score for a background class, a second output value corresponds to a classification label or score for a cluster center class, and a third output value corresponds to a classification label or score for a cluster/cluster interior class.
  • the ground tmth ternary map 1304 is used as ground truth data for training the neural network-based template generator 1512. In some implementations, information from the ground tmth ternary map 1304 is also used to prepare input for the neural network-based base caller 1514.
  • Figure 13 depicts an example ground truth ternary map 1304.
  • the contiguous subpixels in the disjointed regions are categorized on the cluster-by -cluster basis by a ground tmth ternary map generator 1302, as cluster interior subpixels belonging to a same cluster, the centers of mass subpixels as cluster center subpixels, and as background subpixels the subpixels not belonging to any cluster.
  • the categorizations are stored in the ground truth ternary map 1304. These categorizations and the ground tmth ternary map 1304 are stored in the memory for use as ground tmth for training the classifier.
  • coordinates of the cluster interior subpixels, the cluster center subpixels, and the background subpixels are stored in the memory for use as ground tmth for training the classifier. Then, the coordinates are downscaled by a factor used to upsample the cluster map. Then, on the cluster-by -cluster basis, the downscaled coordinates are stored in the memory for use as ground tmth for training the classifier.
  • the ground tmth ternary map generator 1302 uses the cluster maps to generate the ternary ground tmth data 1304 from the upsampled cluster map.
  • the ternary ground tmth data 1304 labels the background subpixels as belonging to a background class, the cluster center subpixels as belonging to a cluster center class, and the cluster interior subpixels as belonging to a cluster interior class.
  • color coding can be used to depict and distinguish the different class labels
  • the ternary ground tmth data 1304 is stored in the memory for use as ground tmth for training the classifier.
  • Figure 14 illustrates one implementation of deriving a ground tmth binary map 1404 from a cluster map.
  • the binary map 1404 contains an array of units and assigns at least one output value to each unit in the array By name, the binary map assigns two output values to each unit in the array, such that, for each unit, a first output value corresponds to a classification label or score for a cluster center class and a second output value corresponds to a classification label or score for a non-center class.
  • the binary map is used as ground tmth data for training the neural network-based template generator 1512. In some implementations, information from the binary map is also used to prepare input for the neural network-based base caller 1514.
  • Figure 14 depicts a ground tmth binary map 1404.
  • the ground tmth binary map generator 1402 uses the cluster maps 120 to generate the binary ground tmth data 1404 from the upsampled cluster maps
  • the binary ground tmth data 1404 labels the cluster center subpixels as belonging to a cluster center class and labels all other subpixels as belonging to a non-center class
  • the binary ground tmth data 1404 is stored in the memory for use as ground tmth for training the classifier.
  • the technology disclosed generates cluster maps 120 for a plurality of tiles of the flow cell, stores the cluster maps in memory, and determines spatial distribution of clusters in the tiles based on the cluster maps 120, including their shapes and sizes. Then, the technology disclosed, in the upsampled cluster maps 120 of the clusters in the tiles, categorizes, on a cluster-by -cluster basis, subpixels as cluster interior subpixels belonging to a same cluster, cluster center subpixels, and background subpixels.
  • the technology disclosed then stores the categorizations in the memory for use as ground tmth for training the classifier, and stores, on the cluster-by -cluster basis across the tiles, coordinates of the cluster interior subpixels, the cluster center subpixels, and the background subpixels in the memory for use as ground tmth for training the classifier
  • the technology disclosed then downscales the coordinates by the factor used to upsample the cluster map and stores, on the cluster-by -cluster basis across the tiles, the downscaled coordinates in the memory for use as ground tmth for training the classifier
  • the flow cell has at least one patterned surface with an array of wells that occupy the clusters.
  • the technology disclosed determines: (1) which ones of the wells are substantially occupied by at least one cluster, (2) which ones of the wells are minimally occupied, and (3) which ones of the wells are cooccupied by multiple clusters. This allows for determining respective metadata of multiple clusters that co-occupy a same well, i.e., centers, shapes, and sizes of two or more clusters that share a same well
  • the solid support on which samples are amplified into clusters comprises a patterned surface.
  • A“patterned surface” refers to an arrangement of different regions in or on an exposed layer of a solid support.
  • one or more of the regions can be features where one or more amplification primers are present.
  • the features can be separated by interstitial regions where amplification primers are not present.
  • the pattern can be an x-y format of features that are in rows and columns
  • the pattern can be a repeating arrangement of features and/or interstitial regions.
  • the pattern can be a random arrangement of features and/or interstitial regions.
  • Exemplary patterned surfaces that can be used in the methods and compositions set forth herein are described in US Pat No 8,778,849, US Pat No 9,079,148, US Pat No 8,778,848, and US Pub No 2014/0243224, each of which is incorporated herein by reference.
  • the solid support comprises an array of wells or depressions in a surface.
  • This may be fabricated as is generally known in the art using a variety of techniques, including, but not limited to, photolithography, stamping techniques, molding techniques and microetching techniques. As will be appreciated by those in the art, the technique used will depend on the composition and shape of the array substrate.
  • the features in a patterned surface can be wells in an array of wells (e.g.
  • microwells or nanowells on glass, silicon, plastic or other suitable solid supports with patterned, covalently-linked gel such as poly (N-(5 -azidoacetamidylpentyl)acrylamide-co-aciy lamide) (PAZAM, see, for example, US Pub. No. 2013/184796, WO 2016/066586, and WO 2015-002813, each of which is incorporated herein by reference in its entirety)
  • PAZAM poly (N-(5 -azidoacetamidylpentyl)acrylamide-co-aciy lamide)
  • the process creates gel pads used for sequencing that can be stable over sequencing mns with a large number of cycles
  • the covalent linking of the polymer to the wells is helpful for maintaining the gel in the structured features throughout the lifetime of the structured substrate during a variety of uses.
  • the gel need not be covalently linked to the wells.
  • silane free acrylamide SFA, see, for example, US Pat. No. 8,563,477, which is incorporated herein by reference in its entirety
  • SFA silane free acrylamide
  • a structured substrate can be made by patterning a solid support material with wells (e g. microwells or nanowells), coating the patterned support with a gel material (e.g. PAZAM, SFA or chemically modified variants thereof, such as the azidolyzed version of SFA (azido-SFA)) and polishing the gel coated support, for example via chemical or mechanical polishing, thereby retaining gel in the wells but removing or inactivating substantially all of the gel from the interstitial regions on the surface of the structured substrate between the wells
  • a solution of target nucleic acids e.g.
  • a fragmented human genome can then be contacted with the polished substrate such that individual target nucleic acids will seed individual wells via interactions with primers attached to the gel material; however, the target nucleic acids will not occupy the interstitial regions due to absence or inactivity of the gel material. Amplification of the target nucleic acids will be confined to the wells since absence or inactivity of gel in the interstitial regions prevents outward migration of the growing nucleic acid colony.
  • the process is conveniently manufacturable, being scalable and utilizing micro- or nanofabrication methods
  • flow cell refers to a chamber comprising a solid surface across which one or more fluid reagents canbe flowed.
  • flow cells and related fluidic systems and detection platforms that canbe readily used in the methods of the present disclosure are described, for example, in Bentley et al., Nature 456:53-59 (2008), WO 04/018497; US 7,057,026; WO 91/06678; WO 07/123744; US 7, 329, 492; US 7, 211, 414; US 7, 315, 019; US 7, 405, 281, and US 2008/0108082, each of which is incorporated herein by reference.
  • P5 and P7 are used when referring to amplification primers. It will be understood that any suitable amplification primers can be used in the methods presented herein, and that the use of P5 and P7 are exemplary implementations only Uses of amplification primers such as P5 and P7 on flow cells is known in the art, as exemplified by the disclosures of WO 2007/010251, WO 2006/064199, WO 2005/065814, WO 2015/106941, WO 1998/044151, and WO 2000/018957, each of which is incorporated by reference in its entirety.
  • any suitable forward amplification primer can be useful in the methods presented herein for hybridization to a complementary sequence and amplification of a sequence
  • any suitable reverse amplification primer canbe useful in the methods presented herein for hybridization to a complementary sequence and amplification of a sequence
  • the flow cell has at least one nonpatterned surface and the clusters are unevenly scattered over the nonpattemed surface.
  • density of the clusters ranges from about 100,000 clusters/mm 2 to about 1,000,000 clusters/mm 2 . In other implementations, density of the clusters ranges from about 1,000,000 clusters/mm 2 to about 10,000,000 clusters/mm 2 .
  • the preliminary center coordinates of the clusters determined by the base caller are defined in a template image of the tile.
  • a pixel resolution, an image coordinate system, and measurement scales of the image coordinate system are same for the template image and the images.
  • the technology disclosed relates to determining metadata about clusters on a tile of a flow cell
  • the technology disclosed accesses (1) a set of images of the tile captured during a sequencing run and (2) preliminary center coordinates of the clusters determined by a base caller.
  • the technology disclosed obtains abase call classifying, as one of four bases, (1) origin subpixels that contain the preliminary center coordinates and (2) a predetermined neighborhood of contiguous subpixels that are successively contiguous to respective ones of the origin subpixels.
  • the predetermined neighborhood of contiguous subpixels can be a m x n subpixel patch centered at subpixels containing the origin subpixels.
  • the subpixel patch is 3 x 3 subpixels.
  • the image patch canbe of any size, such as 5 x 5, 15 x 15, 20 x 20, and so on.
  • the predetermined neighborhood of contiguous subpixels canbe a n-connected subpixel neighborhood centered at subpixels containing the origin subpixels.
  • the technology disclosed identifies as background those subpixels in the cluster map that do not belong to any of the disjointed regions
  • the technology disclosed generates a cluster map that identifies the clusters as disjointed regions of contiguous subpixels that:
  • the technology disclosed then stores the cluster map in memoiy and determines the shapes and the sizes of the clusters based on the disjointed regions in the cluster map. In other implementations, centers of the clusters are also determined.
  • Figure 15 is a block diagram that shows one implementation of generating training data that is used to train the neural network-based template generator 1512 and the neural network-based base caller 1514
  • Figure 16 shows characteristics of the disclosed training examples used to train the neural network-based template generator 1512 and the neural network-based base caller 1514.
  • Each training example corresponds to a tile and is labelled with a corresponding ground truth data representation.
  • the ground truth data representation is a ground truth mask or a ground truth map that identifies the ground truth cluster metadata in the form of the ground truth decay map 1204, the ground truth ternary map 1304, or the ground tmth binary map 1404.
  • multiple training examples correspond to a same tile.
  • the technology disclosed relates to generating training data 1504 for neural network-based template generation and base calling
  • the technology disclosed accesses a multitude of images 108 of a flow cell 202 captured over a plurality of cycles of a sequencing run.
  • the flow cell 202 has a plurality of tiles.
  • each of the tiles has a sequence of image sets generated over the plurality of cycles.
  • Each image in the sequence of image sets 108 depicts intensity emissions of clusters 302 and their surrounding background 304 on a particular one of the tiles at a particular one the cycles.
  • a training set constructor 1502 constructs a training set 1504 that has a plurality of training examples.
  • each training example corresponds to a particular one of the tiles and includes image data from at least some image sets in the sequence of image sets 1602 of the particular one of the tiles.
  • the image data includes images in at least some image sets in the sequence of image sets 1602 of the particular one of the tiles.
  • the images can have a resolution of 1800 x 1800 In other implementations, it can be any resolution such as 100 x 100, 3000 x 3000, 10000 x 10000, and so on.
  • the image data includes at least one image patch from each of the images
  • the image patch covers a portion of the particular one of the tiles.
  • the image patch can have a resolution of 20 x 20.
  • the image patch can have any resolution, such as 50 x 50, 70 x 70, 90 x 90, 100 x 100, 3000 x 3000, 10000 x 10000, and so on.
  • the image data includes an upsampled representation of the image patch.
  • the upsampled representation can have a resolution of 80 x 80, for example.
  • the upsampled representation can have any resolution, such as 50 x 50, 70 x 70, 90 x 90, 100 x 100, 3000 x 3000, 10000 x 10000, and so on
  • multiple training examples correspond to a same particular one of the tiles and respectively include as image data different image patches from each image in each of at least some image sets in a sequence of image sets 1602 of the same particular one of the tiles In such implementations, at least some of the different image patches overlap with each other.
  • a ground truth generator 1506 generates at least one ground truth data representation for each of the training examples.
  • the ground truth data representation identifies at least one of spatial distribution of clusters and their surrounding background on the particular one of the tiles whose intensity emissions are depicted by the image data, including at least one of cluster shapes, cluster sizes, and/or cluster boundaries, and/or centers of the clusters
  • the ground tmth data representation identifies the clusters as disjointed regions of contiguous subpixels, the centers of the clusters as centers of mass subpixels within respective ones of the disjointed regions, and their surrounding background as subpixels that do not belong to any of the disjointed regions.
  • the ground tmth data representation has an upsampled resolution of 80 x 80.
  • the ground tmth data representation can have any resolution, such as 50 x 50, 70 x 70, 90 x 90, 100 x 100, 3000 x 3000, 10000 x 10000, and so on.
  • the ground tmth data representation identifies each subpixel as either being a cluster center or a non-center. In another implementation, the ground tmth data representation identifies each subpixel as either being cluster interior, cluster center, or surrounding background.
  • the technology disclosed stores, in memoiy, the training examples in the training set 1504 and associated ground tmth data 1508 as the training data 1504 for training the neural network-based template generator 1512 and the neural network-based base caller 1514.
  • the training is operationalized by trainer 1510.
  • the technology disclosed generates the training data for a variety of flow cells, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, and cluster densities.
  • the technology disclosed uses peak detection and segmentation to determine cluster metadata.
  • the technology disclosed processes input image data 1702 derived from a series of image sets 1602 through a neural network 1706 to generate an alternative representation 1708 of the input image data 1702.
  • an image set can be for a particular sequencing cycle and include four images, one for each image channel A, C, T, and G. Then, for a sequencing mn with fifty sequencing cycles, there will be fifty such image sets, i.e., a total of 200 images When arranged temporally, fifty image sets with four images-per image set would form the series of image sets 1602.
  • image patches of a certain size are extracted from each image in the fifty image sets, forming fifty image patch sets with four image patches-per image patch set and, in one implementation, this is the input image data 1702.
  • the input image data 1702 comprises image patch sets with four image patches-per image patch set for fewer than the fifty sequencing cycles, i e , just one, two, three, fifteen, twenty sequencing cycles.
  • Figure 17 illustrates one implementation of processing input image data 1702 through the neural network-based template generator 1512 and generating an output value for each unit in an array
  • the array is a decay map 1716.
  • the array is a ternary map 1718.
  • the array is a binary map 1720. The array may therefore represent one or more properties of each of a plurality of locations represented in the input image data 1702.
  • the decay map 1716, the ternary map 1718, and/or the binary map 1720 are generated by forward propagation of the trained neural network-based template generator 1512.
  • the forward propagation can be during training or during inference.
  • the decay map 1716, the ternary map 1718, and the binaiy map 1720 i e., cumulatively the output 1714
  • the decay map 1716, the ternary map 1718, and the binaiy map 1720 i e., cumulatively the output 1714
  • the size of the image array analyzed during inference depends on the size of the input image data 1702 (e.g., be the same or an upscaled or downscaled version), according to one implementation.
  • Each unit can represent a pixel, a subpixel, or a superpixel.
  • the unit-wise output values of an array can characterize/represent/denote the decay map 1716, the ternary map 1718, or the binaiy map 1720
  • the input image data 1702 is also an array of units in the pixel, subpixel, or superpixel resolution.
  • the neural network-based template generator 1512 uses semantic segmentation techniques to produce an output value for each unit in the input array Additional details about the input image data 1702 can be found in Figures 21b, 22, 23, and 24 and their discussion.
  • the neural network-based template generator 1512 is a fully convolutional network, such as the one described in J. Long, E. Shelhamer, and T. Darrell,“Fully convolutional networks for semantic segmentation,” in CVPR, (2015), which is incorporated herein by reference.
  • the neural network-based template generator 1512 is a U-Net network with skip connections between the decoder and the encoder between the decoder and the encoder, such as the one described in Ronneberger O, Fischer P, Brox T.,“U-net: Convolutional networks for biomedical image segmentation,” Med. Image Comput. Comput. Assist. Interv.
  • the U-Net architecture resembles an autoencoder with two main sub-structures: 1) an encoder, which takes an input image and reduces its spatial resolution through multiple convolutional layers to create a representation encoding. 2) A decoder, which takes the representation encoding and increases spatial resolution back to produce a reconstructed image as output
  • the U-Net introduces two innovations to this architecture: First, the objective function is set to reconstruct a segmentation mask using a loss function; and second, the convolutional layers of the encoder are connected to the corresponding layers of the same resolution in the decoder using skip connections.
  • the neural network-based template generator 1512 is a deep fully convolutional segmentation neural network with an encoder subnetwoik and a corresponding decoder network.
  • the encoder subnetwork includes a hierarchy of encoders and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps. Additional details about segmentation networks can be found in Appendix entitled“Segmentation Networks”.
  • the neural network-based template generator 1512 is a convolutional neural network. In another implementation, the neural network-based template generator 1512 is a recurrent neural network. In yet another implementation, the neural netwoik-based template generator 1512 is a residual neural network with residual bocks and residual connections. In a further implementation, the neural network-based template generator 1512 is a combination of a convolutional neural network and a recurrent neural network.
  • the neural network-based template generator 1512 can use various padding and striding configurations. It can use different output functions (e g., classification or regression) and may or may not include one or more fully -connected layers It can use ID convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, l x l convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions.
  • ID convolutions 2D convolutions
  • 3D convolutions 3D convolutions
  • 4D convolutions 5D convolutions
  • dilated or atrous convolutions dilated or atrous convolutions
  • transpose convolutions depthwise separable convolutions
  • pointwise convolutions pointwise convolutions
  • It can use one or more loss functions such as logistic regression/log loss, multi-class cross- entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss. It can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous SGD.
  • loss functions such as logistic regression/log loss, multi-class cross- entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, LI loss, L2 loss, smooth LI loss, and Huber loss.
  • It can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and
  • each image in the sequence of image sets 1602 covers a tile and depicts intensity emissions of clusters on a tile and their surrounding background captured for a particular imaging channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on a flow cell.
  • the input image data 1702 includes at least one image patch from each of the images in the sequence of image sets 1602 In such an implementation, the image patch covers a portion of the tile In one example, the image patch has a resolution of 20 x 20 In other cases, the resolution of the image patch can range from 20 x 20 to 10000 x 10000 In another implementation, the input image data 1702 includes an upsampled, subpixel resolution representation of the image patch from each of the images in the sequence of image sets 1602. In one example, the upsampled, subpixel representation has a resolution of 80 x 80 In other cases, the resolution of the upsampled, subpixel representation can range from 80 x 80 to 10000 x 10000
  • the input image data 1702 has an array of units 1704 that depicts clusters and their surrounding background.
  • an image set can be for a particular sequencing cycle and include four images, one for each image channel A, C, T, and G Then, for a sequencing run with fifty sequencing cycles, there will be fifty such image sets, i.e., a total of 200 images.
  • fifty image sets with four images-per image set would form the series of image sets 1602.
  • image patches of a certain size are extracted from each image in the fifty image sets, forming fifty image patch sets with four image patches-per image patch set and, in one implementation, this is the input image data 1702.
  • the input image data 1702 comprises image patch sets with four image patches-per image patch set for fewer than the fifty sequencing cycles, i.e., just one, two, three, fifteen, twenty sequencing cycles
  • the alternative representation is a feature map.
  • the feature map can be a convolved feature or convolved representation when the neural network is a convolutional neural network.
  • the feature map can be a hidden state feature or hidden state representation when the neural network is a recurrent neural network.
  • the technology disclosed processes the alternative representation 1708 through an output layer 1710 to generate an output 1714 that has an output value 1712 for each unit in the array 1704.
  • the output layer can be a classification layer such as softmax or sigmoid that produces unit-wise output values.
  • the output layer is a ReLU layer or any other activation function layer that produces unit-wise output values.
  • the units in the input image data 1702 are pixels and therefore pixel-wise output values 1712 are produced in the output 1714.
  • the units in the input image data 1702 are subpixels and therefore subpixel-wise output values 1712 are produced in the output 1714.
  • the units in the input image data 1702 are superpixels and therefore superpixel- wise output values 1712 are produced in the output 1714
  • Figure 18 shows one implementation of post-processing techniques that are applied to the decay map 1716, the ternary map 1718, or the binaiy map 1720 produced by the neural network-based template generator 1512 to derive cluster metadata, including cluster centers, cluster shapes, cluster sizes, cluster background, and/or cluster boundaries.
  • the post-processing techniques are applied by a post-processor 1814 that further comprises a thresholder 1802, a peak locator 1806, and a segmenter 1810.
  • the input to the thresholder 1802 is the decay map 1716, the ternary map 1718, or the binary map 1720 produced by template generator 1512, such as the disclosed neural network-based template generator.
  • the thresholder 1802 applies thresholding on the values in the decay map, the ternary map, or the binary map to identify background units 1804 (i e , subpixels characterizing non-cluster background).) and non-background units.
  • the thresholder 1802 thresholds output values of the units 1712 and classifies, or can reclassify a first subset of the units 1712 as“background units” 1804 depicting the surrounding background of the clusters and“non-background units” depicting units that potentially belong to clusters
  • the threshold value applied by the thresholder 1802 can be preset
  • the input to the peak locator 1806 is also the decay map 1716, the ternary map 1718, or the binary map 1720 produced by the neural network-based template generator 1512.
  • the peak locator 1806 applies peak detection on the values in the decay map 1716, the ternary map 1718, or the binary map 1720 to identify center units 1808 (i e , center subpixels characterizing cluster centers)
  • the peak locator 1806 processes the output values of the units 1712 in the output 1714 and classifies a second subset of the units 1712 as“center units” 1808 containing centers of the clusters.
  • the centers of the clusters detected by the peak locator 1806 are also the centers of mass of the clusters.
  • the center units 1808 are then provided to the segmenter 1810. Additional details about the peak locator 1806 can be found in the Appendix entitled“Peak Detection”
  • the thresholding and the peak detection can be done in parallel or one after the other. That is, they are not dependent on each other
  • the input to the segmenter 1810 is also the decay map 1716, the ternary map 1718, or the binary map 1720 produced by the neural network-based template generator 1512
  • Additional supplemental input to the segmenter 1810 comprises the thresholded units (background, nonbackground) 1804 identified by the thresholder 1802 and the center units 1808 identified by the peak locator 1806.
  • the segmenter 1810 uses the background, non-background 1804 and the center units 1808 to identify disjointed regions 1812 (i.e., non-overlapping groups of contiguous cluster/cluster interior subpixels characterizing clusters).
  • the segmenter 1810 processes the output values of the units 1712 in the output 1714 and uses the background, non-background units 1804 and the center units 1808 to determine shapes 1812 of the clusters as nonoverlapping regions of contiguous units separated by the background units 1804 and centered at the center units 1808.
  • the output of the segmenter 1810 is cluster metadata 1812.
  • the cluster metadata 1812 identifies cluster centers, cluster shapes, cluster sizes, cluster background, and/or cluster boundaries.
  • the segmenter 1810 begins with the center units 1808 and determines, for each center unit, a group of successively contiguous units that depict a same cluster whose center of mass is contained in the center unit. In one implementation, the segmenter 1810 uses a so-called“watershed” segmentation technique to subdivide contiguous clusters into multiple adjoining clusters at a valley in intensity. Additional details about the watershed segmentation technique and other segmentation techniques can be found in Appendix entitled “Watershed Segmentation”.
  • the output values of the units 1712 in the output 1714 are continuous values, such as the one encoded in the ground truth decay map 1204.
  • the output values are softmax scores, such as the one encoded in the ground truth ternary map 1304 and the ground truth binary map 1404
  • the contiguous units in the respective ones of the non-overlapping regions have output values weighted according to distance of a contiguous unit from a center unit in a non-overlapping region to which the contiguous unit belongs.
  • the center units have highest output values within the respective ones of the non-overlapping regions.
  • the decay map 1716, the ternary map 1718, and the binary map 1720 i.e., cumulatively the output 1714
  • the binary map 1720 progressively match or approach the ground truth decay map 1204, the ground truth ternary map 1304, and the ground truth binary map 1404, respectively
  • cluster shapes determined by the technology disclosed can be used to extract intensity of the clusters. Since clusters typically have irregular shapes and contours, the technology disclosed can be used to identify which subpixels contribute to the irregularly shaped disjointed/non-overlapping regions that represent the cluster shapes
  • Figure 19 depicts one implementation of extracting cluster intensity in the pixel domain.
  • “Template image” or“template” can refer to a data structure that contains or identifies the cluster metadata 1812 derived from the decay map 1716, the ternary map 1718, and/or the binary map 1718
  • the cluster metadata 1812 identifies cluster centers, cluster shapes, cluster sizes, cluster background, and/or cluster boundaries.
  • the template image is in the upsampled, subpixel domain to distinguish the cluster boundaries at a finegrained level.
  • the sequencing images 108 which contain the cluster and background intensity data, are typically in the pixel domain.
  • the technology disclosed proposes two approaches to use the cluster shape information encoded in the template image in the upsampled, subpixel resolution to extract intensities of the irregularly shaped clusters from the optical, pixel-resolution sequencing images
  • the first approach depicted in Figure 19
  • the non-overlapping groups of contiguous subpixels identified in the template image are located in the pixel resolution sequencing images and their intensities extracted via interpolation. Additional details about this intensity extraction technique can be found in Figure 33 and its discussion.
  • the cluster intensity 1912 of a given cluster is determined by an intensity extractor 1902 as follows.
  • a subpixel locator 1904 identifies subpixels that contribute to the cluster intensity of the given cluster based on a corresponding non-overlapping region of contiguous subpixels that identifies a shape of the given cluster
  • the subpixel locator 1904 locates the identified subpixels in one or more optical, pixel-resolution images 1918 generated for one or more imaging channels at a current sequencing cycle.
  • integer or non-integer coordinates e.g., floating points
  • the subpixel locator 1904 locates the identified subpixels in one or more optical, pixel-resolution images 1918 generated for one or more imaging channels at a current sequencing cycle.
  • integer or non-integer coordinates are located in the optical, pixel-resolution images, after a downscaling based on a downscaling factor that matches an upsampling factor used to create the subpixel domain.
  • an interpolator and subpixel intensity combiner 1906 intensities of the identified subpixels in the images processed, combines the interpolated intensities, and normalizes the combined interpolated intensities to produce a per-image cluster intensity for the given cluster in each of the images
  • the normalization is performed by a normalizer 1908 and is based on a normalization factor
  • the normalization factor is a number of the identified subpixels This is done to normalize/account for different cluster sizes and uneven illuminations that clusters receive depending on their location on the flow cell
  • a cross-channel subpixel intensity accumulator 1910 combines the per-image cluster intensity for each of the images to determine the cluster intensity 1912 of the given cluster at the current sequencing cycle.
  • the given cluster is base called based on the cluster intensity 1912 at the current sequencing cycle by any one of the base callers discussed in this application, yielding base calls 1916.
  • the output of the neural network-based base caller 1514 i.e., the decay map 1716, the ternary map 1718, and the binary map 1720 are in the optical, pixel domain. Accordingly, in such implementations, the template image is also in the optical, pixel domain.
  • Figure 20 depicts the second approach of extracting cluster intensity in the subpixel domain.
  • the sequencing images in the optical, pixel -resolution are upsampled into the subpixel resolution. This results in correspondence between the“cluster shape depicting subpixels” in the template image and the“cluster intensity depicting subpixels” in the upsampled sequencing images.
  • the cluster intensity is then extracted based on the correspondence. Additional details about this intensity extraction technique can be found in Figure 33 and its discussion.
  • the cluster intensity 2012 of a given cluster is determined by an intensity extractor 2002 as follows
  • a subpixel locator 2004 identifies subpixels that contribute to the cluster intensity of the given cluster based on a corresponding non-overlapping region of contiguous subpixels that identifies a shape of the given cluster.
  • the subpixel locator 2004 locates the identified subpixels in one or more subpixel resolution images 2018 upsampled from corresponding optical, pixel-resolution images 1918 generated for one or more imaging channels at a current sequencing cycle.
  • the upsampling can be performed by nearest neighbor intensity extraction, Gaussian based intensity extraction, intensity extraction based on average of 2 x 2 subpixel area, intensity extraction based on brightest of 2 x 2 subpixel area, intensity extraction based on average of 3 x 3 subpixel area, bilinear intensity extraction, bicubic intensity extraction, and/or intensity extraction based on weighted area coverage These techniques are described in detail in Appendix entitled“Intensity Extraction Methods”.
  • the template image can, in some implementations, serve as a mask for intensity extraction
  • a subpixel intensity combiner 2006 in each of the upsampled images, combines intensities of the identified subpixels and normalizes the combined intensities to produce a per-image cluster intensity for the given cluster in each of the upsampled images.
  • the normalization is performed by a normalizer 2008 and is based on a normalization factor.
  • the normalization factor is a number of the identified subpixels. This is done to normalize/account for different cluster sizes and uneven illuminations that clusters receive depending on their location on the flow cell
  • a cross-channel, subpixel-intensity accumulator 2010 combines the per-image cluster intensity for each of the upsampled images to determine the cluster intensity 2012 of the given cluster at the current sequencing cycle
  • the given cluster is base called based on the cluster intensity 2012 at the current sequencing cycle by any one of the base callers discussed in this application, yielding base calls 2016.
  • FIG. 21a The discussion now turns to details of three different implementations of the neural network-based template generator 1512. There are shown in Figure 21a and include: (1) the decay map-based template generator 2600 (also called the regression model), (2) the binary map-based template generator 4600 (also called the binary classification model), and (3) the ternary map-based template generator 5400 (also called the ternary classification model).
  • the decay map-based template generator 2600 also called the regression model
  • the binary map-based template generator 4600 also called the binary classification model
  • ternary map-based template generator 5400 also called the ternary classification model
  • the regression model 2600 is a fully convolutional network. In another implementation, the regression model 2600 is a U-Net network with skip connections between the decoder and the encoder In one implementation, the binary classification model 4600 is a fully convolutional network. In another implementation, the binary classification model 4600 is a U-Net network with skip connections between the decoder and the encoder In one implementation, the ternary classification model 5400 is a fully convolutional network. In another implementation, the ternary classification model 5400 is a U-Net network with skip connections between the decoder and the encoder.
  • Figure 21b depicts one implementation of the input image data 1702 that is fed as input to the neural network-based template generator 1512.
  • the input image data 1702 comprises a series of image sets 2100 with the sequencing images 108 that are generated during a certain number of initial sequences cycles of a sequencing run (e.g., the first 2 to 7 sequencing cycles).
  • intensities of the sequencing images 108 are corrected for background and/or aligned with each other using affine transformation
  • the sequencing run utilizes four-channel chemistry and each image set has four images
  • the sequencing run utilizes two-channel chemistry and each image set has two images.
  • the sequencing run utilizes one-channel chemistry and each image set has two images.
  • each image set has only one image.
  • Each image 2116 in the series of image sets 2100 covers a tile 2104 of a flow cell 2102 and depicts intensity emissions of clusters 2106 on the tile 2104 and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of the sequencing run.
  • the image set includes four images 2112A, 2112C, 2112T, and 2112G: one image for each base A, C, T, and G labeled with a corresponding fluorescent dye and imaged in a corresponding wavelength band (image/imaging channel)
  • Figure 21b depicts cluster intensity emissions as 2108 and background intensity emissions as 2110.
  • the image set also includes four images 2114A, 2114C, 2114T, and 2114G: one image for eachbase A, C, T, and G labeled with a corresponding fluorescent dye and imaged in a corresponding wavelength band (image/imaging channel).
  • images 2114A, 2114C, 2114T, and 2114G one image for eachbase A, C, T, and G labeled with a corresponding fluorescent dye and imaged in a corresponding wavelength band (image/imaging channel).
  • Figure 21b depicts cluster intensity emissions as 2118 and, in image 2114T, depicts background intensity emissions as 2120.
  • the input image data 1702 is encoded using intensity channels (also called imaged channels). For each of the c images obtained from the sequencer for a particular sequencing cycle, a separate imaged channel is used to encode its intensity signal data.
  • intensity channels also called imaged channels.
  • the input data 2632 comprises (i) a first red imaged channel with w x h pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the red image and (ii) a second green imaged channel with w x h pixels that depict intensity emissions of the one or more clusters and their surrounding background captured in the green image
  • the input data to the neural network-based template generator 1512 and the neural network-based base caller 1514 is based on pH changes induced by the release of hydrogen ions during molecule extension.
  • the pH changes are detected and converted to a voltage change that is proportional to the number of bases incorporated (e.g., in the case of Ion Torrent).
  • the input data is constructed from nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • ONT Oxford Nanopore Technologies
  • the Oxford Nanopore Technologies (ONT) sequencing is based on the following concept: pass a single strand of DNA (or RNA) through a membrane via a nanopore and apply a voltage difference across the membrane The nucleotides present in the pore will affect the pore’s electrical resistance, so current measurements overtime can indicate the sequence of DNA bases passing through the pore.
  • This electrical current signal (the‘squiggle’ due to its appearance when plotted) is the raw data gathered by an ONT sequencer.
  • DAC integer data acquisition
  • the input data comprises normalized or scaled DAC values.
  • image data is not used as input to the neural network-based template generator 1512 or the neural network- based base caller 1514.
  • the input to the neural network-based template generator 1512 and the neural network-based base caller 1514 is based on pH changes induced by the release of hydrogen ions during molecule extension. The pH changes are detected and converted to a voltage change that is proportional to the number of bases incorporated (e g., in the case of Ion Torrent).
  • the input to the neural network-based template generator 1512 and the neural network-based base caller 1514 is constructed from nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base.
  • the Oxford Nanopore Technologies (ONT) sequencing is based on the following concept: pass a single strand of DNA (or RNA) through a membrane via a nanopore and apply a voltage difference across the membrane. The nucleotides present in the pore will affect the pore’s electrical resistance, so current measurements over time can indicate the sequence of DNA bases passing through the pore.
  • This electrical current signal (the‘squiggle’ due to its appearance when plotted) is the raw data gathered by an ONT sequencer. These measurements are stored as 16-bit integer data acquisition (DAC) values, taken at 4kHz frequency (for example). With a DNA strand velocity of -450 base pairs per second, this gives approximately nine raw observations per base on average. This signal is then processed to identify breaks in the open pore signal corresponding to individual reads. These stretches of raw signal are base called - the process of converting DAC values into a sequence of DNA bases.
  • the input data 2632 comprises normalized or scaled DAC values
  • Figure 22 shows one implementation of extracting patches from the series of image sets 2100 in Figure 21b to produce a series of “down-sized” image sets that form the input image data 1702.
  • the sequencing images 108 in the series of image sets 2100 are of size f xL (e.g , 2000 x 2000).
  • L is any number ranging from 1 and 10,000.
  • a patch extractor 2202 extracts patches from the sequencing images 108 in the series of image sets 2100 and produces a series of down-sized image sets 2206, 2208, 2210, and 2212.
  • Each image in the series of down-sized image sets is a patch of size M x M (e.g., 20 x 20) that is extracted from a corresponding sequencing image in the series of image sets 2100.
  • the size of the patches canbe preset In other implementations, Mis any number ranging from 1 and 1000.
  • the first example series of down-sized image sets 2206 is extracted from coordinates 0,0 to 20,20 in the sequencing images 108 in the series of image sets 2100.
  • the second example series of down-sized image sets 2208 is extracted from coordinates 20,20 to 40,40 in the sequencing images 108 in the series of image sets 2100.
  • the third example series of down-sized image sets 2210 is extracted from coordinates 40,40 to 60,60 in the sequencing images 108 in the series of image sets 2100.
  • the fourth example series of down-sized image sets 2212 is extracted from coordinates 60,60 to 80,80 in the sequencing images 108 in the series of image sets 2100.
  • the series of down-sized image sets form the input image data 1702 that is fed as input to the neural network-based template generator 1512. Multiple series of down-sized image sets can be simultaneously fed as an input batch and a separate output can be produced for each series in the input batch.
  • Figure 23 depicts one implementation of upsampling the series of image sets 2100 in Figure 21b to produce a series of“upsampled” image sets 2300 that forms the input image data 1702.
  • an upsampler 2302 uses interpolation (e.g , bicubic interpolation) to upsample the sequencing images 108 in the series of image sets 2100 by an upsampling factor (e g., 4x) and the series of upsampled image sets 2300.
  • interpolation e.g , bicubic interpolation
  • an upsampling factor e g., 4x
  • the sequencing images 108 in the series of image sets 2100 are of siz & L x L (e g , 2000 x 2000) and are upsampled by an upsampling factor of four to produce upsampled images of size U x U (e.g., 8000 x 8000) in the series of upsampled image sets 2300.
  • the sequencing images 108 in the series of image sets 2100 are fed directly to the neural network-based template generator 1512 and the upsampling is performed by an initial layer of the neural network-based template generator 1512 That is, the upsampler 2302 is part of the neural network-based template generator 1512 and operates as its first layer that upsamples the sequencing images 108 in the series of image sets 2100 and produces the series of upsampled image sets 2300.
  • the series of upsampled image sets 2300 forms the input image data 1702 that is fed as input to the neural netwoik-based template generator 1512.
  • Figure 24 shows one implementation of extracting patches from the series of upsampled image sets 2300 in Figure 23 to produce a series of“upsampled and down-sized” image sets 2406, 2408, 2410, and 2412 that form the input image data 1702.
  • the patch extractor 2202 extracts patches from the upsampled images in the series of upsampled image sets 2300 and produces series of upsampled and down-sized image sets 2406, 2408, 2410, and 2412.
  • Each upsampled image in the series of upsampled and down-sized image sets is a patch of size Mx M (e.g , 80 x 80 that is extracted from a corresponding upsampled image in the series of upsampled image sets 2300
  • Mx M e.g , 80 x 80 that is extracted from a corresponding upsampled image in the series of upsampled image sets 2300
  • Mx M e.g , 80 x 80 that is extracted from a corresponding upsampled image in the series of upsampled image sets 2300
  • M is any number ranging from 1 and 1000
  • the first example series of upsampled and down-sized image sets 2406 is extracted from coordinates 0,0 to 80,80 in the upsampled images in the series of upsampled image sets 2300.
  • the second example series of upsampled and down-sized image sets 2408 is extracted from coordinates 80,80 to 160, 160 in the upsampled images in the series of upsampled image sets 2300.
  • the third example series of upsampled and down-sized image sets 2410 is extracted from coordinates 160, 160 to 240,240 in the upsampled images in the series of upsampled image sets 2300.
  • the fourth example series of upsampled and down-sized image sets 2412 is extracted from coordinates 240,240 to 320,320 in the upsampled images in the series of upsampled image sets 2300.
  • the series of upsampled and down-sized image sets form the input image data 1702 that is fed as input to the neural network-based template generator 1512
  • Multiple series of upsampled and down-sized image sets can be simultaneously fed as an input batch and a separate output can be produced for each series in the input batch.
  • the three models are trained to produce different outputs. This is achieved by using different types of ground truth data representations as training labels.
  • the regression model 2600 is trained to produce output that characterizes/represents/denotes a so-called“decay map” 1716.
  • the binary classification model 4600 is trained to produce output that characterizes/represents/denotes a so-called“binary map”
  • the ternary classification model 5400 is trained to produce output that characterizes/represents/denotes a so-called“ternary map” 1718.
  • the output 1714 of each type of model comprises an array of units 1712.
  • the units 1712 can be pixels, subpixels, or superpixels.
  • the output of each type of model includes unit-wise output values, such that the output values of an array of units together
  • Figure 25 illustrates one implementation of an overall example process of generating ground truth data for training the neural network-based template generator 1512.
  • the ground truth data can be the decay map 1204.
  • the ground truth data can be the binary map 1404.
  • the ground truth data can be the ternary map 1304
  • the ground truth data is generated from the cluster metadata.
  • the cluster metadata is generated by the cluster metadata generator 122.
  • the ground truth data is generated by the ground truth data generator 1506 [00343] In the illustrated implementation, the ground truth data is generated for tile A that is on lane A of flow cell A The ground truth data is generated from the sequencing images 108 of tile A captured during sequencing run A The sequencing images 108 of tile A are in the pixel domain. In one example involving 4-channel chemistry that generates four sequencing images per sequencing cycle, two hundred sequencing images 108 for fifty sequencing cycles are accessed Each of the two hundred sequencing images 108 depicts intensity emissions of clusters on tile A and their surrounding background captured in a particular image channel at a particular sequencing cycle.
  • the subpixel addresser 110 converts the sequencing images 108 into the subpixel domain (e.g., by dividing each pixel into a plurality of subpixels) and produces sequencing images 112 in the subpixel domain.
  • the base caller 114 (e.g., RTA) then processes the sequencing images 112 in the subpixel domain and produces a base call for each subpixel and for each of the fifty sequencing cycles This is referred to herein as“subpixel base calling”
  • the subpixel base calls 116 are then merged to produce, for each subpixel, a base call sequence across the fifty sequencing cycles.
  • Each subpixef s base call sequence has fifty base calls, i e , one base call for each of the fifty sequencing cycles.
  • the searcher 118 evaluates base call sequences of contiguous subpixels on a pair-wise basis.
  • the search involves evaluating each subpixel to determine with which of its contiguous subpixels it shares a substantially matching base call sequence.
  • the base caller 114 also identifies preliminary center coordinates of the clusters Subpixels that contain the preliminary center coordinates are referred to as center or origin subpixels. Some example preliminary center coordinates (604a-c) identified by the base caller 114 and corresponding origin subpixels (606a-c) are shown in Figure 6 However, identification of the origin subpixels (preliminary center coordinates of the clusters) is not needed, as explained below.
  • the searcher 118 uses a breadth-first search for identifying substantially matching base call sequences of the subpixels by beginning with the origin subpixels 606a-c and continuing with successively contiguous non-origin subpixels 702a-c. This again is optional, as explained below.
  • the search for substantially matching base call sequences of the subpixels does not need identification of the origin subpixels (preliminaiy center coordinates of the clusters) because the search can be done for all the subpixels and the search does not have to start from the origin subpixels and instead can start from any subpixel (e.g. , 0,0 subpixel or any random subpixel).
  • the search since each subpixel is evaluated to determine whether it shares a substantially matching base call sequence with another contiguous subpixel, the search does not have to utilize the origin subpixels and can start with any subpixel.
  • origin subpixels are used or not, certain clusters are identified that do not contain the origin subpixels (preliminaiy center coordinates of the clusters) predicted by the base caller 114.
  • Some examples of clusters identified by the merging of the subpixel base calls and not containing an origin subpixel are clusters 812a, 812b, 812c, 812d, and 812e in Figure 8a Therefore, use of the base caller 114 for identification of origin subpixels (preliminaiy center coordinates of the clusters) is optional and not essential for the search of substantially matching base call sequences of the subpixels.
  • the searcher 118 (1) identifies contiguous subpixels with substantially matching base call sequences as so-called“disjointed regions”, (2) further evaluates base call sequences of those subpixels that do not belong to any of the disjointed regions already identified at (1) to yield additional disjointed regions, and (3) then identifies background subpixels as those subpixels that do not belong to any of the disjointed regions already identified at (1) and (2).
  • Action (2) allows the technology disclosed to identify additional or extra clusters for which the centers are not identified by the base caller 114.
  • the results of the searcher 118 are encoded in a so-called“cluster map” of tile A and stored in the cluster map data store 120.
  • cluster map each of the clusters on tile A are identified by a respective disjointed region of contiguous subpixels, with background subpixels separating the disjointed regions to identify the surrounding background on tile A.
  • the center of mass (COM) calculator 1004 determines a center for each of the clusters on tile A by calculating a COM of each of the disjointed regions as an average of coordinates of respective contiguous subpixels forming the disjointed regions.
  • the centers of mass of the clusters are stored as COM data 2502.
  • a subpixel categorizer 2504 uses the cluster map and the COM data 2502 to produce subpixel categorizations 2506.
  • the subpixel categorizations 2506 classify subpixels in the cluster map as (1) backgrounds subpixels, (2) COM subpixels (one COM subpixel for each disjointed region containing the COM of the respective disjointed region), and (3) cluster/cluster interior subpixels forming the respective disjointed regions That is, each subpixel in the cluster map is assigned one of the three categories.
  • the ground truth decay map 1204 is produced by the ground truth decay map generator 1202
  • the ground truth binary map 1304 is produced by the ground truth binary map generator 1302
  • the ground truth temaiy map 1404 is produced by the ground truth ternary map generator 1402.
  • Figure 26 illustrates one implementation of the regression model 2600.
  • the regression model 2600 is a fully convolutional network 2602 that processes the input image data 1702 through an encoder subnetwork and a corresponding decoder subnetwork.
  • the encoder subnetwork includes a hierarchy of encoders
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to a full input resolution decay map 1716
  • the regression model 2600 is a U-Net network 2604 with skip connections between the decoder and the encoder Additional details about the segmentation networks can be found in the Appendix entitled“Segmentation Networks”
  • Figure 27 depicts one implementation of generating a ground truth decay map 1204 from a cluster map 2702.
  • the ground truth decay map 1204 is used as ground truth data for training the regression model 2600
  • the ground truth decay map generator 1202 assigns a weighted decay value to each contiguous subpixel in the disjointed regions based on a weighted decay factor.
  • the weighted decay value is proportional to Euclidean distance of a contiguous subpixel from a center of mass (COM) subpixel in a disjointed region to which the contiguous subpixel belongs, such that the weighted decay value is highest (e.g., 1 or 100) for the COM subpixel and decreases for subpixels further away from the COM subpixel.
  • the weighted decay value is multiplied by a preset factor, such as 100.
  • ground tmth decay map generator 1202 assigns all background subpixels a same predetermine value (e g , a minimalist background value).
  • the ground truth decay map 1204 expresses the contiguous subpixels in the disjointed regions and the background subpixels based on the assigned values.
  • the ground tmth decay map 1204 also stores the assigned values in an array of units, with each unit in the array representing a corresponding subpixel in the input.
  • Figure 28 is one implementation of training 2800 the regression model 2600 using abackpropagation-based gradient update technique that modifies parameters of the regression model 2600 until the decay map 1716 produced by the regression model 2600 as training output during the training 2800 progressively approaches or matches the ground truth decay map 1204.
  • the training 2800 includes iteratively optimizing a loss function that minimizes error 2806 between the decay map 1716 and the ground tmth decay map 1204, and updating parameters of the regression model 2600 based on the error 2806.
  • the loss function is mean squared error and the error is minimized on a subpixel-by-subpixel basis between weighted decay values of corresponding subpixels in the decay map 1716 and the ground truth decay map 1204
  • the training 2800 includes hundreds, thousands, and/or millions of iterations of forward propagation 2808 and backward propagation 2810, including parallelization techniques such as batching.
  • the training data 1504 includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the training data 1504 is annotated with ground tmth labels by an annotator 2806.
  • the training 2800 is operationalized by the trainer 1510 using a stochastic gradient update algorithm such as ADAM.
  • Figure 29 is one implementation of template generation by the regression model 2600 during inference 2900 in which the decay map 1716 is produced by the regression model 2600 as the inference output during the inference 2900.
  • One example of the decay map 1716 is disclosed in the Appendix titled“Regression_Model_Sample_Ouput”
  • the Appendix includes unit-wise weighted decay output values 2910 that together represent the decay map 1716
  • the inference 2900 includes hundreds, thousands, and/or millions of iterations of forward propagation 2904, including parallelization techniques such as batching.
  • the inference 2900 is performed on inference data 2908 that includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the inference 2900 is operationalized by a tester 2906.
  • Figure 30 illustrates one implementation of subjecting the decay map 1716 to (i) thresholding to identify background subpixels characterizing cluster background and to (ii) peak detection to identify center subpixels characterizing cluster centers.
  • the thresholding is performed by the thresholder 1802 that uses a local threshold binary to produce binarized output.
  • the peak detection is performed by the peak locator 1806 to identify the cluster centers. Additional details about the peak locator can be found in the Appendix entitled“Peak Detection”.
  • Figure 31 depicts one implementation of a watershed segmentation technique that takes as input the background subpixels and the center subpixels respectively identified by the thresholder 1802 and the peak locator 1806, finds valleys in intensity between adjoining clusters, and outputs non-overlapping groups of contiguous cluster/cluster interior subpixels characterizing the clusters. Additional details about the watershed segmentation technique canbe found in the Appendix entitled“Watershed Segmentation”.
  • a watershed segmenter 3102 takes as input (1) negativized output values 2910 in the decay map 1716, (2) binarized output of the thresholder 1802, and (3) cluster centers identified by the peak locator 1806 Then, based on the input, the watershed segmenter 3102 produces output 3104. In the output 3104, each cluster center is identified as a unique set/group of subpixels that belong to the cluster center (as long as the subpixels are“1” in the binaiy output, i.e., not background subpixels). Further, the clusters are filtered based on containing at least four subpixels.
  • the watershed segmenter 3102 can be part of the segmenter 1810, which in turn is part of the post-processor
  • Figure 32 is a table that shows an example U-Net architecture of the regression model 2600, along with details of the layers of the regression model 2600, dimensionality of the output of the layers, magnitude of the model parameters, and interconnections between the layers Similar details are disclosed in the file titled“Regression_Model_Example_Architecture”, which is submitted as an appendix to this application
  • Figure 33 illustrates different approaches of extracting cluster intensity using cluster shape information identified in a template image.
  • the template image identifies the cluster shape information in the upsampled, subpixel resolution.
  • the cluster intensity information is in the sequencing images 108, which are typically in the optical, pixel-resolution
  • coordinates of the subpixels are located in the sequencing images 108 and their respective intensities extracted using bilinear interpolation and normalized based on a count of the subpixels that contribute to a cluster
  • the second approach uses a weighted area coverage technique to modulate the intensity of a pixel according to a number of subpixels that contribute to the pixel.
  • the modulated pixel intensity is normalized by a subpixel count parameter.
  • the third approach upsamples the sequencing images into the subpixel domain using bicubic interpolation, sums the intensity of the upsampled pixels belonging to a cluster, and normalizes the summed intensity based on a count of the upsampled pixels that belong to the cluster.
  • Figure 34 shows different approaches of base calling using the outputs of the regression model 2600
  • the cluster centers identified from the output of the neural network-based template generator 1512 in the template image are fed to a base caller (e.g., Illumina’s Real-Time Analysis software, referred to herein as“RTA base caller”) for base calling.
  • a base caller e.g., Illumina’s Real-Time Analysis software, referred to herein as“RTA base caller”
  • the cluster intensities extracted from the sequencing images based on the cluster shape information in the template image are fed to the RTA base caller for base calling.
  • Figure 35 illustrates the difference in base calling performance when the RTA base caller uses ground truth center of mass (COM) location as the cluster center, as opposed to using a non-COM location as the cluster center.
  • COM center of mass
  • Figure 36 shows, on the left, an example decay map 1716 produced by the regression model 2600. On the right, Figure 36 also shows an example ground truth decay map 1204 that the regression model 2600 approximates during the training.
  • Both the decay map 1716 and the ground truth decay map 1204 depict clusters as disjointed regions of contiguous subpixels, the centers of the clusters as center subpixels at centers of mass of the respective ones of the disjointed regions, and their surrounding background as background subpixels notbelonging to any of the disjointed regions.
  • the contiguous subpixels in the respective ones of the disjointed regions have values weighted according to distance of a contiguous subpixel from a center subpixel in a disjointed region to which the contiguous subpixel belongs.
  • the center subpixels have the highest values within the respective ones of the disjointed regions
  • the background subpixels all have a same minimalist background value within a decay map.
  • Figure 37 portrays one implementation of the peak locator 1806 identifying cluster centers in a decay map by detecting peaks 3702. Additional details about the peak locator can be found in the Appendix entitled“Peak Detection”
  • Figure 38 compares peaks detected by the peak locator 1806 in the decay map 1716 produced by the regression model 2600 with peaks in a corresponding ground truth decay map 1204.
  • the red markers are peaks predicted by the regression model 2600 as cluster centers and the green markers are the ground truth centers of mass of the clusters
  • Figure 39 illustrates performance of the regression model 2600 using precision and recall statistics
  • the precision and recall statistics demonstrate that the regression model 2600 is good at recovering all identified cluster centers.
  • Figure 40 compares performance of the regression model 2600 with the RTA base caller for 20pM library concentration (normal run). Outperforming the RTA base caller, the regression model 2600 identifies 34, 323 (4 46%) more clusters in a higher cluster density environment (i.e., 988,884 clusters).
  • Figure 40 also shows results for other sequencing metrics such as number of clusters that pass the chastity filter ("% PF” (pass- filter)), number of aligned reads (“% Aligned”), number of duplicate reads G% Duplicate”), number of reads mismatching the reference sequence for all reads aligned to the reference sequence G% Mismatch”), bases called with quality score 30 and above (“% Q30 bases”), and so on.
  • % PF pass- filter
  • % Aligned number of aligned reads
  • G% Duplicate number of duplicate reads G% Duplicate
  • % Q30 bases bases called with quality score 30 and above
  • Figure 41 compares performance of the regression model 2600 with the RTA base caller for 30pM library concentration (dense mn) Outperforming the RTA base caller, the regression model 2600 identifies 34, 323 (6.27%) more clusters in a much higher cluster density environment (i e., 1,351,588 clusters).
  • Figure 41 also shows results for other sequencing metrics such as number of clusters that pass the chastity filter (“% PF” (pass- filter)), number of aligned reads C% Aligned”), number of duplicate reads C% Duplicate”), number of reads mismatching the reference sequence for all reads aligned to the reference sequence G% Mismatch”), bases called with quality score 30 and above (“% Q30 bases”), and so on.
  • % PF pass- filter
  • Figure 42 compares number of non-duplicate (unique or deduplicated) proper read pairs, i.e., the number of paired reads that have both reads aligned inwards within a reasonable distance detected by the regression model 2600 versus the same detected by the RTA base caller. The comparison is made both for the 20pM normal run and the 30pM dense mn.
  • Figure 42 shows that the disclosed neural network-based template generators are able to detect more clusters in fewer sequencing cycles of input to template generation than the RTA base caller.
  • the regression model 2600 identifies 11% more non-duplicate proper read pairs than the RTA base caller during the 20pM normal mn and 33% more non-duplicate proper read pairs than the RTA base caller during the 30pM dense run.
  • the regression model 2600 identifies 4.5% more non-duplicate proper read pairs than the RTA base caller during the 20pM normal mn and 6.3% more non-duplicate proper read pairs than the RTA base caller during the 30pM dense mn
  • Figure 43 shows, on the right, a first decay map produced by the regression model 2600.
  • the first decay map identifies clusters and their surrounding background imaged during the 20pM normal mn, along with their spatial distribution depicting cluster shapes, cluster sizes, and cluster centers.
  • Figure 43 shows a second decay map produced by the regression model 2600.
  • the second decay map identifies clusters and their surrounding background imaged during the 30pM dense mn, along with their spatial distribution depicting cluster shapes, cluster sizes, and cluster centers.
  • Figure 44 compares performance of the regression model 2600 with the RTA base caller for 40pM library concentration (highly dense mn).
  • the regression model 2600 produced 89,441,688 more aligned bases than the RTA base caller in a much higher cluster density environment (i.e., 1,509,395 clusters)
  • Figure 44 also shows results for other sequencing metrics such as number of clusters that pass the chastity filter C% PF” (pass- filter)), number of aligned reads (“% Aligned”), number of duplicate reads C% Duplicate”), number of reads mismatching the reference sequence for all reads aligned to the reference sequence C % Mismatch”), bases called with a quality score 30 and above C% Q30 bases”), and so on
  • Figure 45 shows, on the left, a first decay map produced by the regression model 2600.
  • the first decay map identifies clusters and their surrounding background imaged during the 40pM normal mn, along with their spatial distribution depicting cluster shapes, cluster sizes, and cluster centers.
  • Figure 45 shows the results of the thresholding and the peak locating applied to the first decay map to distinguish the respective clusters from each other and from the background and to identify their respective cluster centers.
  • intensities of the respective clusters are identified and a chastity filter (or passing filter) applied to reduce the mismatch rate.
  • Figure 46 illustrates one implementation of the binary classification model 4600.
  • the binary classification model 4600 is a deep fully convolutional segmentation neural network that processes the input image data 1702 through an encoder subnetwork and a corresponding decoder subnetwork.
  • the encoder subnetwork includes a hierarchy of encoders.
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to a full input resolution binary map 1720.
  • the binary classification model 4600 is a U-Net network with skip connections between the decoder and the encoder. Additional details about the segmentation networks can be found in the Appendix entitled“Segmentation Networks”.
  • the final output layer of the binary classification model 4600 is a unit-wise classification layer that produces a classification label for each unit in an output array
  • the unit-wise classification layer is a subpixel-wise classification layer that produces a softmax classification score distribution for each subpixel in the binary map 1720 across two classes, namely, a cluster center class and a noncluster class, and the classification label for a given subpixel is determined from the corresponding softmax classification score distribution.
  • the unit-wise classification layer is a subpixel-wise classification layer that produces a sigmoid classification score for each subpixel in the binary map 1720, such that the activation of a unit is interpreted as the probability that the unit belongs to the first class and, conversely, one minus the activation gives the probability that it belongs to the second class
  • the binary map 1720 expresses each subpixel based on the predicted classification scores
  • the binary map 1720 also stores the predicted value classification scores in an array of units, with each unit in the array representing a corresponding subpixel in the input
  • Figure 47 is one implementation of training 4700 the binary classification model 4600 using abackpropagation-based gradient update technique that modifies parameters of the binaiy classification model 4600 until the binary map 1720 of the binary classification model 4600 progressively approaches or matches the ground truth binary map 1404.
  • the final output layer of the binary classification model 4600 is a softmax-based subpixel-wise classification layer.
  • the ground truth binaiy map generator 1402 assigns each ground truth subpixel either (i) a cluster center value pair (e g , [1, 0]) or (ii) a non-center value pair (e g , [0, 1])
  • a first value [1] represents the cluster center class label and a second value [0] represents the non-center class label.
  • a first value [0] represents the cluster center class label and a second value [1] represents the non-center class label.
  • the ground truth binary map 1404 expresses each subpixel based on the assigned value pair/value.
  • the ground truth binary map 1404 also stores the assigned value pairs/values in an array of units, with each unit in the array representing a corresponding subpixel in the input.
  • the training includes iteratively optimizing a loss function that minimizes error 4706 (e.g , softmax error) between the binary map 1720 and the ground truth binary map 1404, and updating parameters of the binary classification model 4600 based on the error 4706
  • a loss function that minimizes error 4706 (e.g , softmax error) between the binary map 1720 and the ground truth binary map 1404
  • the loss function is a custom-weighted binary cross-entropy loss and the error 4706 is minimized on a subpixel-by-subpixel basis between predicted classification scores (e g., softmax scores) and labelled class scores (e.g., softmax scores) of corresponding subpixels in the binary map 1720 and the ground truth binary map 1404, as shown in Figure 47.
  • predicted classification scores e g., softmax scores
  • labelled class scores e.g., softmax scores
  • the custom-weighted loss function gives more weight to the COM subpixels, such that the cross-entropy loss is multiplied by a corresponding reward (or penalty) weight specified in a reward (or penalty) matrix whenever a COM subpixel is misclassified. Additional details about the custom-weighted loss function can be found in the Appendix entitled“Custom-Weighted Loss Function”.
  • the training 4700 includes hundreds, thousands, and/or millions of iterations of forward propagation 4708 and backward propagation 4710, including parallelization techniques such as batching.
  • the training data 1504 includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the training data 1504 is annotated with ground truth labels by the annotator 2806.
  • the training 2800 is operationalized by the trainer 1510 using a stochastic gradient update algorithm such as ADAM.
  • Figure 48 is another implementation of training 4800 the binaiy classification model 4600, in which the final output layer of the binary classification model 4600 is a sigmoid-based subpixel-wise classification layer.
  • the ground tmth binary map generator 1302 assigns each ground truth subpixel either (i) a cluster center value (e g , [1]) or (ii) a non-center value (e g , [0])
  • the COM subpixels are assigned the cluster center value pair/value and all other subpixels are assigned the non-center value pair/value.
  • values above a threshold intermediate value between 0 and 1 represent the center class label.
  • values below a threshold intermediate value between 0 and 1 represent the noncenter class label.
  • the ground truth binary map 1404 expresses each subpixel based on the assigned value pair/value.
  • the ground truth binary map 1404 also stores the assigned value pairs/values in an array of units, with each unit in the array representing a corresponding subpixel in the input.
  • the training includes iteratively optimizing a loss function that minimizes error 4806 (e g , sigmoid error) between the binaiy map 1720 and the ground tmth binary map 1404, and updating parameters of the binary classification model 4600 based on the error 4806.
  • a loss function that minimizes error 4806 (e g , sigmoid error) between the binaiy map 1720 and the ground tmth binary map 1404, and updating parameters of the binary classification model 4600 based on the error 4806.
  • the loss function is a custom-weighted binaiy cross-entropy loss and the error 4806 is minimized on a subpixel-by-subpixel basis between predicted scores (e.g., sigmoid scores) and labelled scores (e.g., sigmoid scores) of corresponding subpixels in the binaiy map 1720 and the ground truth binary map 1404, as shown in Figure 48.
  • predicted scores e.g., sigmoid scores
  • labelled scores e.g., sigmoid scores
  • the custom-weighted loss function gives more weight to the COM subpixels, such that the cross-entropy loss is multiplied by a corresponding reward (or penalty) weight specified in a reward (or penalty) matrix whenever a COM subpixel is misclassified. Additional details about the custom-weighted loss function can be found in the Appendix entitled“Custom-Weighted Loss Function”
  • the training 4800 includes hundreds, thousands, and/or millions of iterations of forward propagation 4808 and backward propagation 4810, including parallelization techniques such as batching.
  • the training data 1504 includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the training data 1504 is annotated with ground truth labels by the annotator 2806.
  • the training 2800 is operationalized by the trainer 1510 using a stochastic gradient update algorithm such as ADAM.
  • Figure 49 illustrates another implementation of the input image data 1702 fed to the binary classification model 4600 and the corresponding class labels 4904 used to train the binary classification model 4600.
  • the input image data 1702 comprises a series of upsampled and down-sized image sets 4902.
  • the class labels 4904 comprise two classes: (1)“no cluster center” and (2)“cluster center”, which are distinguished using different output values That is, (1) the light green units/subpixels 4906 represent subpixels that are predicted by the binary classification model 4600 to not contain the cluster centers and (2) the dark green subpixels 4908 represent units/subpixels that are predicted by the binary classification model 4600 to contain the cluster centers.
  • Figure 50 is one implementation of template generation by the binary classification model 4600 during inference 5000 in which the binary map 1720 is produced by the binary classification model 4600 as the inference output during the inference 5000.
  • the binary map 1720 includes unit-wise binary classification scores 5010 that together represent the binary map 1720.
  • the binary map 1720 has a first array 5002a of unit-wise classification scores for the non-center class and a second array 5002b of unit-wise classification scores for the cluster center class.
  • the inference 5000 includes hundreds, thousands, and/or millions of iterations of forward propagation 5004, including parallelization techniques such as batching.
  • the inference 5000 is performed on inference data 2908 that includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the inference 5000 is operationalized by the tester 2906.
  • the binary map 1720 is subjected to post-processing techniques discussed above, such as thresholding, peak detection, and/or watershed segmentation to generate cluster metadata.
  • Figure 51 depicts one implementation of subjecting the binary map 1720 to peak detection to identify cluster centers.
  • the binary map 1720 is an array of units that classifies each subpixel based on the predicted classification scores, with each unit in the array representing a corresponding subpixel in the input.
  • the classification scores can be softmax scores or sigmoid scores.
  • the binary map 1720 includes two arrays: (1) a first array 5002a of unit-wise classification scores for the non-center class and (2) a second array 5002b of unit-wise classification scores for the cluster center class In both the arrays, each unit represents a corresponding subpixel in the input.
  • the peak locator 1806 applies peak detection on the units in the binary map 1720.
  • the peak detection identifies those units that have classification scores (e.g., softmax/sigmoid scores) above a preset threshold.
  • classification scores e.g., softmax/sigmoid scores
  • the identified units are inferred as the cluster centers and their corresponding subpixels in the input are determined to contain the cluster centers and stored as cluster center subpixels in a subpixel classifications data store 5102. Additional details about the peak locator 1806 can be found in the Appendix entitled“Peak Detection”
  • the remaining units and their corresponding subpixels in the input are determined to not contain the cluster centers and stored as noncenter subpixels in the subpixel classifications data store 5102.
  • those units that have classification scores below a certain background threshold are set to zero.
  • such units and their corresponding subpixels in the input are inferred to denote the background surrounding the clusters and stored as background subpixels in the subpixel classifications data store 5102. In other implementations, such units can be considered noise and ignored.
  • Figure 52a shows, on the left, an example binary map produced by the binary classification model 4600. On the right, Figure 52a also shows an example ground truth binaiy map that the binaiy classification model 4600 approximates during the training
  • the binaty map has a plurality of subpixels and classifies each subpixel as either a cluster center or a non-center.
  • the ground truth binary map has a plurality of subpixels and classifies each subpixel as either a cluster center or a non-center.
  • Figure 52b illustrates performance of the binary classification model 4600 using recall and precision statistics Applying these statistics, the binaiy classification model 4600 outperforms the RTA base caller.
  • Figure 53 is a table that shows an example architecture of the binary classification model 4600, along with details of the layers of the binary classification model 4600, dimensionality of the output of the layers, magnitude of the model parameters, and interconnections between the layers. Similar details are disclosed in the Appendix titled“Binaiy _Classification_Model_Example_Architecture”. 3. Ternary (Three Class) Classification Model
  • Figure 54 illustrates one implementation of the ternary classification model 5400
  • the ternary classification model 5400 is a deep fully convolutional segmentation neural network that processes the input image data 1702 through an encoder subnetwork and a corresponding decoder subnetwork.
  • the encoder subnetwork includes a hierarchy of encoders.
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to a full input resolution ternary map 1718.
  • the ternary classification model 5400 is a U-Net network with skip connections between the decoder and the encoder. Additional details about the segmentation networks can be found in the Appendix entitled“Segmentation Networks”.
  • the final output layer of the ternary classification model 5400 is a unit-wise classification layer that produces a classification label for each unit in an output array
  • the unit-wise classification layer is a subpixel-wise classification layer that produces a softmax classification score distribution for each subpixel in the ternary map 1718 across three classes, namely, a background class, a cluster center class, and a cluster/cluster interior class, and the classification label for a given subpixel is determined from the corresponding softmax classification score distribution.
  • the ternary map 1718 expresses each subpixel based on the predicted classification scores.
  • the ternary map 1718 also stores the predicted value classification scores in an array of units, with each unit in the array representing a corresponding subpixel in the input.
  • Figure 55 is one implementation of training 5500 the ternary classification model 5400 using a backpropagation-based gradient update technique that modifies parameters of the ternary classification model 5400 until the ternary map 1718 of the ternary classification model 5400 progressively approaches or matches training ground truth ternary maps 1304
  • the final output layer of the ternary classification model 5400 is a softmax-based subpixel-wise classification layer
  • the ground truth ternary map generator 1402 assigns each ground truth subpixel either (i) a background value triplet (e.g., [1, 0, 0]), (ii) a cluster center value triplet (e.g., [0, 1, 0]), or (iii) a cluster/cluster interior value triplet (e.g., [0, 0, ID-
  • the background subpixels are assigned the background value triplet.
  • the center of mass (COM) subpixels are assigned the cluster center value triplet.
  • the cluster/cluster interior subpixels are assigned the cluster/cluster interior value triplet.
  • a first value [1] represents the background class label
  • a second value [0] represents the cluster center label
  • a third value [0] represents the cluster/cluster interior class label.
  • a first value [0] represents the background class label
  • a second value [1] represents the cluster center label
  • a third value [0] represents the cluster/cluster interior class label.
  • a first value [0] represents the background class label
  • a second value [0] represents the cluster center label
  • a third value [1] represents the cluster/cluster interior class label.
  • the ground truth ternary map 1304 expresses each subpixel based on the assigned value triplet.
  • the ground truth ternary map 1304 also stores the assigned triplets in an array of units, with each unit in the array representing a corresponding subpixel in the input.
  • the training includes iteratively optimizing a loss function that minimizes error 5506 (e.g , softmax error) between the ternary map 1718 and the ground truth ternary map 1304, and updating parameters of the ternary classification model 5400 based on the error 5506
  • a loss function that minimizes error 5506 (e.g , softmax error) between the ternary map 1718 and the ground truth ternary map 1304, and updating parameters of the ternary classification model 5400 based on the error 5506
  • the loss function is a custom-weighted categorical cross-entropy loss and the error 5506 is minimized on a subpixel-by-subpixel basis between predicted classification scores (e g., softmax scores) and labelled class scores (e.g., softmax scores) of corresponding subpixels in the ternary map 1718 and the ground truth ternary map 1304, as shown in Figure 54
  • the custom-weighted loss function gives more weight to the COM subpixels, such that the cross-entropy loss is multiplied by a corresponding reward (or penalty) weight specified in a reward (or penalty) matrix whenever a COM subpixel is misclassified. Additional details about the custom-weighted loss function can be found in the Appendix entitled“Custom-Weighted Loss Function”.
  • the training 5500 includes hundreds, thousands, and/or millions of iterations of forward propagation 5508 and backward propagation 5510, including parallelization techniques such as batching.
  • the training data 1504 includes, as the input image data 1702, a series of upsampled and down-sized image sets.
  • the training data 1504 is annotated with ground truth labels by the annotator 2806.
  • the training 5500 is operationalized by the trainer 1510 using a stochastic gradient update algorithm such as ADAM.
  • Figure 56 illustrates one implementation of input image data 1702 fed to the ternary classification model 5400 and the corresponding class labels used to train the ternary classification model 5400.
  • the input image data 1702 comprises a series of upsampled and down-sized image sets 5602.
  • the class labels 5604 comprise three classes: (1)“background class”, 2 “cluster center class”, and (3)“cluster interior class”, which are distinguished using different output values.
  • some of these different output values can be visually represented as follows: (1) the grey units/subpixels 5606 represent subpixels that are predicted by the ternary classification model 5400 to be the background, (2) the dark green units/subpixels 5608 represent subpixels that are predicted by the ternary classification model 5400 to contain the cluster centers, and (3) the light green subpixels 5610 represent subpixels that are predicted by the ternary classification model 5400 to contain the interior of the clusters.
  • Figure 57 is a table that shows an example architecture of the ternary classification model 5400, along with details of the layers of the ternary classification model 5400, dimensionality of the output of the layers, magnitude of the model parameters, and interconnections between the layers. Similar details are disclosed in the Appendix titled t Tematy_Classification_Model_Exanlple_Architecture ,, .
  • Figure 58 is one implementation of template generation by the ternary classification model 5400 during inference 5800 in which the ternary map 1718 is produced by the ternary classification model 5400 as the inference output during the inference 5800
  • One example of the ternary map 1718 is disclosed in the Appendix titled“Temaiy Classification Model Sample Ouput” .
  • the Appendix includes unit-wise binary classification scores 5810 that together represent the ternary map 1718 In the softmax applications, the Appendix has a first array 5802a of unit- wise classification scores for the background class, a second array 5802b of unit-wise classification scores for the cluster center class, and a third array 5802c of unit-wise classification scores for the cluster/cluster interior class.
  • the inference 5800 includes hundreds, thousands, and/or millions of iterations of forward propagation 5804, including parallelization techniques such as batching.
  • the inference 5800 is performed on inference data 2908 that includes, as the input image data 1702, a series of upsampled and down-sized image sets
  • the inference 5000 is operationalized by the tester 2906
  • the temaiy map 1718 is produced by the ternary classification model 5400 using post-processing techniques discussed above, such as thresholding, peak detection, and/or watershed segmentation.
  • Figure 59 graphically portrays the ternary map 1718 produced by the ternary classification model 5400 in which each subpixel has a three-way softmax classification score distribution for the three corresponding classes, namely, the background class 5906, the cluster center class 5902, and the cluster/cluster interior class 5904.
  • Figure 60 depicts an array of units produced by the temaiy classification model 5400, along with the unit-wise output values.
  • each unit has three output values for the three corresponding classes, namely, the background class 5906, the cluster center class 5902, and the cluster/cluster interior class 5904.
  • each unit is assigned the class that has the highest output value, as indicated by the class in parenthesis under each unit
  • the output values 6002, 6004, and 6006 are analyzed for each of the respective classes 5906, 5902, and 5904 (row-wise).
  • Figure 61 shows one implementation of subjecting the temaiy map 1718 to post-processing to identify cluster centers, cluster background, and cluster interior.
  • the ternary map 1718 is an array of units that classifies each subpixel based on the predicted classification scores, with each unit in the array representing a corresponding subpixel in the input.
  • the classification scores can be softmax scores
  • the ternary map 1718 includes three arrays: (1) a first array 5802a of unit-wise classification scores for the background class, (2) a second array 5802b of unit-wise classification scores for the cluster center class, and (3) a third array 5802c of unit-wise classification scores for the cluster interior class. In all three arrays, each unit represents a corresponding subpixel in the input.
  • the peak locator 1806 applies peak detection on softmax values in the temaiy map 1718 for the cluster center class 5802b.
  • the peak detection identifies those units that have classification scores (e.g., softmax scores) above a preset threshold.
  • classification scores e.g., softmax scores
  • the identified units are inferred as the cluster centers and their corresponding subpixels in the input are determined to contain the cluster centers and stored as cluster center subpixels in a subpixel classifications and segmentations data store 6102. Additional details about the peak locator 1806 can be found in the Appendix entitled“Peak Detection”.
  • those units that have classification scores below a certain noise threshold are set to zero. Such units canbe considered noise and ignored.
  • units that have classification scores for the background class 5802a above a certain background threshold e.g., equal to or greater than 0.5
  • a certain background threshold e.g., equal to or greater than 0.5
  • the watershed segmentation algorithm operationalized by the watershed segmenter 3102, is used to determine the shapes of the clusters.
  • the background units/subpixels are used as a mask by the watershed segmentation algorithm.
  • Classification scores of the unit/subpixels inferred as the cluster centers and the cluster interior are summed to produce so-called“cluster labels”.
  • the cluster centers are used as watershed markers, for separationby intensity valleys by the watershed segmentation algorithm [00455]
  • negativized cluster labels are provided as an input image to the watershed segmenter 3102 that performs segmentation and produces the cluster shapes as disjointed regions of contiguous cluster interior subpixels separated by the background subpixels.
  • each disjointed region includes a corresponding cluster center subpixel.
  • the corresponding cluster center subpixel is the center of the disjointed region to which itbelongs.
  • centers of mass (COM) of the disjointed regions are calculated based on the underlying location coordinates and stored as new centers of the clusters.
  • the outputs of the watershed segmenter 3102 are stored in the subpixel classifications and segmentations data store 6102. Additional details about the watershed segmentation algorithm and other segmentation algorithms can be found in Appendix entitled“Watershed Segmentation”.
  • Example outputs of the peak locator 1806 and the watershed segmenter 3102 are shown in Figures 62a, 62b, 63, and 64.
  • Figure 62a shows example predictions of the ternary classification model 5400.
  • Figure 62a shows four maps and each map has an array of units
  • the first map 6202 (left most) shows each unit’s output values for the cluster center class 5802b
  • the second map 6204 shows each unit’ s output values for the cluster/cluster interior class 5802c.
  • the third map 6206 (right most) shows each unit’ s output values for the background class 5802a
  • the fourth map 6208 (bottom) is a binary mask of ground tmth ternary map 6008 that assigns each unit the class label that has the highest output value
  • Figure 62b illustrates other example predictions of the ternary classification model 5400.
  • Figure 62b shows four maps and each map has an array of units.
  • the first map 6212 bottom left most) shows each unit’ s output values for the cluster/cluster interior class.
  • the second map 6214 shows each unit’s output values for the cluster center class.
  • the third map 6216 bottom right most) shows each unit’s output values for the background class
  • the fourth map (top) 6210 is the ground tmth ternary map that assigns each unit the class label that has the highest output value.
  • Figure 62c shows yet other example predictions of the ternary classification model 5400.
  • Figure 64 shows four maps and each map has an array of units.
  • the first map 6220 (bottom left most) shows each unit’ s output values for the cluster/cluster interior class.
  • the second map 6222 shows each unit’s output values for the cluster center class.
  • the third map 6224 (bottom right most) shows each unit’s output values for the background class.
  • the fourth map 6218 (top) is the ground truth ternary map that assigns each unit the class label that has the highest output value.
  • Figure 63 depicts one implementation of deriving the cluster centers and cluster shapes from the output of the ternary classification model 5400 in Figure 62a by subjecting the output to post-processing.
  • the post-processing e.g , peak locating, watershed segmentation
  • Figure 64 compares performance of the binary classification model 4600, the regression model 2600, and the RTA base caller. The performance is evaluated using a variety of sequencing metrics. One metric is the total number of clusters detected (“# clusters”), which can be measured by the number of unique cluster centers that are detected. Another metric is the number of detected clusters that pass the chastity filter (“% PF” (pass -filter)). During cycles 1-25 of a sequencing mn, the chastity filter removes the least reliable clusters from the image extraction results. Clusters“pass filter” if no more than one base call has a chastity value below 0.6 in the first 25 cycles.
  • Chastity is defined as the ratio of the brightest base intensity divided by the sum of the brightest and the second brightest base intensities. This metric goes beyond the quantity of the detected clusters and also conveys their quality, i.e., how many of the detected clusters can be used for accurate base calling and downstream secondary and temaiy analysis such as variant calling and variant pathogenicity annotation.
  • Other metrics that measure how good the detected clusters are for downstream analysis include the number of aligned reads produced from the detected clusters (“% Aligned”), the number of duplicate reads produced from the detected clusters (“% Duplicate”), the number of reads produced from the detected clusters mismatching the reference sequence for all reads aligned to the reference sequence (“% Mismatch”), the number of reads produced from the detected clusters whose portions do not match well to the reference sequence on either side and thus are ignored for the alignment (“% soft clipped”), the number of bases called for the detected clusters with quality score 30 and above (“% Q30 bases”), the number of paired reads produced from the detected clusters that have both reads aligned inwards within a reasonable distance (“total proper read pairs”), and the number of unique or deduplicated proper read pairs produced from the detected clusters (“non-duplicate proper read pairs”)
  • both the binary classification model 4600 and the regression model 2600 outperform the RTA base caller at template generation on most of the metrics.
  • Figure 65 compares the performance of the ternary classification model 5400 with that of the RTA base caller under three contexts, five sequencing metrics, and two mn densities.
  • the cluster centers are detected by the RTA base caller, the intensity extraction from the clusters is done by the RTA base caller, and the clusters are also base called using the RTA base caller.
  • the cluster centers are detected by the ternary classification model 5400; however, the intensity extraction from the clusters is done by the RTA base caller and the clusters are also base called using the RTA base caller.
  • the cluster centers are detected by the ternary classification model 5400 and the intensity extraction from the clusters is done using the cluster shape-based intensity extraction techniques disclosed herein (note that the cluster shape information is generated by the ternary classification model 5400); but the clusters are base called using the RTA base caller
  • the performance is compared between the ternary classification model 5400 and the RTA base caller along five metrics: (1) the total number of clusters detected (“# clusters”), (2) the number of detected clusters that pass the chastity filter (“# PF”), (3) the number of unique or deduplicated proper read pairs produced from the detected clusters C# nondup proper read pairs”), (4) the rate of mismatches between a sequence read produced from the detected clusters and a reference sequence after alignment (“%Mismatch rate”), and (5) bases called for the detected clusters with quality score 30 and above (“% Q30”).
  • the temaiy classification model 5400 outperforms the RTA base caller on all the metrics.
  • Figure 66 shows that the regression model 2600 outperforms the
  • Figure 67 focuses on the penultimate layer 6702 of the neural network-based template generator 1512
  • Figure 68 visualizes what the penultimate layer 6702 of the neural network-based template generator 1512 has learned as a result of the backpropagation-based gradient update training
  • the illustrated implementation visualizes twenty-four out of the thirty -two convolution filters of the penultimate layer 6702 overlaid on the ground truth cluster shapes.
  • the penultimate layer 6702 has learned the cluster metadata, including spatial distribution of the clusters such as cluster centers, cluster shapes, cluster sizes, cluster background, and cluster boundaries.
  • Figure 69 overlays cluster center predictions of the binary classification model 4600 (in blue) onto those of the RTA base caller (in pink). The predictions are made on sequencing image data from the Illumina NextSeq sequencer.
  • Figure 70 overlays cluster center predictions made by the RTA base caller (in pink) onto visualization of the trained convolution filters of the penultimate layer of the binaiy classification model 4600. These convolution filters are learned as a result of training on sequencing image data from the Illumina NextSeq sequencer.
  • Figure 71 illustrates one implementation of training data used to train the neural network-based template generator 1512.
  • the training data is obtained from dense flow cells that produce data with storm probe images.
  • the training data is obtained from dense flow cells that produce data with fewer bridge amplification cycles.
  • Figure 72 is one implementation of using beads for image registration based on cluster center predictions of the neural network-based template generator 1512
  • Figure 73 illustrates one implementation of cluster statistics of clusters identified by the neural network-based template generator 1512.
  • the cluster statistics include cluster size based on number of contributive subpixels and GC-content.
  • Figure 74 shows how the neural network-based template generator 1512 s ability to distinguish between adjacent clusters improves when the number of initial sequencing cycles for which the input image data 1702 is used increases from five to seven
  • a single cluster is identified by a single disjointed region of contiguous subpixels.
  • the single cluster is segmented into two adjacent clusters, each having their own disjointed regions of contiguous subpixels.
  • Figure 75 illustrates the difference in base calling performance when a RTA base caller uses ground truth center of mass (COM) location as the cluster center, as opposed to when a non-COM location is used as the cluster center.
  • COM center of mass
  • Figure 76 portrays the performance of the neural network-based template generator 1512 on extra detected clusters.
  • Figure 77 shows different datasets used for training the neural network-based template generator 1512.
  • Figures 78A and 78B depict one implementation of a sequencing system 7800A.
  • the sequencing system 7800A comprises a configurable processor 7846.
  • the configurable processor 7846 implements the base calling techniques disclosed herein.
  • the sequencing system is also referred to as a“sequencer.”
  • the sequencing system 7800A can operate to obtain any information or data that relates to at least one of a biological or chemical substance.
  • the sequencing system 7800A is a workstation that may be similar to a bench-top device or desktop computer.
  • a majority (or all) of the systems and components for conducting the desired reactions can be within a common housing 7802.
  • the sequencing system 7800A is a nucleic acid sequencing system configured for various applications, including but not limited to de novo sequencing, resequencing of whole genomes or target genomic regions, and metagenomics.
  • the sequencer may also be used for DNA or RNA analysis.
  • the sequencing system 7800A may also be configured to generate reaction sites in a biosensor.
  • the sequencing system 7800A may be configured to receive a sample and generate surface attached clusters of clonally amplified nucleic acids derived from the sample. Each cluster may constitute or be part of a reaction site in the biosensor
  • the exemplary sequencing system 7800A may include a system receptacle or interface 7810 that is configured to interact with a biosensor 7812 to perform desired reactions within the biosensor 7812
  • the biosensor 7812 is loaded into the system receptacle 7810.
  • a cartridge that includes the biosensor 7812 may be inserted into the system receptacle 7810 and in some states the cartridge can be removed temporarily or permanently.
  • the cartridge may include, among other things, fluidic control and fluidic storage components
  • the sequencing system 7800A is configured to perform a large number of parallel reactions within the biosensor 7812.
  • the biosensor 7812 includes one or more reaction sites where desired reactions can occur.
  • the reaction sites may be, for example, immobilized to a solid surface of the biosensor or immobilized to beads (or other movable substrates) that are located within corresponding reaction chambers of the biosensor
  • the reaction sites can include, for example, clusters of clonally amplified nucleic acids.
  • the biosensor 7812 may include a solid-state imaging device (e g., CCD or CMOS imager) and a flow cell mounted thereto.
  • the flow cell may include one or more flow channels that receive a solution from the sequencing system 7800A and direct the solution toward the reaction sites.
  • the biosensor 7812 can be configured to engage a thermal element for transferring thermal energy into or out of the flow channel
  • the sequencing system 7800A may include various components, assemblies, and systems (or sub-systems) that interact with each other to perform a predetermined method or assay protocol for biological or chemical analysis.
  • the sequencing system 7800A includes a system controller 7806 that may communicate with the various components, assemblies, and sub-systems of the sequencing system 7800A and also the biosensor 7812.
  • the sequencing system 7800A may also include a fluidic control system 7808 to control the flow of fluid throughout a fluid network of the sequencing system 7800A and the biosensor 7812; a fluid storage system 7814 that is configured to hold all fluids (e.g., gas or liquids) that may be used by the bioassay system; a temperature control system 7804 that may regulate the temperature of the fluid in the fluid network, the fluid storage system 7814, and/or the biosensor 7812; and an illumination system 7816 that is configured to illuminate the biosensor 7812.
  • the cartridge may also include fluidic control and fluidic storage components.
  • the sequencing system 7800A may include a user interface 7818 that interacts with the user
  • the user interface 7818 may include a display 7820 to display or request information from a user and a user input device 7822 to receive user inputs
  • the display 7820 and the user input device 7822 are the same device.
  • the user interface 7818 may include a touch-sensitive display configured to detect the presence of an individual’s touch and also identify a location of the touch on the display
  • the sequencing system 7800A may communicate with various components, including the biosensor 7812 (e.g., in the form of a cartridge), to perform the desired reactions.
  • the sequencing system 7800A may also be configured to analyze data obtained from the biosensor to provide a user with desired information.
  • the system controller 7806 may include any processor-based or microprocessor-based system, including systems using microcontrollers, reduced instruction set computers (RISC), application specific integrated circuits (ASICs), field programmable gate array (FPGAs), coarse-grained reconfigurable architectures (CGRAs), logic circuits, and any other circuit or processor capable of executing functions described herein.
  • RISC reduced instruction set computers
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate array
  • CGRAs coarse-grained reconfigurable architectures
  • logic circuits and any other circuit or processor capable of executing functions described herein.
  • the system controller 7806 executes a set of instructions that are stored in one or more storage elements, memories, or modules in order to at least one of obtain and analyze detection data.
  • Detection data can include a plurality of sequences of pixel signals, such that a sequence of pixel signals from each of the millions of sensors (or pixels) can be detected over many base calling cycles.
  • Storage elements may be in the form of information sources or physical memory elements within the sequencing system 7800A.
  • the set of instructions may include various commands that instruct the sequencing system 7800A or biosensor 7812 to perform specific operations such as the methods and processes of the various implementations described herein.
  • the set of instructions may be in the form of a software program, which may form part of a tangible, non-transitory computer readable medium or media
  • the terms “software” and“firmware” are interchangeable, and include any computer program stored in memory for execution by a computer, including RAM memoiy, ROM memory, EPROM memory, EEPROM memory, and non-volatile RAM (NVRAM) memory.
  • RAM memoiy ROM memory
  • EPROM memory erasable programmable read-only memory
  • EEPROM memory electrically erasable programmable read-only memory
  • NVRAM non-volatile RAM
  • the software may be in various forms such as system software or application software Further, the software may be in the form of a collection of separate programs, or a program module within a larger program or a portion of a program module
  • the software also may include modular programming in the form of object-oriented programming.
  • the detection data may be automatically processed by the sequencing system 7800 A, processed in response to user inputs, or processed in response to a request made by another processing machine (e.g , a remote request through a communication link).
  • the system controller 7806 includes an analysis module 7844.
  • system controller 7806 does not include the analysis module 7844 and instead has access to the analysis module 7844 (e.g., the analysis module 7844 may be separately hosted on cloud).
  • the system controller 7806 may be connected to the biosensor 7812 and the other components of the sequencing system 7800A via communication links
  • the system controller 7806 may also be communicatively connected to off-site systems or servers.
  • the communication links may be hardwired, corded, or wireless.
  • the system controller 7806 may receive user inputs or commands, from the user interface 7818 and the user input device 7822
  • the fluidic control system 7808 includes a fluid network and is configured to direct and regulate the flow of one or more fluids through the fluid network.
  • the fluid network may be in fluid communication with the biosensor 7812 and the fluid storage system 7814 For example, select fluids may be drawn from the fluid storage system 7814 and directed to the biosensor 7812 in a controlled manner, or the fluids may be drawn from the biosensor 7812 and directed toward, for example, a waste reservoir in the fluid storage system 7814.
  • the fluidic control system 7808 may include flow sensors that detect a flow rate or pressure of the fluids within the fluid network. The sensors may communicate with the system controller 7806.
  • the temperature control system 7804 is configured to regulate the temperature of fluids at different regions of the fluid network, the fluid storage system 7814, and/or the biosensor 7812
  • the temperature control system 7804 may include a thermocycler that interfaces with the biosensor 7812 and controls the temperature of the fluid that flows along the reaction sites in the biosensor 7812.
  • the temperature control system 7804 may also regulate the temperature of solid elements or components of the sequencing system 7800A or the biosensor 7812.
  • the temperature control system 7804 may include sensors to detect the temperature of the fluid or other components. The sensors may communicate with the system controller 7806.
  • the fluid storage system 7814 is in fluid communication with the biosensor 7812 and may store various reaction components or reactants that are used to conduct the desired reactions therein
  • the fluid storage system 7814 may also store fluids for washing or cleaning the fluid network and biosensor 7812 and for diluting the reactants.
  • the fluid storage system 7814 may include various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous, and non-polar solutions, and the like.
  • the fluid storage system 7814 may also include waste reservoirs for receiving waste products from the biosensor 7812
  • the cartridge may include one or more of a fluid storage system, fluidic control system or temperature control system.
  • a cartridge can have various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous, and non-polar solutions, waste, and the like
  • a fluid storage system, fluidic control system or temperature control system can be removably engaged with a bioassay system via a cartridge or other biosensor
  • the illumination system 7816 may include a light source (e.g., one or more LEDs) and a plurality of optical components to illuminate the biosensor.
  • a light source e.g., one or more LEDs
  • Examples of light sources may include lasers, arc lamps, LEDs, or laser diodes.
  • the optical components may be, for example, reflectors, dichroics, beam splitters, collimators, lenses, filters, wedges, prisms, mirrors, detectors, and the like
  • the illumination system 7816 may be configured to direct an excitation light to reaction sites.
  • fluorophores may be excited by green wavelengths of light, as such the wavelength of the excitation light may be approximately 532 nm
  • the illumination system 7816 is configured to produce illumination that is parallel to a surface normal of a surface of the biosensor 7812.
  • the illumination system 7816 is configured to produce illumination that is off-angle relative to the surface normal of the surface of the biosensor 7812
  • the illumination system 7816 is configured to produce illumination that has plural angles, including some parallel illumination and some off-angle illumination.
  • the system receptacle or interface 7810 is configured to engage the biosensor 7812 in at least one of a mechanical, electrical, and fluidic manner
  • the system receptacle 7810 may hold the biosensor 7812 in a desired orientation to facilitate the flow of fluid through the biosensor 7812.
  • the system receptacle 7810 may also include electrical contacts that are configured to engage the biosensor 7812 so that the sequencing system 7800A may communicate with the biosensor 7812 and/or provide power to the biosensor 7812.
  • the system receptacle 7810 may include fluidic ports (e.g., nozzles) that are configured to engage the biosensor 7812.
  • the biosensor 7812 is removably coupled to the system receptacle 7810 in a mechanical manner, in an electrical manner, and also in a fluidic manner.
  • the sequencing system 7800A may communicate remotely with other systems or networks or with other bioassay systems 7800A Detection data obtained by the bioassay sy stem(s) 7800A may be stored in a remote database.
  • Figure 78B is a block diagram of a system controller 7806 that can be used in the system of Figure 78A.
  • the system controller 7806 includes one or more processors or modules that can communicate with one another.
  • Each of the processors or modules may include an algorithm (e.g., instructions stored on a tangible and/or non-transitoiy computer readable storage medium) or sub-algorithms to perform particular processes.
  • the system controller 7806 is illustrated conceptually as a collection of modules, but may be implemented utilizing any combination of dedicated hardware boards, DSPs, processors, etc.
  • the system controller 7806 may be implemented utilizing an off-the-shelf PC with a single processor or multiple processors, with the functional operations distributed between the processors.
  • the modules described below may be implemented utilizing a hybrid configuration in which certain modular functions are performed utilizing dedicated hardware, while the remaining modular functions are performed utilizing an off-the-shelf PC and the like.
  • the modules also may be implemented as software modules within a processing unit.
  • a communication port 7850 may transmit information (e.g., commands) to or receive information (e g., data) from the biosensor 7812 ( Figure 78A) and/or the sub-systems 7808, 7814, 7804 ( Figure 78A).
  • the communication port 7850 may output a plurality of sequences of pixel signals
  • a communication link 7834 may receive user input from the user interface 7818 ( Figure 78A) and transmit data or information to the user interface 7818
  • Data from the biosensor 7812 or sub-systems 7808, 7814, 7804 may be processed by the system controller 7806 in real-time during a bioassay session. Additionally or alternatively, data may be stored temporarily in a system memory during a bioassay session and processed in slower than real-time or off-line operation
  • the system controller 7806 may include a plurality of modules 7826-7848 that communicate with a main control module 7824, along with a central processing unit (CPU) 7852
  • the main control module 7824 may communicate with the user interface 7818 ( Figure 78A).
  • the modules 7826-7848 are shown as communicating directly with the main control module 7824, the modules 7826-7848 may also communicate directly with each other, the user interface 7818, and the biosensor 7812 Also, the modules 7826-7848 may communicate with the main control module 7824 through the other modules.
  • the plurality of modules 7826-7848 include system modules 7828-7832, 7826 that communicate with the sub-systems 7808, 7814, 7804, and 7816, respectively.
  • the fluidic control module 7828 may communicate with the fluidic control system 7808 to control the valves and flow sensors of the fluid network for controlling the flow of one or more fluids through the fluid network.
  • the fluid storage module 7830 may notify the user when fluids are low or when the waste reservoir is at or near capacity.
  • the fluid storage module 7830 may also communicate with the temperature control module 7832 so that the fluids may be stored at a desired temperature.
  • the illumination module 7826 may communicate with the illumination system 7816 to illuminate the reaction sites at designated times during a protocol, such as after the desired reactions (e.g., binding events) have occurred In some implementations, the illumination module 7826 may communicate with the illumination system 7816 to illuminate the reaction sites at designated angles
  • the plurality of modules 7826-7848 may also include a device module 7836 that communicates with the biosensor 7812 and an identification module 7838 that determines identification information relating to the biosensor 7812.
  • the device module 7836 may, for example, communicate with the system receptacle 7810 to confirm that the biosensor has established an electrical and fluidic connection with the sequencing system 7800A.
  • the identification module 7838 may receive signals that identify the biosensor 7812.
  • the identification module 7838 may use the identity of the biosensor 7812 to provide other information to the user. For example, the identification module 7838 may determine and then display a lot number, a date of manufacture, or a protocol that is recommended to be run with the biosensor 7812
  • the plurality of modules 7826-7848 also includes an analysis module 7844 (also called signal processing module or signal processor) that receives and analyzes the signal data (e.g., image data) from the biosensor 7812.
  • Analysis module 7844 includes memory (e.g., RAM or Flash) to store detection/image data.
  • Detection data can include a plurality of sequences of pixel signals, such that a sequence of pixel signals from each of the millions of sensors (or pixels) can be detected over many base calling cycles.
  • the signal data may be stored for subsequent analysis or may be transmitted to the user interface 7818 to display desired information to the user.
  • the signal data may be processed by the solid-state imager (e.g., CMOS image sensor) before the analysis module 7844 receives the signal data
  • the analysis module 7844 is configured to obtain image data from the light detectors at each of a plurality of sequencing cycles.
  • the image data is derived from the emission signals detected by the light detectors and process the image data for each of the plurality of sequencing cycles through the neural network-based template generator 1512 and/or the neural network-based base caller 1514 and produce a base call for at least some of the analytes at each of the plurality of sequencing cycle.
  • the light detectors can be part of one or more over-head cameras (e.g., Illumina’s GAIIx’s CCD camera taking images of the clusters on the biosensor 7812 from the top), or can be part of the biosensor 7812 itself (e g., Illumina’s iSeq’s CMOS image sensors underlying the clusters on the biosensor 7812 and taking images of the clusters from the bottom).
  • over-head cameras e.g., Illumina’s GAIIx’s CCD camera taking images of the clusters on the biosensor 7812 from the top
  • the output of the light detectors is the sequencing images, each depicting intensity emissions of the clusters and their surrounding background.
  • the sequencing images depict intensity emissions generated as a result of nucleotide incorporation in the sequences during the sequencing.
  • the intensity emissions are from associated analytes and their surrounding background.
  • the sequencing images are stored in memory 7848.
  • Protocol modules 7840 and 7842 communicate with the main control module 7824 to control the operation of the sub-systems 7808, 7814, and 7804 when conducting predetermined assay protocols.
  • the protocol modules 7840 and 7842 may include sets of instructions for instructing the sequencing system 7800A to perform specific operations pursuant to predetermined protocols As shown, the protocol module may be a sequencing-by-synthesis (SBS) module 7840 that is configured to issue various commands for performing sequencing -by-synthesis processes.
  • SBS sequencing-by-synthesis
  • extension of a nucleic acid primer along a nucleic acid template is monitored to determine the sequence of nucleotides in the template.
  • the underlying chemical process can be polymerization (e.g., as catalyzed by a polymerase enzyme) or ligation (e.g., catalyzed by a ligase enzyme).
  • fluorescently labeled nucleotides are added to a primer (thereby extending the primer) in a template dependent fashion such that detection of the order and type of nucleotides added to the primer can be used to determine the sequence of the template.
  • commands can be given to deliver one or more labeled nucleotides, DNA polymerase, etc., into/through a flow cell that houses an array of nucleic acid templates.
  • the nucleic acid templates may be located at corresponding reaction sites. Those reaction sites where primer extension causes a labeled nucleotide to be incorporated can be detected through an imaging event. During an imaging event, the illumination system 7816 may provide an excitation light to the reaction sites.
  • the nucleotides can further include a reversible termination property that terminates further primer extension once a nucleotide has been added to a primer
  • a nucleotide analog having a reversible terminator moiety can be added to a primer such that subsequent extension cannot occur until a deblocking agent is delivered to remove the moiety.
  • a command can be given to deliver a deblocking reagent to the flow cell (before or after detection occurs)
  • One or more commands can be given to effect wash(es) between the various delivery steps.
  • the cycle can then be repeated n times to extend the primer by n nucleotides, thereby detecting a sequence of length n.
  • Exemplary sequencing techniques are described, for example, in Bentley et al, Nature 456:53-59 (20078); WO 04/0178497; US 7,057,026; WO 91/066778; WO 07/123744; US 7,329,492; US 7,211,414; US 7,315,019; US 7,405,2781, and US 20078/01470780782, each of which is incorporated herein by reference.
  • nucleotide delivery step of an SBS cycle either a single type of nucleotide can be delivered at a time, or multiple different nucleotide types (e.g., A, C, T and G together) can be delivered.
  • nucleotide delivery configuration where only a single type of nucleotide is present at a time, the different nucleotides need not have distinct labels since they can be distinguished based on temporal separation inherent in the individualized delivery. Accordingly, a sequencing method or apparatus can use single color detection. For example, an excitation source need only provide excitation at a single wavelength or in a single range of wavelengths.
  • sites that incorporate different nucleotide types can be distinguished based on different fluorescent labels that are attached to respective nucleotide types in the mixture
  • four different nucleotides can be used, each having one of four different fluorophores.
  • the four different fluorophores can be distinguished using excitation in four different regions of the spectrum
  • four different excitation radiation sources can be used.
  • fewer than four different excitation sources can be used, but optical filtration of the excitation radiation from a single source can be used to produce different ranges of excitation radiation at the flow cell
  • fewer than four different colors can be detected in a mixture having four different nucleotides.
  • pairs of nucleotides can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g., via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
  • Exemplaiy apparatus and methods for distinguishing four different nucleotides using detection of fewer than four colors are described for example in US Pat. App Ser Nos 61/5378,294 and 61/619,78778, which are incorporated herein by reference in their entireties U S Application No 13/624,200, which was filed on September 21, 2012, is also incorporated by reference in its entirety.
  • the plurality of protocol modules may also include a sample-preparation (or generation) module 7842 that is configured to issue commands to the fluidic control system 7808 and the temperature control system 7804 for amplifying a product within the biosensor 7812.
  • the biosensor 7812 may be engaged to the sequencing system 7800 A
  • the amplification module 7842 may issue instructions to the fluidic control system 7808 to deliver necessary amplification components to reaction chambers within the biosensor 7812.
  • the reaction sites may already contain some components for amplification, such as the template DNA and/or primers.
  • the amplification module 7842 may instruct the temperature control system 7804 to cycle through different temperature stages according to known amplification protocols. In some implementations, the amplification and/or nucleotide incorporation is performed isothermally.
  • the SBS module 7840 may issue commands to performbridge PCR where clusters of clonal amplicons are formed on localized areas within a channel of a flow cell. After generating the amplicons through bridge PCR, the amplicons may be "linearized” to make single stranded template DNA, or sstDNA, and a sequencing primer may be hybridized to a universal sequence that flanks a region of interest. For example, a reversible terminator-based sequencing by synthesis method can be used as set forth above or as follows.
  • Each base calling or sequencing cycle can extend an sstDNA by a single base which can be accomplished for example by using a modified DNA polymerase and a mixture of four types of nucleotides.
  • the different types of nucleotides can have unique fluorescent labels, and each nucleotide can further have a reversible terminator that allows only a single-base incorporation to occur in each cycle. After a single base is added to the sstDNA, excitation light may be incident upon the reaction sites and fluorescent emissions may be detected. After detection, the fluorescent label and the terminator may be chemically cleaved from the sstDNA. Another similar base calling or sequencing cycle may follow.
  • the SBS module 7840 may instruct the fluidic control system 7808 to direct a flow of reagent and enzyme solutions through the biosensor 7812
  • Exemplary reversible terminator-based SBS methods which canbe utilized with the apparatus and methods set forth herein are described in US Patent Application Publication No 2007/0166705 Al, US Patent Application Publication No.
  • the amplification and SBS modules may operate in a single assay protocol where, for example, template nucleic acid is amplified and subsequently sequenced within the same cartridge.
  • the sequencing system 7800A may also allow the user to reconfigure an assay protocol.
  • the sequencing system 7800A may offer options to the user through the user interface 7818 for modifying the determined protocol. For example, if it is determined that the biosensor 7812 is to be used for amplification, the sequencing system 7800A may request a temperature for the annealing cycle.
  • the sequencing system 7800A may issue warnings to a user if a user has provided user inputs that are generally not acceptable for the selected assay protocol
  • the biosensor 7812 includes millions of sensors (or pixels), each of which generates a plurality of sequences of pixel signals over successive base calling cycles
  • the analysis module 7844 detects the plurality of sequences of pixel signals and attributes them to corresponding sensors (or pixels) in accordance to the row-wise and/or column-wise location of the sensors on an array of sensors
  • Figure 79 is a simplified block diagram of a system for analysis of sensor data from the sequencing system 7800A, such as base call sensor outputs.
  • the system includes the configurable processor 7846
  • the configurable processor 7846 can execute a base caller (e g., the neural network-based template generator 1512 and/or the neural network-based base caller 1514) in coordination with a runtime program executed by the central processing unit (CPU) 7852 (i.e., a host processor).
  • the sequencing system 7800A comprises the biosensor 7812 and flow cells.
  • the flow cells can comprise one or more tiles in which clusters of genetic material are exposed to a sequence of analyte flows used to cause reactions in the clusters to identify the bases in the genetic material.
  • the sensors sense the reactions for each cycle of the sequence in each tile of the flow cell to provide tile data.
  • Genetic sequencing is a data intensive operation, which translates base call sensor data into sequences of base calls for each cluster of genetic material sensed in during abase call operation.
  • the system in this example includes the CPU 7852, which executes a runtime program to coordinate the base call operations, memory 7848B to store sequences of arrays of tile data, base call reads produced by the base calling operation, and other information used in the base call operations.
  • the system includes memoiy 7848A to store a configuration file (or files), such as FPGA bit files, and model parameters for the neural networks used to configure and reconfigure the configurable processor 7846, and execute the neural networks.
  • the sequencing system 7800A can include a program for configuring a configurable processor and in some embodiments a reconfigurable processor to execute the neural networks
  • the sequencing system 7800A is coupled by a bus 7902 to the configurable processor 7846.
  • the bus 7902 can be implemented using a high throughput technology, such as in one example bus technology compatible with the PCIe standards (Peripheral Component Interconnect Express) currently maintained and developed by the PCI-SIG (PCI Special Interest Group)
  • a memoiy 7848A is coupled to the configurable processor 7846 by bus 7906
  • the memory 7848A can be on-board memory, disposed on a circuit board with the configurable processor 7846.
  • the memoiy 7848A is used for high speed access by the configurable processor 7846 of working data used in the base call operation.
  • the bus 7906 can also be implemented using a high throughput technology, such as bus technology compatible with the PCIe standards.
  • Configurable processors including field programmable gate arrays FPGAs, coarse grained reconfigurable arrays CGRAs, and other configurable and reconfigurable devices, can be configured to implement a variety of functions more efficiently or faster than might be achieved using a general purpose processor executing a computer program.
  • Configuration of configurable processors involves compiling a functional description to produce a configuration file, referred to sometimes as a bitstream or bit file, and distributing the configuration file to the configurable elements on the processor.
  • the configuration file defines the logic functions to be executed by the configurable processor, by configuring the circuit to set data flow patterns, use of distributed memoiy and other on-chip memoiy resources, lookup table contents, operations of configurable logic blocks and configurable execution units like multiply-and -accumulate units, configurable interconnects and other elements of the configurable array.
  • a configurable processor is reconfigurable if the configuration file may be changed in the field, by changing the loaded configuration file.
  • the configuration file may be stored in volatile SRAM elements, in non-volatile read-write memory elements, and in combinations of the same, distributed among the array of configurable elements on the configurable or reconfigurable processor.
  • a variety of commercially available configurable processors are suitable for use in abase calling operation as described herein Examples include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX9 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA’s VoltaTM, NVIDIA’s DRIVE PXTM, NVIDIA’s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s DynamicIQTM, IBM TiueNorthTM, Lambda GPU Server with Testa VIOOsTM, Xilinx AlveoTM U200, Xilinx AlveoTMU250, Xilinx AlveoTM
  • a host CPU can be implemented on the same integrated circuit as the configurable processor.
  • Embodiments described herein implement the neural network-based template generator 1512 and/or the neural network-based base caller 1514 using the configurable processor 7846.
  • the configuration file for the configurable processor 7846 can be implemented by specifying the logic functions to be executed using a high level description language HDL or a register transfer level RTL language specification.
  • the specification can be compiled using the resources designed for the selected configurable processor to generate the configuration file.
  • the same or similar specification can be compiled for the purposes of generating a design for an application-specific integrated circuit which may not be a configurable processor.
  • configurable processor configurable processor 7846 in all embodiments described herein, therefore include a configured processor comprising an application specific ASIC or special purpose integrated circuit or set of integrated circuits, or a system-on-a- chip SOC device, or a graphics processing unit (GPU) processor or a coarse-grained reconfigurable architecture (CGRA) processor, configured to execute a neural network based base call operation as described herein
  • a configured processor comprising an application specific ASIC or special purpose integrated circuit or set of integrated circuits, or a system-on-a- chip SOC device, or a graphics processing unit (GPU) processor or a coarse-grained reconfigurable architecture (CGRA) processor, configured to execute a neural network based base call operation as described herein
  • GPU graphics processing unit
  • CGRA coarse-grained reconfigurable architecture
  • neural network processors In general, configurable processors and configured processors described herein, as configured to execute runs of a neural network, are referred to herein as neural network processors.
  • the configurable processor 7846 is configured in this example by a configuration file loaded using a program executed by the CPU 7852, or by other sources, which configures the array of configurable elements 7916 (e.g., configuration logic blocks (CLB) such as look up tables (LUTs), flip-flops, compute processing units (PMUs), and compute memory units (CMUs), configurable I/O blocks, programmable interconnects), on the configurable processor to execute the base call function.
  • the configuration includes data flow logic 7908 which is coupled to the buses 7902 and 7906 and executes functions for distributing data and control parameters among the elements used in the base call operation.
  • the configurable processor 7846 is configured with base call execution logic 7908 to execute the neural network-based template generator 1512 and/or the neural network-based base caller 1514.
  • the logic 7908 comprises multi-cycle execution clusters (e.g., 7914) which, in this example, includes execution cluster 1 through execution cluster X.
  • the number of multi -cycle execution clusters can be selected according to a trade-off involving the desired throughput of the operation, and the available resources on the configurable processor 7846.
  • the multi-cycle execution clusters are coupled to the data flow logic 7908 by data flow paths 7910 implemented using configurable interconnect and memory resources on the configurable processor 7846 Also, the multi-cycle execution clusters are coupled to the data flow logic 7908 by control paths 7912 implemented using configurable interconnect and memory resources for example on the configurable processor 7846, which provide control signals indicating available execution clusters, readiness to provide input units for execution of a run of the neural network-based template generator 1512 and/or the neural network-based base caller 1514 to the available execution clusters, readiness to provide trained parameters for the neural network-based template generator 1512 and/or the neural network-based base caller 1514, readiness to provide output patches of base call classification data, and other control data used for execution of the neural network-based template generator 1512 and/or the neural network-based base caller 1514.
  • the configurable processor 7846 is configured to execute mns of the neural network-based template generator 1512 and/or the neural network-based base caller 1514 using trained parameters to produce classification data for the sensing cycles of the base calling operation.
  • a ran of the neural network-based template generator 1512 and/or the neural network-based base caller 1514 is executed to produce classification data for a subject sensing cycle of the base calling operation.
  • a run of the neural network-based template generator 1512 and/or the neural network- based base caller 1514 operates on a sequence including a number N of arrays of tile data from respective sensing cycles of N sensing cycles, where the N sensing cycles provide sensor data for different base call operations for one base position per operation in time sequence in the examples described herein.
  • some of the N sensing cycles can be out of sequence if needed according to a particular neural network model being executed.
  • the number N can be any number greater than one.
  • sensing cycles of the N sensing cycles represent a set of sensing cycles for at least one sensing cycle preceding the subject sensing cycle and at least one sensing cycle following the subject cycle in time sequence. Examples are described herein in which the number N is an integer equal to or greater than five.
  • the data flow logic 7908 is configured to move tile data and at least some trained parameters of the model parameters from the memory 7848A to the configurable processor 7846 for runs of the neural network-based template generator 1512 and/or the neural network-based base caller 1514, using input units for a given run including tile data for spatially aligned patches of the N arrays.
  • the input units can be moved by direct memory access operations in one DMA operation, or in smaller units moved during available time slots in coordination with the execution of the neural network deployed.
  • Tile data for a sensing cycle as described herein can comprise an array of sensor data having one or more features.
  • the sensor data can comprise two images which are analyzed to identify one of four bases at a base position in a genetic sequence of DNA, RNA, or other genetic material.
  • the tile data can also include metadata about the images and the sensors.
  • the tile data can comprise information about alignment of the images with the clusters such as distance from center information indicating the distance of each pixel in the array of sensor data from the center of a cluster of genetic material on the tile.
  • tile data can also include data produced during execution of the neural network-based template generator 1512 and/or the neural network-based base caller 1514, referred to as intermediate data, which can be reused rather than recomputed during a ran of the neural network-based template generator 1512 and/or the neural network-based base caller 1514.
  • the data flow logic 7908 can write intermediate data to the memory 7848A in place of the sensor data for a given patch of an array of tile data. Embodiments like this are described in more detail below.
  • a system for analysis of base call sensor output, comprising memory (e.g., 7848A) accessible by the runtime program storing tile data including sensor data for a tile from sensing cycles of a base calling operation.
  • the system includes a neural network processor, such as configurable processor 7846 having access to the memory.
  • the neural network processor is configured to execute runs of a neural network using trained parameters to produce classification data for sensing cycles.
  • a run of the neural network is operating on a sequence of N arrays of tile data from respective sensing cycles of N sensing cycles, including a subject cycle, to produce the classification data for the subject cycle.
  • the data flow logic 908 is provided to move tile data and the trained parameters from the memory to the neural network processor for mns of the neural network using input units including data for spatially aligned patches of the N arrays from respective sensing cycles of N sensing cycles.
  • the neural network processor has access to the memory, and includes a plurality of execution clusters, the execution clusters in the plurality of execution clusters configured to execute a neural network.
  • the data flow logic 7908 has access to the memory and to execution clusters in the plurality of execution clusters, to provide input units of tile data to available execution clusters in the plurality of execution clusters, the input units including a number N of spatially aligned patches of arrays of tile data from respective sensing cycles, including a subject sensing cycle, and to cause the execution clusters to apply the N spatially aligned patches to the neural network to produce output patches of classification data for the spatially aligned patch of the subject sensing cycle, where N is greater than 1.
  • Figure 80 is a simplified diagram showing aspects of the base calling operation, including functions of a runtime program executed by a host processor.
  • the output of image sensors from a flow cell are provided on lines 8000 to image processing threads 8001, which can perform processes on images such as alignment and arrangement in an array of sensor data for the individual tiles and resampling of images, and can be used by processes which calculate a tile cluster mask for each tile in the flow cell, which identifies pixels in the array of sensor data that correspond to clusters of genetic material on the corresponding tile of the flow cell.
  • the outputs of the image processing threads 8001 are provided on lines 8002 to a dispatch logic 8010 in the CPU which routes the arrays of tile data to a data cache 8004 (e.g., SSD storage) on a high-speed bus 8003, or on high-speed bus 8005 to the neural network processor hardware 8020, such as the configurable processor 7846 of Figure 79, according to the state of the base calling operation.
  • the processed and transformed images can be stored on the data cache 8004 for sensing cycles that were previously used.
  • the hardware 8020 returns classification data output by the neural network to the dispatch logic 8080, which passes the information to the data cache 8004, or on lines 8011 to threads 8002 that perform base call and quality score computations using the classification data, and can arrange the data in standard formats for base call reads.
  • the outputs of the threads 8002 that perform base calling and quality score computations are provided on lines 8012 to threads 8003 that aggregate the base call reads, perform other operations such as data compression, and write the resulting base call outputs to specified destinations
  • the host can include threads (not shown) that perform final processing of the output of the hardware 8020 in support of the neural network.
  • the hardware 8020 can provide outputs of classification data from a final layer of the multi-cluster neural network.
  • the host processor can execute an output activation function, such as a softmax function, over the classification data to configure the data for use by the base call and quality score threads 8002
  • the host processor can execute input operations (not shown), such as batch normalization of the tile data prior to input to the hardware 8020.
  • Figure 81 is a simplified diagram of a configuration of a configurable processor 7846 such as that of Figure 79.
  • the configurable processor 7846 comprises an FPGA with a plurality of high speed PCIe interfaces.
  • the FPGA is configured with a wrapper 8100 which comprises the data flow logic 7908 described with reference to Figure 79.
  • the wrapper 8100 manages the interface and coordination with a runtime program in the CPU across the CPU communication link 8109 and manages communication with the on-board DRAM 8102 (e.g , memory 7848A) via DRAM communication link 8110.
  • DRAM 8102 e.g , memory 7848A
  • the data flow logic 7908 in the wrapper 8100 provides patch data retrieved by traversing the arrays of tile data on the on-board DRAM 8102 for the number N cycles to a cluster 8101, and retrieves process data 8115 from the cluster 8101 for delivery back to the on-board DRAM 8102.
  • the wrapper 8100 also manages transfer of data between the on-board DRAM 8102 and host memory, for both the input arrays of tile data, and for the output patches of classification data.
  • the wrapper transfers patch data on line 8113 to the allocated cluster 8101.
  • the wrapper provides trained parameters, such as weights and biases on line 8112 to the cluster 8101 retrieved from the on-board DRAM 8102.
  • the wrapper provides configuration and control data on line 8111 to the cluster 8101 provided from, or generated in response to, the runtime program on the host via the CPU communication link 8109.
  • the cluster can also provide status signals on line 8116 to the wrapper 8100, which are used in cooperation with control signals from the host to manage traversal of the arrays of tile data to provide spatially aligned patch data, and to execute the multi -cycle neural network over the patch data using the resources of the cluster 8101.
  • each cluster can be configured to provide classification data for base calls in a subject sensing cycle using the tile data of multiple sensing cycles described herein
  • model data including kernel data like filter weights and biases can be sent from the host CPU to the configurable processor, so that the model can be updated as a function of cycle number.
  • a base calling operation can comprise, for a representative example, on the order of hundreds of sensing cycles.
  • Base calling operation can include paired end reads in some embodiments.
  • the model trained parameters may be updated once every 20 cycles (or other number of cycles), or according to update patterns implemented for particular systems and neural network models.
  • the trained parameters can be updated on the transition from the first part to the second part.
  • image data for multiple cycles of sensing data for a tile can be sent from the CPU to the wrapper 8100
  • the wrapper 8100 can optionally do some pre-processing and transformation of the sensing data and write the information to the on-board DRAM 8102.
  • the input tile data for each sensing cycle can include arrays of sensor data including on the order of 4000 x 3000 pixels per sensing cycle per tile or more, with two features representing colors of two images of the tile, and one or two bytes per feature per pixel
  • the array of tile data for each run of the multi-cycle neural network can consume on the order of hundreds of megabytes per tile.
  • the tile data also includes an array of DFC data, stored once per tile, or other type of metadata about the sensor data and the tiles.
  • the wrapper allocates a patch to the cluster.
  • the wrapper fetches a next patch of tile data in the traversal of the tile and sends it to the allocated cluster along with appropriate control and configuration information
  • the cluster can be configured with enough memory on the configurable processor to hold a patch of data including patches from multiple cycles in some systems, that is being worked on in place, and a patch of data that is to be worked on when the current patch of processing is finished using a ping-pong buffer technique or raster scanning technique in various embodiments.
  • an allocated cluster When an allocated cluster completes its run of the neural network for the current patch and produces an output patch, it will signal the wrapper.
  • the wrapper will read the output patch from the allocated cluster, or alternatively the allocated cluster will push the data out to the wrapper. Then the wrapper will assemble output patches for the processed tile in the DRAM 8102.
  • the wrapper When the processing of the entire tile has been completed, and the output patches of data transferred to the DRAM, the wrapper sends the processed output array for the tile back to the host/CPU in a specified format
  • the on-board DRAM 8102 is managed by memory management logic in the wrapper 8100
  • the runtime program can control the sequencing operations to complete analysis of all the arrays of tile data for all the cycles in the mn in a continuous flow to provide real time analysis.
  • Base calling includes incorporation or attachment of a fluorescently-labeled tag with an analyte.
  • the analyte can be a nucleotide or an oligonucleotide, and the tag can be for a particular nucleotide type (A, C, T, or G).
  • Excitation light is directed toward the analyte having the tag, and the tag emits a detectable fluorescent signal or intensity emission. The intensity emission is indicative of photons emitted by the excited tag that is chemically attached to the analyte.
  • the technology disclosed uses neural networks to improve the quality and quantity of nucleic acid sequence information that can be obtained from a nucleic acid sample such as a nucleic acid template or its complement, for instance, a DNA or RNA polynucleotide or other nucleic acid sample. Accordingly, certain implementations of the technology disclosed provide higher throughput polynucleotide sequencing, for instance, higher rates of collection of DNA or RNA sequence data, greater efficiency in sequence data collection, and/or lower costs of obtaining such sequence data, relative to previously available methodologies
  • the technology disclosed uses neural networks to identify the center of a solid-phase nucleic acid cluster and to analyze optical signals that are generated during sequencing of such clusters, to discriminate unambiguously between adjacent, abutting or overlapping clusters in order to assign a sequencing signal to a single, discrete source cluster.
  • These and related implementations thus permit retrieval of meaningful information, such as sequence data, from regions of high-density cluster arrays where useful information could not previously be obtained from such regions due to confounding effects of overlapping orveiy closely spaced adjacent clusters, including the effects of overlapping signals (e g., as used in nucleic acid sequencing) emanating therefrom.
  • composition that comprises a solid support having immobilized thereto one or a plurality of nucleic acid clusters as provided herein.
  • Each cluster comprises a plurality of immobilized nucleic acids of the same sequence and has an identifiable center having a detectable center label as provided herein, by which the identifiable center is distinguishable from immobilized nucleic acids in a surrounding region in the cluster.
  • Also described herein are methods for making and using such clusters that have identifiable centers.
  • the present invention contemplates methods that relate to high-throughput nucleic acid analysis such as nucleic acid sequence determination (e.g.,“sequencing”).
  • high-throughput nucleic acid analyses include without limitation de novo sequencing, re-sequencing, whole genome sequencing, gene expression analysis, gene expression monitoring, epigenetic analysis, genome methylation analysis, allele specific primer extension (APSE), genetic diversity profiling, whole genome polymorphism discovery and analysis, single nucleotide polymorphism analysis, hybridization based sequence determination methods, and the like.
  • APSE allele specific primer extension
  • the implementations of the present invention are described in relation to nucleic acid sequencing, they are applicable in any field where image data acquired at different time points, spatial locations or other temporal or physical perspectives is analyzed.
  • the methods and systems described herein are useful in the fields of molecular and cell biology where image data from microarrays, biological specimens, cells, organisms and the like is acquired and at different time points or perspectives and analyzed Images can be obtained using any number of techniques known in the art including, but not limited to, fluorescence microscopy, light microscopy, confocal microscopy, optical imaging, magnetic resonance imaging, tomography scanning or the like.
  • the methods and systems described herein can be applied where image data obtained by surveillance, aerial or satellite imaging technologies and the like is acquired at different time points or perspectives and analyzed
  • the methods and systems are particularly useful for analyzing images obtained for a field of view in which the analytes being viewed remain in the same locations relative to each other in the field of view
  • the analytes may however have characteristics that differ in separate images, for example, the analytes may appear different in separate images of the field of view.
  • the analytes may appear different with regard to the color of a given analyte detected in different images, a change in the intensity of signal detected for a given analyte in different images, or even the appearance of a signal for a given analyte in one image and disappearance of the signal for the analyte in another image.
  • Examples described herein may be used in various biological or chemical processes and systems for academic or commercial analysis. More specifically, examples described herein may be used in various processes and systems where it is desired to detect an event, property, quality, or characteristic that is indicative of a designated reaction
  • examples described herein include light detection devices, biosensors, and their components, as well as bioassay systems that operate with biosensors.
  • the devices, biosensors and systems may include a flow cell and one or more light sensors that are coupled together (removably or fixedly) in a substantially unitaiy structure.
  • the devices, biosensors and bioassay systems may be configured to perform a plurality of designated reactions that may be detected individually or collectively.
  • the devices, biosensors and bioassay systems may be configured to perform numerous cycles in which the plurality of designated reactions occurs in parallel.
  • the devices, biosensors and bioassay systems may be used to sequence a dense array of DNA features through iterative cycles of enzymatic manipulation and light or image detection/acquisition.
  • the devices, biosensors and bioassay systems may include one or more microfluidic channel that delivers reagents or other reaction components in a reaction solution to a reaction site of the devices, biosensors and bioassay systems
  • the reaction solution may be substantially acidic, such as comprising a pH of less than or equal to about 5, or less than or equal to about 4, or less than or equal to about 3.
  • the reaction solution may be substantially alkaline/basic, such as comprising a pH of greater than or equal to about 8, or greater than or equal to about 9, or greater than or equal to about 10.
  • the term“acidity” and grammatical variants thereof refer to a pH value of less than about 7 and the terms“basicity,”“alkalinity” and grammatical variants thereof refer to a pH value of greater than about 7
  • reaction sites are provided or spaced apart in a predetermined manner, such as in a uniform or repeating pattern In some other examples, the reaction sites are randomly distributed.
  • Each of the reaction sites may be associated with one or more light guides and one or more light sensors that detect light from the associated reaction site In some examples, the reaction sites are located in reaction recesses or chambers, which may at least partially compartmentalize the designated reactions therein
  • a“designated reaction” includes a change in at least one of a chemical, electrical, physical, or optical property (or quality) of a chemical or biological substance of interest, such as an analyte-of-interest.
  • a designated reaction is a positive binding event, such as incorporation of a fluorescently labeled biomolecule with an analyte-of-interest, for example.
  • a designated reaction may be a chemical transformation, chemical change, or chemical interaction.
  • a designated reaction may also be a change in electrical properties
  • a designated reaction includes the incorporation of a fluorescently-labeled molecule with an analyte.
  • the analyte may be an oligonucleotide and the fluorescently-labeled molecule may be a nucleotide
  • a designated reaction may be detected when an excitation light is directed toward the oligonucleotide having the labeled nucleotide, and the fluorophore emits a detectable fluorescent signal.
  • the detected fluorescence is a result of chemiluminescence or bioluminescence.
  • a designated reaction may also increase fluorescence (or Forster) resonance energy transfer (FRET), for example, by bringing a donor fluorophore in proximity to an acceptor fluorophore, decrease FRET by separating donor and acceptor fluorophores, increase fluorescence by separating a quencher from a fluorophore, or decrease fluorescence by co-locating a quencher and fluorophore.
  • FRET fluorescence resonance energy transfer
  • a“reaction solution,”“reaction component” or“reactant” includes any substance that may be used to obtain at least one designated reaction
  • potential reaction components include reagents, enzymes, samples, other biomolecules, and buffer solutions, for example.
  • the reaction components may be delivered to a reaction site in a solution and/or immobilized at a reaction site.
  • the reaction components may interact directly or indirectly with another substance, such as an analyte-of-interest immobilized at a reaction site
  • the reaction solution may be substantially acidic (i.e., include a relatively high acidity) (e g., comprising a pH of less than or equal to about 5, a pH less than or equal to about 4, or a pH less than or equal to about 3) or substantially alkaline/basic (i.e., include a relatively high alkalinity /basicity) (e.g., comprising a pH of greater than or equal to about 8, a pH of greater than or equal to about 9, or a pH of greater than or equal to about 10).
  • reaction site is a localized region where at least one designated reaction may occur.
  • a reaction site may include support surfaces of a reaction structure or substrate where a substance may be immobilized thereon
  • a reaction site may include a surface of a reaction structure (which may be positioned in a channel of a flow cell) that has a reaction component thereon, such as a colony of nucleic acids thereon.
  • the nucleic acids in the colony have the same sequence, being for example, clonal copies of a single stranded or double stranded template.
  • a reaction site may contain only a single nucleic acid molecule, for example, in a single stranded or double stranded form.
  • a plurality of reaction sites may be randomly distributed along the reaction structure or arranged in a predetermined manner (e.g., side-by-side in a matrix, such as in microarrays)
  • a reaction site can also include a reaction chamber or recess that at least partially defines a spatial region or volume configured to compartmentalize the designated reaction.
  • the term“reaction chamber” or“reaction recess” includes a defined spatial region of the support structure (which is often in fluid communication with a flow channel).
  • a reaction recess may be at least partially separated from the surrounding environment other or spatial regions. For example, a plurality of reaction recesses may be separated from each other by shared walls, such as a detection surface.
  • reaction recesses may be nanowells comprising an indent, pit, well, groove, cavity or depression defined by interior surfaces of a detection surface and have an opening or aperture (i.e., be open-sided) so that the nanowells can be in fluid communication with a flow channel.
  • the reaction recesses of the reaction structure are sized and shaped relative to solids (including semi-solids) so that the solids may be inserted, fully or partially, therein.
  • the reaction recesses may be sized and shaped to accommodate a capture bead.
  • the capture bead may have clonally amplified DNA or other substances thereon.
  • the reaction recesses may be sized and shaped to receive an approximate number of beads or solid substrates.
  • the reaction recesses may be filled with a porous gel or substance that is configured to control diffusion or filter fluids or solutions that may flow into the reaction recesses.
  • light sensors are associated with corresponding reaction sites.
  • a light sensorthat is associated with a reaction site is configured to detect light emissions from the associated reaction site via at least one light guide when a designated reaction has occurred at the associated reaction site.
  • a plurality of light sensors e.g. several pixels of a light detection or camera device
  • a single light sensor e.g. a single pixel
  • the light sensor, the reaction site, and other features of the biosensor may be configured so that at least some of the light is directly detected by the light sensor without being reflected
  • a“biological or chemical substance” includes biomolecules, samples-of-interest, analytes-of-interest, and other chemical compound(s).
  • a biological or chemical substance may be used to detect, identity, or analyze other chemical compound(s), or function as intermediaries to study or analyze other chemical compound(s)
  • the biological or chemical substances include a biomolecule.
  • a“biomolecule” includes at least one of a biopolymer, nucleoside, nucleic acid, polynucleotide, oligonucleotide, protein, enzyme, polypeptide, antibody, antigen, ligand, receptor, polysaccharide, carbohydrate, polyphosphate, cell, tissue, organism, or fragment thereof or any other biologically active chemical compound(s) such as analogs or mimetics of the aforementioned species.
  • a biological or chemical substance or a biomolecule includes an enzyme or reagent used in a coupled reaction to detect the product of another reaction such as an enzyme or reagent, such as an enzyme or reagent used to detect pyrophosphate in a pyro sequencing reaction
  • Enzymes and reagents useful for pyrophosphate detection are described, for example, inU.S. Patent Publication No. 2005/0244870 Al, which is incorporated by reference in its entirety.
  • Biomolecules, samples, and biological or chemical substances may be naturally occurring or synthetic and may be suspended in a solution or mixture within a reaction recess or region Biomolecules, samples, and biological or chemical substances may also be bound to a solid phase or gel material. Biomolecules, samples, and biological or chemical substances may also include a pharmaceutical composition. In some cases, biomolecules, samples, and biological or chemical substances of interest may be referred to as targets, probes, or analytes.
  • a“biosensor” includes a device that includes a reaction structure with a plurality of reaction sites that is configured to detect designated reactions that occur at or proximate to the reaction sites.
  • a biosensor may include a solid-state light detection or“imaging” device (e.g., CCD or CMOS light detection device) and, optionally, a flow cell mounted thereto.
  • the flow cell may include at least one flow channel that is in fluid communication with the reaction sites.
  • the biosensor is configured to fluidically and electrically couple to a bioassay system.
  • the bioassay system may deliver a reaction solution to the reaction sites according to a predetermined protocol (e.g., sequencing-by-synthesis) and perform a plurality of imaging events.
  • the bioassay system may direct reaction solutions to flow along the reaction sites.
  • At least one of the reaction solutions may include four types of nucleotides having the same or different fluorescent labels.
  • the nucleotides may bind to the reaction sites, such as to corresponding oligonucleotides at the reaction sites.
  • the bioassay system may then illuminate the reaction sites using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes (LEDs)).
  • the excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths.
  • the fluorescent labels excited by the incident excitation light may provide emission signals (e.g , light of a wavelength or wavelengths that differ from the excitation light and, potentially, each other) that may be detected by the light sensors.
  • the term“immobilized,” when used with respect to a biomolecule or biological or chemical substance, includes substantially attaching the biomolecule or biological or chemical substance at a molecular level to a surface, such as to a detection surface of a light detection device or reaction structure.
  • a biomolecule or biological or chemical substance may be immobilized to a surface of the reaction structure using adsorption techniques including non-covalent interactions (e.g., electrostatic forces, van der Waals, and dehydration of hydrophobic interfaces) and covalent binding techniques where functional groups or linkers facilitate attaching the biomolecules to the surface
  • Immobilizing biomolecules or biological or chemical substances to the surface may be based upon the properties of the surface, the liquid medium carrying the biomolecule or biological or chemical substance, and the properties of the biomolecules or biological or chemical substances themselves.
  • the surface may be functionalized (e.g , chemically or physically modified) to facilitate immobilizing the biomolecules (or biological or chemical substances) to the surface.
  • nucleic acids can be immobilized to the reaction structure, such as to surfaces of reaction recesses thereof
  • the devices, biosensors, bioassay systems and methods described herein may include the use of natural nucleotides and also enzymes that are configured to interact with the natural nucleotides
  • Natural nucleotides include, for example, ribonucleotides or deoxyribonucleotides. Natural nucleotides can be in the mono-, di-, or tri-phosphate form and can have a base selected from adenine (A),
  • Thymine T
  • U uracil
  • G guanine
  • C cytosine
  • a biomolecule or biological or chemical substance may be immobilized at a reaction site in a reaction recess of a reaction structure
  • a biomolecule or biological substance may be physically held or immobilized within the reaction recesses through an interference fit, adhesion, covalent bond, or entrapment.
  • items or solids that may be disposed within the reaction recesses include polymer beads, pellets, agarose gel, powders, quantum dots, or other solids that may be compressed and/or held within the reaction chamber.
  • the reaction recesses may be coated or filled with a hydrogel layer capable of covalently binding DNA oligonucleotides.
  • a nucleic acid superstructure such as a DNA ball
  • a DNA ball can be disposed in or at a reaction recess, for example, by attachment to an interior surface of the reaction recess or by residence in a liquid within the reaction recess
  • a DNA ball or other nucleic acid superstructure canbe performed and then disposed in or at a reaction recess
  • a DNA ball can be synthesized in situ at a reaction recess.
  • a substance that is immobilized in a reaction recess can be in a solid, liquid, or gaseous state.
  • analyte is intended to mean a point or area in a pattern that can be distinguished from other points or areas according to relative location
  • An individual analyte can include one or more molecules of a particular type
  • an analyte can include a single target nucleic acid molecule having a particular sequence or an analyte can include several nucleic acid molecules having the same sequence (and/or complementary sequence, thereof).
  • Different molecules that are at different analytes of a pattern can be differentiated from each other according to the locations of the analytes in the pattern.
  • Example analytes include without limitation, wells in a substrate, beads (or other particles) in or on a substrate, projections from a substrate, ridges on a substrate, pads of gel material on a substrate, or channels in a substrate.
  • any of a variety of target analytes that are to be detected, characterized, or identified can be used in an apparatus, system or method set forth herein.
  • exemplary analytes include, but are not limited to, nucleic acids (e g., DNA, RNA or analogs thereof), proteins, polysaccharides, cells, antibodies, epitopes, receptors, ligands, enzymes (e.g. kinases, phosphatases or polymerases), small molecule drug candidates, cells, viruses, organisms, or the like.
  • nucleic acids may be used as templates as provided herein (e g., a nucleic acid template, or a nucleic acid complement that is complementary to a nucleic acid nucleic acid template) for particular types of nucleic acid analysis, including but not limited to nucleic acid amplification, nucleic acid expression analysis, and/or nucleic acid sequence determination or suitable combinations thereof.
  • Nucleic acids in certain implementations include, for instance, linear polymers of deoxyribonucleotides in 3 -5' phosphodiester or other linkages, such as deoxyribonucleic acids (DNA), for example, single- and double-stranded DNA, genomic DNA, copy DNA or complementary DNA (cDNA), recombinant DNA, or any form of synthetic or modified DNA.
  • DNA deoxyribonucleic acids
  • cDNA complementary DNA
  • recombinant DNA or any form of synthetic or modified DNA.
  • nucleic acids include for instance, linear polymers of ribonucleotides in 3'-5' phosphodiester or other linkages such as ribonucleic acids (RNA), for example, single- and double-stranded RNA, messenger (mRNA), copy RNA or complementary RNA (cRNA), alternatively spliced mRNA, ribosomal RNA, small nucleolar RNA (snoRNA), microRNAs (miRNA), small interfering RNAs (sRNA), piwi RNAs (piRNA), or any form of synthetic or modified RNA.
  • RNA ribonucleic acids
  • Nucleic acids used in the compositions and methods of the present invention may vary in length and may be intact or full-length molecules or fragments or smaller parts of larger nucleic acid molecules.
  • a nucleic acid may have one or more detectable labels, as described elsewhere herein.
  • nucleic acid cluster comprises a plurality of copies of template nucleic acid and/or complements thereof, attached via their 5' termini to the solid support.
  • the copies of nucleic acid strands making up the nucleic acid clusters may be in a single or double stranded form
  • Copies of a nucleic acid template that are present in a cluster can have nucleotides at corresponding positions that differ from each other, for example, due to presence of a label moiety.
  • the corresponding positions can also contain analog structures having different chemical structure but similar Watson-Crick base-pairing properties, such as is the case for uracil and thymine.
  • Colonies of nucleic acids can also be referred to as“nucleic acid clusters”. Nucleic acid colonies can optionally be created by cluster amplification or bridge amplification techniques as set forth in further detail elsewhere herein. Multiple repeats of a target sequence can be present in a single nucleic acid molecule, such as a concatamer created using a rolling circle amplification procedure
  • the nucleic acid clusters of the invention can have different shapes, sizes and densities depending on the conditions used.
  • clusters can have a shape that is substantially round, multi-sided, donut-shaped or ring-shaped.
  • the diameter of a nucleic acid cluster can be designed to be from about 0.2 pm to about 6 pm, about 0 3 pm to about 4 pm, about 0.4 pm to about 3 pm, about 0.5 pm to about 2 pm, about 0.75 pm to about 1.5 pm, or any intervening diameter.
  • the diameter of a nucleic acid cluster is about 0.5 pm, about 1 pm, about 1.5 pm, about 2 pm, about 2.5 pm, about 3 pm, about 4 pm, about 5 pm, or about 6 pm
  • the diameter of a nucleic acid cluster may be influenced by a number of parameters, including, but not limited to the number of amplification cycles performed in producing the cluster, the length of the nucleic acid template or the density of primers attached to the surface upon which clusters are formed
  • the density of nucleic acid clusters can be designed to typically be in the range of 0.1/mm 2 , 1/mm 2 , 10/mm 2 , 100/mm 2 , 1,000/mm 2 , 10,000/mm 2 to 100,000/mm 2 .
  • the present invention further contemplates, in part, higher density nucleic acid clusters, for example, 100,000/mm 2 to 1,000,000/mm 2 and 1,000,000/mm 2 to 10,000,000/mm 2 .
  • an“analyte” is an area of interest within a specimen or field of view.
  • an analyte refers to the area occupied by similar or identical molecules.
  • an analyte can be an amplified oligonucleotide or any other group of a polynucleotide or polypeptide with a same or similar sequence
  • an analyte can be any element or group of elements that occupy a physical area on a specimen
  • an analyte could be a parcel of land, a body of water or the like. When an analyte is imaged, each analyte will have some area. Thus, in many implementations, an analyte is not merely one pixel
  • the distances between analytes can be described in any number of ways. In some implementations, the distances between analytes can be described from the center of one analyte to the center of another analyte. In other implementations, the distances can be described from the edge of one analyte to the edge of another analyte, or between the outer-most identifiable points of each analyte. The edge of an analyte can be described as the theoretical or actual physical boundary on a chip, or some point inside the boundary of the analyte. In other implementations, the distances can be described in relation to a fixed point on the specimen or in the image of the specimen.
  • this disclosure provides neural network-based template generation and base calling systems, wherein the systems can include a processor; a storage device; and a program for image analysis, the program including instructions for carrying out one or more of the methods set forth herein. Accordingly, the methods set forth herein can be carried out on a computer, for example, having components set forth herein or otherwise known in the art.
  • the methods and systems set forth herein are useful for analyzing any of a variety of objects.
  • Particularly useful objects are solid supports or solid-phase surfaces with attached analytes.
  • the methods and systems set forth herein provide advantages when used with objects having a repeating pattern of analytes in an xy plane.
  • An example is a microarray having an attached collection of cells, viruses, nucleic acids, proteins, antibodies, carbohydrates, small molecules (such as drug candidates), biologically active molecules or other analytes of interest.
  • microarrays typically include deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) probes.
  • DNA deoxyribonucleic acid
  • RNA ribonucleic acid
  • individual DNA or RNA probes can be attached at individual analytes of an array
  • a test sample such as from a known person or organism, can be exposed to the array, such that target nucleic acids (e.g., gene fragments, mRNA, or amplicons thereof) hybridize to complementary probes at respective analytes in the array
  • the probes can be labeled in a target specific process (e g., due to labels present on the target nucleic acids or due to enzymatic labeling of the probes or targets that are present in hybridized form at the analytes).
  • the array can then be examined by scanning specific frequencies of light over the analytes to identify which target nucleic acids are present in the sample.
  • Bio microarrays may be used for genetic sequencing and similar applications.
  • genetic sequencing comprises determining the order of nucleotides in a length of target nucleic acid, such as a fragment of DNA or RNA. Relatively short sequences are typically sequenced at each analyte, and the resulting sequence information may be used in various bioinformatics methods to logically fit the sequence fragments together so as to reliably determine the sequence of much more extensive lengths of genetic material from which the fragments were derived. Automated, computer-based algorithms for characteristic fragments have been developed, and have been used more recently in genome mapping, identification of genes and their function, and so forth. Microarrays are particularly useful for characterizing genomic content because a large number of variants are present and this supplants the alternative of performing many experiments on individual probes and targets. The microarray is an ideal format for performing such investigations in a practical manner.
  • analyte arrays also referred to as“microarrays”
  • a typical array contains analytes, each having an individual probe or a population of probes.
  • the population of probes at each analyte is typically homogenous having a single species of probe.
  • each analyte can have multiple nucleic acid molecules each having a common sequence.
  • the populations at each analyte of an array canbe heterogeneous.
  • protein arrays can have analytes with a single protein or a population of proteins typically, but not always, having the same amino acid sequence.
  • the probes can be attached to the surface of an array for example, via covalent linkage of the probes to the surface or via non-covalent interaction(s) of the probes with the surface.
  • probes such as nucleic acid molecules, can be attached to a surface via a gel layer as described, for example, in U.S patent application Ser. No. 13/784,368 and US Pat. App Pub. No 2011/0059865 Al, each of which is incorporated herein by reference
  • Example arrays include, without limitation, a BeadChip Array available from Illumina, Inc. (San Diego, Calif.) or others such as those where probes are attached to beads that are present on a surface (e.g. beads in wells on a surface) such as those described in U.S Pat. No. 6,266,459; 6,355,431; 6,770,441; 6,859,570; or 7,622,294; or PCT Publication No.
  • microarrays that can be used include, for example, an Affymetrix® GeneChip® microarray or other microarray synthesized in accordance with techniques sometimes referred to as VLSIPSTM (Very Large Scale Immobilized Polymer Synthesis) technologies.
  • a spotted microarray can also be used in a method or system according to some implementations of the present disclosure.
  • An example spotted microarray is a CodeLinkTM Array available from Amersham Biosciences.
  • Another microarray that is useful is one that is manufactured using inkjet printing methods such as SurePrintTM Technology available from Agilent Technologies.
  • arrays having amplicons of genomic fragments are particularly useful such as those described in Bentley et al., Nature 456:53-59 (2008), WO 04/018497; WO 91/06678; WO 07/123744; U S Pat No 7,329,492; 7,211,414; 7,315,019; 7,405,281, or 7,057,026; or US Pat App Pub No 2008/0108082 Al, each of which is incorporated herein by reference
  • Another type of array that is useful for nucleic acid sequencing is an array of particles produced from an emulsion PCR technique.
  • Arrays used for nucleic acid sequencing often have random spatial patterns of nucleic acid analytes.
  • HiSeq or MiSeq sequencing platforms available from Illumina Inc. utilize flow cells upon which nucleic acid arrays are formed by random seeding followed by bridge amplification.
  • patterned arrays can also be used for nucleic acid sequencing or other analytical applications
  • Example patterned arrays, methods for their manufacture and methods for their use are set forth in U S. Ser. No. 13/787,396; U.S. Ser. No. 13/783,043; U.S. Ser No. 13/784,368; US Pat. App. Pub. No. 2013/0116153 Al; and US Pat. App. Pub No.
  • patterned arrays can be used to capture a single nucleic acid template molecule to seed subsequent formation of a homogenous colony, for example, via bridge amplification.
  • Such patterned arrays are particularly useful for nucleic acid sequencing applications.
  • an analyte on an array can be selected to suit a particular application.
  • an analyte of an array can have a size that accommodates only a single nucleic acid molecule.
  • a surface having a plurality of analytes in this size range is useful for constructing an array of molecules for detection at single molecule resolution.
  • Analytes in this size range are also useful for use in arrays having analytes that each contain a colony of nucleic acid molecules.
  • the analytes of an array can each have an area that is no larger than about 1 mm 2 , no larger than about 500 pm 2 , no larger than about 100 pm 2 , no larger than about 10 pm 2 , no larger than about 1 pm 2 , no larger than about 500 nm 2 , or no larger than about 100 nm 2 , no larger than about 10 nm 2 , no larger than about 5 nm 2 , or no larger than about 1 nm 2 .
  • the analytes of an array will be no smaller than about 1 mm 2 , no smaller than about 500 pm 2 , no smaller than about 100 pm 2 , no smaller than about 10 pm 2 , no smaller than about 1 pm 2 , no smaller than about 500 nm 2 , no smaller than about 100 nm 2 , no smaller than about 10 nm 2 , no smaller than about 5 nm 2 , or no smaller than about 1 nm 2 .
  • an analyte can have a size that is in a range between an upper and lower limit selected from those exemplified above.
  • analytes in these size ranges can be used for applications that do not include nucleic acids. It will be further understood that the size of the analytes need not necessarily be confined to a scale used for nucleic acid applications.
  • analytes canbe discrete, being separated with spaces between each other.
  • An array useful in the invention can have analytes that are separated by edge to edge distance of at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 pm, or less.
  • an array can have analytes that are separated by an edge to edge distance of at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more. These ranges can apply to the average edge to edge spacing for analytes as well as to the minimum or maximum spacing.
  • the analytes of an array need not be discrete and instead neighboring analytes can abut each other. Whether or not the analytes are discrete, the size of the analytes and/or pitch of the analytes can vaiy such that arrays can have a desired density
  • the average analyte pitch in a regular pattern can be at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 pm, or less
  • the average analyte pitch in a regular pattern can be at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more. These ranges can apply to the maximum or minimum pitch for a regular pattern as well.
  • the maximum analyte pitch for a regular pattern can be at most 100 pm, 50 pm, 10 pm, 5 pm, 1 pm, 0.5 pm, or less; and/or the minimum analyte pitch in a regular pattern can be at least 0.5 pm, 1 pm, 5 pm, 10 pm, 50 pm, 100 pm, or more.
  • the density of analytes in an array can also be understood in terms of the number of analytes present per unit area.
  • the average density of analytes for an array can be at least about lxlO 3 analytes/mm 2 , lxlO 4 analytes/mm 2 , lxlO 5 analytes/mm 2 , lxlO 6 analytes/mm 2 , lxlO 7 analytes/mm 2 , 1x10 s analytes/mm 2 , or lxlO 9 analytes/mm 2 , or higher
  • the average density of analytes for an array can be at most about lxlO 9 analytes/mm 2 , 1x10 s analytes/mm 2 , lxlO 7 analytes/mm 2 , lxlO 6 analytes/mm 2 , lxlO 5 analytes/mm 2 ,
  • the analytes in a pattern can have any of a variety of shapes. For example, when observed in a two dimensional plane, such as on the surface of an array, the analytes can appear rounded, circular, oval, rectangular, square, symmetric, asymmetric, triangular, polygonal, or the like.
  • the analytes can be arranged in a regular repeating pattern including, for example, a hexagonal or rectilinear pattern
  • a pattern can be selected to achieve a desired level of packing For example, round analytes are optimally packed in a hexagonal arrangement Of course other packing arrangements can also be used for round analytes and vice versa
  • a pattern can be characterized in terms of the number of analytes that are present in a subset that forms the smallest geometric unit of the pattern.
  • the subset can include, for example, at least about 2, 3, 4, 5, 6, 10 or more analytes.
  • the geometric unit can occupy an area of less than 1 mm 2 , 500 pm 2 , 100 pm 2 , 50 pm 2 , 10 pm 2 , 1 pm 2 , 500 mn 2 , 100 nm 2 , 50 nm 2 , 10 nm 2 , or less.
  • the geometric unit can occupy an area of greater than 10 nm 2 , 50 nm 2 , 100 nm 2 , 500 nm 2 , 1 pm 2 , 10 pm 2 , 50 pm 2 , 100 pm 2 , 500 pm 2 , 1 mm 2 , or more Characteristics of the analytes in a geometric unit, such as shape, size, pitch and the like, can be selected from those set forth herein more generally with regard to analytes in an array or pattern.
  • An array having a regular pattern of analytes can be ordered with respect to the relative locations of the analytes but random with respect to one or more other characteristic of each analyte
  • the nuclei acid analytes can be ordered with respect to their relative locations but random with respect to one’ s knowledge of the sequence for the nucleic acid species present at any particular analyte.
  • nucleic acid arrays formed by seeding a repeating pattern of analytes with template nucleic acids and amplifying the template at each analyte to form copies of the template at the analyte will have a regular pattern of nucleic acid analytes but will be random with regard to the distribution of sequences of the nucleic acids across the array.
  • detection of the presence of nucleic acid material generally on the array can yield a repeating pattern of analytes, whereas sequence specific detection can yield non-repeating distribution of signals across the array
  • patterns, order, randomness and the like pertain not only to analytes on objects, such as analytes on arrays, but also to analytes in images.
  • patterns, order, randomness and the like can be present in any of a variety of formats that are used to store, manipulate or communicate image data including, but not limited to, a computer readable medium or computer component such as a graphical user interface or other output device.
  • the term“image” is intended to mean a representation of all or part of an object.
  • the representation can be an optically detected reproduction.
  • an image can be obtained from fluorescent, luminescent, scatter, or absorption signals.
  • the part of the object that is present in an image can be the surface or other xy plane of the object.
  • an image is a 2 dimensional representation, but in some cases information in the image can be derived from 3 or more dimensions
  • An image need not include optically detected signals
  • Non-optical signals can be present instead.
  • An image can be provided in a computer readable format or medium such as one or more of those set forth elsewhere herein.
  • image refers to a reproduction or representation of at least a portion of a specimen or other object.
  • the reproduction is an optical reproduction, for example, produced by a camera or other optical detector
  • the reproduction can be a non-optical reproduction, for example, a representation of electrical signals obtained from an array of nanopore analytes or a representation of electrical signals obtained from an ion-sensitive CMOS detector.
  • non-optical reproductions can be excluded from a method or apparatus set forth herein
  • An image can have a resolution capable of distinguishing analytes of a specimen that are present at any of a variety of spacings including, for example, those that are separated by less than 100 pm, 50 pm, 10 pm, 5 pm, 1 pm or 0.5 pm.
  • data acquisition can include generating an image of a specimen, looking for a signal in a specimen, instructing a detection device to look for or generate an image of a signal, giving instructions for further analysis or transformation of an image file, and any number of transformations or manipulations of an image file.
  • a template refers to a representation of the location or relation between signals or analytes.
  • a template is a physical grid with a representation of signals corresponding to analytes in a specimen
  • a template can be a chart, table, text file or other computer file indicative of locations corresponding to analytes.
  • a template is generated in order to track the location of analytes of a specimen across a set of images of the specimen captured at different reference points.
  • a template could be a set of x,y coordinates or a set of values that describe the direction and/or distance of one analyte with respect to another analyte
  • the term“specimen” can refer to an object or area of an object of which an image is captured.
  • a parcel of land can be a specimen.
  • the flow cell may be divided into any number of subdivisions, each of which may be a specimen.
  • a flow cell may be divided into various flow channels or lanes, and each lane can be further divided into 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60 70, 80, 90, 100, 110, 120, 140, 160, 180, 200, 400, 600, 800, 1000 or more separate regions that are imaged.
  • a flow cell has 8 lanes, with each lane divided into 120 specimens or tiles.
  • a specimen may be made up of a plurality of tiles or even an entire flow cell.
  • the image of each specimen can represent a region of a larger surface that is imaged.
  • references to ranges and sequential number lists described herein include not only the enumerated number but all real numbers between the enumerated numbers.
  • a“reference point” refers to any temporal or physical distinction between images
  • a reference point is a time point.
  • a reference point is a time point or cycle during a sequencing reaction.
  • the term“reference point” can include other aspects that distinguish or separate images, such as angle, rotational, temporal, or other aspects that can distinguish or separate images
  • a“subset of images” refers to a group of images within a set.
  • a subset may contain 1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20, 30, 40, 50, 60 or any number of images selected from a set of images.
  • a subset may contain no more than 1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20, 30, 40, 50, 60 or any number of images selected from a set of images.
  • images are obtained from one or more sequencing cycles with four images correlated to each cycle
  • a subset could be a group of 16 images obtained through four cycles.
  • a base refers to a nucleotide base or nucleotide, A (adenine), C (cytosine), T (thymine), or G (guanine). This application uses
  • base(s) and“nucleotide(s)” interchangeably.
  • chromosome refers to the heredity -bearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones).
  • the conventional internationally recognized individual human genome chromosome numbering system is employed herein.
  • the term“site” refers to a unique position (e.g , chromosome ID, chromosome position and orientation) on a reference genome.
  • a site may be a residue, a sequence tag, or a segment’s position on a sequence.
  • locus may be used to refer to the specific location of a nucleic acid sequence or polymorphism on a reference chromosome.
  • sample refers to a sample, typically derived from a biological fluid, cell, tissue, organ, or organism containing a nucleic acid or a mixture of nucleic acids containing at least one nucleic acid sequence that is to be sequenced and/or phased.
  • samples include, but are not limited to sputum/oral fluid, amniotic fluid, blood, a blood fraction, fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc ), urine, peritoneal fluid, pleural fluid, tissue explant, organ culture and any other tissue or cell preparation, or fraction or derivative thereof or isolated therefrom.
  • samples can be taken from any organism having chromosomes, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc.
  • the sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample.
  • pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth.
  • Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc.
  • sequence includes or represents a strand of nucleotides coupled to each other.
  • the nucleotides may be based on DNA or RNA It should be understood that one sequence may include multiple sub-sequences.
  • a single sequence e.g., of aPCR amplicon
  • the sample read may include multiple sub-sequences within these 350 nucleotides.
  • the sample read may include first and second flanking subsequences having, for example, 20-50 nucleotides.
  • the first and second flanking sub-sequences may be located on either side of a repetitive segment having a corresponding sub-sequence (e.g., 40-100 nucleotides).
  • Each of the flanking sub-sequences may include (or include portions of) a primer sub-sequence (e.g., 10-30 nucleotides).
  • the term“sub-sequence” will be referred to as“sequence,” but it is understood that two sequences are not necessarily separate from each other on a common strand.
  • the sequences may be given different labels (e g., target sequence, primer sequence, flanking sequence, reference sequence, and the like). Other terms, such as“allele,” may be given different labels to differentiate between like objects.
  • the application uses“read(s)” and“sequence read(s)” interchangeably.
  • paired-end sequencing refers to sequencing methods that sequence both ends of a target fragment. Paired-end sequencing may facilitate detection of genomic rearrangements and repetitive segments, as well as gene fusions and novel transcripts Methodology for paired-end sequencing are described in PCT publication W007010252, PCT application Serial No. PCTGB2007/003798 and US patent application publication US 2009/0088327, each of which is incorporated by reference herein.
  • a series of operations may be performed as follows; (a) generate clusters of nucleic acids; (b) linearize the nucleic acids; (c) hybridize a first sequencing primer and carry out repeated cycles of extension, scanning and deblocking, as set forth above; (d)“invert” the target nucleic acids on the flow cell surface by synthesizing a complimentary copy; (e) linearize the resynthesized strand; and (f) hybridize a second sequencing primer and cany out repeated cycles of extension, scanning and deblocking, as set forth above
  • the inversion operation can be carried out be delivering reagents as set forth above for a single cycle of bridge amplification.
  • reference genome refers to any particular known genome sequence, whether partial or complete, of any organism which may be used to reference identified sequences from a subject.
  • reference genome refers to any particular known genome sequence, whether partial or complete, of any organism which may be used to reference identified sequences from a subject.
  • A“genome” refers to the complete genetic information of an organism or vims, expressed in nucleic acid sequences.
  • a genome includes both the genes and the noncoding sequences of the DNA.
  • the reference sequence may be larger than the reads that are aligned to it.
  • the reference genome sequence is that of a full length human genome
  • the reference genome sequence is limited to a specific human chromosome such as chromosome 13.
  • a reference chromosome is a chromosome sequence from human genome version hgl9. Such sequences may be referred to as chromosome reference sequences, although the term reference genome is intended to cover such sequences.
  • reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (such as strands), etc , of any species.
  • the reference genome is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual.
  • the“genome” also covers so-called“graph genomes”, which use a particular storage format and representation of the genome sequence
  • graph genomes store data in a linear file.
  • the graph genomes refer to a representation where alternative sequences (e.g , different copies of a chromosome with small differences) are stored as different paths in a graph. Additional information regarding graph genome implementations can be found in https://www.biorxiv.org/content/biorxiv/early/2018/03/20/194530.full.pdf, the content of which is hereby incorporated herein by reference in its entirety.
  • the term“read” refer to a collection of sequence data that describes a fragment of a nucleotide sample or reference
  • the term“read” may refer to a sample read and/or a reference read.
  • a read represents a short sequence of contiguous base pairs in the sample or reference
  • the read may be represented symbolically by the base pair sequence (in ATCG) of the sample or reference fragment
  • a read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample.
  • a read is a DNA sequence of sufficient length (e.g., at least about 25 bp) that canbe used to identify a larger sequence or region, e g., that can be aligned and specifically assigned to a chromosome or genomic region or gene.
  • Next-generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), pyrosequencing (454), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing ( Pacific Biosciences) and sequencing by ligation (SOLiD sequencing).
  • the length of each read may vary from about 30 bp to more than 10,000 bp.
  • the DNA sequencing method using SOLiD sequencer generates nucleic acid reads of about 50 bp.
  • Ion Torrent Sequencing generates nucleic acid reads of up to 400 bp and 454 pyrosequencing generates nucleic acid reads of about 700 bp
  • single-molecule real-time sequencing methods may generate reads of 10,000 bp to 15,000 bp. Therefore, in certain implementations, the nucleic acid sequence reads have a length of 30-100 bp, 50-200 bp, or 50-400 bp.
  • sample read refers to sequence data for a genomic sequence of interest from a sample.
  • sample read comprises sequence data from a PCR amplicon having a forward and reverse primer sequence.
  • sequence data can be obtained from any select sequence methodology.
  • the sample read can be, for example, from a sequencing -by -synthesis (SBS) reaction, a sequencing-by-ligation reaction, or any other suitable sequencing methodology for which it is desired to determine the length and/or identify of a repetitive element.
  • SBS sequencing -by -synthesis
  • the sample read canbe a consensus (e.g., averaged or weighted) sequence derived from multiple sample reads
  • providing a reference sequence comprises identifying a locus-of-interest based upon the primer sequence of the PCR amplicon
  • raw fragment refers to sequence data for a portion of a genomic sequence of interest that at least partially overlaps a designated position or secondary position of interest within a sample read or sample fragment.
  • raw fragments include a duplex stitched fragment, a simplex stitched fragment, a duplex un-stitched fragment and a simplex un-stitched fragment.
  • the term“raw” is used to indicate that the raw fragment includes sequence data having some relation to the sequence data in a sample read, regardless of whether the raw fragment exhibits a supporting variant that corresponds to and authenticates or confirms a potential variant in a sample read
  • the term“raw fragment” does not indicate that the fragment necessarily includes a supporting variant that validates a variant call in a sample read For example, when a sample read is determined by a variant call application to exhibit a first variant, the variant call application may determine that one or more raw fragments lack a corresponding type of“supporting” variant that may otherwise be expected to occur given the variant in the sample read.
  • the terms“mapping”,“aligned,”“alignment,” or“aligning” refer to the process of comparing a read or tag to a reference sequence and thereby determining whether the reference sequence contains the read sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain implementations, to a particular location in the reference sequence. In some cases, alignment simply tells whether or not a read is a member of a particular reference sequence (i.e., whether the read is present or absent in the reference sequence) For example, the alignment of a read to the reference sequence for human chromosome 13 will tell whether the read is present in the reference sequence for chromosome 13. A tool that provides this information may be called a set membership tester.
  • an alignment additionally indicates a location in the reference sequence where the read or tag maps to For example, if the reference sequence is the whole human genome sequence, an alignment may indicate that a read is present on chromosome 13, and may further indicate that the read is on a particular strand and/or site of chromosome 13.
  • micro-indel refers to the insertion and/or the deletion of bases in the DNA of an organism.
  • a micro-indel represents an indel that results in a net change of 1 to 50 nucleotides. In coding regions of the genome, unless the length of an indel is a multiple of 3, it will produce a frameshift mutation Indels can be contrasted with point mutations
  • An indel inserts and deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA.
  • Indels can also be contrasted with a Tandem Base Mutation (TBM), which may be defined as substitution at adjacent nucleotides (primarily substitutions at two adjacent nucleotides, but substitutions at three adjacent nucleotides have been observed.
  • TBM Tandem Base Mutation
  • nucleic acid sequence variant refers to a nucleic acid sequence that is different from a nucleic acid reference
  • Typical nucleic acid sequence variant includes without limitation single nucleotide polymorphism (SNP), short deletion and insertion polymorphisms (Indel), copy number variation (CNV), micro satellite markers or short tandem repeats and structural variation.
  • Somatic variant calling is the effort to identify variants present at low frequency in the DNA sample Somatic variant calling is of interest in the context of cancer treatment Cancer is caused by an accumulation of mutations in DNA.
  • a DNA sample from a tumor is generally heterogeneous, including some normal cells, some cells at an early stage of cancer progression (with fewer mutations), and some late-stage cells (with more mutations).
  • somatic mutations will often appear at a low frequency. For example, a SNV might be seen in only 10% of the reads covering a given base.
  • a variant that is to be classified as somatic or germline by the variant classifier is also referred to herein as the“variant under test”.
  • noise refers to a mistaken variant call resulting from one or more errors in the sequencing process and/or in the variant call application.
  • variable frequency represents the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage.
  • the fraction or percentage may be the fraction of all chromosomes in the population that carry that allele.
  • sample variant frequency represents the relative frequency of an allele/variant at a particular locus/position along a genomic sequence of interest over a“population” corresponding to the number of reads and/or samples obtained for the genomic sequence of interest from an individual.
  • a baseline variant frequency represents the relative frequency of an allele/variant at a particular locus/position along one or more baseline genomic sequences where the“population” corresponding to the number of reads and/or samples obtained for the one or more baseline genomic sequences from a population of normal individuals.
  • VAF variable allele frequency
  • the terms“position”,“designated position”, and“locus” refer to a location or coordinate of one or more nucleotides within a sequence of nucleotides.
  • the terms“position”,“designated position”, and“locus” also refer to a location or coordinate of one or more base pairs in a sequence of nucleotides.
  • haplotype refers to a combination of alleles at adjacent sites on a chromosome that are inherited together.
  • a haplotype may be one locus, several loci, or an entire chromosome depending on the number of recombination events that have occurred between a given set of loci, if any occurred.
  • threshold refers to a numeric or non-numeric value that is used as a cutoff to characterize a sample, a nucleic acid, or portion thereof (e.g. , a read).
  • a threshold may be varied based upon empirical analysis. The threshold may be compared to a measured or calculated value to determine whether the source giving rise to such value suggests should be classified in a particular manner. Threshold values can be identified empirically or analytically. The choice of a threshold is dependent on the level of confidence that the user wishes to have to make the classification. The threshold may be chosen for a particular purpose (e g., to balance sensitivity and selectivity).
  • the term“threshold” indicates a point at which a course of analysis may be changed and/or a point at which an action may be triggered
  • a threshold is not required to be a predetermined number Instead, the threshold may be, for instance, a function that is based on a plurality of factors.
  • the threshold may be adaptive to the circumstances. Moreover, a threshold may indicate an upper limit, a lower limit, or a range between limits.
  • a metric or score that is based on sequencing data may be compared to the threshold.
  • the terms“metric” or“score” may include values or results that were determined from the sequencing data or may include functions that are based on the values or results that were determined from the sequencing data.
  • the metric or score may be adaptive to the circumstances. For instance, the metric or score may be a normalized value.
  • one or more implementations may use count scores when analyzing the data. A count score may be based on number of sample reads The sample reads may have undergone one or more filtering stages such that the sample reads have at least one common characteristic or quality.
  • each of the sample reads that are used to determine a count score may have been aligned with a reference sequence or may be assigned as a potential allele.
  • the number of sample reads having a common characteristic may be counted to determine a read count.
  • Count scores may be based on the read count. In some implementations, the count score may be a value that is equal to the read count. In other implementations, the count score may be based on the read count and other information. For example, a count score may be based on the read count for a particular allele of a genetic locus and a total number of reads for the genetic locus. In some implementations, the count score may be based on the read count and previously -obtained data for the genetic locus.
  • the count scores may be normalized scores between predetermined values
  • the count score may also be a function of read counts from other loci of a sample or a function of read counts from other samples that were concurrently run with the sample-of-interest
  • the count score may be a function of the read count of a particular allele and the read counts of other loci in the sample and/or the read counts from other samples
  • the read counts from other loci and/or the read counts from other samples may be used to normalize the count score for the particular allele
  • the terms“coverage” or“fragment coverage” refer to a count or other measure of a number of sample reads for the same fragment of a sequence.
  • a read count may represent a count of the number of reads that cover a corresponding fragment.
  • the coverage may be determined by multiplying the read count by a designated factor that is based on historical knowledge, knowledge of the sample, knowledge of the locus, etc
  • read depth refers to the number of sequenced reads with overlapping alignment at the target position. This is often expressed as an average or percentage exceeding a cutoff over a set of intervals (such as exons, genes, or panels) For example, a clinical report might say that a panel average coverage is 1,105 c with 98% of targeted bases covered >100 x
  • base call quality score or“Q score” refer to a PHRED-scaled probability ranging from 0-50 inversely proportional to the probability that a single sequenced base is correct. For example, a T base call with Q of 20 is considered likely correct with a probability of 99.99%. Any base call with Q ⁇ 20 should be considered low quality, and any variant identified where a substantial proportion of sequenced reads supporting the variant are of low quality should be considered potentially false positive.
  • variant reads or“variant read number” refer to the number of sequenced reads supporting the presence of the variant.
  • the genetic message in DNA can be represented as a string of the letters A, G, C, and T. For example, 5’ - AGGACA - 3’. Often, the sequence is written in the direction shown here, i.e., with the 5’ end to the left and the 3’ end to the right. DNA may sometimes occur as single-stranded molecule (as in certain viruses), but normally we find DNA as a double-stranded unit. It has a double helical structure with two antiparallel strands. In this case, the word“antiparallel” means that the two strands run in parallel, but have opposite polarity.
  • the double-stranded DNA is held together by pairing between bases and the pairing is always such that adenine (A) pairs with thymine (T) and cytosine (C) pairs with guanine (G).
  • This pairing is referred to as complementarity, and one strand of DNA is said to be the complement of the other.
  • the double-stranded DNA may thus be represented as two strings, like this: 5’ - AGGACA - 3’ and 3’ - TCCTGT - 5’ Note that the two strands have opposite polarity. Accordingly, the strandedness of the two DNA strands can be referred to as the reference strand and its complement, forward and reverse strands, top and bottom strands, sense and antisense strands, or Watson and Crick strands
  • the reads alignment (also called reads mapping) is the process of figuring out where in the genome a sequence is from. Once the alignment is performed, the“mapping quality” or the“mapping quality score (MAPQ)” of a given read quantifies the probability that its position on the genome is correct The mapping quality is encoded in the phred scale where P is the probability that the alignment is not correct.
  • mapping quality the mapping quality.
  • a mapping quality of 40 10 to the power of -4, meaning that there is a 0.01% chance that the read was aligned incorrectly.
  • the mapping quality is therefore associated with several alignment factors, such as the base quality of the read, the complexity of the reference genome, and the paired-end information.
  • the base quality of the read is low, it means that the observed sequence might be wrong and thus its alignment is wrong.
  • the mappability refers to the complexity of the genome.
  • the MAPQ reflects the fact that the reads are not uniquely aligned and that their real origin cannot be determined.
  • the mapping quality in case of paired-end sequencing data, concordant pairs are more likely to be well aligned. The higher is the mapping quality, the better is the alignment.
  • a read aligned with a good mapping quality usually means that the read sequence was good and was aligned with few mismatches in a high mappability region.
  • the MAPQ value can be used as a quality control of the alignment results.
  • a signal refers to a detectable event such as an emission, preferably light emission, for example, in an image
  • a signal can represent any detectable light emission that is captured in an image (i e , a "spot”).
  • signal can refer to both an actual emission from an analyte of the specimen, and can refer to a spurious emission that does not correlate to an actual analyte
  • a signal could arise from noise and could be later discarded as not representative of an actual analyte of a specimen
  • clump refers to a group of signals
  • the signals are derived from different analytes.
  • a signal clump is a group of signals that cluster together.
  • a signal clump represents a physical region covered by one amplified oligonucleotide. Each signal clump should be ideally observed as several signals (one per template cycle, and possibly more due to cross-talk) Accordingly, duplicate signals are detected where two (or more) signals are included in a template from the same clump of signals.
  • terms such as“minimum,” “maximum,”“minimize,” “maximize” and grammatical variants thereof can include values that are not the absolute maxima or minima.
  • the values include near maximum and near minimum values.
  • the values can include local maximum and/or local minimum values.
  • the values include only absolute maximum or minimum values.
  • cross-talk refers to the detection of signals in one image that are also detected in a separate image.
  • cross-talk can occur when an emitted signal is detected in two separate detection channels. For example, where an emitted signal occurs in one color, the emission spectrum of that signal may overlap with another emitted signal in another color.
  • fluorescent molecules used to indicate the presence of nucleotide bases A, C, G and T are detected in separate channels. However, because the emission spectra of A and C overlap, some of the C color signal may be detected during detection using the A color channel Accordingly, crosstalk between the A and C signals allows signals from one color image to appear in the other color image. In some implementations, G and T cross-talk.
  • the amount of cross-talk between channels is asymmetric. It will be appreciated that the amount of cross talk between channels can be controlled by, among other things, the selection of signal molecules having an appropriate emission spectrum as well as selection of the size and wavelength range of the detection channel.
  • register refers to any process to correlate signals in an image or data set from a first time point or perspective with signals in an image or data set from another time point or perspective.
  • registration can be used to align signals from a set of images to form a template
  • registration can be used to align signals from other images to a template.
  • One signal may be directly or indirectly registered to another signal.
  • a signal from image “S” may be registered to image “G” directly.
  • a signal from image “N” may be directly registered to image “G”, or alternatively, the signal from image “N” may be registered to image "S”, which has previously been registered to image "G”.
  • the signal from image "N” is indirectly registered to image "G”.
  • the term "fiducial” is intended to mean a distinguishable point of reference in or on an object.
  • the point of reference canbe, for example, a mark, second object, shape, edge, area, irregularity, channel, pit, post or the like.
  • the point of reference can be present in an image of the object or in another data set derived from detecting the object.
  • the point of reference can be specified by an x and/or y coordinate in a plane of the object. Alternatively or additionally, the point of reference can be specified by a z coordinate that is orthogonal to the xy plane, for example, being defined by the relative locations of the object and a detector.
  • One or more coordinates for a point of reference can be specified relative to one or more other analytes of an object or of an image or other data set derived from the object
  • optical signal is intended to include, for example, fluorescent, luminescent, scatter, or absorption signals.
  • Optical signals can be detected in the ultraviolet (UV) range (about 200 to 390 nm), visible (VIS) range (about 391 to 770 nm), infrared (IR) range (about 0.771 to 25 microns), or other range of the electromagnetic spectrum.
  • UV ultraviolet
  • VIS visible
  • IR infrared
  • Optical signals can be detected in a way that excludes all or part of one or more of these ranges.
  • signal level is intended to mean an amount or quantity of detected energy or coded information that has a desired or predefined characteristic.
  • an optical signal canbe quantified by one or more of intensity, wavelength, energy, frequency, power, luminance or the like.
  • Other signals can be quantified according to characteristics such as voltage, current, electric field strength, magnetic field strength, frequency, power, temperature, etc. Absence of signal is understood to be a signal level of zero or a signal level that is not meaningfully distinguished from noise
  • the term "simulate” is intended to mean creating a representation or model of a physical thing or action that predicts characteristics of the thing or action.
  • the representation or model can in many cases be distinguishable from the thing or action.
  • the representation or model can be distinguishable from a thing with respect to one or more characteristic such as color, intensity of signals detected from all or part of the thing, size, or shape.
  • the representation or model can be idealized, exaggerated, muted, or incomplete when compared to the thing or action.
  • a representation of model can be distinguishable from the thing or action that it represents, for example, with respect to at least one of the characteristics set forth above.
  • the representation or model can be provided in a computer readable format or medium such as one or more of those set forth elsewhere herein.
  • a specific signal is intended to mean detected energy or coded information that is selectively observed over other energy or information such as background energy or information
  • a specific signal can be an optical signal detected at a particular intensity, wavelength or color; an electrical signal detected at a particular frequency, power or field strength; or other signals known in the art pertaining to spectroscopy and analytical detection
  • the term“swath” is intended to mean a rectangular portion of an object.
  • the swath can be an elongated strip that is scanned by relative movement between the object and a detector in a direction that is parallel to the longest dimension of the strip Generally, the width of the rectangular portion or strip will be constant along its full length.
  • Multiple swaths of an object can be parallel to each other.
  • Multiple swaths of an object can be adjacent to each other, overlapping with each other, abutting each other, or separated from each other by an interstitial area.
  • variance is intended to mean a difference between that which is expected and that which is observed or a difference between two or more observations.
  • variance can be the discrepancy between an expected value and a measured value.
  • Variance can be represented using statistical functions such as standard deviation, the square of standard deviation, coefficient of variation or the like.
  • xy coordinates is intended to mean information that specifies location, size, shape, and/or orientation in an xy plane.
  • the information can be, for example, numerical coordinates in a Cartesian system.
  • the coordinates can be provided relative to one or both of the x and y axes or can be provided relative to another location in the xy plane.
  • coordinates of a analyte of an object can specify the location of the analyte relative to location of a fiducial or other analyte of the object.
  • xy plane is intended to mean a 2 dimensional area defined by straight line axes x and y .
  • the area can be further specified as being orthogonal to the direction of observation between the detector and object being detected
  • the term“z coordinate” is intended to mean information that specifies the location of a point, line or area along an axes that is orthogonal to an xy plane.
  • the z axis is orthogonal to an area of an object that is observed by a detector.
  • the direction of focus for an optical system may be specified along the z axis.
  • acquired signal data is transformed using an affine transformation.
  • template generation makes use of the fact that the affine transforms between color channels are consistent between runs Because of this consistency, a set of default offsets can be used when determining the coordinates of the analytes in a specimen.
  • a default offsets file can contain the relative transformation (shift, scale, skew) for the different channels relative to one channel, such as the A channel.
  • the offsets between color channels drift during a run and/or between runs, making offset-driven template generation difficult
  • the methods and systems provided herein can utilize offset-less template generation, which is described further below.
  • the system can comprise a flow cell.
  • the flow cell comprises lanes, or other configurations, of tiles, wherein at least some of the tiles comprise one or more arrays of analytes.
  • the analytes comprise a plurality of molecules such as nucleic acids
  • the flow cell is configured to deliver a labeled nucleotide base to an array of nucleic acids, thereby extending a primer hybridized to a nucleic acid within a analyte so as to produce a signal corresponding to a analyte comprising the nucleic acid
  • the nucleic acids within a analyte are identical or substantially identical to each other.
  • each image in the set of images includes color signals, wherein a different color corresponds to a different nucleotide base.
  • each image of the set of images comprises signals having a single color selected from at least four different colors.
  • each image in the set of images comprises signals having a single color selected from four different colors.
  • nucleic acids can be sequenced by providing four different labeled nucleotide bases to the array of molecules so as to produce four different images, each image comprising signals having a single color, wherein the signal color is different for each of the four different images, thereby producing a cycle of four color images that corresponds to the four possible nucleotides present at a particular position in the nucleic acid.
  • the system comprises a flow cell that is configured to deliver additional labeled nucleotide bases to the array of molecules, thereby producing a plurality of cycles of color images.
  • the methods provided herein can include determining whether a processor is actively acquiring data or whether the processor is in a low activity state. Acquiring and storing large numbers of high-quality images typically requires massive amounts of storage capacity. Additionally, once acquired and stored, the analysis of image data can become resource intensive and can interfere with processing capacity of other functions, such as ongoing acquisition and storage of additional image data.
  • the term low activity state refers to the processing capacity of a processor at a given time
  • a low activity state occurs when a processor is not acquiring and/or storing data
  • a low activity state occurs when some data acquisition and/or storage is taking place, but additional processing capacity remains such that image analysis can occur at the same time without interfering with other functions.
  • identifying a conflict refers to identifying a situation where multiple processes compete for resources. In some such implementations, one process is given priority over another process. In some implementations, a conflict may relate to the need to give priority for allocation of time, processing capacity, storage capacity or any other resource for which priority is given. Thus, in some implementations, where processing time or capacity is to be distributed between two processes such as either analyzing a data set and acquiring and/or storing the data set, a conflict between the two processes exists and can be resolved by giving priority to one of the processes.
  • the systems can include a processor; a storage capacity; and a program for image analysis, the program comprising instructions for processing a first data set for storage and the second data set for analysis, wherein the processing comprises acquiring and/or storing the first data set on the storage device and analyzing the second data set when the processor is not acquiring the first data set
  • the program includes instructions for identifying at least one instance of a conflict between acquiring and/or storing the first data set and analyzing the second data set; and resolving the conflict in favor of acquiring and/or storing image data such that acquiring and/or storing the first data set is given priority.
  • the first data set comprises image files obtained from an optical imaging device.
  • the system further comprises an optical imaging device.
  • the optical imaging device comprises a light source and a detection device.
  • program refers to instructions or commands to perform a task or process.
  • program can be used interchangeably with the term module.
  • a program can be a compilation of various instructions executed under the same set of commands In other implementations, a program can refer to a discrete batch or file.
  • an important measure of a sequencing system's utility is its overall efficiency. For example, the amount of mappable data produced per day and the total cost of installing and running the instrument are important aspects of an economical sequencing solution.
  • real-time base calling can be enabled on an instrument computer and can run in parallel with sequencing chemistry and imaging. This allows much of the data processing and analysis to be completed before the sequencing chemistry finishes. Additionally, it can reduce the storage required for intermediate data and limit the amount of data that needs to travel across the network
  • sequence output has increased, the data per run transferred from the systems provided herein to the network and to secondary analysis processing hardware has substantially decreased.
  • network loads are dramatically reduced. Without these on-instrument, off-network data reduction techniques, the image output of a fleet of DNA sequencing instruments would cripple most networks.
  • the methods and/or systems presented herein act as a state machine, keeping track of the individual state of each specimen, and when it detects that a specimen is ready to advance to the next state, it does the appropriate processing and advances the specimen to that state.
  • a state machine monitors a file system to determine when a specimen is ready to advance to the next state according to a preferred implementation is set forth in Example 1 below.
  • the methods and systems provided herein are multi-threaded and can work with a configurable number of threads
  • the methods and systems provided herein are capable of working in the background during a live sequencing run for real-time analysis, or it can be run using a pre-existing set of image data for off-line analysis
  • the methods and systems handle multi -threading by giving each thread its own subset of specimen for which it is responsible. This minimizes the possibility of thread contention
  • a method of the present disclosure can include a step of obtaining a target image of an object using a detection apparatus, wherein the image includes a repeating pattern of analytes on the object
  • Detection apparatus that are capable of high resolution imaging of surfaces are particularly useful.
  • the detection apparatus will have sufficient resolution to distinguish analytes at the densities, pitches, and/or analyte sizes set forth herein.
  • Particularly useful are detection apparatus capable of obtaining images or image data from surfaces.
  • Example detectors are those that are configured to maintain an object and detector in a static relationship while obtaining an area image. Scanning apparatus can also be used. For example, an apparatus that obtains sequential area images (e.g., so called‘step and shoot’ detectors) can be used.
  • Point scanning detectors can be configured to scan a point (i.e., a small detection area) over the surface of an object via a raster motion in the x-y plane of the surface.
  • Line scanning detectors can be configured to scan a line along the y dimension of the surface of an object, the longest dimension of the line occurring along the x dimension. It will be understood that the detection device, object or both can be moved to achieve scanning detection. Detection apparatus that are particularly useful, for example in nucleic acid sequencing applications, are described in US Pat App. Pub.
  • article of manufacture refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or nonvolatile memoiy devices
  • Such hardware may include, but is not limited to, field programmable gate arrays (FPGAs), coarse grained reconfigurable architectures (CGRAs), application-specific integrated circuits (ASICs), complex programmable logic devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar processing devices.
  • FPGAs field programmable gate arrays
  • CGRAs coarse grained reconfigurable architectures
  • ASICs application-specific integrated circuits
  • CPLDs complex programmable logic devices
  • PDAs programmable logic arrays
  • microprocessors or other similar processing devices.
  • information or algorithms set forth herein are present in non-transient storage media
  • a computer implemented method set forth herein can occur in real time while multiple images of an object are being obtained.
  • Such real time analysis is particularly useful for nucleic acid sequencing applications wherein an array of nucleic acids is subjected to repeated cycles of fluidic and detection steps.
  • Analysis of the sequencing data can often be computationally intensive such that it canbe beneficial to perform the methods set forth herein in real time or in the background while other data acquisition or analysis algorithms are in process.
  • Example real time analysis methods that can be used with the present methods are those used for the MiSeq and HiSeq sequencing devices commercially available from Illumina, Inc. (San Diego, Calif.) and/or described in US Pat. App. Pub. No. 2012/0020537 Al, which is incorporated herein by reference.
  • An example data analysis system formed by one or more programmed computers, with programming being stored on one or more machine readable media with code executed to carry out one or more steps of methods described herein.
  • the system includes an interface designed to permit networking of the system to one or more detection systems (e.g., optical imaging systems) that are configured to acquire data from target objects
  • the interface may receive and condition data, where appropriate
  • the detection system will output digital image data, for example, image data that is representative of individual picture elements or pixels that, together, form an image of an array or other object.
  • a processor processes the received detection data in accordance with a one or more routines defined by processing code.
  • the processing code may be stored in various types of memory circuitry.
  • the processing code executed on the detection data includes a data analysis routine designed to analyze the detection data to determine the locations and metadata of individual analytes visible or encoded in the data, as well as locations at which no analyte is detected (i.e., where there is no analyte, or where no meaningful signal was detected from an existing analyte)
  • a data analysis routine designed to analyze the detection data to determine the locations and metadata of individual analytes visible or encoded in the data, as well as locations at which no analyte is detected (i.e., where there is no analyte, or where no meaningful signal was detected from an existing analyte)
  • analyte locations in an array will typically appear brighter than non-analyte locations due to the presence of fluorescing dyes attached to the imaged analytes.
  • analytes need not appear brighter than their surrounding area, for example, when a target for the probe at the analyte is not present in an array being detected.
  • the color at which individual analytes appear may be a function of the dye employed as well as of the wavelength of the light used by the imaging system for imaging purposes.
  • Analytes to which targets are not bound or that are otherwise devoid of a particular label can be identified according to other characteristics, such as their expected location in the microarray
  • a value assignment may be carried out.
  • the value assignment will assign a digital value to each analyte based upon characteristics of the data represented by detector components (e.g. , pixels) at the corresponding location. That is, for example when imaging data is processed, the value assignment routine may be designed to recognize that a specific color or wavelength of light was detected at a specific location, as indicated by a group or cluster of pixels at the location.
  • the four common nucleotides will be represented by four separate and distinguishable colors Each color, then, may be assigned a value corresponding to that nucleotide.
  • module may include a hardware and/or software system and circuitiy that operates to perform one or more functions
  • a module, system, or system controller may include a computer processor, controller, or other logic-based device that performs operations based on instructions stored on a tangible and non-transitoiy computer readable storage medium, such as a computer memory.
  • a module, system, or system controller may include a hard-wired device that performs operations based on hard-wired logic and circuitry.
  • the module, system, or system controller shown in the attached figures may represent the hardware and circuitry that operates based on software or hardwired instructions, the software that directs hardware to perform the operations, or a combination thereof.
  • the module, system, or system controller can include or represent hardware circuits or circuitry that include and/or are connected with one or more processors, such as one or computer microprocessors.
  • the terms“software” and“firmware” are interchangeable, and include any computer program stored in memoiy for execution by a computer, including RAM memoiy, ROM memoiy, EPROM memoiy, EEPROM memoiy, and non-volatile RAM (NVRAM) memoiy
  • RAM memoiy volatile RAM
  • ROM memoiy ROM memoiy
  • EPROM memoiy EEPROM memoiy
  • NVRAM non-volatile RAM
  • one of the processes for nucleic acid sequencing in use is sequencing-by -synthesis
  • the technique can be applied to massively parallel sequencing projects. For example, by using an automated platform, it is possible to cany out hundreds of thousands of sequencing reactions simultaneously.
  • one of the implementations of the present invention relates to instruments and methods for acquiring, storing, and analyzing image data generated during nucleic acid sequencing.
  • Enormous gains in the amount of data that can be acquired and stored make streamlined image analysis methods even more beneficial
  • the image analysis methods described herein permitboth designers and end users to make efficient use of existing computer hardware.
  • the present disclosure describes various methods and systems for carrying out the methods. Examples of some of the methods are described as a series of steps. However, it should be understood that implementations are not limited to the particular steps and/or order of steps described herein Steps may be omitted, steps may be modified, and/or other steps may be added. Moreover, steps described herein may be combined, steps may be performed simultaneously, steps may be performed concurrently, steps may be split into multiple sub-steps, steps may be performed in a different order, or steps (or a series of steps) may be re-performed in an iterative fashion. In addition, although different methods are set forth herein, it should be understood that the different methods (or steps of the different methods) may be combined in other implementations.
  • a processing unit, processor, module, or computing system that is“configured to” perform a task or operation may be understood as being particularly structured to perform the task or operation (e g , having one or more programs or instructions stored thereon or used in conjunction therewith tailored or intended to perform the task or operation, and/or having an arrangement of processing circuitry tailored or intended to perform the task or operation).
  • a general purpose computer which may become“configured to” perform the task or operation if appropriately programmed) is not“configured to” perform a task or operation unless or until specifically programmed or structurally modified to perform the task or operation.
  • the operations of the methods described herein can be sufficiently complex such that the operations cannot be mentally performed by an average human being or a person of ordinary skill in the art within a commercially reasonable time period.
  • the methods may rely on relatively complex computations such that such a person cannot complete the methods within a commercially reasonable time.
  • each when used in reference to a collection of items, is intended to identify an individual item in the collection but does not necessarily refer to every item in the collection. Exceptions can occur if explicit disclosure or context clearly dictates otherwise.
  • modules in this application can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some can also be implemented on different processors or computers, or spread among a number of different processors or computers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. Also as used herein, the term“module” can include“sub-modules”, which themselves can be considered herein to constitute modules. The blocks in the figures designated as modules can also be thought of as flowchart steps in a method.
  • the“identification” of an item of information does not necessarily require the direct specification of that item of information.
  • Information can be“identified” in a field by simply referring to the actual information through one or more layers of indirection, or by identifying one or more items of different information which are together sufficient to determine the actual item of information.
  • the term“specify” is used herein to mean the same as“identify”.
  • a given signal, event or value is“in dependence upon” a predecessor signal, event or value of the predecessor signal, event or value influenced by the given signal, event or value. If there is an intervening processing element, step or time period, the given signal, event or value can still be“in dependence upon” the predecessor signal, event or value If the intervening processing element or step combines more than one signal, event or value, the signal output of the processing element or step is considered“in dependence upon” each of the signal, event orvalue inputs.
  • Figure 82 is a computer system 8200 that can be used by the sequencing system 800A to implement the technology disclosed herein
  • Computer system 8200 includes at least one central processing unit (CPU) 8272 that communicates with a number of peripheral devices via bus subsystem 8255.
  • peripheral devices can include a storage subsystem 8210 including, for example, memory devices and a file storage subsystem 8236, user interface input devices 8238, user interface output devices 8276, and a network interface subsystem 8274.
  • the input and output devices allow user interaction with computer system 8200.
  • Network interface subsystem 8274 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • system controller 7806 is communicably linked to the storage subsystem 8210 and the user interface input devices 8238
  • User interface input devices 8238 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • use of the term“input device” is intended to include all possible types of devices and ways to input information into computer system 8200
  • User interface output devices 8276 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices
  • output device is intended to include all possible types of devices and ways to output information from computer system 8200 to the user or to another machine or computer system
  • Storage subsystem 8210 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 8278.
  • Deep learning processors 8278 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application- specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning processors 8278 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • GPUs graphics processing units
  • FPGAs field-programmable gate arrays
  • ASICs application-specific integrated circuits
  • CGRAs coarse-grained reconfigurable architectures
  • Deep learning processors 8278 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • Examples of deep learning processors 8278 include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX82 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’ s Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA’ s VoltaTM, NVIDIA’s DRIVE PXTM, NVIDIA’s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s DynamicIQTM, IBM TrueNorthTM, Lambda GPU Server with Testa VIOOsTM, and others.
  • TPU Tensor Processing Unit
  • rackmount solutions like GX4 Rackmount SeriesTM, GX82 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’ s Intelligent Processor Unit (IPU)TM
  • Memoiy subsystem 8222 used in the storage subsystem 8210 can include a number of memories including a main random access memory (RAM) 8232 for storage of instructions and data during program execution and a read only memoiy (ROM) 8234 in which fixed instructions are stored.
  • a file storage subsystem 8236 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 8236 in the storage subsystem 8210, or in other machines accessible by the processor.
  • Bus subsystem 8255 provides a mechanism for letting the various components and subsystems of computer system 8200 communicate with each other as intended Although bus subsystem 8255 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 8200 itself can be of vaiying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 8200 depicted in Figure 82 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 8200 are possible having more or less components than the computer system depicted in Figure 82.
  • the method includes accessing a series of image sets generated during a sequencing run, each image set in the series generated during a respective sequencing cycle of the sequencing run, each image in the series depicting the analytes and their surrounding background, and each image in the series having a plurality of subpixels.
  • the method includes obtaining, from a base caller, a base call classifying each of the subpixels as one of four bases (A, C, T, and G), thereby producing abase call sequence for each of the subpixels across a plurality of sequencing cycles of the sequencing run.
  • the method includes generating an analyte map that identifies the analytes as disjointed regions of contiguous subpixels which share a substantially matching base call sequence
  • the method includes determining spatial distribution of analytes, including their shapes and sizes based on the disjointed regions and storing the analyte map in memory for use as ground truth for training a classifier.
  • the method includes identifying as background those subpixels in the analyte map that do not belong to any of the disjointed regions
  • the method includes obtaining, from the base caller, the base call classifying each of the subpixels as one of five bases (A, C, T, G, and N).
  • the analyte map identifies analyte boundary portions between two contiguous subpixels whose base call sequences do not substantially match.
  • the method includes identifying origin subpixels at preliminary center coordinates of the analytes determined by the base caller, and breadth-first searching for substantially matching base call sequences by beginning with the origin subpixels and continuing with successively contiguous non-origin subpixels.
  • the method includes, on an analyte-by-analyte basis, determining hyperlocated center coordinates of the analytes by calculating centers of mass of the disjointed regions of the analyte map as an average of coordinates of respective contiguous subpixels forming the disjointed regions, and storing the hyperlocated center coordinates of the analytes in the memory on the analyte-by-analyte basis for use as ground truth for training the classifier.
  • the method includes, on the analyte-by-analyte basis, identifying centers of mass subpixels in the disjointed regions of the analyte map at the hyperlocated center coordinates of the analytes, upsampling the analyte map using interpolation and storing the upsampled analyte map in the memory for use as ground truth for training the classifier, and, in the upsampled analyte map, on the analyte-by- analyte basis, assigning a value to each contiguous subpixel in the disjointed regions based on a decay factor that is proportional to distance of a contiguous subpixel from a center of mass subpixel in a disjointed region to which the contiguous subpixel belongs.
  • the value is a intensity value normalized between zero and one
  • the method includes, in the upsampled analyte map, assigning a same predetermined value to all the subpixels identified as the background.
  • the predetermined value is a zero intensity value.
  • the method includes generating a decay map from the upsampled analyte map that expresses the contiguous subpixels in the disjointed regions and the subpixels identified as the background based on their assigned values, and storing the decay map in the memory for use as ground truth for training the classifier.
  • each subpixel in the decay map has a value normalized between zero and one.
  • the method includes, in the upsampled analyte map, categorizing, on the analyte-by -analyte basis, the contiguous subpixels in the disjointed regions as analyte interior subpixels belonging to a same analyte, the centers of mass subpixels as analyte center subpixels, subpixels containing the analyte boundaiy portions as boundary subpixels, and the subpixels identified as the background as background subpixels, and storing the categorizations in the memory for use as ground truth for training the classifier.
  • the method includes, storing, on the analyte-by -analyte basis, coordinates of the analyte interior subpixels, the analyte center subpixels, the boundary subpixels, and the background subpixels in the memory for use as ground truth for training the classifier, downscaling the coordinates by a factor used to upsample the analyte map, and, storing, on the analyte-by -analyte basis, the downscaled coordinates in the memory for use as ground truth for training the classifier
  • the method includes, in a binary ground truth data generated from the upsampled analyte map, using color coding to label the analyte center subpixels as belonging to an analyte center class and all other subpixels are belonging to a non-center class, and storing the binary ground truth data in the memory for use as ground truth for training the classifier
  • the method includes, in a ternary ground truth data generated from the upsampled analyte map, using color coding to label the background subpixels as belonging to a background class, the analyte center subpixels as belonging to an analyte center class, and the analyte interior subpixels as belonging to an analyte interior class, and storing the ternary ground truth data in the memory for use as ground truth for training the classifier.
  • the method includes generating analyte maps for a plurality of tiles of the flow cell, storing the analyte maps in memory and determining spatial distribution of analytes in the tiles based on the analyte maps, including their shapes and sizes, in the upsampled analyte maps of the analytes in the tiles, categorizing, on an analyte-by-analyte basis, subpixels as analyte interior subpixels belonging to a same analyte, analyte center subpixels, boundary subpixels, and background subpixels, storing the categorizations in the memory for use as ground truth for training the classifier, storing, on the analyte-by -analyte basis across the tiles, coordinates of the analyte interior subpixels, the analyte center subpixels, the boundary subpixels, and the background subpixels in the memory for use as ground truth for training the
  • the base call sequences are substantially matching when a predetermined portion of base calls match on an ordinal position-wise basis.
  • the base caller produces the base call sequences by interpolating intensity of the subpixels, including at least one of nearest neighbor intensity extraction, Gaussian based intensity extraction, intensity extraction based on average of 2 x 2 subpixel area, intensity extraction based on brightest of 2 x 2 subpixel area, intensity extraction based on average of 3 x 3 subpixel area, bilinear intensity extraction, bicubic intensity extraction, and/or intensity extraction based on weighted area coverage
  • the subpixels are identified to the base caller based on their integer or non-integer coordinates.
  • the method includes requiring that at least some of the disjointed regions have a predetermined minimum number of subpixels.
  • the flow cell has at least one patterned surface with an array of wells that occupy the analytes.
  • the method includes, based on the determined shapes and sizes of the analytes, determining which ones of the wells are substantially occupied by at least one analyte, which ones of the wells are minimally occupied, and which ones of the wells are co-occupied by multiple analytes.
  • the flow cell has at least one nonpattemed surface and the analytes are unevenly scattered over the nonpattemed surface.
  • the density of the analytes ranges from about 100,000 analytes/nun 2 to about 1,000,000 analytes/mm 2 In one implementation, the density of the analytes ranges from about 1,000,000 analytes/mm 2 to about 10,000,000 analytes/mm 2
  • the subpixels are quarter subpixels. In another implementation, the subpixels are half subpixels.
  • the preliminary center coordinates of the analytes determined by the base caller are defined in a template image of the tile, and a pixel resolution, an image coordinate system, and measurement scales of the image coordinate system are same for the template image and the images.
  • each image set has four images
  • each image set has two images.
  • each image set has one image
  • the sequencing run utilizes four-channel chemistry.
  • the sequencing run utilizes two-channel chemistry.
  • the sequencing run utilizes one-channel chemistry.
  • implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes accessing a set of images of the tile captured during a sequencing run and preliminary center coordinates of the analytes determined by a base caller.
  • the method includes, for each image set, obtaining, from a base caller, a base call classifying, as one of four bases origin subpixels that contain the preliminary center coordinates and a predetermined neighborhood of contiguous subpixels that are successively contiguous to respective ones of the origin subpixels, thereby producing a base call sequence for each of the origin subpixels and for each of the predetermined neighborhood of contiguous subpixels.
  • the method includes generating an analyte map that identifies the analytes as disjointed regions of contiguous subpixels that are successively contiguous to at least some of the respective ones of the origin subpixels and share a substantially matching base call sequence of the one of four bases with the at least some of the respective ones of the origin subpixels.
  • the method includes storing the analyte map in memory and determining the shapes and the sizes of the analytes based on the disjointed regions in the analyte map.
  • the predetermined neighborhood of contiguous subpixels is a m x n subpixel patch centered at pixels containing the origin subpixels and the subpixel patch is 3 x 3 pixels. In one implementation, the predetermined neighborhood of contiguous subpixels is a n-connected subpixel neighborhood centered at pixels containing the origin subpixels. In one implementation, the method includes, identifying as background those subpixels in the analyte map that do not belong to any of the disjointed regions.
  • implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes accessing a multitude of images of a flow cell captured over a plurality of cycles of a sequencing run, the flow cell having a plurality of tiles and, in the multitude of images, each of the tiles having a sequence of image sets generated over the plurality of cycles, and each image in the sequence of image sets depicting intensity emissions of analytes and their surrounding background on a particular one of the tiles at a particular one the cycles.
  • the method includes constructing a training set having a plurality of training examples, each training example corresponding to a particular one of the tiles and including image data from at least some image sets in the sequence of image sets of the particular one of the tiles.
  • the method includes generating at least one ground truth data representation for each of the training examples, the ground truth data representation identifying at least one of spatial distribution of analytes and their surrounding background on the particular one of the tiles whose intensity emissions are depicted by the image data, including at least one of analyte shapes, analyte sizes, and/or analyte boundaries, and/or centers of the analytes.
  • the image data includes images in each of the at least some image sets in the sequence of image sets of the particular one of the tiles, and the images have a resolution of 1800 x 1800.
  • the image data includes at least one image patch from each of the images, and the image patch covers a portion of the particular one of the tiles and has a resolution of 20 x 20.
  • the image data includes an upsampled representation of the image patch, and the upsampled representation has a resolution of 80 x 80.
  • the ground truth data representation has an upsampled resolution of 80 x 80.
  • multiple training examples correspond to a same particular one of the tiles and respectively include as image data different image patches from each image in each of at least some image sets in a sequence of image sets of the same particular one of the tiles, and at least some of the different image patches overlap with each other.
  • the ground truth data representation identifies the analytes as disjoint regions of adjoining subpixels, the centers of the analytes as centers of mass subpixels within respective ones of the disjoint regions, and their surrounding background as subpixels that do not belong to any of the disjoint regions
  • the ground truth data representation uses color coding to identify each subpixel as either being a analyte center or a non-center.
  • the ground truth data representation uses color coding to identify each subpixel as either being analyte interior, analyte center, or surrounding background
  • the method includes, storing, in memory, the training examples in the training set and associated ground truth data representations as the training data for the neural network-based template generation and base calling.
  • the method includes generating the training data for a variety of flow cells, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a method includes accessing sequencing images of analytes produced by a sequencer, generating training data from the sequencing images, and using the training data for training a neural network to generate metadata about the analytes.
  • Each of the features discussed in the particular implementation section for other implementations apply equally to this implementation. As indicated above, all the other features are not repeated here and should be considered repeated by reference. The reader will understand how features identified in these implementations can readily be combined with sets of base features identified in other implementations.
  • Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • Yet another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a method includes accessing sequencing images of analytes produced by a sequencer, generating training data from the sequencing images, and using the training data for training a neural network to base call the analytes.
  • Each of the features discussed in the particular implementation section for other implementations apply equally to this implementation. As indicated above, all the other features are not repeated here and should be considered repeated by reference The reader will understand how features identified in these implementations can readily be combined with sets of base features identified in other implementations
  • Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • Yet another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the input image data.
  • Each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through an output layer and generating an output that identifies analytes, whose intensity emissions are depicted by the input image data, as disjoint regions of adjoining subpixels, centers of the analytes as center subpixels at centers of mass of the respective ones of the disjoint regions, and their surrounding background as background subpixels not belonging to any of the disjoint regions.
  • the adjoining subpixels in the respective ones of the disjoint regions have intensify values weighted according to distance of an adjoining subpixel from a center subpixel in a disjoint region to which the adjoining subpixel belongs
  • the center subpixels have highest intensify values within the respective ones of the disjoint regions.
  • the background subpixels all have a same lowest intensify value in the output.
  • the output layer normalizes the intensify values between zero and one.
  • the method includes applying a peak locator to the output to find peak intensities in the output, determining location coordinates of the centers of the analytes based on the peak intensities, downscaling the location coordinates by an upsampling factor used to prepare the input image data, and storing the downscaled location coordinates in memory for use in base calling the analytes.
  • the method includes categorizing the adjoining subpixels in the respective ones of the disjoint regions as analyte interior subpixels belonging to a same analyte, and storing the categorization and downscaled location coordinates of the analyte interior subpixels in the memory on an analyte-by -analyte basis for use in base calling the analytes.
  • the method includes, on the analyte-by- analyte basis, determining distances of the analyte interior subpixels from respective ones of the centers of the analytes, and storing the distances in the memory on the analyte-by -analyte basis for use in base calling the analytes.
  • the method includes extracting intensities from the analyte interior subpixels in the respective ones of the disjoint regions, including using at least one of nearest neighbor intensity extraction, Gaussianbased intensity extraction, intensify extraction based on average of 2 x 2 subpixel area, intensity extraction based on brightest of 2 x 2 subpixel area, intensity extractron based on average of 3 x 3 subpixel area, bilinear intensify extraction, bicubic intensify extraction, and/or intensity extraction based on weighted area coverage, and storing the intensities in the memory on the analyte-by-analyte basis for use in base calling the analytes.
  • the method includes based on the disjoint regions, determining, as part of the related analyte metadata, spatial distribution of the analytes, including at least one of analyte shapes, analyte sizes, and/or analyte boundaries, and storing the related analyte metadata in the memory on the analyte-by -analyte basis for use in base calling the analytes.
  • the input image data includes images in the sequence of image sets, and the images have a resolution of 3000 x 3000
  • the rnput image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20
  • the input image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation has a resolution of 80 x 80
  • the output has an upsampled resolution of 80 x 80
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network
  • the encoder subnetwork includes a hierarchy of encoders
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps.
  • the density of the analytes ranges from about 100,000 analytes/mm 2 to about 1 000,000 analytes/mm 2
  • the density of the analytes ranges from about 1,000,000 analytes/mm 2 to about 10,000,000 analytes/mm 2 .
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes obtaining training data for training the neural network.
  • the training data includes a plurality of training examples and corresponding ground truth data that should be generated by the neural network by processing the training examples.
  • Each training example includes image data from a sequence of image sets.
  • Each image in the sequence of image sets covers a tile of a flow cell and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing mn performed on the flow cell
  • Each ground tmth data identifies analytes, whose intensity emissions are depicted by the image data of a corresponding training example, as disjoint regions of adjoining subpixels, centers of the analytes as center subpixels at centers of mass of the respective ones of the disjoint regions, and their surrounding background as background subpixels not belonging to any of the disjoint regions.
  • the method includes using a gradient descent training technique to train the neural network and generating outputs for the training examples that progressively match the ground tmth data, including iteratively optimizing a loss function that minimizes error between the outputs and the ground tmth data, and updating parameters of the neural network based on the error.
  • the method includes, upon error convergence after a final iteration, storing the updated parameters of the neural network in memory to be applied to further neural network-based template generation and base calling.
  • the adjoining subpixels in the respective ones of the disjoint regions have intensity values weighted according to distance of an adjoining subpixel from a center subpixel in a disjoint region to which the adjoining subpixel belongs.
  • the center subpixels have highest intensity values within the respective ones of the disjoint regions
  • the background subpixels all have a same lowest intensity value in the output
  • the intensity values are normalized between zero and one.
  • the loss function is mean squared error and the error is minimized on a subpixel-basis between the normalized intensity values of corresponding subpixels in the outputs and the ground tmth data
  • the ground tmth data identify, as part of the related analyte metadata, spatial distribution of the analytes, including at least one of analyte shapes, analyte sizes, and/or analyte boundaries.
  • the image data includes images in the sequence of image sets, and the images have a resolution of 1800 x 1800
  • the image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20
  • the image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation of the image patch has a resolution of 80 x 80
  • multiple training examples respectively include as image data different image patches from each image in a sequence of image sets of a same tile, and at least some of the different image patches overlap with each other.
  • the ground truth data has an upsampled resolution of 80 x 80.
  • the training data includes training examples for a plurality of tiles of the flow cell.
  • the training data includes training examples for a variety of flow cells, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network
  • the encoder subnetwork includes a hierarchy of encoders
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps for subpixel-wise classification by a final classification layer.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memoiy, to perform any of the methods described above.
  • the method includes accessing image data that depicts intensity emissions of the analytes, processing the image data through one or more layers of a neural network and generating an alternative representation of the image data, and processing the alternative representation through an output layer and generating an output that identifies at least one of shapes and sizes of the analytes and/or centers of the analytes.
  • the image data further depicts intensity emissions of surrounding background of the analytes.
  • the method includes the output identifying spatial distribution of the analytes on the flow cell, including the surrounding background and boundaries between the analytes.
  • the method includes determining center location coordinates of the analytes on the flow cell based on the output.
  • the neural network is a convolutional neural network.
  • the neural network is a recurrent neural network.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, followed by the output layer, the encoder subnetwork includes a hierarchy of encoders, and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the image data.
  • each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through a classification layer and generating an output that identifies centers of analytes whose intensity emissions are depicted by the input image data.
  • the output has a plurality of subpixels, and each subpixel in the plurality of subpixels is classified as either an analyte center or a non-center.
  • the classification layer assigns each subpixel in the output a first likelihood score of being the analyte center, and a second likelihood score of being the non-center.
  • the first and second likelihood scores are determined based on a softmax function and exponentially normalized between zero and one.
  • the first and second likelihood scores are determined based on a sigmoid function and normalized between zero and one
  • each subpixel in the output is classified as either the analyte center or the non-center based on which one of the first and second likelihood scores is higher than the other
  • each subpixel in the output is classified as either the analyte center or the non-center based on whether the first and second likelihood scores are above a predetermined threshold likelihood score.
  • the output identifies the centers at centers of mass of respective ones of the analytes.
  • subpixels classified as analyte centers are assigned a same first predetermined value, and subpixels classified as non-centers are all assigned a same second predetermined value.
  • the first and second predetermined values are intensity values. In one implementation, the first and second predetermined values are continuous values
  • the method includes determining location coordinates of subpixels classified as analyte centers, downscaling the location coordinates by an upsampling factor used to prepare the input image data, and storing the downscaled location coordinates in memory for use in base calling the analytes
  • the input image data includes images in the sequence of image sets, and the images have a resolution of 3000 x 3000
  • the input image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20
  • the input image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation has a resolution of 80 x 80
  • the output has an upsampled resolution of 80 x 80
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, followed by the classification layer, the encoder subnetwork includes a hierarchy of encoders, and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps for subpixel-wise classification by the classification layer.
  • the density of the analytes ranges from about 100,000 analytes/mm 2 to about 1 000,000 analytes/mm 2 . In another implementation, the density of the analytes ranges from about 1 000,000 analytes/mm 2 to about 10 000,000 analyte s/mm 2 .
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes obtaining training data for training the neural network.
  • the training data includes a plurality of training examples and corresponding ground truth data that should be generated by the neural network by processing the training examples.
  • Each training example includes image data from a sequence of image sets.
  • Each image in the sequence of image sets covers a tile of a flow cell and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing mn performed on the flow cell
  • Each ground tmth data identifies centers of analytes, whose intensity emissions are depicted by the image data of a corresponding training example.
  • the ground truth data has a plurality of subpixels, and each subpixel in the plurality of subpixels is classified as either an analyte center or a non-center
  • the method includes using a gradient descent training technique to train the neural network and generating outputs for the training examples that progressively match the ground tmth data, including iteratively optimizing a loss function that minimizes error between the outputs and the ground truth data, and updating parameters of the neural network based on the error.
  • the method includes, upon error convergence after a final iteration, storing the updated parameters of the neural network in memory to be applied to further neural network-based template generation and base calling.
  • subpixels classified as analyte centers are all assigned a same first predetermined class score
  • subpixels classified as noncenters are all assigned a same second predetermined class score.
  • each subpixel in each output, has a first prediction score of being the analyte center, and a second prediction score of being the non-center
  • the loss function is custom weighted binary cross entropy loss and the error is minimized on a subpixel-basis between the prediction scores and the class scores of corresponding subpixels in the outputs and the ground tmth data.
  • the ground tmth data identifies the centers at centers of mass of respective ones of the analytes
  • subpixels classified as analyte centers are all assigned a same first predetermined value
  • subpixels classified as non-centers are all assigned a same second predetermined value.
  • the first and second predetermined values are intensity values
  • the first and second predetermined values are continuous values.
  • the ground tmth data identify, as part of the related analyte metadata, spatial distribution of the analytes, including at least one of analyte shapes, analyte sizes, and/or analyte boundaries
  • the image data includes images in the sequence of image sets, and the images have a resolution of 1800 x 1800.
  • the image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20.
  • the image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation of the image patch has a resolution of 80 x 80.
  • multiple training examples respectively include as image data different image patches from each image in a sequence of image sets of a same tile, and at least some of the different image patches overlap with each other.
  • the ground truth data has an upsampled resolution of 80 x 80.
  • the training data includes training examples for a plurality of tiles of the flow cell
  • the training data includes training examples for a variety of flow cells, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, followed by a classification layer
  • the encoder subnetwork includes a hierarchy of encoders
  • the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps for subpixel-wise classification by the classification layer.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the image data.
  • Each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through a classification layer and generating an output that identifies spatial distribution of analytes and their surrounding background whose intensify emissions are depicted by the input image data, including at least one of analyte centers, analyte shapes, analyte sizes, and/or analyte boundaries.
  • the output has a plurality of subpixels, and each subpixel in the plurality of subpixels is classified as either background, analyte center, or analyte interior [00738]
  • the classification layer assigns each subpixel in the output a first likelihood score of being the background, a second likelihood score of being the analyte center, and a third likelihood score of being the analyte interior.
  • the first, second, and third likelihood scores are determined based on a softmax function and exponentially normalized between zero and one.
  • each subpixel in the output is classified as either the background, the analyte center, or the analyte interior based on which one among the first, second, and third likelihood scores is highest.
  • each subpixel in the output is classified as either the background, the analyte center, or the analyte interior based on whether the first, second, and third likelihood scores are above a predetermined threshold likelihood score
  • the output identifies the analyte centers at centers of mass of respective ones of the analytes.
  • subpixels classified as background are all assigned a same first predetermined value
  • subpixels classified as analyte centers are all assigned a same second predetermined value
  • subpixels classified as analyte interior are all assigned a same third predetermined value.
  • the first, second, and third predetermined values are intensity values.
  • the first, second, and third predetermined values are continuous values.
  • the method includes determining location coordinates of subpixels classified as analyte centers on an analyte- by -analyte basis, downscaling the location coordinates by an upsampling factor used to prepare the input image data, and storing the downscaled location coordinates in memory on the analyte -by -analyte basis for use in base calling the analytes
  • the method includes determining location coordinates of subpixels classified as analyte interior on the analyte-by -analyte basis, downscaling the location coordinates by an upsampling factor used to prepare the input image data, and storing the downscaled location coordinates in memory on the analyte-by- analyte basis for use in base calling the analytes.
  • the method includes, on the analyte-by -analyte basis, determining distances of the subpixels classified as analyte interior from respective ones of the subpixels classified as analyte centers, and storing the distances in the memory on the analyte-by-analyte basis for use in base calling the analytes.
  • the method includes, on the analyte-by -analyte basis, extracting intensities from the subpixels classified as analyte interior, including using at least one of nearest neighbor intensity extraction, Gaussian based intensity extraction, intensity extraction based on average of 2 x 2 subpixel area, intensity extraction based on brightest of 2 x 2 subpixel area, intensity extraction based on average of 3 x 3 subpixel area, bilinear intensity extraction, bicubic intensity extraction, and/or intensity extraction based on weighted area coverage, and storing the intensities in the memory on the analyte-by -analyte basis for use in base calling the analytes.
  • the input image data includes images in the sequence of image sets, and the images have a resolution of 3000 x 3000.
  • the input image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20.
  • the input image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation has a resolution of 80 x 80.
  • the output has an upsampled resolution of 80 x 80.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, followed by the classification layer, the encoder subnetwork includes a hierarchy of encoders, and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps for subpixel-wise classification by the classification layer.
  • the density of the analytes ranges from about 100,000 analytes/mm 2 to about 1,000,000 analytes/mm 2 In another implementation, the density of the analytes ranges from about 1,000,000 analytes/mm 2 to about 10,000,000 analytes/mm 2
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes obtaining training data for training the neural network.
  • the training data includes a plurality of training examples and corresponding ground truth data that should be generated by the neural network by processing the training examples.
  • Each training example includes image data from a sequence of image sets
  • Each image in the sequence of image sets covers a tile of a flow cell and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell
  • Each ground truth data identifies spatial distribution of analytes and their surrounding background whose intensity emissions are depicted by the input image data, including analyte centers, analyte shapes, analyte sizes, and analyte boundaries.
  • the ground truth data has a plurality of subpixels, and each subpixel in the plurality of subpixels is classified as either background, analyte center, or analyte interior.
  • the method includes using a gradient descent training technique to train the neural network and generating outputs for the training examples that progressively match the ground truth data, including iteratively optimizing a loss function that minimizes error between the outputs and the ground truth data, and updating parameters of the neural network based on the error.
  • the method includes, upon error convergence after a final iteration, storing the updated parameters of the neural network in memory to be applied to further neural network-based template generation and base calling.
  • subpixels classified as background are all assigned a same first predetermined class score
  • subpixels classified as analyte centers are all assigned a same second predetermined class score
  • subpixels classified as analyte interior are all assigned a same third predetermined class score
  • each subpixel in each output, has a first prediction score of being the background, a second prediction score of being the analyte center, and a third prediction score of being the analyte interior.
  • the loss function is custom weighted ternary cross entropy loss and the error is minimized on a subpixel-basis between the prediction scores and the class scores of corresponding subpixels in the outputs and the ground truth data.
  • the ground truth data identifies the analyte centers at centers of mass of respective ones of the analytes
  • subpixels classified as background are all assigned a same first predetermined value
  • subpixels classified as analyte centers are all assigned a same second predetermined value
  • subpixels classified as analyte interior are all assigned a same third predetermined value
  • the first, second, and third predetermined values are intensity values.
  • the first, second, and third predetermined values are continuous values.
  • the image data includes images in the sequence of image sets, and the images have a resolution of 1800 x 1800. In one implementation, the image data includes images in the sequence of image sets, and the images have a resolution of 1800 x 1800.
  • the image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20.
  • the image data includes an upsampled representation of the image patch from each of the images in the sequence of image sets, and the upsampled representation of the image patch has a resolution of 80 x 80
  • multiple training examples respectively include as image data different image patches from each image in a sequence of image sets of a same tile, and at least some of the different image patches overlap with each other.
  • the ground truth data has an upsampled resolution of 80 x 80.
  • the training data includes training examples for a plurality of tiles of the flow cell. In one implementation, the training data includes training examples for a variety of flow cells, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, and analyte densities.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, followed by a classification layer, the encoder subnetwork includes a hierarchy of encoders, and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps for subpixel-wise classification by the classification layer.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memoiy, to perform any of the methods described above.
  • the method includes processing input image data derived from a sequence of image sets through a neural network and generating an alternative representation of the input image data
  • the input image data has an array of units that depicts analytes and their surrounding background.
  • the method includes processing the alternative representation through an output layer and generating an output value for each unit in the array.
  • the method includes thresholding output values of the units and classifying a first subset of the units as background units depicting the surrounding background.
  • the method includes locating peaks in the output values of the units and classifying a second subset of the units as center units containing centers of the analytes.
  • the method includes applying a segmenter to the output values of the units and determining shapes of the analytes as non-overlapping regions of contiguous units separated by the background units and centered at the center units.
  • the segmenter begins with the center units and determines, for each center unit, a group of successively contiguous units that depict a same analyte whose center is contained in the center unit
  • the units are pixels. In another implementation, the units are subpixels. Inyet another implementation, the units are superpixels. In one implementation, the output values are continuous values. In another implementation, the output values are softmax scores.
  • the contiguous units in the respective ones of the non-overlapping regions have output values weighted according to distance of a contiguous unit from a center unit in a non-overlapping region to which the contiguous unitbelongs
  • the center units have highest output values within the respective ones of the non-overlapping regions.
  • the non-overlapping regions have irregular contours and the units are subpixels.
  • the method includes determining analyte intensity of a given analyte by identifying subpixels that contribute to the analyte intensity of the given analyte based on a corresponding non-overlapping region of contiguous subpixels that identifies a shape of the given analyte, locating the identified subpixels in one or more optical, pixel-resolution images generated for one or more image channels at a current sequencing cycle, in each of the images, interpolating intensities of the identified subpixels, combining the interpolated intensities, and normalizing the combined interpolated intensities to produce a per-image analyte intensity for the given analyte in each of the images, and combining the per-image analyte intensity for each of the images to determine the analyte intensity of the given analyte at the current sequencing cycle.
  • the normalizing is based on a normalization factor, and the normalization factor is a number of the identified subpixels.
  • the method includes base calling the given analyte based on the analyte intensity at the current sequencing cycle.
  • the non-overlapping regions have irregular contours and the units are subpixels.
  • the method includes determining analyte intensity of a given analyte by identifying subpixels that contribute to the analyte intensity of the given analyte based on a corresponding non-overlapping region of contiguous subpixels that identifies a shape of the given analyte, locating the identified subpixels in one or more subpixel resolution images upsampled from corresponding optical, pixel-resolution images generated for one or more image channels at a current sequencing cycle, in each of the upsampled images, combining intensities of the identified subpixels and normalizing the combined intensities to produce a per-image analyte intensity for the given analyte in each of the upsampled images, and combining the per-image analyte intensity for each of the upsampled images to determine the analyte intensity of the given analyte at the current sequencing cycle
  • the normalizing is based on a normalization factor, and the normalization factor is a number of the identified subpixels.
  • the method includes base calling the given analyte based on the analyte intensity at the current sequencing cycle.
  • each image in the sequence of image sets covers a tile, and depicts intensity emissions of analytes on a tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on a flow cell.
  • the input image data includes at least one image patch from each of the images in the sequence of image sets, and the image patch covers a portion of the tile and has a resolution of 20 x 20
  • the input image data includes an upsampled, subpixel resolution representation of the image patch from each of the images in the sequence of image sets, and the upsampled, subpixel representation has a resolution of 80 x 80
  • the neural network is a convolutional neural network
  • the neural network is a recurrent neural network.
  • the neural network is a residual neural network with residual bocks and residual connections.
  • the neural network is a deep fully convolutional segmentation neural network with an encoder subnetwork and a corresponding decoder network, the encoder subnetwork includes a hierarchy of encoders, and the decoder subnetwork includes a hierarchy of decoders that map low resolution encoder feature maps to full input resolution feature maps
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data derived from a sequence of image sets through a neural network and generating an alternative representation of the input image data.
  • the input image data has an array of units that depicts analytes and their surrounding background.
  • the method includes processing the alternative representation through an output layer and generating an output value for each unit in the array.
  • the method includes thresholding output values of the units and classifying a first subset of the units as background units depicting the surrounding background.
  • the method includes locating peaks in the output values of the units and classifying a second subset of the units as center units containing centers of the analytes.
  • the method includes applying a segmenter to the output values of the units and determining shapes of the analytes as non-overlapping regions of contiguous units separated by the background units and centered at the center units.
  • the segmenter begins with the center units and determines, for each center unit, a group of successively contiguous units that depict a same analyte whose center is contained in the center unit.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a method includes processing image data through a neural network and generating an alternative representation of the image data.
  • the image data depicts intensity emissions of analytes.
  • the method includes processing the alternative representation through an output layer and generating an output that identifies metadata about the analytes, including at least one of spatial distribution of the analytes, shapes of the analytes, centers of the analytes, and/or boundaries between the analytes.
  • implementations of the method described in this section can include a non-transitoiy computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • implementations of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the input image data.
  • Each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through an output layer and generating an output that identifies analytes, whose intensity emissions are depicted by the input image data, as disjoint regions of adjoining units, centers of the analytes as center units at centers of mass of the respective ones of the disjoint regions, and their surrounding background as background units not belonging to any of the disjoint regions.
  • the units are pixels. In another implementation, the units are subpixels. In yet another implementation, the units are superpixels.
  • Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • Y et another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the image data Each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through a classification layer and generating an output that identifies centers of analytes whose intensity emissions are depicted by the input image data.
  • the output has a plurality of units, and each unit in the plurality of units is classified as either an analyte center or a non-center.
  • the units are pixels. In another implementation, the units are subpixels. Inyet another implementation, the units are superpixels.
  • Other implementations of the method described in this section can include a non-transitoty computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • Y et another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • the method includes processing input image data from a sequence of image sets through a neural network and generating an alternative representation of the image data.
  • Each image in the sequence of image sets covers the tile, and depicts intensity emissions of analytes on the tile and their surrounding background captured for a particular image channel at a particular one of a plurality of sequencing cycles of a sequencing run performed on the flow cell.
  • the method includes processing the alternative representation through a classification layer and generating an output that identifies spatial distribution of analytes and their surrounding background whose intensity emissions are depicted by the input image data, including at least one of analyte centers, analyte shapes, analyte sizes, and/or analyte boundaries.
  • the output has a plurality of units, and each unit in the plurality of units is classified as either background, analyte center, or analyte interior
  • the units are pixels. In another implementation, the units are subpixels. Inyet another implementation, the units are superpixels.
  • Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above.
  • Y et another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.
  • a computer-implemented method of determining image regions indicative of analytes on a tile of a flow cell comprising: accessing a series of image sets generated during a sequencing run, each image set in the series generated during a respective sequencing cycle of the sequencing run, each image in the series depicting the analytes and their surrounding background, and each image in the series having a plurality of subpixels;
  • the classifier being a neural network-based template generator for processing input image data to generate a decay map, a ternary map, or a binary map, representing one or more properties of each of a plurality of analytes represented in the input image data for base calling by a neural network-based base caller, preferably in order to increase the level of throughput in high-throughput nucleic acid sequencing technologies.
  • determining hyperlocated center coordinates of the analytes by calculating centers of mass of the disjointed regions of the analyte map as an average of coordinates of respective contiguous subpixels forming the disjointed regions;
  • analyte map categorizing, on the analyte-by -analyte basis, the contiguous subpixels in the disjointed regions as analyte interior subpixels belonging to a same analyte, the centers of mass subpixels as analyte center subpixels, subpixels containing the analyte boundary portions as boundary subpixels, and the subpixels identified as the background as background subpixels;

Abstract

Selon l'invention, une technologie utilise des réseaux neuronaux pour déterminer des métadonnées d'analyte en : (i) traitant des données d'image d'entrée dérivées d'une séquence d'ensembles d'images au moyen d'un réseau neuronal et générant une représentation alternative des données d'image d'entrée, les données d'image d'entrée comprenant un réseau d'unités qui représentent des analytes et leur arrière-plan environnant; (ii) traitant la représentation alternative au moyen d'une couche de sortie et générant une valeur de sortie pour chaque unité dans le réseau; (iii) seuillant les valeurs de sortie des unités et classant un premier sous-ensemble des unités en tant qu'unités d'arrière-plan représentant l'arrière-plan environnant; et (iv) localisant des pics dans les valeurs de sortie des unités, puis classant un second sous-ensemble des unités en tant qu'unités centrales contenant les centres des analytes.
PCT/US2020/024087 2019-03-21 2020-03-21 Génération à base d'intelligence artificielle de métadonnées de séquençage WO2020205296A1 (fr)

Priority Applications (9)

Application Number Priority Date Filing Date Title
BR112020026426-1A BR112020026426A2 (pt) 2019-03-21 2020-03-21 geração de metadados de sequenciamento baseada em inteligência artificial
EP20719052.1A EP3942071A1 (fr) 2019-03-21 2020-03-21 Génération à base d'intelligence artificielle de métadonnées de séquençage
SG11202012453PA SG11202012453PA (en) 2019-03-21 2020-03-21 Artificial intelligence-based generation of sequencing metadata
AU2020256047A AU2020256047A1 (en) 2019-03-21 2020-03-21 Artificial intelligence-based generation of sequencing metadata
KR1020207037713A KR20210142529A (ko) 2019-03-21 2020-03-21 서열분석 메타데이터의 인공 지능 기반 생성
CN202080003614.9A CN112334984A (zh) 2019-03-21 2020-03-21 基于人工智能的测序元数据生成
JP2020572715A JP2022525267A (ja) 2019-03-21 2020-03-21 人工知能ベースのシーケンスメタデータ生成
MX2020014293A MX2020014293A (es) 2019-03-21 2020-03-21 Generación de metadatos de secuenciación basada en inteligencia artificial.
IL279525A IL279525A (en) 2019-03-21 2020-12-17 Generation of metadata sequences by artificial intelligence

Applications Claiming Priority (30)

Application Number Priority Date Filing Date Title
US201962821681P 2019-03-21 2019-03-21
US201962821602P 2019-03-21 2019-03-21
US201962821724P 2019-03-21 2019-03-21
US201962821766P 2019-03-21 2019-03-21
US201962821618P 2019-03-21 2019-03-21
US62/821,766 2019-03-21
US62/821,724 2019-03-21
US62/821,602 2019-03-21
US62/821,681 2019-03-21
US62/821,618 2019-03-21
NL2023316A NL2023316B1 (en) 2019-03-21 2019-06-14 Artificial intelligence-based sequencing
NL2023311 2019-06-14
NL2023311A NL2023311B9 (en) 2019-03-21 2019-06-14 Artificial intelligence-based generation of sequencing metadata
NL2023312 2019-06-14
NL2023314A NL2023314B1 (en) 2019-03-21 2019-06-14 Artificial intelligence-based quality scoring
NL2023310 2019-06-14
NL2023316 2019-06-14
NL2023310A NL2023310B1 (en) 2019-03-21 2019-06-14 Training data generation for artificial intelligence-based sequencing
NL2023312A NL2023312B1 (en) 2019-03-21 2019-06-14 Artificial intelligence-based base calling
NL2023314 2019-06-14
US16/826,126 2020-03-20
US16/826,134 2020-03-20
US16/825,987 US11347965B2 (en) 2019-03-21 2020-03-20 Training data generation for artificial intelligence-based sequencing
US16/825,991 2020-03-20
US16/826,134 US11676685B2 (en) 2019-03-21 2020-03-20 Artificial intelligence-based quality scoring
US16/826,126 US11783917B2 (en) 2019-03-21 2020-03-20 Artificial intelligence-based base calling
US16/825,991 US11210554B2 (en) 2019-03-21 2020-03-20 Artificial intelligence-based generation of sequencing metadata
US16/825,987 2020-03-20
US16/826,168 US11436429B2 (en) 2019-03-21 2020-03-21 Artificial intelligence-based sequencing
US16/826,168 2020-03-21

Publications (1)

Publication Number Publication Date
WO2020205296A1 true WO2020205296A1 (fr) 2020-10-08

Family

ID=72519388

Family Applications (5)

Application Number Title Priority Date Filing Date
PCT/US2020/024088 WO2020191387A1 (fr) 2019-03-21 2020-03-21 Appel de base à base d'intelligence artificielle
PCT/US2020/024087 WO2020205296A1 (fr) 2019-03-21 2020-03-21 Génération à base d'intelligence artificielle de métadonnées de séquençage
PCT/US2020/024091 WO2020191390A2 (fr) 2019-03-21 2020-03-21 Notation de qualité faisant appel à l'intelligence artificielle
PCT/US2020/024090 WO2020191389A1 (fr) 2019-03-21 2020-03-21 Génération de données d'apprentissage pour séquençage à base d'intelligence artificielle
PCT/US2020/024092 WO2020191391A2 (fr) 2019-03-21 2020-03-22 Séquençage à base d'intelligence artificielle

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/US2020/024088 WO2020191387A1 (fr) 2019-03-21 2020-03-21 Appel de base à base d'intelligence artificielle

Family Applications After (3)

Application Number Title Priority Date Filing Date
PCT/US2020/024091 WO2020191390A2 (fr) 2019-03-21 2020-03-21 Notation de qualité faisant appel à l'intelligence artificielle
PCT/US2020/024090 WO2020191389A1 (fr) 2019-03-21 2020-03-21 Génération de données d'apprentissage pour séquençage à base d'intelligence artificielle
PCT/US2020/024092 WO2020191391A2 (fr) 2019-03-21 2020-03-22 Séquençage à base d'intelligence artificielle

Country Status (1)

Country Link
WO (5) WO2020191387A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277116A (zh) * 2022-07-06 2022-11-01 中能电力科技开发有限公司 网络隔离的方法、装置、存储介质及电子设备
CN116796196A (zh) * 2023-08-18 2023-09-22 武汉纺织大学 基于多模态联合嵌入的共语姿势生成方法

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110398370B (zh) * 2019-08-20 2021-02-05 贵州大学 一种基于hts-cnn模型的轴承故障诊断方法
CN111883203B (zh) * 2020-07-03 2023-12-29 上海厦维医学检验实验室有限公司 用于预测pd-1疗效的模型的构建方法
US11200446B1 (en) 2020-08-31 2021-12-14 Element Biosciences, Inc. Single-pass primary analysis
CN112598620B (zh) * 2020-11-25 2022-11-15 哈尔滨工程大学 尿沉渣中透明管型、病理管型以及粘液丝的识别方法
CN112629851B (zh) * 2020-12-11 2022-10-25 南方海上风电联合开发有限公司 基于数据增强方法与图像识别的海上风电机组齿轮箱故障诊断方法
CN112541576B (zh) * 2020-12-14 2024-02-20 四川翼飞视科技有限公司 Rgb单目图像的生物活体识别神经网络构建方法
CN112652356B (zh) * 2021-01-19 2024-01-26 深圳市儒瀚科技有限公司 一种dna甲基化表观修饰的识别方法、识别设备及存储介质
CN112418360B (zh) * 2021-01-21 2021-04-13 深圳市安软科技股份有限公司 卷积神经网络的训练方法、行人属性识别方法及相关设备
CN113034355B (zh) * 2021-04-20 2022-06-21 浙江大学 一种基于深度学习的肖像图像双下巴去除方法
CN113506243A (zh) * 2021-06-04 2021-10-15 联合汽车电子有限公司 Pcb焊接缺陷检测方法、装置及存储介质
CN113658643B (zh) * 2021-07-22 2024-02-13 西安理工大学 一种基于注意力机制对lncRNA和mRNA的预测方法
WO2023010069A1 (fr) * 2021-07-29 2023-02-02 Ultima Genomics, Inc. Systèmes et procédés d'appel de base adaptatifs
WO2023049212A2 (fr) * 2021-09-22 2023-03-30 Illumina, Inc. Appel de base basé sur l'état
CN114399628B (zh) * 2021-12-21 2024-03-08 四川大学 复杂空间环境下的绝缘子高效检测系统
EP4222749A4 (fr) 2021-12-24 2023-08-30 GeneSense Technology Inc. Procédés et systèmes basés sur l'apprentissage profond pour le séquençage d'acide nucléique
CN114092920B (zh) * 2022-01-18 2022-04-15 腾讯科技(深圳)有限公司 一种模型训练的方法、图像分类的方法、装置及存储介质
WO2023183937A1 (fr) * 2022-03-25 2023-09-28 Illumina, Inc. Appel de bases séquence par séquence
CN115272136B (zh) * 2022-09-27 2023-05-05 广州卓腾科技有限公司 基于大数据的证件照眼镜反光消除方法、装置、介质及设备

Citations (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (fr) 1989-10-26 1991-05-16 Sri International Sequençage d'adn
WO1991006778A1 (fr) 1989-11-02 1991-05-16 Sundstrand Corporation Pompe regenerative et procede de refoulement du fluide sous pression
US5528050A (en) 1995-07-24 1996-06-18 Molecular Dynamics, Inc. Compact scan head with multiple scanning modalities
US5719391A (en) 1994-12-08 1998-02-17 Molecular Dynamics, Inc. Fluorescence imaging system employing a macro scanning objective
WO1998044151A1 (fr) 1997-04-01 1998-10-08 Glaxo Group Limited Methode d'amplification d'acide nucleique
WO2000018957A1 (fr) 1998-09-30 2000-04-06 Applied Research Systems Ars Holding N.V. Procedes d'amplification et de sequençage d'acide nucleique
WO2000063437A2 (fr) 1999-04-20 2000-10-26 Illumina, Inc. Detection de reactions d'acide nucleique sur microsupports de billes en reseau
US6266459B1 (en) 1997-03-14 2001-07-24 Trustees Of Tufts College Fiber optic sensor with encoded microspheres
US6355431B1 (en) 1999-04-20 2002-03-12 Illumina, Inc. Detection of nucleic acid amplification reactions using bead arrays
WO2004018497A2 (fr) 2002-08-23 2004-03-04 Solexa Limited Nucleotides modifies
US6770441B2 (en) 2000-02-10 2004-08-03 Illumina, Inc. Array compositions and methods of making same
WO2005010145A2 (fr) 2003-07-05 2005-02-03 The Johns Hopkins University Procede et compositions de detection et d'enumeration de variations genetiques
US6859570B2 (en) 1997-03-14 2005-02-22 Trustees Of Tufts College, Tufts University Target analyte sensors utilizing microspheres
US20050064460A1 (en) 2001-11-16 2005-03-24 Medical Research Council Emulsion compositions
US20050130173A1 (en) 2003-01-29 2005-06-16 Leamon John H. Methods of amplifying and sequencing nucleic acids
WO2005065814A1 (fr) 2004-01-07 2005-07-21 Solexa Limited Arrangements moleculaires modifies
US20050244870A1 (en) 1999-04-20 2005-11-03 Illumina, Inc. Nucleic acid sequencing using microsphere arrays
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
WO2006064199A1 (fr) 2004-12-13 2006-06-22 Solexa Limited Procede ameliore de detection de nucleotides
US20060178901A1 (en) 2005-01-05 2006-08-10 Cooper Kelana L Home movies television (HMTV)
US20060240439A1 (en) 2003-09-11 2006-10-26 Smith Geoffrey P Modified polymerases for improved incorporation of nucleotide analogues
US20060278147A1 (en) 2005-06-10 2006-12-14 Janome Sewing Machine Co., Ltd. Embroidery sewing machine
WO2007010251A2 (fr) 2005-07-20 2007-01-25 Solexa Limited Preparation de matrices pour sequencage d'acides nucleiques
WO2007010252A1 (fr) 2005-07-20 2007-01-25 Solexa Limited Procede de sequencage d'une matrice de polynucleotide
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
WO2007123744A2 (fr) 2006-03-31 2007-11-01 Solexa, Inc. Systèmes et procédés pour analyse de séquençage par synthèse
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US20080108082A1 (en) 2006-10-23 2008-05-08 Pacific Biosciences Of California, Inc. Polymerase enzymes and reagents for enhanced nucleic acid sequencing
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
US7414716B2 (en) 2006-10-23 2008-08-19 Emhart Glass S.A. Machine for inspecting glass containers
US20090088327A1 (en) 2006-10-06 2009-04-02 Roberto Rigatti Method for sequencing a polynucleotide template
US7592435B2 (en) 2005-08-19 2009-09-22 Illumina Cambridge Limited Modified nucleosides and nucleotides and uses thereof
US7622294B2 (en) 1997-03-14 2009-11-24 Trustees Of Tufts College Methods for detecting target analytes and enzymatic reactions
US20120020537A1 (en) 2010-01-13 2012-01-26 Francisco Garcia Data processing system and methods
US8158926B2 (en) 2005-11-23 2012-04-17 Illumina, Inc. Confocal imaging methods and apparatus
US20120270305A1 (en) 2011-01-10 2012-10-25 Illumina Inc. Systems, methods, and apparatuses to image a sample for biological or chemical analysis
US20120316086A1 (en) 2011-06-09 2012-12-13 Illumina, Inc. Patterned flow-cells useful for nucleic acid analysis
US20130023422A1 (en) 2008-05-05 2013-01-24 Illumina, Inc. Compensator for multiple surface imaging
US20130116153A1 (en) 2011-10-28 2013-05-09 Illumina, Inc. Microarray fabrication system and method
US20130184796A1 (en) 2012-01-16 2013-07-18 Greatbatch Ltd. Elevated Hermetic Feedthrough Insulator Adapted for Side Attachment of Electrical Conductors on the Body Fluid Side of an Active Implantable Medical Device
US20130260372A1 (en) 2012-04-03 2013-10-03 Illumina, Inc. Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing
US20140243224A1 (en) 2013-02-26 2014-08-28 Illumina, Inc. Gel patterned surfaces
WO2015002813A1 (fr) 2013-07-01 2015-01-08 Illumina, Inc. Greffage de polymère et fonctionnalisation de surface sans catalyseur
US9079148B2 (en) 2008-07-02 2015-07-14 Illumina Cambridge Limited Using populations of beads for the fabrication of arrays on surfaces
WO2015106941A1 (fr) 2014-01-16 2015-07-23 Illumina Cambridge Limited Modification de polynucléotides sur support solide
WO2016066586A1 (fr) 2014-10-31 2016-05-06 Illumina Cambridge Limited Nouveaux polymères et revêtements de copolymères d'adn
US20180189613A1 (en) * 2016-04-21 2018-07-05 Ramot At Tel Aviv University Ltd. Cascaded convolutional neural network
WO2018129314A1 (fr) * 2017-01-06 2018-07-12 Illumina, Inc. Correction de phase
WO2019079182A1 (fr) * 2017-10-16 2019-04-25 Illumina, Inc. Apprentissage semi-supervisé pour l'apprentissage d'un ensemble de réseaux neuronaux à convolution profonde
WO2019079202A1 (fr) * 2017-10-16 2019-04-25 Illumina, Inc. Détection de raccordement aberrant à l'aide de réseaux neuronaux à convolution (cnn)
WO2019136284A1 (fr) * 2018-01-05 2019-07-11 Illumina, Inc. Prédiction de la qualité de résultats de séquençage utilisant des réseaux neuronaux profonds
WO2019136388A1 (fr) * 2018-01-08 2019-07-11 Illumina, Inc. Systèmes et dispositifs de séquençage à haut débit avec détection basée sur un semi-conducteur
WO2019140402A1 (fr) * 2018-01-15 2019-07-18 Illumina, Inc. Classificateur de variants basé sur un apprentissage profond

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2782858B2 (ja) 1989-10-31 1998-08-06 松下電器産業株式会社 スクロール気体圧縮機
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
JP2001517948A (ja) 1997-04-01 2001-10-09 グラクソ、グループ、リミテッド 核酸配列決定法
AR031640A1 (es) 2000-12-08 2003-09-24 Applied Research Systems Amplificacion isotermica de acidos nucleicos en un soporte solido
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
DE10320388A1 (de) 2003-05-06 2004-11-25 Basf Ag Polymere für die Wasserbehandlung
JP2006156608A (ja) 2004-11-29 2006-06-15 Hitachi Ltd 磁気メモリおよびその製造方法
EP1819304B1 (fr) 2004-12-09 2023-01-25 Twelve, Inc. Reparation de valvule sigmoide aortique
GB0427236D0 (en) 2004-12-13 2005-01-12 Solexa Ltd Improved method of nucleotide detection
JP2006199187A (ja) 2005-01-21 2006-08-03 Kyowa Sangyo Kk 車両用サンバイザ
SE529136C2 (sv) 2005-01-24 2007-05-08 Volvo Lastvagnar Ab Styrväxelkylare
US7144195B1 (en) 2005-05-20 2006-12-05 Mccoskey William D Asphalt compaction device
AU2006259565B2 (en) 2005-06-15 2011-01-06 Complete Genomics, Inc. Single molecule arrays for genetic and chemical analysis
GB0522310D0 (en) 2005-11-01 2005-12-07 Solexa Ltd Methods of preparing libraries of template polynucleotides
WO2007107710A1 (fr) 2006-03-17 2007-09-27 Solexa Limited Procédés isothermiques pour créer des réseaux moléculaires clonales simples
EP2663656B1 (fr) 2011-01-13 2016-08-24 Decode Genetics EHF Variants génétiques comme marqueurs à utiliser dans l'évaluation du risque du cancer de la vessie
US8666119B1 (en) 2011-11-29 2014-03-04 Lucasfilm Entertainment Company Ltd. Geometry tracking
US20160110498A1 (en) 2013-03-13 2016-04-21 Illumina, Inc. Methods and systems for aligning repetitive dna elements
PT3077943T (pt) 2013-12-03 2020-08-21 Illumina Inc Métodos e sistemas para analisar dados de imagem
KR102538753B1 (ko) 2014-09-18 2023-05-31 일루미나, 인코포레이티드 핵산 서열결정 데이터를 분석하기 위한 방법 및 시스템
KR102246285B1 (ko) * 2017-03-07 2021-04-29 일루미나, 인코포레이티드 단일 광원, 2-광학 채널 서열분석
NL2018852B1 (en) * 2017-05-05 2018-11-14 Illumina Inc Optical distortion correction for imaged samples
CN111094540A (zh) * 2017-09-15 2020-05-01 伊鲁米纳公司 序列检测系统的调整与校准特征
US20200251183A1 (en) * 2018-07-11 2020-08-06 Illumina, Inc. Deep Learning-Based Framework for Identifying Sequence Patterns that Cause Sequence-Specific Errors (SSEs)

Patent Citations (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (fr) 1989-10-26 1991-05-16 Sri International Sequençage d'adn
WO1991006778A1 (fr) 1989-11-02 1991-05-16 Sundstrand Corporation Pompe regenerative et procede de refoulement du fluide sous pression
US5719391A (en) 1994-12-08 1998-02-17 Molecular Dynamics, Inc. Fluorescence imaging system employing a macro scanning objective
US5528050A (en) 1995-07-24 1996-06-18 Molecular Dynamics, Inc. Compact scan head with multiple scanning modalities
US6859570B2 (en) 1997-03-14 2005-02-22 Trustees Of Tufts College, Tufts University Target analyte sensors utilizing microspheres
US6266459B1 (en) 1997-03-14 2001-07-24 Trustees Of Tufts College Fiber optic sensor with encoded microspheres
US7622294B2 (en) 1997-03-14 2009-11-24 Trustees Of Tufts College Methods for detecting target analytes and enzymatic reactions
WO1998044151A1 (fr) 1997-04-01 1998-10-08 Glaxo Group Limited Methode d'amplification d'acide nucleique
WO2000018957A1 (fr) 1998-09-30 2000-04-06 Applied Research Systems Ars Holding N.V. Procedes d'amplification et de sequençage d'acide nucleique
WO2000063437A2 (fr) 1999-04-20 2000-10-26 Illumina, Inc. Detection de reactions d'acide nucleique sur microsupports de billes en reseau
US6355431B1 (en) 1999-04-20 2002-03-12 Illumina, Inc. Detection of nucleic acid amplification reactions using bead arrays
US20050244870A1 (en) 1999-04-20 2005-11-03 Illumina, Inc. Nucleic acid sequencing using microsphere arrays
US6770441B2 (en) 2000-02-10 2004-08-03 Illumina, Inc. Array compositions and methods of making same
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
US20050064460A1 (en) 2001-11-16 2005-03-24 Medical Research Council Emulsion compositions
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US7566537B2 (en) 2001-12-04 2009-07-28 Illumina Cambridge Limited Labelled nucleotides
US7427673B2 (en) 2001-12-04 2008-09-23 Illumina Cambridge Limited Labelled nucleotides
US7541444B2 (en) 2002-08-23 2009-06-02 Illumina Cambridge Limited Modified nucleotides
WO2004018497A2 (fr) 2002-08-23 2004-03-04 Solexa Limited Nucleotides modifies
US20070166705A1 (en) 2002-08-23 2007-07-19 John Milton Modified nucleotides
US20050130173A1 (en) 2003-01-29 2005-06-16 Leamon John H. Methods of amplifying and sequencing nucleic acids
WO2005010145A2 (fr) 2003-07-05 2005-02-03 The Johns Hopkins University Procede et compositions de detection et d'enumeration de variations genetiques
US20060240439A1 (en) 2003-09-11 2006-10-26 Smith Geoffrey P Modified polymerases for improved incorporation of nucleotide analogues
US8563477B2 (en) 2004-01-07 2013-10-22 Illumina Cambridge Limited Modified molecular arrays
US20110059865A1 (en) 2004-01-07 2011-03-10 Mark Edward Brennan Smith Modified Molecular Arrays
WO2005065814A1 (fr) 2004-01-07 2005-07-21 Solexa Limited Arrangements moleculaires modifies
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
WO2006064199A1 (fr) 2004-12-13 2006-06-22 Solexa Limited Procede ameliore de detection de nucleotides
US20060178901A1 (en) 2005-01-05 2006-08-10 Cooper Kelana L Home movies television (HMTV)
US20060278147A1 (en) 2005-06-10 2006-12-14 Janome Sewing Machine Co., Ltd. Embroidery sewing machine
WO2007010252A1 (fr) 2005-07-20 2007-01-25 Solexa Limited Procede de sequencage d'une matrice de polynucleotide
WO2007010251A2 (fr) 2005-07-20 2007-01-25 Solexa Limited Preparation de matrices pour sequencage d'acides nucleiques
US7592435B2 (en) 2005-08-19 2009-09-22 Illumina Cambridge Limited Modified nucleosides and nucleotides and uses thereof
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
US8158926B2 (en) 2005-11-23 2012-04-17 Illumina, Inc. Confocal imaging methods and apparatus
WO2007123744A2 (fr) 2006-03-31 2007-11-01 Solexa, Inc. Systèmes et procédés pour analyse de séquençage par synthèse
US8241573B2 (en) 2006-03-31 2012-08-14 Illumina, Inc. Systems and devices for sequence by synthesis analysis
US20090088327A1 (en) 2006-10-06 2009-04-02 Roberto Rigatti Method for sequencing a polynucleotide template
US20080108082A1 (en) 2006-10-23 2008-05-08 Pacific Biosciences Of California, Inc. Polymerase enzymes and reagents for enhanced nucleic acid sequencing
US7414716B2 (en) 2006-10-23 2008-08-19 Emhart Glass S.A. Machine for inspecting glass containers
US20130023422A1 (en) 2008-05-05 2013-01-24 Illumina, Inc. Compensator for multiple surface imaging
US9079148B2 (en) 2008-07-02 2015-07-14 Illumina Cambridge Limited Using populations of beads for the fabrication of arrays on surfaces
US20120020537A1 (en) 2010-01-13 2012-01-26 Francisco Garcia Data processing system and methods
US20120270305A1 (en) 2011-01-10 2012-10-25 Illumina Inc. Systems, methods, and apparatuses to image a sample for biological or chemical analysis
US20120316086A1 (en) 2011-06-09 2012-12-13 Illumina, Inc. Patterned flow-cells useful for nucleic acid analysis
US8778848B2 (en) 2011-06-09 2014-07-15 Illumina, Inc. Patterned flow-cells useful for nucleic acid analysis
US20130116153A1 (en) 2011-10-28 2013-05-09 Illumina, Inc. Microarray fabrication system and method
US8778849B2 (en) 2011-10-28 2014-07-15 Illumina, Inc. Microarray fabrication system and method
US20130184796A1 (en) 2012-01-16 2013-07-18 Greatbatch Ltd. Elevated Hermetic Feedthrough Insulator Adapted for Side Attachment of Electrical Conductors on the Body Fluid Side of an Active Implantable Medical Device
US20130260372A1 (en) 2012-04-03 2013-10-03 Illumina, Inc. Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing
US20140243224A1 (en) 2013-02-26 2014-08-28 Illumina, Inc. Gel patterned surfaces
WO2015002813A1 (fr) 2013-07-01 2015-01-08 Illumina, Inc. Greffage de polymère et fonctionnalisation de surface sans catalyseur
WO2015106941A1 (fr) 2014-01-16 2015-07-23 Illumina Cambridge Limited Modification de polynucléotides sur support solide
WO2016066586A1 (fr) 2014-10-31 2016-05-06 Illumina Cambridge Limited Nouveaux polymères et revêtements de copolymères d'adn
US20180189613A1 (en) * 2016-04-21 2018-07-05 Ramot At Tel Aviv University Ltd. Cascaded convolutional neural network
WO2018129314A1 (fr) * 2017-01-06 2018-07-12 Illumina, Inc. Correction de phase
WO2019079182A1 (fr) * 2017-10-16 2019-04-25 Illumina, Inc. Apprentissage semi-supervisé pour l'apprentissage d'un ensemble de réseaux neuronaux à convolution profonde
WO2019079202A1 (fr) * 2017-10-16 2019-04-25 Illumina, Inc. Détection de raccordement aberrant à l'aide de réseaux neuronaux à convolution (cnn)
WO2019136284A1 (fr) * 2018-01-05 2019-07-11 Illumina, Inc. Prédiction de la qualité de résultats de séquençage utilisant des réseaux neuronaux profonds
WO2019136388A1 (fr) * 2018-01-08 2019-07-11 Illumina, Inc. Systèmes et dispositifs de séquençage à haut débit avec détection basée sur un semi-conducteur
WO2019140402A1 (fr) * 2018-01-15 2019-07-18 Illumina, Inc. Classificateur de variants basé sur un apprentissage profond

Non-Patent Citations (48)

* Cited by examiner, † Cited by third party
Title
"3.3.9.11. Watershed and random walker for segmentation", SCIPY LECTURE NOTES, 13 November 2018 (2018-11-13), pages 2, Retrieved from the Internet <URL:http://scipy-lectures.org/packages/scikit-image/auto_examples/plot_segmentations.html>
"skikit-image/peak.py at master", 16 November 2018, GITHUB, pages: 5
A. G. HOWARDM. ZHUB. CHEND. KALENICHENKOW. WANGT. WEYANDM. ANDREETTOH. ADAM: "Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications", ARXIV: 1704.04861, 2017
ANONYMOUS: "MiSeq: Imaging and Base Calling", 1 January 2013 (2013-01-01), XP055669460, Retrieved from the Internet <URL:https://support.illumina.com/content/dam/illumina-support/courses/MiSeq_Imaging_and_Base_Calling/story_content/external_files/MiSeq%20Imaging%20and%20Base%20Calling%20Script.pdf> [retrieved on 20200218] *
ANONYMOUS: "MiSEQ: Imaging and Base Calling", 1 January 2013 (2013-01-01), XP055669545, Retrieved from the Internet <URL:https://support.illumina.com/training.html> [retrieved on 20200218] *
BENTLEY ET AL., NATURE, vol. 456, 2008, pages 53 - 59
BENTLEY ET AL., NATURE, vol. 456, pages 53 - 59
C. SZEGEDYW. LIUY. JIAP. SERMANETS. REEDD. ANGUELOVD. ERHANV. VANHOUCKEA. RABINOVICH: "GOING DEEPER WITH CONVOLUTIONS", ARXIV: 1409.4842, 2014
DRESSMAN ET AL., PROC. NATL. ACAD. SCI. USA, vol. 100, 2003, pages 8817 - 8822
F. CHOLLET: "Xception: Deep Learning with Depthwise Separable Convolutions", PROC. OF CVPR, 2017
F. YUV. KOLTUN: "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS", ARXIV:1511.07122, 2016
G. HUANGZ. LIUL. VAN DER MAATENK. Q. WEINBERGER: "DENSELY CONNECTED CONVOLUTIONAL NETWORKS", ARXIV: 1608.06993, 2017
HYUNGTAE LEE ET AL: "Fast Object Localization Using a CNN Feature Map Based Multi-Scale Search", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 April 2016 (2016-04-12), XP080695042 *
I. J. GOODFELLOWD. WARDE-FARLEYM. MIRZAA. COURVILLEY. BENGIO: "AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION", 2016, TAMPERE UNIVERSITY OF TECHNOLOGY, article "CONVOLUTIONAL NETWORKS"
J. GUZ. WANGJ. KUENL. MAA. SHAHROUDYB. SHUAIT. LIUX. WANGG. WANG: "RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS", ARXIV: 1512.07108, 2017
J. HUANGV. RATHODC. SUNM. ZHUA. KORATTIKARAA. FATHII. FISCHERZ. WOJNAY. SONGS. GUADARRAMA ET AL.: "Speed/accuracy trade-offs for modern convolutional object detectors", ARXIV PREPRINT ARXIV: 1611.10012, 2016
J. LONGE. SHELHAMERT. DARRELL: "Fully convolutional networks for semantic segmentation", CVPR, 2015
J. M. WOLTERINKT. LEINERM. A. VIERGEVERI. ISGUM: "DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE", ARXIV: 1704.03669, 2017
K. HEX. ZHANGS. RENJ. SUN: "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION", ARXIV:1512.03385, 2015
K. HEX. ZHANGS. RENJ. SUN: "Deep Residual Learning for Image Recognition", PROC. OF CVPR, 2016
L. SIFRE: "Rigid-motion Scattering for Image Classification", PH.D. THESIS, 2014
L. SIFRES. MALLAT: "Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination", PROC. OF CVPR, 2013
LIANG-CHIEH CHENGEORGE PAPANDREOUFLORIAN SCHROFFHARTWIG ADAM: "Rethinking atrous convolution for semantic image segmentation", CORR, ABS/1706.05587, 2017
LIU PHEMANI APAUL KWEIS CJUNG MWEHN N: "3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems", INT J PARALLEL PROG., vol. 45, no. 6, 2017, pages 1420 - 60, XP036325442, DOI: 10.1007/s10766-017-0495-0
LONG, JONATHAN: "Fully Convolutional Networks for Semantic Segmentation", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 4, 1 April 2017 (2017-04-01), pages 10
M. LINQ. CHENS. YAN: "Network in Network", PROC. OF ICLR, 2014
M. SANDLERA. HOWARDM. ZHUA. ZHMOGINOVL. CHEN: "MobileNetV2: Inverted Residuals and Linear Bottlenecks", ARXIV:1801.04381V3, 2018
MORDVINTSEV, ALEXANDERREVISION, ABID K.: "Image Segmentation with Watershed Algorithm", REVISION 43532856, 2013, pages 6, Retrieved from the Internet <URL:https://opency-pythontutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_watershed/py_watershed.html>
MZUR, WATERSHED.PY, 25 October 2017 (2017-10-25), pages 3, Retrieved from the Internet <URL:https://github.com/mzur/watershed/blob/master/Watershed.py>
PRABHAKAR ET AL.: "Plasticine: A Reconfigurable Architecture for Parallel Patterns", ISCA ' 17, 24 June 2017 (2017-06-24)
R.K. SRIVASTAVAK. GREFFJ. SCHMIDHUBER: "HIGHWAY NETWORKS", ARXIV: 1505.00387, 2015
RONNEBERGER OFISCHER PBROX T.: "U-net: Convolutional networks for biomedical image segmentation", MED. IMAGE COMPUT. COMPUT. ASSIST. INTERV., 2015, Retrieved from the Internet <URL:http://link.springer.com/chapter/10.1007/978-3-319-24574-4_28>
RONNEBERGER, OLAF: "U-net: Convolutional networks for biomedical image segmentation", INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, 18 May 2015 (2015-05-18), pages 8
S. DIELEMANH. ZENK. SIMONYANO. VINYALSA. GRAVESN. KALCHBRENNERA. SENIORK. KAVUKCUOGLU: "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", ARXIV: 1609.03499, 2016
S. IOFFEC. SZEGEDY: "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT", ARXIV: 1502.03167, 2015
S. O. ARIKM. CHRZANOWSKIA. COATESG. DIAMOSA. GIBIANSKYY. KANGX. LIJ. MILLERA. NGJ. RAIMAN: "DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH", ARXIV:1702.07825, 2017
S. XIER. GIRSHICKP. DOLLARZ. TUK. HE: "Aggregated Residual Transformations for Deep Neural Networks", PROC. OF CVPR, 2017
SHEVCHENKO, A., KERAS WEIGHTED CATEGORICAL_CROSSENTROPY, 15 January 2019 (2019-01-15), pages I, Retrieved from the Internet <URL:https://gist.github.com/skeeet/cad06d584548fb45eece1d4e28cfa98b>
THAKUR, PRATIBHA: "A Survey of Image Segmentation Techniques", INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS, vol. 2, no. 4, April 2014 (2014-04-01), pages 158 - 165
TIM ALBRECHT ET AL: "Deep learning for single-molecule science", NANOTECHNOLOGY, INSTITUTE OF PHYSICS PUBLISHING, GB, vol. 28, no. 42, 18 September 2017 (2017-09-18), pages 423001, XP020320531, ISSN: 0957-4484, [retrieved on 20170918], DOI: 10.1088/1361-6528/AA8334 *
VAN DEN ASSEM, D.C.F.: "Master of Science Thesis", 18 August 2017, DELFT UNIVERSITY OF TECHNOLOGY, article "Deep Learning for Pixelwise Classification of Hyperspectral Images", pages: 19 - 38
X. ZHANGX. ZHOUM. LINJ. SUN: "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices", ARXIV:1707.01083, 2017
XIE, W.: "Microscopy cell counting and detection with fully convolutional regression networks", COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING: IMAGING & VISUALIZATION, vol. 6, no. 3, 2018, pages 283 - 292, XP055551866, DOI: 10.1080/21681163.2016.1149104
XIE, YUANPU ET AL.: "Beyond classification: structured regression for robust cell detection using convolutional neural network", INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, October 2015 (2015-10-01), pages 12
Z. QINZ. ZHANGX. CHENY. PENG: "FD-MobileNet: Improved MobileNet with a Fast Downsampling Strategy", ARXIV: 1802.03750, 2018
Z. WUK. HAMMADE. GHAFAR-ZADEHS. MAGIEROWSKI: "FPGA-Accelerated 3rd Generation DNA Sequencing", IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, vol. 14, no. 1, February 2020 (2020-02-01), pages 65 - 74, XP011771041, DOI: 10.1109/TBCAS.2019.2958049
Z. WUK. HAMMADR. MITTMANNS. MAGIEROWSKIE. GHAFAR-ZADEHX. ZHONG: "FPGA-Based DNA Basecalling Hardware Acceleration", PROC. IEEE 61ST INT. MIDWEST SYMP. CIRCUITS SYST., August 2018 (2018-08-01), pages 1098 - 1101, XP033508770, DOI: 10.1109/MWSCAS.2018.8623988
ZHONG-QIU ZHAO ET AL: "Object Detection with Deep Learning: A Review", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 July 2018 (2018-07-15), XP081117166 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277116A (zh) * 2022-07-06 2022-11-01 中能电力科技开发有限公司 网络隔离的方法、装置、存储介质及电子设备
CN115277116B (zh) * 2022-07-06 2024-02-02 中能电力科技开发有限公司 网络隔离的方法、装置、存储介质及电子设备
CN116796196A (zh) * 2023-08-18 2023-09-22 武汉纺织大学 基于多模态联合嵌入的共语姿势生成方法
CN116796196B (zh) * 2023-08-18 2023-11-21 武汉纺织大学 基于多模态联合嵌入的共语姿势生成方法

Also Published As

Publication number Publication date
WO2020191391A2 (fr) 2020-09-24
WO2020191390A3 (fr) 2020-11-12
WO2020191391A3 (fr) 2020-12-03
WO2020191387A1 (fr) 2020-09-24
WO2020191390A2 (fr) 2020-09-24
WO2020191389A1 (fr) 2020-09-24

Similar Documents

Publication Publication Date Title
US11961593B2 (en) Artificial intelligence-based determination of analyte data for base calling
US11347965B2 (en) Training data generation for artificial intelligence-based sequencing
WO2020205296A1 (fr) Génération à base d&#39;intelligence artificielle de métadonnées de séquençage
NL2023311B9 (en) Artificial intelligence-based generation of sequencing metadata
US20210265018A1 (en) Knowledge Distillation and Gradient Pruning-Based Compression of Artificial Intelligence-Based Base Caller
NL2023310B1 (en) Training data generation for artificial intelligence-based sequencing
US20230087698A1 (en) Compressed state-based base calling
US20230298339A1 (en) State-based base calling
WO2023049215A1 (fr) Appel de base basé sur l&#39;état compressé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20719052

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020572715

Country of ref document: JP

Kind code of ref document: A

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112020026426

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 2020256047

Country of ref document: AU

Date of ref document: 20200321

Kind code of ref document: A

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
ENP Entry into the national phase

Ref document number: 112020026426

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20201222

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2020719052

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2020719052

Country of ref document: EP

Effective date: 20211021