WO2020242603A1 - Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities - Google Patents

Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities Download PDF

Info

Publication number
WO2020242603A1
WO2020242603A1 PCT/US2020/026892 US2020026892W WO2020242603A1 WO 2020242603 A1 WO2020242603 A1 WO 2020242603A1 US 2020026892 W US2020026892 W US 2020026892W WO 2020242603 A1 WO2020242603 A1 WO 2020242603A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
rolony
size
data
quality
Prior art date
Application number
PCT/US2020/026892
Other languages
French (fr)
Inventor
Yanhong Tong
Dominic A. MANGIARDI
Yvonne CHAN
Original Assignee
Qiagen Sciences Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qiagen Sciences Llc filed Critical Qiagen Sciences Llc
Publication of WO2020242603A1 publication Critical patent/WO2020242603A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration

Definitions

  • This invention relates in general to novel methods, algorithms and usages of extracting quantitative comprehensive data matrices based on cluster size distribution on solid surfaces, like flow cells.
  • Background Clonal amplification is a critical step for next generation sequencing (INGS) technology.
  • the traditional methods are either bead-based emulsion PCR (454 and ABI) or bridge amplification on the surface of the flow cell (lllumina, patent US7972820B2).
  • the clonal amplification occurs on the solid surfaces (flow cell or beads) for both technologies.
  • An example of clonal amplification is the bridge amplification for colony generation. Bridge amplification is performed after the hybridization of oligonucleotides complementary to primers immobilized on a glass surface.
  • the primers form a basis for the extension and the free end of each single-stranded molecule can anneal to a second immobilized primer in close spatial proximity forming a "bridge" that acts as a template for a second round of amplification, leading to four linear molecules. Repeated cycles lead to the formations of clusters of clonal population.
  • the density of the clonal clusters from the solid surface amplification has a large impact on sequencing performance in terms of run quality and total data output. While under-clustering maintains high data quality, it results in reduced data output. Alternatively, over-clustering can lead to poor run performance, reduced Q30 scores, the possible introduction of sequencing artifacts, and counterintuitively reduced total data output.
  • Rolling-circle amplification (RCA) driven by DNA polymerase can replicate circular oligonucleotide templates.
  • Rolony clonal amplified through RCA process (rolling circle amplification), is a single- stranded DNA with multiple copies of concatemers.
  • Rolony can be generated either on solid surfaces (Clonal rolling circle amplification for on-chip DNA cluster generation, Biology Methods and Protocols, 2017) or in solution. If rolony is generated in solution, it is seeded/loaded on a flow cell surface for the following sequencing process. The cluster formation by rolony is impacted by multiple factors which include clonal amplification and post-amplification which correspond to the seeding and the sequencing.
  • the rolony based sequencing is a process with in-solution clonal amplification first and then cluster formation on solid surfaces by seeding. This is different from the bridge amplification, in which process the DNA fragments are seeded first, and then clusters are generated by clonal amplification on solid surface ( Figure 1). Therefore, bridge amplification is performed on solid surface, while rolony amplification is not on solid surface.
  • the methods and algorithms developed for the cluster evaluation based on bridge amplification cannot fit the needs for the rolony based sequencing platform.
  • RCA an isothermal amplification method
  • rolonies in terms of quality and quantity and the relationship to cluster density on flow cell surface.
  • Intercalating dyes based read-time detection is not specific.
  • Probe-based real-time detection can provide some information of amplification kinetics. However, it cannot tell the difference that the increased fluorescent signals are from probe binding to the same rolony or probe binding to different rolonies. It is also not quantitative.
  • Cluster density and mapped reads density are critical quality criteria for rolony based sequencing platform.
  • rolony quantification or rolony concentration
  • rolony quality amplification specificity
  • seeding efficiencies and reproducibility sequencing chemistries and instrument variations.
  • the methods, and usages, disclosed in this invention have addressed the novel solutions for above issues.
  • the quantitative evaluation matrices generated by the algorithms can establish the relationships among RCA amplification conditions, rolony size distribution, cluster density and mapped reads density.
  • the evaluation matrices can be part of data output of sequencing runs.
  • the methods can be used to identify the root causes of sequencing quality variations. It can also be applied for clonal amplification optimization, sequencing platform optimization, troubleshooting and run quality control. Summary of the Invention
  • This disclosure describes, in some aspects, methods for generating quantitative evaluation matrices on the sequencing platform based on rolony (DNA nanoball) which is a ssDNA template created by rolling circle amplification (RCA) from circularized library inputs.
  • the rolony is a DNA concatemer including amplified DNA sequences for sequencing primer (single or multiple) hybridization sites, sample Index, barcodes (umi) and targeted insert sequences (regions of interests).
  • the method includes the following data process: 1) Primary data analysis to generate the output files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files; 2) Extracting cluster/rolony size based statistical information from above files by sizeStats data process; 3) Establish comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine rolony quality, cluster quantity and sequencing run quality ( Figure 2).
  • the quantitative evaluation matrices comprise one, or some or all of the following matrices for each defined size bin based on certain rolony or cluster size range (or size bin):
  • the rolony or cluster size can be defined as either camera px or ROI px.
  • the DNA fragment for sequencing is clonal amplified by rolling circle amplification prior to being seeded on flow cell and sequenced.
  • the methods disclosed here address the special needs of rolony based sequencing, which is with the process of cluster formation on flow cell after the clonal amplification in solution. It solves the challenges to identify the critical parameters (from clonal amplification to sequencing) impacting sequencing quality related to size distribution.
  • Figure 2. Data process workflow to generate quantitative evaluation matrices based on size distribution.
  • Figure 3. Rolony size distribution dependent mapped reads density. The percentage of total objects of large rolonies with ROI px more than 60 px vs. overall mapped reads density (K/mm 2 ).
  • Figure 4 Evaluation matrices for rolony quality. For each defined size bin, poor quality rolonies (Rl, R2 and R3) with reduced percentage of mapped objects and increased raw error rate comparing to the control rolony. 4A: Percentage of mapped objects for rolony quality. 4B. Mean of raw error rates for rolony quality.
  • the present invention provides a novel method useful for quantitative evaluation on rolony (clonal amplification in solution) and sequencing quality based on cluster size distribution on a flow cell.
  • Rolony and “nanoball” may be used interchangeably, and generally refer to DNA sequences amplified or created by rolling circle amplification (RCA) of a circularized DNAfragment.
  • Rolonies may be sequenced using sequencing-by-synthesis (SBS) and/or sequencing-by-ligation (SBL, International Patent Application Publication No. WO2011/044437).
  • SBS sequencing-by-synthesis
  • SBL sequencing-by-ligation
  • RCA based clonal amplification provides a simple solution that often can eliminate the need for emulsion PCR (ePCR) and thereby provide the option of eliminating an often expensive and labor- intensive step in many next generation sequencing methods.
  • index may be used interchangeably, and generally refer to a region of an adapter nucleic acid sequence that is useful as an identifier for the population to which the ligated nucleic acid sequence belongs.
  • an index comprises a fixed nucleic acid sequence that may be used to identify a collection of sequences belonging to a common library.
  • index sequences enable sequencing of multiple different samples in a single reaction (e.g., performed in a single flow cell).
  • an index sequence can be used to orientate a sequence imager for purposes of detecting individual sequencing reactions.
  • an index sequence may be 2 to 25 nucleotides in length.
  • seeding generally refers to the process of loading and hybridization rolony on the flow cell surface.
  • object generally refers to the detected rolony or cluster on the flow cell surface.
  • object size generally refers to the detected rolony or cluster size on the flow cell surface.
  • object “size range” and “size bin” may be used interchangeably, and generally refers to a certain range of object size. For example, 0-10 ROI px, 11-60 ROI px, > 60 ROI px.
  • % of total may be used interchangeably, and generally refer to the total number of objects in the defined object size range divide by the number of all objects in the overall object size range. It is a normalization process, minimizing the impacts of seeding concentration and efficiency which are associated with total count of objects.
  • % of mapped may be used interchangeably, and generally refer to, for each defined object size range, mean of percentage of mapped objects per object size. It can reflect rolony quality and sequencing run quality, including sequencing chemistry and instrument.
  • rolony size and “cluster size” may be used interchangeably, and generally refer to, the rolony size on the surfaces of flow cells. Each rolony forms a cluster by the seeding process.
  • mean of total objects generally refers to, for each defined object size range, the mean of total objects per size. It is an indicator to reflect seeding amounts on flow cells.
  • cluster density generally refers to the number of objects per mm 2 on the flow cell surface.
  • mapped reads density generally refers to the number of mapped objects per mm 2 on the flow cell surface.
  • filtered size generally refers to calculations using a filtered or limited size range (i.e. 2 to 20 px) instead of all identified size ranges.
  • trimmed generally refers to calculations using a trimmed or limited sequencing cycle range (i.e. initial cycles 7 to 15) instead of all cycle range.
  • sequence map output generally refers to the output which reports the measured area on the flow cell surface uniquely occupied by an individual rolony, also referred to as “rolony size", which produced a read sequence.
  • intensity file generally refers to the file for storing cluster intensity information, which is extracted from the image file.
  • Q-score generally refers to a quality score which is a prediction of the probability of an error in base calling. It serves as a compact way to communicate very small error probabilities. A high quality score indicates that a base call is more reliable and less likely to be incorrect. For example, for base calls with a quality score of Q30, one base call in 1,000 is predicted to be incorrect.
  • raw error rate generally refers to the percent of base calls in a single read or collection of reads that do not match the reference sequence, without the removal of any base calls within the read(s) based on predicted quality score.
  • SAM file generally refers to Sequence Alignment Map, a text-based format for storing biological sequences aligned to a reference sequence.
  • FQ file generally refers to a text file in FASTQ file format where no base calls have been removed due to predicted quality score.
  • the FASTQ file format is a text file format (human readable) that provides 4 lines of data per sequence: sequence identifier, the sequence, comments, quality score. It is commonly used to store sequencing reads.
  • FASTQ file generally refers to a text file in FASTQ file format, created from an FQ file, where entire sequences or individual base calls within sequences have been removed from the original FQfile based on predicted quality score.
  • multi dimension data matrices is a term for data structure.
  • a multi-dimensional array is an array of arrays. Two-dimensional arrays are the most commonly used. They are used to store data in a tabular manner.
  • SizeStat As used herein, “SizeStat”, “sizeStat”, “SizeStats” and “sizeStats” may be used interchangeably, is a software package developed by QIAGEN, extraction and combination of cluster/rolony size based statistical information from the output files of primary analysis (comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files).
  • the output of the process are multi-dimension data matrices related to the size distribution. It is a comprehensive output, which is independent of the sequencing platform and targets, for example, different input library panels, different targeting organisms.
  • the invention refers to method for determining rolony quality, cluster quantity and/or sequencing run quality carried out by SizeStat data processing based on the sequencing results of rolonies on a flow cell surface after rolony preparation and seeding, the method comprising the steps of:
  • the at least two ranges/bins can consist of three ranges/bins, namely the small noises bin range, the functional bin range, and the competition bin range, for example as shown in example 1.
  • the quantitative evaluation matrices can be part of sequencing instrument output for the following applications:
  • the quality control encompasses control as exemplified in example 2.
  • the trouble shooting encompasses is as exemplified in example 3, and the performance optimization is as exemplified in example 1 for optimization of cluster/rolony density and in example 4 for optimization of sequencing chemistry performance.
  • the rolony preparation uses rolony technology, which is well known in the art, wherein the DNA is circularized and amplified for further sequencing.
  • different RCA amplification condition are applied to generate rolonies from a library input.
  • the rolonies obtained by RCA are seeded on a flow cell surface for the following sequencing process, wherein each rolony forms a cluster by the seeding process.
  • the method includes the following data process workflow ( Figure 2): a. Primary data analysis: The input file is sequencing raw output-image data, the output of the process are the following files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files.
  • the sequencing quality information is embedded in the FQand FASTQfile, including Q-score, raw error rate, read length, etc.
  • the SAM file contains the sequencing mapped reads information.
  • the surface map output has the raw information related to rolony/cluster size distribution.
  • the intensity files have the information for signal intensity (total intensity, signal margin, chastity, etc.).
  • SizeStats analysis Extraction and combination of cluster/rolony size based statistical information from above files by sizeStats data process.
  • the output of the process are multidimension data matrices related to the size distribution. It is a comprehensive output, which is independent of the sequencing platform and targets, for example, different input library panels, different targeting organisms.
  • Evaluation matrices for rolony and sequencing quality and quantity Establish comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine rolony quality, cluster quantity and sequencing run quality.
  • the fundamental evaluation concept is universal across different sequencing platform and panels ( E.coli vs. human genome, BRCA vs. lung cancer, etc.), the matrices might be platform and panel specific in order to address the needs for optimization and troubleshooting.
  • the above data process can be established by different programming languages. For example, MATLAB, Python, Ruby, Java, Perl, R, etc.
  • the invention further includes the use of the sizeStats data process for generating quantitative evaluation matrices on the sequencing platform based on clonal amplification from rolling circle amplification, comprising the following data process workflow:
  • Sizestats as a critical central unit to transfer the primary data analysis output to quantitative evaluation matrices
  • the sizeStats data process algorithms comprise, the following components: a. imports image-based rolony surface area information (from surface map outputs) and sequencing data (FQ, FASTQ, SAM, intensity file, etc.);
  • b aggregates surface area and sequencing information (sequence, alignment, quality, error, signal, GC content) to relate size information to core key performance indicators (KPIs) for each rolony/cluster;
  • the sizeStats data process comprises
  • the various data sources comprise FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files.
  • the various filters comprise size, alignment (mapped reads), sequencing KPI (such as mapped reads density, error rate, read length, etc.), Q-score based filter, read-length based trim.
  • the sizeStats matrices includes the following categories or levels of information: rolony/cluster, tile and/or flow cell. Accordingly, the sizeStats data extracts and aggregates statistical information from a rolony or cluster; a tile within a flow cell; and/or a flow cell.
  • the quantitative evaluation matrices comprise the data processed from single, or multiple or full set of tiles on a flow cell.
  • the sizeStats matrices at rolony/cluster level represent the properties of each cluster, including, but not limited to, the following matrices: size (surface area), alignment, Q- score (initial cycles and total cycles), probability of error, error rate, read length, GC content, signal (total intensity, signal margin, chastity) decay rate.
  • the sizeStats matrices at tile level represent the consistency within a flow cell, including, but not limited to, the following matrices:
  • Quantity and quality metrics for total, aligned, and unaligned reads aggregated per tile include distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) for rolony count, linearized fits of size distribution (slope and Y-intercepts), read length, error rate, GC content.
  • Quantity and quality metrics aggregated per size and alignment per tile include average and standard deviation of rolony density, Q-score, error rate, read length, GC content, signal decay (total intensity, signal margin, chastity).
  • Quantity and quality metrics binned by Q-score for all reads include calculation of the total count, aligned count, unaligned count and percent aligned for the entire read sequence and for user-specified initial cycle ranges, and average read length.
  • Quantity and quality metrics binned by Q-score for mapped reads include calculation of the mean, median, and standard deviation for the rolony size, error rate, and read length.
  • the sizeStats matrices at flow cell level represent the overall key performance indicators (KPIs), including, but not limited to, the following matrices: i. Quantity and quality metrics aggregated per size per flow cell (ROI) for total, aligned, and unaligned reads include rolony count, density, probability, percent of surface area covered, percent alignment, distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) calculations for Q-score, read length, GC content, error rate, and average and standard deviation of fits to signal (total intensity, signal margin, chastity) decay. ii.
  • KPIs key performance indicators
  • Summary statistics per flow cell for total, aligned, and unaligned reads include rolony count, density, linearized fits of size distribution (slope and Y-intercepts), distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) for rolony size, Q-score, fits of signal (total intensity, signal margin, chastity) decay, read length, GC content.
  • the sizeStats outputs are comprehensive data matrices, which can be universal, and shared across different sequencing platforms and library panels. Quantitative evaluation matrices based on rolony/cluster size distribution
  • Clusters on the surface of flow cell are formed by rolonies through the seeding process.
  • the rolony size in solution might be different from the cluster size on flow cells.
  • the surface based rolony size distribution has direct impacts on cluster density and mapped reads density. In order to identify the relationship of cluster size distribution to the desired cluster density and mapped reads density, it is necessary to divide the size distribution into different size ranges or size bins.
  • the size distribution can be divided into or grouped to as three reprehensive ranges: small noises bin/range, functional bin/range, competition bin/range.
  • the small noises range for example, 0-10 ROI px, the clusters in this range are either too small rolonies, or nonspecific amplified products.
  • the sequencing quality (mapped reads density, raw error rate, Q-score, read length, etc.) is always bad in this range.
  • the number of clusters in this size range is to be minimized in terms of performance optimization.
  • the functional range for example, 11-60 ROI px, the clusters in this range are with desired sequencing quality.
  • the number of clusters in this size range is to be maximized in terms of performance optimization.
  • the competition range for example, ROI px more than 60px, the cluster size is considered too big.
  • the clusters/rolonies in this range are with desired sequencing quality in some cases, each cluster takes too much surface space, the overall cluster density and mapped reads density is reduced (Example 1).
  • the number of clusters in this size range is to be minimized in terms of performance optimization.
  • the size range of above described bins can be different.
  • the size range for each bin can be adjusted depending on the needs of application (quality control, trouble-shooting, optimization, etc.) and cases (sequence platform/instrument, input library panel, sequencing chemistry, etc). For example, it is depended on the sequencing chemistry performance and instrument (sequencer) measurement capability.
  • a high quality sequencing chemistry or an improved sequencing instrument can have high quality sequencing results in smaller sized clusters. Therefore, the functional range can be reduced to, for example, 10-40 ROI px. Therefore, the overall cluster density and mapped reads density can be increased. The final high quality data output can be increased.
  • the number of above described size bins can be a different number.
  • the number of defined size bins can be adjusted depending on the needs of application (quality control, trouble-shooting, optimization, etc.) and cases (sequence platform/instrument, input library panel, sequencing chemistry, etc). For example, but not limited to, 2 bins, 4 bins, 5 bins, etc.
  • the quantitative evaluation matrices are established based on above described size bins.
  • the quantitative evaluation matrices include percentage of total objects for each defined size bin. It is a normalization process to minimize the impacts of seeding concentration and efficiency. For example, the percentage of total objects need to be minimized or controlled for the size in the "competition range" in order to increase the cluster density or mapped reads density.
  • the quantitative evaluation matrices include percentage of mapped reads for each defined size bin. It can reflect rolony quality and sequencing run quality, including sequencing chemistry and instrument setup and measurement capabilities. For example, when there are non-specific amplifications during RCA amplification, the percentage of mapped reads in the "functional range" is reduced. For this case, even though the cluster density is high, the overall mapped reads density is reduced which is not the desired sequencing quality (Example 2).
  • the quantitative evaluation matrices include mean of total objects for each defined size bin. It is an indicator to reflect actual seeding amounts on flow cells without being impacted by the analyzed number of tiles. For example, if lower amount of rolonies are seeded on the flow cell, the counts of total objects are reduced for each defined size bin. It can be applied for the troubleshooting situation, especially for the case: a sequencing result is with overall mapped reads density lower than expectation, but other performances KPIs are in the acceptable range.
  • the quantitative evaluation matrices include mean of Q-scores of reads for each defined size bin. It reflects the sequencing quality for each flow cell. It can be applied for rolony size distribution optimization to increase the overall Q-scores.
  • the quantitative evaluation matrices include mean of raw error rate (from FQ file) or filtered error rate (from FASTQ file) for each defined size bin. It can be applied for rolony/cluster size distribution optimization to minimize overall either raw error rate or filtered error rate.
  • the quantitative evaluation matrices include mean of read lengths for each defined size bin. It can be applied for rolony/cluster size distribution optimization to maximize overall read length.
  • the quantitative evaluation matrices include signal intensity information for each defined size bin.
  • the signal intensity information can include some or all of the following matrices: total signal mean, signal Y-intercept, signal slope, signal margin, etc. It can be applied for the study of relationship of sequencing phasing issue and rolony/cluster size distribution.
  • the quantitative evaluation matrices include mean of percentage of GC contents for each defined size bin. It can be applied for investigation RCA amplification bias related to the rolony size distribution. In some embodiments, the quantitative evaluation matrices can include one, some or all of above described matrices. An example of usage is listed in Table 1. Table 1. Example of matrices for defined targets or study purpose.
  • the quantitative evaluation matrices can be based on the analysis from single, multiple or all tiles on a flow cell.
  • the quantitative evaluation matrices can be displayed as table, and/or graph. In some embodiments, the quantitative evaluation matrices can be applied for sequencing run-to- run comparison and/or tile-to-tile comparison for the same run.
  • the quantitative evaluation matrices can be applied for clonal amplification condition optimization.
  • the optimization comprises amplification primers, amplification buffers and time, etc.
  • the quantitative evaluation matrices can be applied for upstream library construction optimization.
  • the optimization comprises oligo design schema for library construction with corresponding different bridge oligos for circularization and different RCA amplification primers, etc.
  • the quantitative evaluation matrices can be applied before or after sample demultiplex process. If it is applied after the demultiplex process, the corresponding FQ, FASTQ and SAM files are from each sample.
  • Example 1 Evaluation matrices for rolony size distribution dependent mapped reads density
  • Step 1 Rolony preparation: Different RCA amplification conditions were applied to generate rolonies from a library input (a human genome panel), including different buffer components and amplification time (total 14 different test RCA conditions, each condition was tested either in duplicates or triplicates).
  • Step 2 Rolonies were seeded on the flow cell surface with the same protocol, and then sequenced on sequencer (GeneReader 1.2) with the same sequencing chemistry for 51 cycles; Total 44 sequencing flow cells.
  • Step 3 Data process through the workflow: primary analysis, sizeStats and quantitative evaluation matrices. Three size bins were defined as:
  • Example 2 Evaluation matrices for rolony quality Clusters from non-specific RCA amplification can take the surface space of flow cell, however, cannot produce mapped reads. Therefore, non-specific amplification can reduce rolony quality and reduce overall mapped reads density. It is necessary to optimize and quality control RCA process to ensure rolony quality. Percentage of mapped objects/rolonies and raw error rate per size distribution are important criteria to evaluate rolony quality.
  • RCA amplification is based on circularized templates. Circulation is based on bridge oligo guided ligation process.
  • Bridge oligo design is one of the critical parameters which impact rolony quality.
  • Two different bridge oligos were designed for circularization on an E.coli library. One of the bridge oligos had one mismatch site to the circularized template, the other one had perfect match to the circularized template as a control.
  • Three rolony batches (Rl, R2 and R3, prepared from the mismatch bridge oligo) and the control rolony batch (from the perfect match bridge oligo) are seeded and sequenced by the same sequencing chemistry for 157 cycles (R1 was sequenced twice). Comparing to the control rolony batch, Rl, R2 and R3 has reduced mapped reads density (from around 400 K/mm2 to 250 K/mm2) and increased raw error rate (from around 5% to 10%).
  • Rolony seeding, sequencing chemistry and instrument are the major confounding variables after the RCA clonal amplification process. It is necessary to establish evaluation matrices to differentiate the root causes for troubleshooting. Mean of total objects for each defined size bin is an important specification for seeding quantification.
  • Step 1 For that particular sequencing run, three flow cells were sequenced, the mapped reads density and raw error rate were within specification for the other two flow cells. It ruled out the issues from sequencing chemistry and instrument. In some embodiments, if a sequencing control is included in the process of RCA, seeding and sequencing, it can be used as an internal control reference to rule out the issues of sequencing chemistry and instrument based on the specification of raw error rate.
  • Step 2 Performed the quantitative evaluation matrix on the problem data point, the results was shown in table 4 (Test 1). It showed that the mean of total objects for each defined bin sizes were obviously reduced (highlighted in bold), without obvious difference in other evaluation matrices. Seeding quality might be the root cause of reduced mapped reads density.
  • Step 3 Repeat the sequencing process on the same rolony sample (test 2), the results are shown in Table 4 (Test 2), with the mapped reads density (670 K/mm2) in the acceptable range. It demonstrates that mean of total objects can help identify the root causes from seeding quantity. And the percent of total objects for each defined size bin can also be impacted by seeding quantity, which make the larger rolonies seeded relatively more with less surface space competition from low seeding amount.
  • Example 4 Evaluation matrices for sequencing chemistry quality control
  • Sequencing chemistry is an important parameter to impact final sequencing performance in terms of mapped reads density, raw error rate, Q-scores, read length, etc. And the impacts are reflected by the quantitative evaluation matrices.
  • An example is listed here.
  • a rolony (from E.coli library input) showed acceptable quality with previous sequencing test, with around 300 K/mm 2 mapped reads density and 6% raw error rate at 157 cycle sequencing.
  • One of the sequencing chemistry buffers (extend buffer) was increased 0.5 pH comparing to the standard condition. The sequencing results showed around 70 K/mm 2 mapped reads density and 17% raw error rate.
  • the evaluation matrices (Table 5) shows that, with increased pH, percent of mapped reads and Q-scores are significantly reduced, raw error rates are significantly increased for each defined size bin. Moreover, more rolonies are detected as smaller sizes (reflected by percent of total objects) by the sequencing chemistry with increased pH in extend buffer. It demonstrates that the evaluation matrix can be applied for sequencing chemistry quality control or optimization during development activities with the pre-qualified or pre-evaluated rolony samples.
  • Table 5 pH impacts on the evaluation matrices.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This disclosure describes, in one aspect, a method and algorithm for quantitative evaluation clonal amplified products (for example, but not limited to, rolonies) and sequencing qualities (including sequencing chemistry and instrument) on solid surfaces (for example, flow cells). The sequencing data are processed by primary analysis software to generate files comprising FQ file, SAM file, FASTQ file, and surface map output and intensity files. The statistical data are extract from above files to generate a comprehensive evaluation matrix based on cluster size from each tile or selected tile subsets or all of the tiles, including the number of total cluster objects, mapped percentage, mapped quality (Qscores), mapped length, error rate, GC content, signal intensity, etc. The size distribution based statistical evaluation matrix can be applied for differentiation of root causes of sequencing quality variation, including clonal amplification quality and quantity, seeding conditions of clonal amplified products, sequencing chemistry quality and instrument variations.

Description

Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities
Field of the Invention This invention relates in general to novel methods, algorithms and usages of extracting quantitative comprehensive data matrices based on cluster size distribution on solid surfaces, like flow cells.
Background Clonal amplification is a critical step for next generation sequencing (INGS) technology. The traditional methods are either bead-based emulsion PCR (454 and ABI) or bridge amplification on the surface of the flow cell (lllumina, patent US7972820B2). The clonal amplification occurs on the solid surfaces (flow cell or beads) for both technologies. An example of clonal amplification is the bridge amplification for colony generation. Bridge amplification is performed after the hybridization of oligonucleotides complementary to primers immobilized on a glass surface. The primers form a basis for the extension and the free end of each single-stranded molecule can anneal to a second immobilized primer in close spatial proximity forming a "bridge" that acts as a template for a second round of amplification, leading to four linear molecules. Repeated cycles lead to the formations of clusters of clonal population. For the technologies based on bridge amplification, the density of the clonal clusters from the solid surface amplification has a large impact on sequencing performance in terms of run quality and total data output. While under-clustering maintains high data quality, it results in reduced data output. Alternatively, over-clustering can lead to poor run performance, reduced Q30 scores, the possible introduction of sequencing artifacts, and counterintuitively reduced total data output. Optimization of cluster density involves identifying a balance between under- and over-clustering. Several quantitative analysis matrices have been developed by lllumina, as part of software output for sequencing quality control, optimization, and troubleshooting. For example, signal intensity drop rate by cycle, percentage of more than Q30 reads by cycle, density box plots compare the raw cluster density to the %PF cluster density by lane of flow cell, image profile, etc. (https://www.illumina.com/content/dam/illumina-marketing/documents/products/other/miseq- overclustering-primer-770-2014-038.pdf). Nevertheless, with the known tools, it is not possible to study rolony and/or cluster size distribution.
Rolling-circle amplification (RCA) driven by DNA polymerase can replicate circular oligonucleotide templates. Rolony, clonal amplified through RCA process (rolling circle amplification), is a single- stranded DNA with multiple copies of concatemers. Rolony can be generated either on solid surfaces (Clonal rolling circle amplification for on-chip DNA cluster generation, Biology Methods and Protocols, 2017) or in solution. If rolony is generated in solution, it is seeded/loaded on a flow cell surface for the following sequencing process. The cluster formation by rolony is impacted by multiple factors which include clonal amplification and post-amplification which correspond to the seeding and the sequencing. Therefore, it is critical to understand the relationship of clonal amplification efficiency, quality and quantity in solution and the cluster density on solid surface by seeding process. However, the rolony based sequencing is a process with in-solution clonal amplification first and then cluster formation on solid surfaces by seeding. This is different from the bridge amplification, in which process the DNA fragments are seeded first, and then clusters are generated by clonal amplification on solid surface (Figure 1). Therefore, bridge amplification is performed on solid surface, while rolony amplification is not on solid surface. The methods and algorithms developed for the cluster evaluation based on bridge amplification cannot fit the needs for the rolony based sequencing platform.
RCA, an isothermal amplification method, has been well developed and applied for many years. However, there are limitations for all current available analytical methods to quantitatively evaluate the amplified products, i.e. the rolonies, in terms of quality and quantity and the relationship to cluster density on flow cell surface. Intercalating dyes based read-time detection is not specific. Probe-based real-time detection can provide some information of amplification kinetics. However, it cannot tell the difference that the increased fluorescent signals are from probe binding to the same rolony or probe binding to different rolonies. It is also not quantitative. Realtime PCR based technology to evaluate RCA amplification (Clonal rolling circle amplification for on- chip DNA cluster generation, Biology Methods and Protocols, 2017) only can provide limited information for some particular targets, which is not comprehensive. FACS is a method to evaluate rolony size distribution. However, it cannot provide direct supports to understand the relationship of rolony size distribution in solution to the desired cluster density or mapped reads density on solid surfaces. And all of above analytical methods cannot establish the linkage among RCA amplification conditions, rolony size distribution and mapped reads density.
Cluster density and mapped reads density are critical quality criteria for rolony based sequencing platform. However, without certain quantitative evaluation matrices, it is challenging to differentiate the following root causes of sequencing quality variations: rolony quantification (or rolony concentration), rolony quality (amplification specificity), seeding efficiencies and reproducibility, sequencing chemistries and instrument variations. The methods, and usages, disclosed in this invention, have addressed the novel solutions for above issues. The quantitative evaluation matrices generated by the algorithms can establish the relationships among RCA amplification conditions, rolony size distribution, cluster density and mapped reads density. The evaluation matrices can be part of data output of sequencing runs. The methods can be used to identify the root causes of sequencing quality variations. It can also be applied for clonal amplification optimization, sequencing platform optimization, troubleshooting and run quality control. Summary of the Invention
This disclosure describes, in some aspects, methods for generating quantitative evaluation matrices on the sequencing platform based on rolony (DNA nanoball) which is a ssDNA template created by rolling circle amplification (RCA) from circularized library inputs. The rolony is a DNA concatemer including amplified DNA sequences for sequencing primer (single or multiple) hybridization sites, sample Index, barcodes (umi) and targeted insert sequences (regions of interests).
Generally, the method includes the following data process: 1) Primary data analysis to generate the output files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files; 2) Extracting cluster/rolony size based statistical information from above files by sizeStats data process; 3) Establish comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine rolony quality, cluster quantity and sequencing run quality (Figure 2). In some embodiments of either aspect, the quantitative evaluation matrices comprise one, or some or all of the following matrices for each defined size bin based on certain rolony or cluster size range (or size bin):
1. Percentage of total objects (without specific mapping information),
2. Mean of percentage of mapped objects,
3. Mean of total objects,
4. Mean of Q-score of objects,
5. Mean of raw error rate,
6. Mean of read length,
7. Mean of total signal intensity and signal decay (slope), and/or 8. Mean of percentage of GC contents
In some embodiments of either aspect, the rolony or cluster size can be defined as either camera px or ROI px. The conversion is 1 pm2 = 10 camera px = 40 ROI px.
In some embodiments of either aspect, the DNA fragment for sequencing is clonal amplified by rolling circle amplification prior to being seeded on flow cell and sequenced.
The methods disclosed here address the special needs of rolony based sequencing, which is with the process of cluster formation on flow cell after the clonal amplification in solution. It solves the challenges to identify the critical parameters (from clonal amplification to sequencing) impacting sequencing quality related to size distribution.
The above summary of the present invention is not intended to describe each disclosed embodiment or every implementation of the present invention. The description that follows more particularly exemplifies illustrative embodiments. In several places throughout the disclosure, guidance is provided through lists of examples, which examples can be used in various combinations. In each instance, the recited list serves only as a representative group and should not be interpreted as an exclusive list.
Brief description of the figures
Non-limiting embodiments of the present invention will be described by way of example with reference to the accompanying figures, which are schematic and are not intended to be drawn to scale. In the figures, each identical or nearly identical component illustrated is typically represented by a single numeral. For purposes of clarity, not every component is labeled in every figure, nor is every component of each embodiment of the invention shown where illustration is not necessary to allow those of ordinary skill in the art to understand the invention. In the figures: Figure 1. Comparison of sequencing platform based on bridge amplification and rolony (clonal amplification in solution).
Figure 2. Data process workflow to generate quantitative evaluation matrices based on size distribution. Figure 3. Rolony size distribution dependent mapped reads density. The percentage of total objects of large rolonies with ROI px more than 60 px vs. overall mapped reads density (K/mm2).
Figure 4. Evaluation matrices for rolony quality. For each defined size bin, poor quality rolonies (Rl, R2 and R3) with reduced percentage of mapped objects and increased raw error rate comparing to the control rolony. 4A: Percentage of mapped objects for rolony quality. 4B. Mean of raw error rates for rolony quality.
Figure 5. Example of root cause decision algorithm for seeding issues.
Detailed description of illustrative embodiments
The present invention provides a novel method useful for quantitative evaluation on rolony (clonal amplification in solution) and sequencing quality based on cluster size distribution on a flow cell.
Terms and definitions
In some embodiments, as used herein, "rolony" and "nanoball" may be used interchangeably, and generally refer to DNA sequences amplified or created by rolling circle amplification (RCA) of a circularized DNAfragment. Rolonies may be sequenced using sequencing-by-synthesis (SBS) and/or sequencing-by-ligation (SBL, International Patent Application Publication No. WO2011/044437). RCA based clonal amplification provides a simple solution that often can eliminate the need for emulsion PCR (ePCR) and thereby provide the option of eliminating an often expensive and labor- intensive step in many next generation sequencing methods.
In some embodiments, as used herein, "index", "index sequence" and "sample index" may be used interchangeably, and generally refer to a region of an adapter nucleic acid sequence that is useful as an identifier for the population to which the ligated nucleic acid sequence belongs. In some embodiments, an index comprises a fixed nucleic acid sequence that may be used to identify a collection of sequences belonging to a common library. In some embodiments, index sequences enable sequencing of multiple different samples in a single reaction (e.g., performed in a single flow cell). In some embodiments, an index sequence can be used to orientate a sequence imager for purposes of detecting individual sequencing reactions. In some embodiments, an index sequence may be 2 to 25 nucleotides in length. In some embodiments, as used herein, "seeding" generally refers to the process of loading and hybridization rolony on the flow cell surface.
In some embodiments, as used herein, "object" generally refers to the detected rolony or cluster on the flow cell surface.
In some embodiments, as used herein, "object size" generally refers to the detected rolony or cluster size on the flow cell surface. In some embodiments, as used herein, "camera px" generally refers to the area of 1 pixel in the images generated by the camera integrated into the sequencer. It is a standard size unit, independent from different types of flow cells and sequencers. The relationship is 1 pm2 = 10 camera px. In some embodiments, as used herein, "ROI px" generally refers to the area of 1 pixel in up-sampled images used during sequence map generation. The original camera images are up-sampled by a factor of 2, so an original image pixel is split into 2x2 pixels using 2D bilinear interpolation. This operation occurs at an early stage in the data processing chain as part of the sequence map generation process. The size and distributions are calculated based on objects detected within ROI. The relationship is 1 pm2 = 40 ROI px.
In some embodiments, as used herein, object "size range" and "size bin" may be used interchangeably, and generally refers to a certain range of object size. For example, 0-10 ROI px, 11-60 ROI px, > 60 ROI px.
In some embodiments, as used herein, "% of total", "percent of total" and "percentage of total" may be used interchangeably, and generally refer to the total number of objects in the defined object size range divide by the number of all objects in the overall object size range. It is a normalization process, minimizing the impacts of seeding concentration and efficiency which are associated with total count of objects.
In some embodiments, as used herein, "% of mapped", "percent of mapped" and "percentage of mapped" may be used interchangeably, and generally refer to, for each defined object size range, mean of percentage of mapped objects per object size. It can reflect rolony quality and sequencing run quality, including sequencing chemistry and instrument. In some embodiments, as used herein, "rolony size" and "cluster size" may be used interchangeably, and generally refer to, the rolony size on the surfaces of flow cells. Each rolony forms a cluster by the seeding process.
In some embodiments, as used herein, "mean of total objects" generally refers to, for each defined object size range, the mean of total objects per size. It is an indicator to reflect seeding amounts on flow cells. In some embodiments, as used herein, "cluster density" generally refers to the number of objects per mm2 on the flow cell surface.
In some embodiments, as used herein, "mapped reads density" generally refers to the number of mapped objects per mm2 on the flow cell surface.
In some embodiments, as used herein, "filtered size" generally refers to calculations using a filtered or limited size range (i.e. 2 to 20 px) instead of all identified size ranges.
In some embodiments, as used herein, "trimmed" generally refers to calculations using a trimmed or limited sequencing cycle range (i.e. initial cycles 7 to 15) instead of all cycle range.
In some embodiments, as used herein, "untrimmed" generally refers to calculations using the entire sequencing cycle range. In some embodiments, as used herein, "sequence map output" generally refers to the output which reports the measured area on the flow cell surface uniquely occupied by an individual rolony, also referred to as "rolony size", which produced a read sequence.
In some embodiments, as used herein, "intensity file" generally refers to the file for storing cluster intensity information, which is extracted from the image file.
In some embodiments, as used herein, "Q-score" generally refers to a quality score which is a prediction of the probability of an error in base calling. It serves as a compact way to communicate very small error probabilities. A high quality score indicates that a base call is more reliable and less likely to be incorrect. For example, for base calls with a quality score of Q30, one base call in 1,000 is predicted to be incorrect.
In some embodiments, as used herein, "raw error rate" generally refers to the percent of base calls in a single read or collection of reads that do not match the reference sequence, without the removal of any base calls within the read(s) based on predicted quality score.
In some embodiments, as used herein, "SAM file" generally refers to Sequence Alignment Map, a text-based format for storing biological sequences aligned to a reference sequence.
In some embodiments, as used herein, "FQ file" generally refers to a text file in FASTQ file format where no base calls have been removed due to predicted quality score. The FASTQ file format is a text file format (human readable) that provides 4 lines of data per sequence: sequence identifier, the sequence, comments, quality score. It is commonly used to store sequencing reads.
In some embodiments, as used herein, "FASTQ file" generally refers to a text file in FASTQ file format, created from an FQ file, where entire sequences or individual base calls within sequences have been removed from the original FQfile based on predicted quality score. As used herein, "multi dimension data matrices" is a term for data structure. A multi-dimensional array is an array of arrays. Two-dimensional arrays are the most commonly used. They are used to store data in a tabular manner.
As used herein, "SizeStat", "sizeStat", "SizeStats" and "sizeStats" may be used interchangeably, is a software package developed by QIAGEN, extraction and combination of cluster/rolony size based statistical information from the output files of primary analysis (comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files). The output of the process are multi-dimension data matrices related to the size distribution. It is a comprehensive output, which is independent of the sequencing platform and targets, for example, different input library panels, different targeting organisms. Overall data process workflow
The invention refers to method for determining rolony quality, cluster quantity and/or sequencing run quality carried out by SizeStat data processing based on the sequencing results of rolonies on a flow cell surface after rolony preparation and seeding, the method comprising the steps of:
a. generating a plurality of data files, from sequencing raw output image data, comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files;
b. processing the plurality of data files by SizeStat, thereby generating multi dimension data matrices based on rolony/cluster size distributions; and
c. generate comprehensive quantitative evaluation matrices from said multi dimension data matrices based on size distributions, wherein the size distributions comprise at least two ranges/bins.
According to the invention, the at least two ranges/bins can consist of three ranges/bins, namely the small noises bin range, the functional bin range, and the competition bin range, for example as shown in example 1. According to the invention, the quantitative evaluation matrices, can be part of sequencing instrument output for the following applications:
a. quality control;
b. troubleshooting; and/or
c. performance optimization,
wherein the quality control encompasses control as exemplified in example 2. The trouble shooting encompasses is as exemplified in example 3, and the performance optimization is as exemplified in example 1 for optimization of cluster/rolony density and in example 4 for optimization of sequencing chemistry performance. In one embodiment, the rolony preparation uses rolony technology, which is well known in the art, wherein the DNA is circularized and amplified for further sequencing.
In one embodiment, different RCA amplification condition are applied to generate rolonies from a library input. In one embodiment, the rolonies obtained by RCA are seeded on a flow cell surface for the following sequencing process, wherein each rolony forms a cluster by the seeding process.
In some embodiments, the method includes the following data process workflow (Figure 2): a. Primary data analysis: The input file is sequencing raw output-image data, the output of the process are the following files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files. The sequencing quality information is embedded in the FQand FASTQfile, including Q-score, raw error rate, read length, etc. The SAM file contains the sequencing mapped reads information. The surface map output has the raw information related to rolony/cluster size distribution. The intensity files have the information for signal intensity (total intensity, signal margin, chastity, etc.). b. SizeStats analysis: Extraction and combination of cluster/rolony size based statistical information from above files by sizeStats data process. The output of the process are multidimension data matrices related to the size distribution. It is a comprehensive output, which is independent of the sequencing platform and targets, for example, different input library panels, different targeting organisms. c. Evaluation matrices for rolony and sequencing quality and quantity: Establish comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine rolony quality, cluster quantity and sequencing run quality. Although the fundamental evaluation concept is universal across different sequencing platform and panels ( E.coli vs. human genome, BRCA vs. lung cancer, etc.), the matrices might be platform and panel specific in order to address the needs for optimization and troubleshooting.
In some embodiments, the above data process can be established by different programming languages. For example, MATLAB, Python, Ruby, Java, Perl, R, etc.
The invention further includes the use of the sizeStats data process for generating quantitative evaluation matrices on the sequencing platform based on clonal amplification from rolling circle amplification, comprising the following data process workflow:
a. Primary data analysis to generate the files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files; b. Extract cluster/rolony size based statistical information from above files by sizeStats data process;
c. Generate comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine the rolony quality, cluster quantity and/or sequencing run quality.
The use of the Sizestats data process can be applied in combination with any of the herein disclosed embodiments. SizeStats as a critical central unit to transfer the primary data analysis output to quantitative evaluation matrices
In some embodiments, the sizeStats data process algorithms comprise, the following components: a. imports image-based rolony surface area information (from surface map outputs) and sequencing data (FQ, FASTQ, SAM, intensity file, etc.);
b. aggregates surface area and sequencing information (sequence, alignment, quality, error, signal, GC content) to relate size information to core key performance indicators (KPIs) for each rolony/cluster;
c. creates variety of output summarizing distribution of cluster sizes and sequencing performance matrices.
In some embodiments, the sizeStats data process comprises
a. Relating rolony size to quality metrics through extracting and aggregating rolony size and quality information from various data sources;
b. Data reshaping, modeling, and statistical calculations using various filters, for example size, alignment, KPI;
c. Feature engineering and data reduction to summarize data.
According to the invention, the various data sources comprise FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files.
Also, according to the invention, the various filters comprise size, alignment (mapped reads), sequencing KPI (such as mapped reads density, error rate, read length, etc.), Q-score based filter, read-length based trim. In some embodiments, the sizeStats matrices includes the following categories or levels of information: rolony/cluster, tile and/or flow cell. Accordingly, the sizeStats data extracts and aggregates statistical information from a rolony or cluster; a tile within a flow cell; and/or a flow cell.
The quantitative evaluation matrices comprise the data processed from single, or multiple or full set of tiles on a flow cell.
In some embodiments, the sizeStats matrices at rolony/cluster level represent the properties of each cluster, including, but not limited to, the following matrices: size (surface area), alignment, Q- score (initial cycles and total cycles), probability of error, error rate, read length, GC content, signal (total intensity, signal margin, chastity) decay rate.
In some embodiments, the sizeStats matrices at tile level represent the consistency within a flow cell, including, but not limited to, the following matrices:
1) Quantity and quality metrics for total, aligned, and unaligned reads aggregated per tile include distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) for rolony count, linearized fits of size distribution (slope and Y-intercepts), read length, error rate, GC content.
2) Quantity and quality metrics aggregated per size and alignment per tile include average and standard deviation of rolony density, Q-score, error rate, read length, GC content, signal decay (total intensity, signal margin, chastity).
3) Quantity and quality metrics binned by Q-score for all reads include calculation of the total count, aligned count, unaligned count and percent aligned for the entire read sequence and for user-specified initial cycle ranges, and average read length. 4) Quantity and quality metrics binned by Q-score for mapped reads include calculation of the mean, median, and standard deviation for the rolony size, error rate, and read length.
In some embodiments, the sizeStats matrices at flow cell level represent the overall key performance indicators (KPIs), including, but not limited to, the following matrices: i. Quantity and quality metrics aggregated per size per flow cell (ROI) for total, aligned, and unaligned reads include rolony count, density, probability, percent of surface area covered, percent alignment, distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) calculations for Q-score, read length, GC content, error rate, and average and standard deviation of fits to signal (total intensity, signal margin, chastity) decay. ii. Summary statistics per flow cell for total, aligned, and unaligned reads include rolony count, density, linearized fits of size distribution (slope and Y-intercepts), distribution statistics (25%, 50%, 75%, IQR, SID, median, standard deviation) for rolony size, Q-score, fits of signal (total intensity, signal margin, chastity) decay, read length, GC content.
In some embodiments, the sizeStats outputs are comprehensive data matrices, which can be universal, and shared across different sequencing platforms and library panels. Quantitative evaluation matrices based on rolony/cluster size distribution
Clusters on the surface of flow cell are formed by rolonies through the seeding process. The rolony size in solution might be different from the cluster size on flow cells. However, the surface based rolony size distribution has direct impacts on cluster density and mapped reads density. In order to identify the relationship of cluster size distribution to the desired cluster density and mapped reads density, it is necessary to divide the size distribution into different size ranges or size bins.
In some embodiments, the size distribution can be divided into or grouped to as three reprehensive ranges: small noises bin/range, functional bin/range, competition bin/range. The small noises range, for example, 0-10 ROI px, the clusters in this range are either too small rolonies, or nonspecific amplified products. The sequencing quality (mapped reads density, raw error rate, Q-score, read length, etc.) is always bad in this range. The number of clusters in this size range is to be minimized in terms of performance optimization. For the functional range, for example, 11-60 ROI px, the clusters in this range are with desired sequencing quality. The number of clusters in this size range is to be maximized in terms of performance optimization. The competition range, for example, ROI px more than 60px, the cluster size is considered too big. Although the clusters/rolonies in this range are with desired sequencing quality in some cases, each cluster takes too much surface space, the overall cluster density and mapped reads density is reduced (Example 1). The number of clusters in this size range is to be minimized in terms of performance optimization. In some embodiments, the size range of above described bins can be different. The size range for each bin can be adjusted depending on the needs of application (quality control, trouble-shooting, optimization, etc.) and cases (sequence platform/instrument, input library panel, sequencing chemistry, etc). For example, it is depended on the sequencing chemistry performance and instrument (sequencer) measurement capability. A high quality sequencing chemistry or an improved sequencing instrument can have high quality sequencing results in smaller sized clusters. Therefore, the functional range can be reduced to, for example, 10-40 ROI px. Therefore, the overall cluster density and mapped reads density can be increased. The final high quality data output can be increased.
In some embodiments, the number of above described size bins can be a different number. The number of defined size bins can be adjusted depending on the needs of application (quality control, trouble-shooting, optimization, etc.) and cases (sequence platform/instrument, input library panel, sequencing chemistry, etc). For example, but not limited to, 2 bins, 4 bins, 5 bins, etc.
In some embodiments, the quantitative evaluation matrices are established based on above described size bins.
In some embodiments, the quantitative evaluation matrices include percentage of total objects for each defined size bin. It is a normalization process to minimize the impacts of seeding concentration and efficiency. For example, the percentage of total objects need to be minimized or controlled for the size in the "competition range" in order to increase the cluster density or mapped reads density.
In some embodiments, the quantitative evaluation matrices include percentage of mapped reads for each defined size bin. It can reflect rolony quality and sequencing run quality, including sequencing chemistry and instrument setup and measurement capabilities. For example, when there are non-specific amplifications during RCA amplification, the percentage of mapped reads in the "functional range" is reduced. For this case, even though the cluster density is high, the overall mapped reads density is reduced which is not the desired sequencing quality (Example 2).
In some embodiments, the quantitative evaluation matrices include mean of total objects for each defined size bin. It is an indicator to reflect actual seeding amounts on flow cells without being impacted by the analyzed number of tiles. For example, if lower amount of rolonies are seeded on the flow cell, the counts of total objects are reduced for each defined size bin. It can be applied for the troubleshooting situation, especially for the case: a sequencing result is with overall mapped reads density lower than expectation, but other performances KPIs are in the acceptable range. The quantitative evaluation matrices study shows: 1) percentage of total: the rolonies with larger size in the competition range are limited and under controlled; 2) percentage of mapped objects: good rolony quality with desired mapped percentage for each defined size bin; 3) mean of total objects: significantly reduced for each defined size bin. Rolony seeding, reflected by the reduced mean of total objects in each defined size bin, is most likely the root cause for this issue (Example 3).
In some embodiments, the quantitative evaluation matrices include mean of Q-scores of reads for each defined size bin. It reflects the sequencing quality for each flow cell. It can be applied for rolony size distribution optimization to increase the overall Q-scores.
In some embodiments, the quantitative evaluation matrices include mean of raw error rate (from FQ file) or filtered error rate (from FASTQ file) for each defined size bin. It can be applied for rolony/cluster size distribution optimization to minimize overall either raw error rate or filtered error rate.
In some embodiments, the quantitative evaluation matrices include mean of read lengths for each defined size bin. It can be applied for rolony/cluster size distribution optimization to maximize overall read length.
In some embodiments, the quantitative evaluation matrices include signal intensity information for each defined size bin. The signal intensity information can include some or all of the following matrices: total signal mean, signal Y-intercept, signal slope, signal margin, etc. It can be applied for the study of relationship of sequencing phasing issue and rolony/cluster size distribution.
In some embodiments, the quantitative evaluation matrices include mean of percentage of GC contents for each defined size bin. It can be applied for investigation RCA amplification bias related to the rolony size distribution. In some embodiments, the quantitative evaluation matrices can include one, some or all of above described matrices. An example of usage is listed in Table 1. Table 1. Example of matrices for defined targets or study purpose.
Figure imgf000017_0001
In some embodiments, the quantitative evaluation matrices can be based on the analysis from single, multiple or all tiles on a flow cell.
In some embodiments, the quantitative evaluation matrices can be displayed as table, and/or graph. In some embodiments, the quantitative evaluation matrices can be applied for sequencing run-to- run comparison and/or tile-to-tile comparison for the same run.
In some embodiments, the quantitative evaluation matrices can be applied for clonal amplification condition optimization. The optimization comprises amplification primers, amplification buffers and time, etc.
In some embodiments, the quantitative evaluation matrices can be applied for upstream library construction optimization. The optimization comprises oligo design schema for library construction with corresponding different bridge oligos for circularization and different RCA amplification primers, etc.
In some embodiments, the quantitative evaluation matrices can be applied before or after sample demultiplex process. If it is applied after the demultiplex process, the corresponding FQ, FASTQ and SAM files are from each sample.
Examples Example 1: Evaluation matrices for rolony size distribution dependent mapped reads density
In order to develop and optimize the sequencing platform based on rolony, it is critical to understand and establish the linkage among RCA amplification conditions, rolony size distribution and mapped reads density. An example study is disclosed here.
Step 1. Rolony preparation: Different RCA amplification conditions were applied to generate rolonies from a library input (a human genome panel), including different buffer components and amplification time (total 14 different test RCA conditions, each condition was tested either in duplicates or triplicates).
Step 2. Rolonies were seeded on the flow cell surface with the same protocol, and then sequenced on sequencer (GeneReader 1.2) with the same sequencing chemistry for 51 cycles; Total 44 sequencing flow cells.
Step 3. Data process through the workflow: primary analysis, sizeStats and quantitative evaluation matrices. Three size bins were defined as:
- Small noises: 0-10 ROI px, inclusive
- Functional: 11-60 ROI px, inclusive
- Competition: more than 60 ROI px The results of evaluation matrices are shown in Table 2. It shows that in term of rolony quality (percent of mapped) and sequencing quality (raw error rate and Q-score), there is no significant difference across the 44 sequencing results from 14 tested RCA amplification conditions. However, there is significant difference in the rolony/cluster size distribution (highlighted in bold). Figure 3 shows the relationship of percent of total objects in the competition size bin vs. the overall mapped reads density, with the correlation p Value < 0.0001. It demonstrates that mapped reads density can be increased by reducing the number of larger rolonies or larger clusters. It shows that the evaluation matrices can be a guide to optimize sequencing performance, for example, mapped reads density. Table 2: Evaluation matrices for example 1.
Figure imgf000019_0001
Example 2: Evaluation matrices for rolony quality Clusters from non-specific RCA amplification can take the surface space of flow cell, however, cannot produce mapped reads. Therefore, non-specific amplification can reduce rolony quality and reduce overall mapped reads density. It is necessary to optimize and quality control RCA process to ensure rolony quality. Percentage of mapped objects/rolonies and raw error rate per size distribution are important criteria to evaluate rolony quality.
RCA amplification is based on circularized templates. Circulation is based on bridge oligo guided ligation process. Bridge oligo design is one of the critical parameters which impact rolony quality. Two different bridge oligos were designed for circularization on an E.coli library. One of the bridge oligos had one mismatch site to the circularized template, the other one had perfect match to the circularized template as a control. Three rolony batches (Rl, R2 and R3, prepared from the mismatch bridge oligo) and the control rolony batch (from the perfect match bridge oligo) are seeded and sequenced by the same sequencing chemistry for 157 cycles (R1 was sequenced twice). Comparing to the control rolony batch, Rl, R2 and R3 has reduced mapped reads density (from around 400 K/mm2 to 250 K/mm2) and increased raw error rate (from around 5% to 10%).
In order to identify the root cause of poor sequencing performance of rolony Rl, R2 and R3, quantitative evaluation matrices were designed based on the following 7 size bins (ROI px): 0-10px, ll-20px, 21-40px, 41-60px, 61-80px, 81-100px, and more than lOOpx. Figure 4 shows that for each defined size bin, the rolonies generated from mismatch bridge oligos are with reduced percentage of mapped reads (4A) and increased raw error rate (4B) comparing to the control rolony batch. Therefore, the rolony quality is the root cause of the poor sequencing performance. Percentage of mapped reads and mean of raw error rates for each defined size bin can be applied for the evaluation criteria to determine rolony quality. Example 3: Evaluation matrices for seeding quantity
Rolony seeding, sequencing chemistry and instrument (sequencer) are the major confounding variables after the RCA clonal amplification process. It is necessary to establish evaluation matrices to differentiate the root causes for troubleshooting. Mean of total objects for each defined size bin is an important specification for seeding quantification.
In order to establish specification for seeding quantification, it is necessary to have enough training data set. The study was designed as: with a certain defined RCA protocol and sequencing protocol, multiple operators to generate rolonies from multiple library inputs (a human genome panel), and perform sequencing on multiple sequencers. The quantification evaluation matrix was established based on the data sets. Table 3 is an example matrix based on 12 sequencing runs from 12 rolonies, prepared from 2 different library inputs by 3 operators, with an average mapped reads density as 536 +/- 127 K/mm2. With the same RCA and sequencing protocol, more rolonies were generated and sequenced later on. One data (test 1 in table 4) showed reduced mapped reads density to 250 K/mm2 without difference in raw error rate. The following procedure was performed to identify the root causes, the corresponding decision algorithm is shown in figure 5. Step 1: For that particular sequencing run, three flow cells were sequenced, the mapped reads density and raw error rate were within specification for the other two flow cells. It ruled out the issues from sequencing chemistry and instrument. In some embodiments, if a sequencing control is included in the process of RCA, seeding and sequencing, it can be used as an internal control reference to rule out the issues of sequencing chemistry and instrument based on the specification of raw error rate.
Step 2: Performed the quantitative evaluation matrix on the problem data point, the results was shown in table 4 (Test 1). It showed that the mean of total objects for each defined bin sizes were obviously reduced (highlighted in bold), without obvious difference in other evaluation matrices. Seeding quality might be the root cause of reduced mapped reads density.
Step 3: Repeat the sequencing process on the same rolony sample (test 2), the results are shown in Table 4 (Test 2), with the mapped reads density (670 K/mm2) in the acceptable range. It demonstrates that mean of total objects can help identify the root causes from seeding quantity. And the percent of total objects for each defined size bin can also be impacted by seeding quantity, which make the larger rolonies seeded relatively more with less surface space competition from low seeding amount.
Table 3. Evaluation matrices for run-to-run variations in Example 3.
Figure imgf000021_0001
Table 4. Seeding quantity effects on the same rolony.
Figure imgf000022_0001
Example 4: Evaluation matrices for sequencing chemistry quality control
Sequencing chemistry is an important parameter to impact final sequencing performance in terms of mapped reads density, raw error rate, Q-scores, read length, etc. And the impacts are reflected by the quantitative evaluation matrices. An example is listed here. A rolony (from E.coli library input) showed acceptable quality with previous sequencing test, with around 300 K/mm2 mapped reads density and 6% raw error rate at 157 cycle sequencing. One of the sequencing chemistry buffers (extend buffer) was increased 0.5 pH comparing to the standard condition. The sequencing results showed around 70 K/mm2 mapped reads density and 17% raw error rate. The evaluation matrices (Table 5) shows that, with increased pH, percent of mapped reads and Q-scores are significantly reduced, raw error rates are significantly increased for each defined size bin. Moreover, more rolonies are detected as smaller sizes (reflected by percent of total objects) by the sequencing chemistry with increased pH in extend buffer. It demonstrates that the evaluation matrix can be applied for sequencing chemistry quality control or optimization during development activities with the pre-qualified or pre-evaluated rolony samples.
Table 5: pH impacts on the evaluation matrices.
Figure imgf000023_0001

Claims

Claims
1. A method for determining rolony quality, cluster quantity and/or sequencing run quality carried out by SizeStat data processing based on the sequencing results of rolonies on a flow cell surface after rolony preparation and seeding, the method comprising the steps of:
a. generating a plurality of data files, from sequencing raw output image data, comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output, and intensity files;
b. processing the plurality of data files by SizeStat, thereby generating multi dimension data matrices based on rolony/cluster size distributions; and
c. generate comprehensive quantitative evaluation matrices from said multi dimension data matrices based on size distributions, wherein the size distributions comprise at least two ranges/bins.
2. The method of claim 1, wherein the at least two ranges/bins consist of three ranges/bins, the small noises bin range, the functional bin range, and the competition bin range.
3. The method of claim 1, wherein the number of size bins can be adjusted on a case dependent basis, wherein the cases comprise sequence platform/instrument, input library panel, sequencing chemistry.
4. The method of claim 1, wherein the range of each bin can be adjusted on an application dependent basis, wherein the applications comprise quality control, trouble-shooting, optimization.
5. The method according to claim 2 , wherein the small noises bin range is defined as a range with non-specific amplification, small clusters or small rolonies; the functional bin range is defined as a range wherein the clusters have the desired sequencing quality and the competition bin range is defined as a range with oversized clusters or rolonies.
6. The method of claim 1, wherein said sizeStats data process, comprises:
a. Relating rolony size to quality metrics through extracting and aggregating rolony size and quality information from various data sources;
b. Data reshaping, modeling, and statistical calculations using various filters (size, alignment, KPI); c. Feature engineering and data reduction to summarize data.
7. The method of claim 1, wherein said sizeStats data process, extracts and aggregates statistical information from:
a. a rolony or cluster;
b. a tile within a flow cell; and/or
c. a flow cell.
8. The method of claim 1, wherein said quantitative evaluation matrices, includes grouping the clusters based on size to generate predefined size bins, wherein the size is ROI px or camera px.
9. The method of claim 1, wherein said quantitative evaluation matrices, comprise one, or some or all of the following matrices for each defined size bin:
a. Percentage of total objects (without specific mapping information);
b. Mean of percentage of mapped objects;
c. Mean of total objects;
d. Mean of Q-score of objects;
e. Mean of raw error rate;
f. Mean of read length;
g. Mean of total signal intensity and signal decay (slope); and/or
h. Mean of percentage of GC contents.
10. The method of claim 1, wherein said quantitative evaluation matrices comprise the data processed from single, multiple, or full set of tiles on a flow cell.
11. The method of claim 1, wherein said quantitative evaluation matrices, can be part of sequencing instrument output for the following applications:
a. quality control;
b. troubleshooting; and/or
c. performance optimization.
12. The method of claim 1, wherein said quantitative evaluation matrices, can be displayed as table, and/or graph.
13. The method of claim 1, wherein said quantitative evaluation matrices, can be applied for sequencing run-to-run comparison and/or tile-to-tile comparison for the same run.
14. The method of claim 1, wherein said quantitative evaluation matrices, can be applied for clonal amplification condition optimization and/or for library construction optimization.
15. The method of claim 1, wherein said quantitative evaluation matrices, can be applied before or after sample demultiplex process.
16. Use of the Sizestats data process for generating quantitative evaluation matrices on the sequencing platform based on clonal amplification from rolling circle amplification, comprising the following data process workflow:
a. Primary data analysis to generate the files comprising FQ, FQ based SAM file, FASTQ, FASTQ based SAM file, surface map output and intensity files;
b. Extract cluster/rolony size based statistical information from above files by sizeStats data process;
c. Generate comprehensive quantitative evaluation matrices from sizeStats outputs, apply the matrices to determine the rolony quality, cluster quantity and/or sequencing run quality.
17. The use of claim 16, wherein said sizeStats data process comprises:
a. Relating rolony size to quality metrics through extracting and aggregating rolony size and quality information from various data sources;
b. Data reshaping, modeling, and statistical calculations using various filters (size, alignment, KPI);
c. Feature engineering and data reduction to summarize data.
18. The use of claim 16, wherein said sizeStats data process, extracts and aggregates statistical information from:
a. a rolony or cluster;
b. a tile within a flow cell; and/or
c. a flow cell.
19. The use of claim 16, wherein said quantitative evaluation matrices, includes grouping the clusters based on size (either ROI px or camera px) to generate predefined size bins.
20. The use of claim 16, wherein said quantitative evaluation matrices, includes one, or some or all of the following matrices for each defined size bin:
a. Percentage of total objects (without specific mapping information);
b. Mean of percentage of mapped objects;
c. Mean of total objects;
d. Mean of Q-score of objects;
e. Mean of raw error rate;
f. Mean of read length;
g. Mean of total signal intensity and signal decay (slope);
h. Mean of percentage of GC contents
PCT/US2020/026892 2019-05-24 2020-04-06 Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities WO2020242603A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962852596P 2019-05-24 2019-05-24
US62/852,596 2019-05-24

Publications (1)

Publication Number Publication Date
WO2020242603A1 true WO2020242603A1 (en) 2020-12-03

Family

ID=73552874

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/026892 WO2020242603A1 (en) 2019-05-24 2020-04-06 Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities

Country Status (1)

Country Link
WO (1) WO2020242603A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092585A (en) * 2023-01-30 2023-05-09 上海睿璟生物科技有限公司 Multiple PCR amplification optimization method, system, equipment and medium based on machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031844B2 (en) * 2002-03-18 2006-04-18 The Board Of Regents Of The University Of Nebraska Cluster analysis of genetic microarray images
WO2015040591A1 (en) * 2013-09-20 2015-03-26 The Chinese University Of Hong Kong Sequencing analysis of circulating dna to detect and monitor autoimmune diseases
WO2016154584A1 (en) * 2015-03-26 2016-09-29 Quest Diagnostics Investments Incorporated Alignment and variant sequencing analysis pipeline

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031844B2 (en) * 2002-03-18 2006-04-18 The Board Of Regents Of The University Of Nebraska Cluster analysis of genetic microarray images
WO2015040591A1 (en) * 2013-09-20 2015-03-26 The Chinese University Of Hong Kong Sequencing analysis of circulating dna to detect and monitor autoimmune diseases
WO2016154584A1 (en) * 2015-03-26 2016-09-29 Quest Diagnostics Investments Incorporated Alignment and variant sequencing analysis pipeline

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SENA, JA ET AL.: "Unique Molecular Identifiers reveal a novel sequencing artefact with implications for RNA-Seq based gene expression analysis", SCIENTIFIC REPORTS, vol. 8, 3 September 2018 (2018-09-03), XP055763985, DOI: 10.1038/s41598-018-31064-7 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092585A (en) * 2023-01-30 2023-05-09 上海睿璟生物科技有限公司 Multiple PCR amplification optimization method, system, equipment and medium based on machine learning
CN116092585B (en) * 2023-01-30 2024-04-19 上海睿璟生物科技有限公司 Multiple PCR amplification optimization method, system, equipment and medium based on machine learning

Similar Documents

Publication Publication Date Title
US11680284B2 (en) Screening for structural variants
EP2850211B1 (en) Method for increasing accuracy in quantitative detection of polynucleotides
US20100063264A1 (en) Nucleotide sequencing via repetitive single molecule hybridization
EP2821501B1 (en) Method and device for detecting microdeletion in chromosome sts area
JP2008533558A (en) Normalization method for genotype analysis
EP2207901B1 (en) Methods of whole genome analysis
US11859249B2 (en) Method and kit for the generation of DNA libraries for massively parallel sequencing
US20240167016A1 (en) Methods for normalizing nucleic acid samples
US9169515B2 (en) Methods and systems for nucleic acid sequencing validation, calibration and normalization
WO2020242603A1 (en) Methods and usage for quantitative evaluation of clonal amplified products and sequencing qualities
Williams et al. Genomic DNA as a cohybridization standard for mammalian microarray measurements
EP3959337A1 (en) Method for detecting specific nucleic acids in samples
JP7160349B2 (en) Methods of sequencing and analyzing nucleic acids
CN111394800B (en) Method for evaluating quality of ginseng-free species yeast two-hybrid library
CN114420214A (en) Quality evaluation method and screening method of nucleic acid sequencing data
US20130143746A1 (en) Method for detecting gene region features based on inter-alu polymerase chain reaction
CN108841919A (en) A kind of inserted type SDA method prepares probe
US11834709B2 (en) Multiplexed target-binding candidate screening analysis
US11965880B2 (en) Next-generation sequencing for protein measurement
US20240177807A1 (en) Cluster segmentation and conditional base calling
JP2002325586A (en) Method for discriminating microorganism, apparatus for discriminating microorganism, method for preparing database for discriminating microorganism and recording medium recording program for discriminating microorganism discriminating
WO2024073136A1 (en) Rapid reconstruction of large nucleic acids
Belyaev et al. The Assessment of Methods for Preimplantation Genetic Testing for Aneuploidies Using a Universal Parameter: Implications for Costs and Mosaicism Detection
WO2022140579A1 (en) Methods of preparing assays, systems, and compositions for determining fetal fraction
Welle What statisticians should know about microarray gene expression technology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20814123

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20814123

Country of ref document: EP

Kind code of ref document: A1