WO2020210876A1 - Quality measurement of next generation sequencing reads - Google Patents

Quality measurement of next generation sequencing reads Download PDF

Info

Publication number
WO2020210876A1
WO2020210876A1 PCT/AU2020/050382 AU2020050382W WO2020210876A1 WO 2020210876 A1 WO2020210876 A1 WO 2020210876A1 AU 2020050382 W AU2020050382 W AU 2020050382W WO 2020210876 A1 WO2020210876 A1 WO 2020210876A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
data
condition
sequencer
quality
Prior art date
Application number
PCT/AU2020/050382
Other languages
French (fr)
Inventor
John Pearson
Nic WADDELL
Sarah KEMPE
Original Assignee
genomiQa Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2019901349A external-priority patent/AU2019901349A0/en
Application filed by genomiQa Pty Ltd filed Critical genomiQa Pty Ltd
Publication of WO2020210876A1 publication Critical patent/WO2020210876A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This disclosure relates to a high-throughput, next-generation sequencing system. The sequencing system comprises a sequencer performing an operation on a biological sample to generate sequencing data for the biological sample and a computer system configured to receive the sequencing data for the biological sample, align the sequencing data to a reference sequence to thereby calculate an alignment score indicative of a quality of the alignment, and identify a condition of the operation of the sequencer based on the alignment score. Proposed quality metrics are determined based on the sequencing data for the biological sample instead of control sequences. This allows the quality assessment of the actual 'payload' data instead of inferred quality where the basis of the inference is not always strong.

Description

QUALITY MEASUREMENT OF NEXT GENERATION SEQUENCING READS
Cross reference to related applications
[0001] The present application claims priority from Australian Provisional Patent Application 2019901349 filed on 18 April 2019, the contents of which are incorporated herein by reference in their entirety.
Technical Field
[0002] This disclosure relates to a high-throughput, next-generation sequencing system.
Background
[0003] The development of high-throughput, next generation sequencing has enabled a large range of research in the field of genomics. Further, the research results are increasingly translated into clinical applications, such as diagnostics and personalised medicine.
[0004] However, it remains a challenge to accurately measure the quality of the sequencing data. For example, there are quality metrics available at the stage of base calling, including PHRED scores. However, these scores are often aggregated over a large number of bases or difficult to interpret.
[0005] There are also quality controls using control spike-ins, like PhiX controls. However, the sequencing of those controls can only measure the quality of that sequencing instance and in many cases, good quality sequencing of the PhiX controls does not mean that the payload/sample data is also of good quality.
[0006] Therefore, there is a need for improved quality control mechanisms in high- throughput, next-generation sequencing (NGS) systems. [0007] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.
[0008] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
Summary
[0009] This disclosure provides a system that assesses the quality of the sequencing data from the sequencer (i.e. the base calls) using not only the quality metrics from each individual base call or statistics over those metrics, but also metrics from the downstream alignment step where the reads are aligned to a reference sequence. Such a system may link the aligned reads to the actual sequencing cycles on the sequencer and therefore extract a more meaningful and more accurate quality assessment.
[0010] As NGS systems have evolved rapidly, many investigations have used a mix of different technologies or machines. While this has been driven by the desire to reduce the cost per sample or per read, it makes it more difficult to distil a process for calculating a quality metric for one particular machine. However, when a large number of samples are sequenced by the same machine over a period of time (months or years) it becomes possible to identify patterns from post-alignment data. In turn, these patterns allow the determination of operation conditions which can then be adjusted to increase the quality of the sequencing. Importantly, the proposed quality metrics are determined based on the sequencing data for the biological sample instead of control sequences. This allows the quality assessment of the actual‘payload’ data instead of inferred quality where the basis of the inference is not always strong. [0011] A sequencing system comprises:
a sequencer performing an operation on a biological sample to generate sequencing data for the biological sample;
a computer system configured to:
receive the sequencing data for the biological sample;
align the sequencing data to a reference sequence to thereby calculate an alignment score indicative of a quality of the alignment; and
identify a condition of the operation of the sequencer based on the alignment score.
[0012] It is an advantage that the alignment score provides additional information that is not available for the sequencing data directly. Therefore, the method is able to monitor more conditions of the operation of the sequencer. As a result, by adjusting the operation accordingly, a high quality of the sequencing data can be ensured continuously.
[0013] The computer system may be further configured to disregard sequencing data of a sequencing run in response to identifying that the alignment score across the sequencing data of the sequencing run is below a predefined threshold.
[0014] The condition may relate to a part of the sequencing data and identifying a condition is based on that part.
[0015] The step of aligning the sequencing data may comprise aligning each of multiple reads with the reference sequence and maintaining an association between each of the multiple reads with the sequencing data from the sequencer.
[0016] Identifying a condition of the operation may comprise analysing the sequencing data from the sequencer at a position on the reference sequence where multiple reads that are aligned to that position show a similar quality related pattern. [0017] Maintaining an association may comprise maintaining for each aligned read an association with meta-data of the sequencing data.
[0018] The meta-data may comprise an indication of one or more of:
a cycle number;
a flowcell lane;
a tile number within the flowcell lane;
an‘x’ -coordinate of a cluster represented by the read within the tile;
a‘y’ -coordinate of a cluster represented by the read within the tile;
an index number for a multiplexed sample; and
a member index of a pair for paired-end reads.
[0019] Identifying the condition may comprise identifying common metadata elements for multiple reads that show a similar quality related pattern. Identifying the condition may comprise identifying one or more common meta-data elements of the multiple reads at the position on the reference sequence where multiple reads that are aligned to that position show a similar quality related pattern. Identifying the condition may comprise identifying the condition based on the one or more common meta-data elements.
[0020] The condition may be related to an elapsed time within a current run. The elapsed time may be indicative of a period of time a chemical has been used during the sequencing.
[0021] The condition may be related to a location within the sequencer. The condition may be indicative of a defect of the flowcell.
[0022] The computer system may be further configured to aggregate the alignment score for each cycle of multiple cycles of the sequencer and the condition is based on the aggregated alignment score. [0023] The condition may be based on a number of mismatches in each of the multiple cycles.
[0024] The processor may be further configured to automatically adjust the operation of the sequencer based on the condition to improve a quality of the sequencing data.
[0025] A method for assessing sequencing data comprises:
receiving the sequencing data for the biological sample;
aligning the sequencing data to a reference sequence to thereby calculate an alignment score indicative of a quality of the alignment; and
identify a quality of the sequencing data based on the alignment score.
[0026] Optional features described of any aspect of method, computer readable medium or computer system, where appropriate, similarly apply to the other aspects also described here.
Brief Description of Drawings
[0027] An example will now be described with reference to the following figures:
[0028] Fig. 1 illustrates a sequencing system.
[0029] Fig. 2 illustrates a method 200 for sequencing data analysis.
[0030] Fig. 3 is a schematic to illustrate the aggregation process in more detail.
[0031] Fig. 4 illustrates the aggregation of alignment scores, such as mismatches.
[0032] Fig. 5 illustrates a schematic data plot that visualises the accumulated numbers from Fig. 4. [0033] Fig. 6 illustrates a similar data plot to Fig. 5 but where the mismatch numbers are accumulated per base, resulting in four different mismatch numbers for each cycle as indicated by the shading of the bars.
[0034] Fig. 7 is a data plot similar to Fig. 6 but from experimental sequencing data for first reads of read pairs.
[0035] Fig. 8 is a data plot similar to Fig. 6 but from experimental sequencing data for second reads of read pairs.
[0036] Fig. 9 illustrates the computer system from Fig. 1 in more detail.
Description of Embodiments
[0037] Fig. 1 illustrates a sequencing system 100 that comprises a sequencer 101 and a computer system 102. The sequencer 101 performs an operation on a biological sample, such as a polynucleotide that may be human DNA (i.e. the human genome) or RNA, to generate sequencing data for the biological sample. For example, the sequencer 101 may perform a sequencing operation, such as sequencing -by-synthesis of DNA, also known as Illumina sequencing and available in Illumina’s HiSeq X Ten sequencers. Other sequencing system examples include Ion Torrent, Pacific
Biosciences and Oxford Nanopore. In the example of Illumina sequencing, during fragmentation the DNA is broken into many relatively short segments (“fragments”). During amplification, each fragment is then attached to a flow cell and cloned many times. This results in many clusters of identical fragments (the fragments in each cluster are identical). The fragments are connected to the flow cell at one end and the other end is free.
[0038] The sequencing process now starts at the end connected to the flow cell. The flow cell is now flooded with individual bases that are tagged with different fluorescent markers, so that identical bases have identical colour markers (e.g. red and green). As the flow cell is flooded, the tagged bases attach to the fragment at the first (bottom) location. Which tagged base is attached depends on the base of the fragment at that location. Consequently, each segment is tagged with a fluorescent colour depending on the base at the bottom of the fragment (red, greed, both/orange, none). Since there are many identical fragments in one cluster, there are many bases that are tagged and since they are all tagged with the same colour, there is a bright coloured spot at the position of the cluster (when laser light stimulates the attached fluorescent marker). This bright coloured spot can be captured by a camera. In fact, the camera captures an image of the entire flow cell and each spot on the image represents the base of a cluster at a position on the flow cell that corresponds to the respective pixel of the image.
[0039] Once the bottom bases of the segments are captured, this process is repeated for the next base from the bottom, such that over time, each cluster’s fragments are read base-by-base“bottom up” and each base is captured by an image of the flow cell.
Since there are many clusters captured at the same time, there is a parallel operation, which drastically reduces sequencing time. Each repetition of these steps of single base synthesis and image capture is referred to as a“cycle” of the sequencer. The sequencing can be stopped after a sufficient number of cycles and the resulting sequence of bases at a cluster is referred to as a“read”. A sufficient number of cycles generally means that the length of the read is sufficient to map the read against a reference sequence (e.g. 150bp). The sequencing process is repeated so that the other end of the same fragment is sequenced to produce a second read from the same cluster. The two reads from these two opposite‘directions’ of a cluster are referred to as a “read pair”. The aim of this disclosure is to provide a quality assessment and potential data filtering or automatic adjustment of the sequencer.
[0040] Fig. 2 illustrates a method 200 for sequencing data analysis. The computer system 102 is programmed to perform the method. In particular, computer system 102 receives 201 the sequencing data for the biological sample. This may be a data file generated by the sequencer, such as a FASTQ file. [0041] While the FASTQ file contains a quality score for each base, there are cases where that base quality score indicates good quality while the read is actually of a poor quality. This can be addressed by the solution disclosed herein.
[0042] In particular, computer system 102 aligns the sequencing data to a reference sequence. This may use state of the art alignment methods including GATK or others, which typically write the alignment results into a BAM file. Computer system 102 maintains an association between each of the multiple reads with the sequencing data from the sequencer. This means that even after alignment, computer system 102 can access the sequencing data (FASTQ file) that was generated by the sequencer. In particular, after alignment, computer system 102 has access to the cycle number for each base of the aligned reads.
[0043] Typically, alignment tools calculate an alignment score indicative of a quality of the alignment. In one example, the alignment score is a binary score that indicates a match or a mismatch with the reference sequence. Mismatch means that the read cannot be mapped against the reference sequence with sufficient accuracy, such as sufficient number of bases that match the reference sequence. In other examples, the alignment score may be a separate score for each base of the read. In yet another example, the alignment score may indicate the number of bases that are different between the reference sequence and the read. Qprofiler
(https://sourceforge.net/p/adamajava/wiki/qprofiler%202.0/) can be used to extract data. Qprofiler is a standalone Java app that takes in BAM or FASTQ files and provides an XML file containing basic summary statistics for each file.
[0044] Importantly, since computer system 102 maintains the association to the cycle number of each aligned base, it can now aggregate the alignment score for each cycle. This means that the computer system 102 maintains the cycle number for each base when alignment is performed. This maintaining of an association may be achieved by using meta-data of the sequencing data where the cycle number is stored for each cycle Computer system 102 then groups bases by cycle number and determines the number of mismatches in bases labelled with the same number. [0045] Fig. 3 is a schematic to illustrate the alignment process in more detail. In particular, Fig. 3 shows a reference sequence 301, which is separated into multiple segments simply for easy of illustration. There are also four reads 302, 303, 304 and 305 which have been aligned to the reference sequence so that a significant number of bases are identical between the reference sequence and the respective read. Each of the reads 302, 303, 304, 305 have four bases in this example represented by squares. Of course, the reads in most practical applications would have more bases, such as 150.
As explained above, each base is sequenced in one cycle of the sequencer and the cycle corresponding to each base in Fig. 3 is indicated by the pattern in the square. For example, first base 312 of first read 302 as well as first base 313 of second read 303 were both sequenced in parallel in the first cycle.
[0046] Fig. 3 also indicates whether there is a match or a mismatch between the bases within reads 302, 303, 304, 305 and the reference sequence 301. In particular, a thick outline, such as around base 312 indicates a mismatch, where a thin outline, such as around base 313 indicates a match. The number of mismatches is exaggerated and matches removed in Fig. 3 for illustrative purposes. The two reasons that base 312 would not match the reference 301 are because there is a true biological variant in the read 302 at base 312, or the sequencing machine made a mistake. If it is assumed that for whole genome sequencing, approximately 1 in a 1000 positions in a given human genome will differ from the reference 301, then the true variant rate should be approximately 0.1%. Error rates above that number are probably due to machine error.
[0047] While it is possible to simply count the overall number of mismatches over matches for all reads to determine a percentage of mismatches, this may not give an accurate insight into the operation of the sequencer because this would ignore that many bases are read in the same cycle in parallel. Therefore, once the reads are aligned, the computer system 102 can reference each read back to the cycle number and accumulate mismatches for each cycle.
[0048] Fig. 4 illustrates the aggregation process in more detail. In particular, Fig. 4 only shows the mismatches from Fig. 3 at unchanged horizontal position. However, the vertical position has changed to group the mismatches according to their cycle numbers. So, for example, mismatch 312 from Fig. 3 is shown again in Fig. 4 in the row for cycle 1. Mismatch 314 from Fig. 3 is shown again in Fig. 4 in the row for cycle 2, and so on. This means that read 302 has three mismatches in cycles 1, 2 and 4, read 303 has two mismatches in cycles 3 and 4, read 304 has also two mismatches in cycles 3 and 4 and finally, read 305 has three mismatches in cycles 2, 3 and 4.
[0049] Computer system 102 can now simply count the number of mismatches in each row which is shown in column 401.
[0050] Fig. 5 illustrates a data plot that visualises the accumulated numbers from Fig. 4. The numbers of mismatches are sorted by cycle number, which is useful because in many cases the numbers show a trend during the run. In this example of Fig. 5, it can be seen that the number of mismatches increases steadily with the number of cycles. It is also possible to set a threshold 502, such as 1%, so that this run can be discarded if a set count or proportion of the cycles of this run exceeds the threshold.
[0051] It is noted again that this percentage cannot be determined based on the sequencing data alone, that is, from the FASTQ file since the match/mismatch property is determined based on the results from the alignment step. At the same time, the percentage of mismatches per cycle is an important value that gives valuable insight into the operation condition of the sequencer. In other words, the condition of the sequencer (such as“normal operation” or“malfunction”) is based on an aggregated score, which may be indicative of the number of mismatches in each of the multiple cycles.
[0052] Fig. 6 illustrates another example data plot, where the mismatch numbers are accumulated per base, resulting in four different mismatch numbers for each cycle. For example, the first mismatch number for cycle 4 may represent the number of mismatches where the mismatched base is“G” , the second number where the mismatched base is“A”, the third number where the mismatched base is“C” and finally the fourth number where the mismatched base is“T”. It is now possible to identify problems associated with particular bases. For example, in Fig. 6 cycle 3 shows that the mismatches for base“T” are significantly more than for the other bases. This could indicate that at this cycle there was a problem with the laser or camera or sequencing chemistry such that the sequencer had an increased and erroneous tendency to call a given base as“T” when it was not.
[0053] In response to the determination of the condition of the sequencer, it is possible to adjust the operation of the sequencer. Further, it is possible to filter the sequencing data based on the determined condition. For example, the sequencing data may comprise data from multiple runs or multiple read groups. Computer system 102 can therefore remove runs or read groups where the number of mismatches is above the threshold for at least one cycle. Computer system 102 may then use the filtered data in subsequent steps, such as variant calling or machine learning. While this reduces the overall sequencing depth, it improves the overall outcome since the error introduced by a low-quality run may be more detrimental to the outcome than removing that run altogether. In other words, the machine learning will provide a better prediction accuracy when the poor quality runs are excluded from the training samples and/or not used during the evaluation phase.
[0054] Fig. 7 is a data plot from experimental sequencing data. This plot again shows the number and type of base mismatches found at each cycle. The early cycles are shown at the top and the later at the bottom. In this case, the run has two reads per fragment, so Fig. 7 shows the numbers for the first read while Fig. 8 shows the numbers or the second read in the pair.
[0055] This plot helps identify the base error rate of the run. For each cycle, the various colours represent how many bases of that type were observed that did not match the reference sequence. For example, it is possible to disregard sequencing data of a sequencing run in response to identifying that the alignment score across the sequencing data of the sequencing run is below a predefined threshold (which is equivalent to the mismatch score/count being above the threshold). [0056] As explained above, the condition relates to a part of the sequencing data, such as one run or one cycle, which means that computer system 102 reads only sequencing data from the input fde that relates to that run. Consequently, identifying a condition is based on that part.
[0057] Further, computer system 102 may analyse the sequencing data from the sequencer at a position on the reference sequence where multiple reads that are aligned to that position show a similar quality related pattern. For example, computer system 102 may identify locations in the reference sequence where there is a large number of low quality reads and may flag this in an output report.
[0058] Instead of, or in addition to the cycle number, the meta-data that is maintained together with the aligned reads may comprise an indication of a flowcell lane, a tile number within the flowcell lane, an‘x’ -coordinate of a cluster represented by the read within the tile, a‘y’ -coordinate of a cluster represented by the read within the tile, an index number for a multiplexed sample, a member index of a pair for paired-end reads. As a result, any problems (e.g. where the mismatch percentage is above the threshold) can be isolated to one of the above indications in the sense that computer system 102 identifies the condition of the sequencer by identifying common metadata elements for multiple reads that show a similar quality related pattern. For example, computer system 102 may generate the data shown in Fig. 6 for all different flowcell lanes (or other locations in the sequencer/on the chip) instead of cycle numbers. This would readily identify any flowcell lanes (or locations) that generate bad (i.e. high mismatches) sequencing data based on the alignment.
[0059] It is noted that cycle numbers are used herein to indicate the progress of the run along the sequenced segments but equally, the elapsed time within the current run may be used to accumulate and plot the mismatch score. In that sense, the elapsed time (and/or the cycle number) is indicative of a period of time a chemical has been used during the sequencing. [0060] While the data can be used for automatic correction of the sequencer (such as adjusting chemicals or other parameters), it is also useful to visualise the data. This can be achieved using the qvisualise2 tool (https://github.com/ChristinaXu2017/annotate- overlap/tree/master/qvisualise2).
Further metrics
[0061] The following section provides further metrics that can be used to determine the operational condition of the sequencer. These metrics can be determined by computer system 102 and alerts can be generated if the suggested standards are violated. Further, computer system 102 can automatically filter the sequencing data in response to detecting violating parts of the sequencing data by removing those parts.
[0062] Computer system 102 may provide information about the run, and performance metrics for each readgroup. The first table below shows that 61 cycles had more than 1% mismatch to the reference, both reads had an average length of 150 bases, and 15 million reads were discarded out of atotal of 2.18 billion.
Figure imgf000014_0001
, ,
[0063] The second table below shows that there were 9 readgroups in this BAM file and the amount of reads from each is quite different - 7 readgroups had about 170 million reads, one had 288 million reads and one had 687 million reads. There are large differences between the relative contributions of different readgroups, which is unusual and can be problematic if there is a problem with the large readgroup.
Figure imgf000015_0001
[0064] The table shows average and maximum read lengths across the readgroups and the modal TLEN (Template LENgth). These 3 values should be highly similar across all readgroups or the variant calling may be compromised, and in particular any substantial variation (>10%) in the TLEN values will dramatically impact the ability to call structural variants. The rest of the columns count bases that are lost to analysis for a variety of reasons including duplicate, overlapping and unmapped reads. If the total bases lost is below 20% then we call this an acceptable run however any amount of loss that causes the average read depth to drop below the level paid for (3 Ox for germline and 60x for tumour) is unacceptable. Base loss above 20% is an indication to redo the run.
• Non-canonical pairs can be real and represent structural variants however the reads due to structural variants (SV's) is usually infinitesimal so any value here can be considered primarily due to library preparation.
• Duplicate reads - if there was too little source DNA or the library preparation failed, this may lead to a library that lacks complexity and a high level of duplicate reads. Because only one read from a set of duplicates can be kept for analysis, a high level of duplicate reads can quickly erode the read depth available for variant calling. The duplicate read rate should ideally be below a threshold of 10% so the 14% average shown is not optimal.
• Within ReadPair Overlap Sequence quality score
Base distribution by cycle.
[0065] In humans, the genome is approximately 40% GC so in WGS it can be expected to sequence approximate base distributions of 30% A (red), 20% C (blue), 20% G (black) and 30% T (red). Ideally, this is consistent across all of the cycles. It can also be expected to see some disruption of the pattern in about 10 cycles at the start and end of each read. Also note that in the later cycles some extra truncation of the reads is expected, leading to fewer bases.
[0066] Also note that exomes and panels may have a different default GC% as they are primarily sequencing exons and genes tends to have higher GC% than introns and intergenic regions. Platforms that use barcodes often show much larger disruptions of the base distribution in the first dozen cycles. Depending on the sequencing technology, it is common to see barcodes used when sequencing panels and exomes.
Kmer-2 distribution
[0067] Computer system 102 may provide a kmer analyses at 1,2,3 and 6 bases. The kmers are calculated as a sliding window starting from the first cycle to the last. So for a kmer-3 analysis computer system 102 performs the following steps: for every read, the cycle 1 distribution uses cycles 1,2,3 to construct a 3 -base kmer and then total the different 3-base kmers. System 102 then starts at cycle 2 and using the bases in cycles 2,3,4, it constructs another distribution, etc. For a 100-base read, the last kmer-3 distribution is cycles 98 which uses the bases from cycles 98,99,100.
[0068] An expected pattern comprises AA and TT as the most common kmer-2's and CG as the least common with decent separation from the other 2-mers which are in a single band. Deviation from this pattern could mean a problem with the sequencer or library preparation or the sample is not human WGS. Small deviations are OK as the bias will be removed during alignment and variant-calling.
OUAL
Quality score distribution by cycle
[0069] Sequencer 101 typically assigns a quality score to each sequenced base, usually in the range 0 to 60. The quality score is modelled on the Phred quality score from the days of gel and capillary sequencing. It may be 10 * the log 10 of the error probability. So for a 1-in-a-thousand error, that would be 1 in 10L3 so log 10 of the error probability is 3 and 10 * 3 = 30. So a Phred score of 30 represents a 1 in 1000 error probability, Phred 20 is 1 in 100, and Phred 60 is 1 in 1,000,000.
[0070] System 102 may provide the first and second reads separately from the first to the last cycle. It is expected that there are many high quality reads and as little as possible low quality reads. Quality tends to degrade in later cycles so there is likely an increase in low quality reads in the later cycles There is often a discontinuity in the middle of the read where the quality resets to be slightly better and this is because many platforms reload reagents in the middle of the run and this briefly improves the quality. The second read is done after the first so the whole of the second read is of lower quality than the first read.
Tag RG
[0071] Readgroups are the basic collective unit for sequencing reads and they map directly to one or two FASTQ fdes for single-ended and paired-end/mate-pair sequencing respectively. A readgroup should come from one (or zero) barcodes, from one lane/channel on one flowcell/slide. While this is an important rule, same operators may conflate fdes when they should not.
RG Tally
[0072] Computer system 102 may provide the relative read count for each of the readgroups in the BAM fde. Ideally all readgroups would have the same approximate read count but big deviations may not necessarily represent a problem. Different readgroups mean that the sequencing is from multiple different lanes, or slides or barcodes. Multiple readgroups can be thought of as an advantage in that a problem in one readgroup might not be present in the others, lessening the effect of the problem. On the flipside, multiple readgroups may mean more opportunities for error. In either case, the quality of each readgroup can be considered independently and if a readgroup shows a significant problem, system 102 can remerge the BAM without the problematic readgroup. This reduces the effective read depth but lower read depth give betters variant calling than leaving in a readgroup that is damaged or biased.
Read Name Analysis
[0073] As mentioned above, a readgroup should come from one (or zero) barcodes, lane/channel on one flowcell/slide. So one FASTQ file (or pair of readl/read2 files) represents a readgroup. Some operators may merge FASTQ files from different readgroups into a single unit prior to processing. This is typically undesirable because information may be lost that the merged readgroups may have been from different library preparations which could affect TLE, base distribution and other
parameters. Readgroups should be kept separate and intact so that the quality of each one can be considered separately.
[0074] Given the rules on readgroups, it follows that within a readgroup, there is only one sequencer, one runID, one flowcell/slide, and one lane/channel. There should however be multiple tiles representing the multiple photographic fields within a lane/channel. An expected pattern would be that some readgroups are from the same machine/run/slide/lane but no readgroup comes from more than one
machine/run/slide/lane .
SOP
[0075] This section provides criteria that may be applied to interpret sequencing quality. It may be used in creation of an SOP for BAM file quality assessment.
[0076] First table (see above):
• The "Number of cycles with >1% mismatches" should not be more than one third of the sum of "Average length of first-in-pair reads" and "Average length of second-in-pair reads" [0077] Second table (see above):
• No individual ReadGroup should have more than 30% "Total Bases Lost"
• The "overall" ReadGroup should have less than 20% "Total Bases Lost"
[0078] For human sequencing 25 reference sequences should be shown as having reads aligned to them. These are chromosomes 1-22, X, Y and the mitochondrial sequence (MT).
[0079] For each figure present ("FirstOfPair SEQ Cycles", "SecondOfPair SEQ Cycles"), for each of the Cycles shown:
• the approximate distribution of A, C, G, and T should be 30%, 20%, 20%, 30% for human sequencing
• it is acceptable for the last 10% of cycles to show a reduced number of bases in the cycle as the read quality drops and there is a higher likelihood of clipping by the aligner
• it is acceptable for the base % to vary slightly in the first 12 cycles
• it is acceptable for the C and G percentages to increase slightly in the latter cycles
[0080] 75% of reads should be full length
[0081] 75% of reads should have zero bad bases
[0082] In relation to the distribution of kmers across runs
• The A and T lines should be approximately the same • The G and C lines should be approximately the same
• The A and T lines should be above and separated from the G and C lines
[0083] There may be a significantly less separation between the A/T lines and the G/C lines.
[0084] Fig. 9 illustrates the computer system 102 from Fig. 1 in more detail. The computer system 102 comprises a processor 901 connected to a program memory 902, a data memory 903 and a communication port 905. The program memory 902 is a non- transitory computer readable medium, such as a hard drive, a solid state disk or CD- ROM. Software, that is, an executable program stored on program memory 902 causes the processor 901 to perform the method in Fig. 2, that is, receive sequencing data, align the sequencing data, determine an alignment score (e.g., per cycle) and determine a condition of the sequencer based on the alignment score. The term“determining a condition” refers to calculating a value that is indicative of the condition. This also applies to related terms.
[0085] The processor 901 may then store the condition on data store 903, such as on RAM or a processor register. Processor 901 may also send the determined condition via communication port 905 to a server, such as a sequencer control server.
[0086] The processor 901 may receive data, such as sequencing reads, from data memory 902, data base 904 as well as from the communications port 905. The communication port 905 may be connected to a display that shows a visual representation of the quality scores to a user. In one example, the processor 901 receives and processes the sequencing in real time. This means that the processor 901 determines the quality score every time sequencing data is received from sequencer 101 and completes this calculation before the sequencer 101 finishes the sequencing.
[0087] Although communications port 905 is shown as a distinct entity, it is to be understood that any kind of data port may be used to receive data, such as a network connection, a memory interface, a pin of the chip package of processor 901, or logical ports, such as IP sockets or parameters of functions stored on program memory 903 and executed by processor 901. These parameters may be stored on data memory 903 and may be handled by-value or by-reference, that is, as a pointer, in the source code.
[0088] The processor 901 may receive data through all these interfaces, which includes memory access of volatile memory, such as cache or RAM, or non-volatile memory, such as an optical disk drive, hard disk drive, storage server or cloud storage. The computer system 102 may further be implemented within a cloud computing environment, such as a managed group of interconnected servers hosting a dynamic number of virtual machines.
[0089] It is to be understood that any receiving step may be preceded by the processor 901 determining or computing the data that is later received. For example, the processor 901 filters the sequencing data and stores the filtered sequencing data in data memory 903, such as RAM or a processor register. The processor 901 then requests the data from the data memory 903, such as by providing a read signal together with a memory address. The data memory 903 provides the data as a voltage signal on a physical bit line and the processor 901 receives the sequencing data via a memory interface.
[0090] It is to be understood that throughout this disclosure unless stated otherwise, nodes, edges, graphs, solutions, variables, scores and the like refer to data structures, which are physically stored on data memory 903 or processed by processor 901.
Further, for the sake of brevity when reference is made to particular variable names, such as“period of time” or“quantity of movement” this is to be understood to refer to values of variables stored as physical data in computer system 102.
[0091] Further, Fig. 2 is to be understood as a blueprint for the software program and may be implemented step-by-step, such that each step in Fig. 2 is represented by a function in a programming language, such as C++ or Java. The resulting source code is then compiled and stored as computer executable instructions on program memory 902. [0092] It is noted that for most humans performing the method 200 manually, that is, without the help of a computer, would be practically impossible simply because sequencing data is typically large, such as over 100 MB or even close to or over 1 TB. Therefore, the use of a computer is part of the substance of the invention and allows performing the necessary calculations that would otherwise not be possible due to the large amount of data and the large number of calculations that are involved.
[0093] It is further noted that computer system 102 may generate a report including the metrics disclosed herein and including data plots as shown in Figs. 5 to 8. These allow direct and efficient visual inspection by an operator to locate malfunctions if the sequencer quickly. The report may be created as a Word, PDF file or as a browse-able HTML page with multiple tabs, for example, to navigate through the different metrics.
[0094] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the specific embodiments without departing from the scope as defined in the claims.
[0095] It should be understood that the techniques of the present disclosure might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g. RAM) and/or non-volatile (e.g. ROM, disk) memory, carrier waves and transmission media. Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data steams along a local network or a publically accessible network such as the internet.
[0096] It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "estimating" or "processing" or "computing" or "calculating", "optimizing" or "determining" or "displaying" or “maximising” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:
1. A sequencing system comprising:
a sequencer performing an operation on a biological sample to generate sequencing data for the biological sample;
a computer system configured to:
receive the sequencing data for the biological sample;
align the sequencing data to a reference sequence to thereby calculate an alignment score indicative of a quality of the alignment; and
identify a condition of the operation of the sequencer based on the alignment score.
2. The sequencing system of claim 1, wherein the computer system is further configured to disregard sequencing data of a sequencing run in response to identifying that the alignment score across the sequencing data of the sequencing run is below a predefined threshold.
3. The sequencing system of claim 1 or 2, wherein the condition relates to a part of the sequencing data and identifying a condition is based on that part.
4. The sequencing system of any one of the preceding claims, wherein the step of aligning the sequencing data comprises aligning each of multiple reads with the reference sequence and maintaining an association between each of the multiple reads with the sequencing data from the sequencer.
5. The sequencing system of claim 4, wherein identifying a condition of the operation comprises analysing the sequencing data from the sequencer at a position on the reference sequence where multiple reads that are aligned to that position show a similar quality related pattern.
6. The sequencing system of claim 4 or 5, wherein maintaining an association comprises maintaining for each aligned read an association with meta-data of the sequencing data.
7. The sequencing system of claim 6, wherein the meta-data comprises an indication of one or more of:
a cycle number;
a flowcell lane;
a tile number within the flowcell lane;
an‘x’ -coordinate of a cluster represented by the read within the tile;
a‘y’ -coordinate of a cluster represented by the read within the tile;
an index number for a multiplexed sample; and
a member index of a pair for paired-end reads.
8. The sequencing system of claim 6 or 7, wherein identifying the condition comprises identifying common metadata elements for multiple reads that show a similar quality related pattern.
9. The sequencing system of claim 8, wherein identifying the condition comprises identifying one or more common meta-data elements of the multiple reads at the position on the reference sequence where multiple reads that are aligned to that position show a similar quality related pattern.
10. The sequencing system of claim 8 or 9, wherein identifying the condition comprises identifying the condition based on the one or more common meta-data elements.
11. The sequencing system of any one of the preceding claims, wherein the condition is related to an elapsed time within a current run.
12. The sequencing system of claim 11, wherein the elapsed time is indicative of a period of time a chemical has been used during the sequencing.
13. The sequencing system of any one of the preceding claims, wherein the condition is related to a location within the sequencer.
14. The sequencing system of claim 13, wherein the condition is indicative of a defect of the flowcell.
15. The sequencing system of any one of the preceding claims, wherein the computer system is further configured to aggregate the alignment score for each cycle of multiple cycles of the sequencer and the condition is based on the aggregated alignment score.
16. The sequencing system of claim 15, wherein the condition is based on a number of mismatches in each of the multiple cycles.
17. The sequencing system of any one of the proceeding claims, wherein the processor is further configured to automatically adjust the operation of the sequencer based on the condition to improve a quality of the sequencing data.
18. A method for assessing sequencing data, the method comprising:
receiving the sequencing data for the biological sample;
aligning the sequencing data to a reference sequence to thereby calculate an alignment score indicative of a quality of the alignment; and
identify a quality of the sequencing data based on the alignment score.
PCT/AU2020/050382 2019-04-18 2020-04-17 Quality measurement of next generation sequencing reads WO2020210876A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2019901349A AU2019901349A0 (en) 2019-04-18 Quality measurement of next generation sequencing reads
AU2019901349 2019-04-18

Publications (1)

Publication Number Publication Date
WO2020210876A1 true WO2020210876A1 (en) 2020-10-22

Family

ID=72836751

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2020/050382 WO2020210876A1 (en) 2019-04-18 2020-04-17 Quality measurement of next generation sequencing reads

Country Status (1)

Country Link
WO (1) WO2020210876A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270245A1 (en) * 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270245A1 (en) * 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PFEIFER, S. P.: "From Next-Generation Resequencing Reads to a High-Quality Variant Data Set", HEREDITY, vol. 118, February 2017 (2017-02-01), pages 111 - 124, XP055599952, DOI: 10.1038/hdy.2016.102 *
SHENG, Q. ET AL.: "Multi-Perspective Quality Control of Illumina RNA Sequencing Data Analysis", BRIEFINGS IN FUNCTIONAL GENOMICS, vol. 16, no. 4, July 2017 (2017-07-01), pages 194 - 204, XP055748992, DOI: 10.1093/bfgp/elw035 *

Similar Documents

Publication Publication Date Title
Schafer et al. Alternative splicing signatures in RNA‐seq data: percent spliced in (PSI)
Lee et al. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score
Li et al. Phenomics‐based GWAS analysis reveals the genetic architecture for drought resistance in cotton
CN111341383B (en) Method, device and storage medium for detecting copy number variation
DeHaven et al. Software techniques for enabling high-throughput analysis of metabolomic datasets
Zhang et al. SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data
Wang et al. Genomic evidence for homoploid hybrid speciation between ancestors of two different genera
CN111718982A (en) Tumor tissue single sample somatic mutation detection method and device
Marsh et al. Bioinformatic analysis of bacteria and host cell dual RNA-sequencing experiments
Poplin et al. Creating a universal SNP and small indel variant caller with deep neural networks
CN110021355B (en) Haploid typing and variation detection method and device for diploid genome sequencing segment
Kearse et al. The Geneious 6.0. 3 read mapper
CN115691672A (en) Base quality value correction method, base quality value correction device, electronic device and storage medium for sequencing platform features
CN115083521A (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
Singh et al. Next-generation sequencing (NGS) tools and impact in plant breeding
CN107967411B (en) Method and device for detecting off-target site and terminal equipment
Comte et al. PhylteR: efficient identification of outlier sequences in phylogenomic datasets
WO2020210876A1 (en) Quality measurement of next generation sequencing reads
Bayer et al. Exome capture for variant discovery and analysis in barley
Camiolo et al. Altools: a user friendly NGS data analyser
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
US20160026756A1 (en) Method and apparatus for separating quality levels in sequence data and sequencing longer reads
AU2018391843B2 (en) Sequencing data-based ITD mutation ratio detecting apparatus and method
Söylev et al. CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data
US20230093253A1 (en) Automatically identifying failure sources in nucleotide sequencing from base-call-error patterns

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20791720

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20791720

Country of ref document: EP

Kind code of ref document: A1