US20240004838A1 - Quality score compression for improving downstream genotyping accuracy - Google Patents

Quality score compression for improving downstream genotyping accuracy Download PDF

Info

Publication number
US20240004838A1
US20240004838A1 US18/370,299 US202318370299A US2024004838A1 US 20240004838 A1 US20240004838 A1 US 20240004838A1 US 202318370299 A US202318370299 A US 202318370299A US 2024004838 A1 US2024004838 A1 US 2024004838A1
Authority
US
United States
Prior art keywords
compression
mer
read
quality
tool
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US18/370,299
Inventor
Bonnie Berger Leighton
Deniz Yorukoglu
Yun William Yu
Jian Peng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US18/370,299 priority Critical patent/US20240004838A1/en
Publication of US20240004838A1 publication Critical patent/US20240004838A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C99/00Subject matter not provided for in other groups of this subclass

Definitions

  • NGS next-generation sequencing
  • NGS next-generation sequencing
  • the bulk of NGS data typically consists of read sequences, where each base call is associated with a corresponding quality score, which consume at least as much storage space as the base calls themselves.
  • Quality scores are primarily used and often essential for assessing sequence quality, filtering low quality reads, assembling genomic sequences, mapping reads to a reference sequence, and performing accurate genotyping. Quality scores are a major bottleneck in any sequence analysis pipeline, impacting genomic medicine, environmental genomics, and the ability to find signatures of selection within large sets of closely-related sequenced individuals.
  • this disclosure provides for a highly-efficient and scalable compression tool that compresses quality scores by capitalizing on sequence redundancy.
  • compression is achieved by smoothing a large fraction of quality score values based on k-mer neighborhood of their corresponding positions in the read sequences.
  • SNP single-nucleotide polymorphism
  • a preferred approach is to only preserve quality scores for probable variant locations and compress quality scores of concordant bases, preferably by resetting them to a default value.
  • the approach herein preferably takes advantage of redundancy inherent in NGS read data.
  • the more often a read sequence is seen in a dataset the more confidence in its correctness; thus, its quality scores are less informative and useful.
  • the probability of a read appearing multiple times is extremely low.
  • shorter substrings (k-mers) can instead be used as a proxy to estimate sequence redundancy.
  • FIG. 1 a depicts generation of a dictionary of common k-mers in a corpus of NGS reads
  • FIG. 1 B depicts how each read sequence R is broken up into overlapping supporting k-mers
  • FIG. 1 c depicts identification of a dictionary k-mers that are within one (for example) mismatch from the supporting k-mers;
  • FIG. 1 d depicts how quality scores are smoothed and the scores of all high-quality positions (i.e., bases) are set to a default value;
  • FIG. 1 e depicts how tool of this disclosure is fit into an existing genotyping analysis pipeline, preferably as an additional processing step between acquisition of raw reads ad mapping and genotyping;
  • FIG. 2 a illustrates a first scaled receiver operating characteristic (ROC) curve of genotyping accuracy according to the technique of this disclosure
  • FIG. 2 b illustrates a second scaled receiver operating characteristic (ROC) curve of genotyping accuracy according to the technique of this disclosure.
  • FIG. 3 depicts a counting algorithm that is useful for generating a dictionary according to this disclosure.
  • the compression tool of this disclosure is referred to as “Quartz” (“QUAlity score Reduction at Terabyte scale”), but this nomenclature is not intended to be limiting.
  • the tool is implemented as computer software executing in one or more hardware processors.
  • the computer software may be implemented as one or more programs, processes, instruction sets, execution threads, scripts, or the like.
  • the tool preferably compresses quality scores by capitalizing on sequence redundancy. Compression is achieved by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences.
  • k-mer typically refers to all the possible substrings, of length k, that are contained in a string.
  • k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing.
  • the technique herein is based on the intuition that any divergent base in a k-mer likely corresponds to either a SNP or sequencing error; thus, the tool preferably only preserves quality scores for probable variant locations and compress quality scores of concordant bases by resetting them to a default value. While k-mer frequencies have been used to infer knowledge about the error content of a read sequence (in known sequence-correction and assembly methods), they have not been used for quality score compression.
  • a dictionary of common k-mers in a corpus of NGS reads is generated.
  • the dictionary is generated once for any species.
  • each read sequence R is broken up into overlapping supporting k-mers.
  • dictionary k-mers that are within one (for example) mismatch from the supporting k-mers are identified.
  • every position different from a dictionary k-mer is annotated as a possible variant unless covered by a dictionary k-mer corresponding to a different supporting k-mer, in which case it will nevertheless be marked for correction.
  • Other covered positions preferably are also marked for correction as high quality.
  • a list of commonly occurring 32-mers in a population-size corpus of sequencing reads For the corpus, and for exemplary purposes, 194 FASTQ files taken from the 1000 Genomes Project were used. This data corresponds to paired-end reads from 97 human individuals. Read lengths in the dataset ranged from 91-103 bp, with a total depth of coverage of about 700 ⁇ .
  • a counting algorithm depicted in FIG. 3 is used. This algorithm is a modified Misra-Gries approximate counting algorithm. It uses a collection of hash buckets instead of doubly-linked lists to minimize cache misses. The number of hash buckets is arbitrarily limited.
  • Memory usage is controlled by maintaining an invariant that (for example) the total number of k-mers stored never exceeds a memory-usage parameter m.
  • the approximation error is bounded by a decrement counter D.
  • the algorithm was run in multiple passes on each of the subsets.
  • all 32-mers in the dictionary were guaranteed to occur at least 200 times in the corpus, and all 32-mers in the corpus that occurred at least 240 times were included in the dictionary.
  • the quality score compression may be carried out as follows.
  • the quality score compression algorithm Quartz requires a dictionary such as described above.
  • the dictionary comprises commonly-occurring k-mers extracted from a population-sized read dataset.
  • the dictionary is designed in such a way that any given read dataset can be mostly covered from these k-mers within a small Hamming distance, or number of mismatches.
  • the tool compresses the quality scores in any given read dataset by identifying k-mers from each read within a small Hamming distance from other k-mers in the dictionary.
  • Any quality score value corresponding to a position that is concordant with at least one supporting k-mer is set to a default value, whereas the quality score value at any position that is divergent from all supporting k-mers is kept.
  • This coarse representation greatly reduces the storage requirement for the quality scores of read datasets because the smoothed quality score values are substantially more compressible than the original values.
  • the tool of this disclosure may be parallelized (e.g., using OpenMP), thereby achieving speedups nearly linear to the number of cores used, up to the I/O bound.
  • Parameter selection for the compression technique may be as follows. Efficiency concerns sufficiently dictate the parameters of k-mer length and Hamming distance that may be hard-coded optimal choices.
  • the k-mer length parameter, k is 32 bp, although this is not a limitation as other lengths may be used.
  • k-mers should be long enough to ensure that the number of all possible k-mers is much larger than the number of unique k-mers in the genome, so as to ensure incidental collisions between unrelated k-mers are rare;
  • k-mer neighbors preferably are defined to be within a Hamming distance of 1, k-mers should also be short enough to allow the probability that a k-mer contains more than one sequencing error to be low (this criterion is to ensure that k-mer sequences within reads originating from the same genomic region are highly likely to be detected as neighbors), and
  • k-mer length should ideally be a multiple of four, because a 4 bp length DNA sequence can be represented by a single byte.
  • a 32 bp long k-mer satisfies all of these criteria.
  • it is represented by a single 64-bit integer, with a relatively low probability of containing more than one sequencing error with Illumina sequences, as well as resulting in few k-mer collisions.
  • a default replacement value (Q) for smoothed quality scores was selected as 50.
  • Quartz only replaces a quality score within a 32-mer if all Hamming neighbors agree on a position.
  • the method only smoothes a quality score incorrectly if there are two read errors: one for the quality score and one at one of the other 31 locations.
  • Hamming neighbors are defined as all k-mers in the dictionary that differ from a query k-mer by no more than 1 base substitution, the tool needs to be able to quickly search in the dictionary for all 3k+1 potential Hamming neighbors.
  • a sorted list or hash tables may be used, but such techniques are CPU— or memory-inefficient, and they incur many cache misses.
  • a preferred approach is to use locality sensitive hashing.
  • both hash tables can be efficiently implemented by sorting lexicographically, first by the key (i.e. the projected half) and then by the rest, saving pointers to the first dictionary entry in each “bucket,” and then discarding the projected half of each k-mer (as all k-mers are only accessed through the bucket pointers, the hash key is implied).
  • the technique described above provides for a highly-efficient and scalable compression tool that compresses quality scores by capitalizing on sequence redundancy.
  • compression is achieved by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences.
  • the technique preferably exploits the fact that any divergent base in a k-mer likely corresponds to either a SNP or sequencing error; thus, the tool preferably only preserves quality scores for probable variant locations and compress quality scores of concordant bases by resetting them to a default value.
  • the tool and the approach herein maintains genotyping accuracy while achieving substantially higher compression ratios.
  • the compression is made possible by Quartz's coarse representation of quality scores, which allows it to store quality scores at roughly 0.4 bits per value (from the original size of 8 bits in FASTQ format).
  • the above-described compression framework despite eliminating a significant percentage (e.g., even more than 95%) of the quality score information, has been shown to achieve improved genotyping accuracy over the original, uncompressed quality scores, as compared to a trio-validated, gold-standard variant dataset for the NA12878 genome from the GATK “best-practices” bundle.
  • both the GATK and SAMtools pipelines were applied to the compressed quality scores generated by Quartz on a gold standard NA12878 read dataset from the 1000 Genomes Project.
  • the genotyping accuracy based on Quartz's compressed data consistently outperforms that based on the uncompressed raw quality scores as measured by the area under the receiver operating characteristic (ROC) curve.
  • ROC receiver operating characteristic
  • FIG. 2 curves are shown for NA12878 before and after Quartz compression using (a) Bowtie 2 and GATK UnifiedGenotyper ( FIGS. 2 a ), and ( b ) BWA and SAMtools mpileup ( FIG. 2 b ). Accuracy is improved using both variant-calling pipelines. Although this improvement is most pronounced for SNP calls, indel-calling accuracy is also maintained, if not improved, by Quartz compression.
  • Quartz is highly scalable on large-scale, whole-genome datasets. After a one-time construction of the k-mer dictionary for any given species, quality compression is orders of magnitude faster than read mapping, genotyping, and other quality compression methods. Additionally, Quartz is especially applicable for large-scale cohort-based sequencing projects because its improvements in genotyping accuracy are particularly useful for lower depths of sequencing coverage (e.g., 2 ⁇ -4 ⁇ ).
  • Quartz alters quality scores and improves downstream genotyping accuracy, in common existing base quality score recalibration tools that make use of human genome variation from population-scale sequencing. Although currently-available recalibration tools and Quartz both use distilled information from a corpus of genomes, they differ in several key points. Most importantly, quality score recalibration tools do not apply compression; in fact, GATK recalibration tends to greatly increase the amount of storage needed. Quartz avoids losing compressibility, preferably by using a single default replacement quality value. Furthermore, because recalibration tools such as the GATK BaseRecalibrator, employ a list of known SNP locations, reads must first be mapped to the reference. As Quartz uses only k-mer frequencies and Hamming distances, it is possible to apply Quartz compression upstream of mapping, which is important when the task at hand either is assembly or mapping itself.
  • Quartz is a scalable, sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets, thereby solving both the problems of indefinite storage and transmission of quality scores.
  • the experimental results further indicate that a significant proportion of quality score data, despite having been thought entirely essential to downstream analysis, are less informative than the k-mer sequence profiles, and thus can be discarded without weakening, and actually improving, downstream analysis. Thus, even with lossy data compression, it is possible to preserve biologically important data.
  • Quartz takes as input a FASTQ file (the standard format for read data) and outputs a smoothed FASTQ file, which can in turn be input into any compression program (e.g., BZIP2 or GZIP) for efficient storage and transmission, or any read mapper (e.g., BWA or Bowtie 2 ).
  • any compression program e.g., BZIP2 or GZIP
  • any read mapper e.g., BWA or Bowtie 2
  • Further analysis steps such as variant calling (e.g., using SAMTools, GATK) can follow in the usual way.
  • the technique described herein provides significant advantages for genomic data analysis. It provides for a tool to compress quality scores in a lossy manner, drastically reducing storage requirements while still preserving biologically important information.
  • the automated approach not only preserves, but also improves, downstream genotyping accuracy.
  • the tool thus provides broad applicability and can be readily incorporated into existing downstream analysis pipelines, at orders of magnitude faster speed than both mapping and variant calling.
  • the tool addresses a critical and currently unmet need for reduced storage, increased transmission, and more efficient processing of next-generation sequencing reads, which will facilitate collaboration, enable even larger datasets, and allow future re-analyses on existing data.
  • This apparatus may be a particular machine that is specially constructed for the required purposes, or it may comprise a computer otherwise selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including an optical disk, a CD-ROM, and a magnetic-optical disk, a read-only memory (ROM), a random access memory (RAM), a magnetic or optical card, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • a given implementation of the disclosure is software written in a given programming language that runs in conjunction with a proxy on a standard Intel hardware platform running an operating system such as Linux.
  • the functionality may be built into other code, or it may be executed as an adjunct to that code.
  • a machine implementing the techniques herein comprises a processor, computer memory holding instructions that are executed by the processor to perform the above-described methods, functions or operations. Functions described herein may be implemented across multiple machines, processes, programs, systems and the like, co-located or remote from one another.
  • cloud computing is a model of service delivery for enabling on-demand network access to a shared pool of configurable computing resources (e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service.
  • configurable computing resources e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services
  • SaaS Software as a Service
  • PaaS Platform as a service
  • IaaS Infrastructure as a Service
  • a cloud platform may comprise co-located hardware and software resources, or resources that are physically, logically, virtually and/or geographically distinct.
  • Communication networks used to communicate to and from the platform services may be packet-based, non-packet based, and secure or non-secure, or some combination thereof.
  • NGS quality scores are smoothed or removed, e.g., using one or more algorithms to determine which scores to remove, and based on sequence data, and at the same maintaining processing speed.
  • the techniques herein provide improved technologies to store, transmit and process genomic data.
  • NGS next-generation sequencing

Abstract

This disclosure provides for a highly-efficient and scalable compression tool that compresses quality scores, preferably by capitalizing on sequence redundancy. In one embodiment, compression is achieved by smoothing a large fraction of quality score values based on k-mer neighborhood of their corresponding positions in read sequences. The approach exploits the intuition that any divergent base in a k-mer likely corresponds to either a single-nucleotide polymorphism (SNP) or sequencing error; thus, a preferred approach is to only preserve quality scores for probable variant locations and compress quality scores of concordant bases, preferably by resetting them to a default value. By viewing individual read datasets through the lens of k-mer frequencies in a corpus of reads, the approach herein ensures that compression “lossiness” does not affect accuracy in a deleterious way.

Description

    BACKGROUND OF THE INVENTION Technical Field
  • This disclosure relates generally to next-generation sequencing (NGS) technologies and, in particular, technologies to store, transmit and process genomic data.
  • Background of the Related Art
  • Advances in next-generation sequencing (NGS) technologies have produced a deluge of genomic information, outpacing even increases in our computational resources. In addition to facilitating the widespread use of NGS data throughout biotechnology, this avalanche of data enables novel, large-scale population studies (e.g., maps of human genetic variation, reconstruction of human population history, and uncovering cell lineage relationships). To fully capitalize on these advances, however, better technologies to store, transmit, and process genomic data need to be developed.
  • The bulk of NGS data typically consists of read sequences, where each base call is associated with a corresponding quality score, which consume at least as much storage space as the base calls themselves. Quality scores are primarily used and often essential for assessing sequence quality, filtering low quality reads, assembling genomic sequences, mapping reads to a reference sequence, and performing accurate genotyping. Quality scores are a major bottleneck in any sequence analysis pipeline, impacting genomic medicine, environmental genomics, and the ability to find signatures of selection within large sets of closely-related sequenced individuals.
  • At the expense of downstream analysis, biomedical researchers have typically discarded quality scores or turned to compression, which has been moderately successful when applied to genomic sequence data. Quality score compression is usually lossy, meaning that maximum compression is achieved at the expense of the ability to reconstruct the original quality values. Due to decline in downstream accuracy, such methods are unsuitable for both transmission and indefinite storage of quality scores. To address these limitations, several newly-developed methods exploit sequence data to boost quality score compression using alignments to a reference genome, or use raw read datasets without reference alignment. Such solutions have not proven satisfactory. In particular, reference-based compression requires runtime-costly whole-genome alignments of the NGS dataset, while alignment-free compression applies costly indexing methods directly to the read dataset.
  • There remains a need to provide for an efficient and scalable method for very large (e.g., terabyte-sized) NGS datasets and that addresses the degradation of downstream genotyping accuracy that results from lossy compression.
  • BRIEF SUMMARY
  • Generally, this disclosure provides for a highly-efficient and scalable compression tool that compresses quality scores by capitalizing on sequence redundancy. In one embodiment, compression is achieved by smoothing a large fraction of quality score values based on k-mer neighborhood of their corresponding positions in the read sequences. The approach exploits the intuition that any divergent base in a k-mer likely corresponds to either a single-nucleotide polymorphism (SNP) or sequencing error; thus, a preferred approach is to only preserve quality scores for probable variant locations and compress quality scores of concordant bases, preferably by resetting them to a default value.
  • In particular, and to achieve scalable analyses, the approach herein preferably takes advantage of redundancy inherent in NGS read data. Intuitively, the more often a read sequence is seen in a dataset, the more confidence in its correctness; thus, its quality scores are less informative and useful. For longer read sequences (e.g., >100 bp), however, the probability of a read appearing multiple times is extremely low. For such long reads, shorter substrings (k-mers) can instead be used as a proxy to estimate sequence redundancy. By viewing individual read datasets through the lens of k-mer frequencies in a corpus of reads, the approach herein ensures that compression “lossiness” does not deleteriously affect accuracy.
  • The foregoing has outlined some of the more pertinent features of the disclosed subject matter. These features should be construed to be merely illustrative. Many other beneficial results can be attained by applying the disclosed subject matter in a different manner or by modifying the subject matter as will be described.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
  • FIG. 1 a depicts generation of a dictionary of common k-mers in a corpus of NGS reads;
  • FIG. 1B depicts how each read sequence R is broken up into overlapping supporting k-mers;
  • FIG. 1 c depicts identification of a dictionary k-mers that are within one (for example) mismatch from the supporting k-mers;
  • FIG. 1 d depicts how quality scores are smoothed and the scores of all high-quality positions (i.e., bases) are set to a default value;
  • FIG. 1 e depicts how tool of this disclosure is fit into an existing genotyping analysis pipeline, preferably as an additional processing step between acquisition of raw reads ad mapping and genotyping;
  • FIG. 2 a illustrates a first scaled receiver operating characteristic (ROC) curve of genotyping accuracy according to the technique of this disclosure;
  • FIG. 2 b illustrates a second scaled receiver operating characteristic (ROC) curve of genotyping accuracy according to the technique of this disclosure; and
  • FIG. 3 depicts a counting algorithm that is useful for generating a dictionary according to this disclosure.
  • DETAILED DESCRIPTION
  • For descriptive purposes, the compression tool of this disclosure is referred to as “Quartz” (“QUAlity score Reduction at Terabyte scale”), but this nomenclature is not intended to be limiting. Typically, the tool is implemented as computer software executing in one or more hardware processors. The computer software may be implemented as one or more programs, processes, instruction sets, execution threads, scripts, or the like.
  • According to this disclosure, the tool preferably compresses quality scores by capitalizing on sequence redundancy. Compression is achieved by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences. As is known, the term k-mer typically refers to all the possible substrings, of length k, that are contained in a string. In computational genomics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. As noted above, the technique herein is based on the intuition that any divergent base in a k-mer likely corresponds to either a SNP or sequencing error; thus, the tool preferably only preserves quality scores for probable variant locations and compress quality scores of concordant bases by resetting them to a default value. While k-mer frequencies have been used to infer knowledge about the error content of a read sequence (in known sequence-correction and assembly methods), they have not been used for quality score compression.
  • As seen in FIG. 1 a , a dictionary of common k-mers in a corpus of NGS reads is generated. Typically, the dictionary is generated once for any species. At FIG. 1B, each read sequence R is broken up into overlapping supporting k-mers. At FIG. 1 c , dictionary k-mers that are within one (for example) mismatch from the supporting k-mers are identified. Preferably, and as indicated at the top, every position different from a dictionary k-mer is annotated as a possible variant unless covered by a dictionary k-mer corresponding to a different supporting k-mer, in which case it will nevertheless be marked for correction. Other covered positions preferably are also marked for correction as high quality. As indicated at the bottom, when two dictionary k-mers correspond to the same supporting k-mer, preferably all mismatches are preserved, unless the mismatch portion is cover by a dictionary k-mer corresponding to a different supporting k-mer as shown (at top). Uncovered bases preferably also are annotated. At FIG. 1 d , quality scores are smoothed and the scores of all high-quality positions (i.e., bases) are set to a default value. Scores of uncovered and possible variant loci are kept. At FIG. 1 e , the tool is fit into an existing genotyping analysis pipeline, preferably as an additional processing step between acquisition of raw reads ad mapping and genotyping.
  • To generate the dictionary, in one embodiment a list of commonly occurring 32-mers in a population-size corpus of sequencing reads. For the corpus, and for exemplary purposes, 194 FASTQ files taken from the 1000 Genomes Project were used. This data corresponds to paired-end reads from 97 human individuals. Read lengths in the dataset ranged from 91-103 bp, with a total depth of coverage of about 700×. To construct the dictionary of k-mers memory efficiently, a counting algorithm depicted in FIG. 3 is used. This algorithm is a modified Misra-Gries approximate counting algorithm. It uses a collection of hash buckets instead of doubly-linked lists to minimize cache misses. The number of hash buckets is arbitrarily limited. Memory usage is controlled by maintaining an invariant that (for example) the total number of k-mers stored never exceeds a memory-usage parameter m. The approximation error is bounded by a decrement counter D. After partitioning the space of 32-mers into 256 subsets by the identity of the middle 4 bases, the algorithm was run in multiple passes on each of the subsets. By controlling the approximation ratio of the Misra-Gries algorithm, all 32-mers in the dictionary were guaranteed to occur at least 200 times in the corpus, and all 32-mers in the corpus that occurred at least 240 times were included in the dictionary.
  • The quality score compression may be carried out as follows. As input, the quality score compression algorithm Quartz requires a dictionary such as described above. Preferably, the dictionary comprises commonly-occurring k-mers extracted from a population-sized read dataset. Preferably, and as described above, the dictionary is designed in such a way that any given read dataset can be mostly covered from these k-mers within a small Hamming distance, or number of mismatches. Using this dictionary, the tool compresses the quality scores in any given read dataset by identifying k-mers from each read within a small Hamming distance from other k-mers in the dictionary. Any quality score value corresponding to a position that is concordant with at least one supporting k-mer is set to a default value, whereas the quality score value at any position that is divergent from all supporting k-mers is kept. This coarse representation greatly reduces the storage requirement for the quality scores of read datasets because the smoothed quality score values are substantially more compressible than the original values.
  • Although not intended to be limiting, because each read can be independently processed, the tool of this disclosure may be parallelized (e.g., using OpenMP), thereby achieving speedups nearly linear to the number of cores used, up to the I/O bound.
  • Parameter selection for the compression technique may be as follows. Efficiency concerns sufficiently dictate the parameters of k-mer length and Hamming distance that may be hard-coded optimal choices. Thus, for example, the k-mer length parameter, k, is 32 bp, although this is not a limitation as other lengths may be used. Typically, there are several criteria to consider when selecting the k-mer length within the Quartz quality score compression framework: (1) k-mers should be long enough to ensure that the number of all possible k-mers is much larger than the number of unique k-mers in the genome, so as to ensure incidental collisions between unrelated k-mers are rare; (2) because within the Quartz framework, k-mer neighbors preferably are defined to be within a Hamming distance of 1, k-mers should also be short enough to allow the probability that a k-mer contains more than one sequencing error to be low (this criterion is to ensure that k-mer sequences within reads originating from the same genomic region are highly likely to be detected as neighbors), and (3) k-mer length should ideally be a multiple of four, because a 4 bp length DNA sequence can be represented by a single byte. A 32 bp long k-mer satisfies all of these criteria. In particular, it is represented by a single 64-bit integer, with a relatively low probability of containing more than one sequencing error with Illumina sequences, as well as resulting in few k-mer collisions.
  • For the experimental results presented, a default replacement value (Q) for smoothed quality scores was selected as 50. In the datasets studied, the average quality score excluding the special value of 2, which was not modified, was at least 35. Preferably, and as noted above, Quartz only replaces a quality score within a 32-mer if all Hamming neighbors agree on a position. Thus, assuming that all 32-mers within a read are variants of 32-mers in the dictionary, the method only smoothes a quality score incorrectly if there are two read errors: one for the quality score and one at one of the other 31 locations. Then, as a first order approximation, the error probability for a smoothed location is 31×10−3.5×10−3.5<10−5, justifying the choice of Q=50. This parameter may be adjusted.
  • The following describes a Hamming neighbor search in one embodiment. As Hamming neighbors are defined as all k-mers in the dictionary that differ from a query k-mer by no more than 1 base substitution, the tool needs to be able to quickly search in the dictionary for all 3k+1 potential Hamming neighbors. To this end, a sorted list or hash tables may be used, but such techniques are CPU— or memory-inefficient, and they incur many cache misses. A preferred approach is to use locality sensitive hashing. Recall that an (R, cR, P1, P2)-sensitive LSH family F of hash functions h:M→S is defined if v p,q is a member of M, a uniformly random h is a member of F satisfies two conditions:

  • if ∥p−q∥≤R,then Prob(h(p)=h(q))≥P 1, and  1.

  • if ∥p−q∥≥cR,then Prob(h(p0=h(q))≤P 2.  2.
  • Because the Hamming metric is discrete, all Hamming neighbors of a k-mer must match under one of the two projections. Thus, by double hashing, all Hamming neighbors can be found by looking in just two hash buckets, one for each hash function, giving much better cache-efficiency. Additionally, as genomic sequences are not chosen in an adversarial manner, instead of using random projections, projecting onto the front and back halves of the k-mer suffices in practice.
  • As the dictionary is static and read-only, both hash tables can be efficiently implemented by sorting lexicographically, first by the key (i.e. the projected half) and then by the rest, saving pointers to the first dictionary entry in each “bucket,” and then discarding the projected half of each k-mer (as all k-mers are only accessed through the bucket pointers, the hash key is implied). This scheme also allows for binary search within each bucket, further speeding up lookups. For the default k=32, an array of length 232 of unsigned 32-bit integers is used for the pointers, and an array of unsigned 32-bit integers of length IDI, the dictionary size, for the remaining half of the k-mers not specified by the pointers.
  • Thus, the technique described above provides for a highly-efficient and scalable compression tool that compresses quality scores by capitalizing on sequence redundancy. As described, compression is achieved by smoothing a large fraction of quality score values based on the k-mer neighborhood of their corresponding positions in the read sequences. The technique preferably exploits the fact that any divergent base in a k-mer likely corresponds to either a SNP or sequencing error; thus, the tool preferably only preserves quality scores for probable variant locations and compress quality scores of concordant bases by resetting them to a default value. Unlike other quality score compression methods, the tool and the approach herein maintains genotyping accuracy while achieving substantially higher compression ratios. Generally, the compression is made possible by Quartz's coarse representation of quality scores, which allows it to store quality scores at roughly 0.4 bits per value (from the original size of 8 bits in FASTQ format).
  • The above-described compression framework, despite eliminating a significant percentage (e.g., even more than 95%) of the quality score information, has been shown to achieve improved genotyping accuracy over the original, uncompressed quality scores, as compared to a trio-validated, gold-standard variant dataset for the NA12878 genome from the GATK “best-practices” bundle. In particular, both the GATK and SAMtools pipelines were applied to the compressed quality scores generated by Quartz on a gold standard NA12878 read dataset from the 1000 Genomes Project. The genotyping accuracy based on Quartz's compressed data consistently outperforms that based on the uncompressed raw quality scores as measured by the area under the receiver operating characteristic (ROC) curve. For the experiments, Quartz compression decreases the number of false positives in the million highest quality variant calls by over 4.5%. In FIG. 2 , curves are shown for NA12878 before and after Quartz compression using (a) Bowtie 2 and GATK UnifiedGenotyper (FIGS. 2 a ), and (b) BWA and SAMtools mpileup (FIG. 2 b ). Accuracy is improved using both variant-calling pipelines. Although this improvement is most pronounced for SNP calls, indel-calling accuracy is also maintained, if not improved, by Quartz compression. This result emerges from the discovery, through the application of Quartz, that quality score values within an NGS dataset are implicitly encoded in the genomic sequence information with 95% redundancy, so they often do not have to be stored separately. This improvement further indicates that the compression technique herein reduces the noise within the raw quality scores, thus leading to better genotyping results.
  • Quartz is highly scalable on large-scale, whole-genome datasets. After a one-time construction of the k-mer dictionary for any given species, quality compression is orders of magnitude faster than read mapping, genotyping, and other quality compression methods. Additionally, Quartz is especially applicable for large-scale cohort-based sequencing projects because its improvements in genotyping accuracy are particularly useful for lower depths of sequencing coverage (e.g., 2×-4×).
  • Quartz alters quality scores and improves downstream genotyping accuracy, in common existing base quality score recalibration tools that make use of human genome variation from population-scale sequencing. Although currently-available recalibration tools and Quartz both use distilled information from a corpus of genomes, they differ in several key points. Most importantly, quality score recalibration tools do not apply compression; in fact, GATK recalibration tends to greatly increase the amount of storage needed. Quartz avoids losing compressibility, preferably by using a single default replacement quality value. Furthermore, because recalibration tools such as the GATK BaseRecalibrator, employ a list of known SNP locations, reads must first be mapped to the reference. As Quartz uses only k-mer frequencies and Hamming distances, it is possible to apply Quartz compression upstream of mapping, which is important when the task at hand either is assembly or mapping itself.
  • Quartz is a scalable, sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets, thereby solving both the problems of indefinite storage and transmission of quality scores. The experimental results further indicate that a significant proportion of quality score data, despite having been thought entirely essential to downstream analysis, are less informative than the k-mer sequence profiles, and thus can be discarded without weakening, and actually improving, downstream analysis. Thus, even with lossy data compression, it is possible to preserve biologically important data.
  • A Quartz compression step can be readily added to nearly any pre-existing NGS data processing pipeline. In one embodiment, and as described, Quartz takes as input a FASTQ file (the standard format for read data) and outputs a smoothed FASTQ file, which can in turn be input into any compression program (e.g., BZIP2 or GZIP) for efficient storage and transmission, or any read mapper (e.g., BWA or Bowtie 2). Further analysis steps such as variant calling (e.g., using SAMTools, GATK) can follow in the usual way.
  • With improvements in sequencing technologies increasing the pace at which genomic data is generated, quality scores will require ever greater amounts of storage space; compressive quality scores thus will become crucial to fully realizing the potential of large-scale genomics. The technique herein establishes that the twin goals of compression and accuracy do not have to be at odds with each other. Although total compression comes at the cost of accuracy, and quality score recalibration generally decreases compressibility, there is a happy medium at which one can get both good compression and improved accuracy. The techniques herein will greatly benefit any researchers who are generating, storing, mapping, or analyzing large amounts of DNA, RNA, Chip-seq, or exome sequencing data.
  • In summary, the technique described herein provides significant advantages for genomic data analysis. It provides for a tool to compress quality scores in a lossy manner, drastically reducing storage requirements while still preserving biologically important information. The automated approach not only preserves, but also improves, downstream genotyping accuracy. The tool thus provides broad applicability and can be readily incorporated into existing downstream analysis pipelines, at orders of magnitude faster speed than both mapping and variant calling. The tool addresses a critical and currently unmet need for reduced storage, increased transmission, and more efficient processing of next-generation sequencing reads, which will facilitate collaboration, enable even larger datasets, and allow future re-analyses on existing data.
  • While the disclosed subject matter will be described in the context of a method or process, the subject matter also relates to apparatus for performing the operations herein. This apparatus may be a particular machine that is specially constructed for the required purposes, or it may comprise a computer otherwise selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including an optical disk, a CD-ROM, and a magnetic-optical disk, a read-only memory (ROM), a random access memory (RAM), a magnetic or optical card, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. A given implementation of the disclosure is software written in a given programming language that runs in conjunction with a proxy on a standard Intel hardware platform running an operating system such as Linux. The functionality may be built into other code, or it may be executed as an adjunct to that code. A machine implementing the techniques herein comprises a processor, computer memory holding instructions that are executed by the processor to perform the above-described methods, functions or operations. Functions described herein may be implemented across multiple machines, processes, programs, systems and the like, co-located or remote from one another.
  • While given components of the system have been described separately, one of ordinary skill will appreciate that some of the functions may be combined or shared in given instructions, program sequences, code portions, and the like.
  • Processing (computation, storage or otherwise) as described herein may be implemented in whole or in part using cloud computing. As is well-known, cloud computing is a model of service delivery for enabling on-demand network access to a shared pool of configurable computing resources (e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. Available services models that may be leveraged in whole or in part include: Software as a Service (SaaS) (the provider's applications running on cloud infrastructure); Platform as a service (PaaS) (the customer deploys applications that may be created using provider tools onto the cloud infrastructure); Infrastructure as a Service (IaaS) (customer provisions its own processing, storage, networks and other computing resources and can deploy and run operating systems and applications). A cloud platform may comprise co-located hardware and software resources, or resources that are physically, logically, virtually and/or geographically distinct. Communication networks used to communicate to and from the platform services may be packet-based, non-packet based, and secure or non-secure, or some combination thereof.
  • Thus, according to this disclosure, NGS quality scores are smoothed or removed, e.g., using one or more algorithms to determine which scores to remove, and based on sequence data, and at the same maintaining processing speed.
  • The techniques herein provide improved technologies to store, transmit and process genomic data.
  • The techniques herein provide for improvements to another technology or technical field, namely, next-generation sequencing (NGS), as well as improvements to the functioning of compression techniques.
  • Having described our invention, what is claimed follows below.

Claims (1)

1. A genotyping analysis pipeline, comprising:
one or more computing systems comprising computer hardware, and computer software stored in memory and executing on the computer hardware, the computer software configured as a set of computer program code, the computer program code comprising:
first program code configured as a read data set acquisition tool;
second program code configured as a genotyping tool; and
third program code configured as a quality score compression tool and positioned between the read data set acquisition tool and the genotyping tool, the quality score compression tool configured to:
receive, as a dictionary, a set of data comprising commonly-occurring k-mers extracted from the read dataset by the read data set acquisition tool;
compress quality scores in a given read dataset by identifying k-mers from each read within a given mismatch distance from other k-mers in the dictionary to generate compressed quality scores; and
provide the compressed quality scores for further processing by the genotyping tool.
US18/370,299 2014-04-26 2023-09-19 Quality score compression for improving downstream genotyping accuracy Pending US20240004838A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/370,299 US20240004838A1 (en) 2014-04-26 2023-09-19 Quality score compression for improving downstream genotyping accuracy

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201461984743P 2014-04-26 2014-04-26
US14/697,114 US10198454B2 (en) 2014-04-26 2015-04-27 Quality score compression for improving downstream genotyping accuracy
US16/267,778 US11762813B2 (en) 2014-04-26 2019-02-05 Quality score compression apparatus and method for improving downstream accuracy
US18/370,299 US20240004838A1 (en) 2014-04-26 2023-09-19 Quality score compression for improving downstream genotyping accuracy

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/267,778 Continuation US11762813B2 (en) 2014-04-26 2019-02-05 Quality score compression apparatus and method for improving downstream accuracy

Publications (1)

Publication Number Publication Date
US20240004838A1 true US20240004838A1 (en) 2024-01-04

Family

ID=58721625

Family Applications (3)

Application Number Title Priority Date Filing Date
US14/697,114 Active 2035-11-16 US10198454B2 (en) 2014-04-26 2015-04-27 Quality score compression for improving downstream genotyping accuracy
US16/267,778 Active 2038-03-13 US11762813B2 (en) 2014-04-26 2019-02-05 Quality score compression apparatus and method for improving downstream accuracy
US18/370,299 Pending US20240004838A1 (en) 2014-04-26 2023-09-19 Quality score compression for improving downstream genotyping accuracy

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US14/697,114 Active 2035-11-16 US10198454B2 (en) 2014-04-26 2015-04-27 Quality score compression for improving downstream genotyping accuracy
US16/267,778 Active 2038-03-13 US11762813B2 (en) 2014-04-26 2019-02-05 Quality score compression apparatus and method for improving downstream accuracy

Country Status (1)

Country Link
US (3) US10198454B2 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013096620A1 (en) * 2011-12-20 2013-06-27 Baym Michael H Compressing, storing and searching sequence data
GB2543068A (en) * 2015-10-06 2017-04-12 Fonleap Ltd System for generating genomics data, and device, method and software product for use therein
WO2019076177A1 (en) * 2017-10-20 2019-04-25 人和未来生物科技(长沙)有限公司 Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
US20190172553A1 (en) * 2017-11-08 2019-06-06 Koninklijke Philips N.V. Using k-mers for rapid quality control of sequencing data without alignment
CN108470113B (en) * 2018-03-14 2019-05-17 四川大学 Several species do not occur the calculating of k-mer subsequence and characteristic analysis method and system
US20200058379A1 (en) * 2018-08-20 2020-02-20 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Compressing Genetic Sequencing Data and Uses Thereof
CN110066862B (en) * 2019-05-22 2021-02-12 中南大学 Repeated DNA sequence identification method based on high-throughput sequencing reading
DE102021100199A1 (en) 2020-01-12 2021-07-15 Gottfried Wilhelm Leibniz Universität Hannover, Körperschaft des öffentlichen Rechts Method for compressing, method for decompressing information from hereditary information, electronic storage device and data processing system
BR112022024127A2 (en) * 2020-09-15 2023-03-28 Illumina Inc ACCELERATED GENOMIC READING MAPPING BY SOFTWARE
CN113496762B (en) * 2021-05-20 2022-09-27 山东大学 Biological gene sequence summary data generation method and system

Also Published As

Publication number Publication date
US10198454B2 (en) 2019-02-05
US11762813B2 (en) 2023-09-19
US20170147597A1 (en) 2017-05-25
US20190171625A1 (en) 2019-06-06

Similar Documents

Publication Publication Date Title
US20240004838A1 (en) Quality score compression for improving downstream genotyping accuracy
Pandey et al. Squeakr: an exact and approximate k-mer counting system
KR102007070B1 (en) Reference block aggregating into a reference set for deduplication in memory management
US8344916B2 (en) System and method for simplifying transmission in parallel computing system
Bonfert et al. ContextMap 2: fast and accurate context-based RNA-seq mapping
US9235651B2 (en) Data retrieval apparatus, data storage method and data retrieval method
Ahmadi et al. Hobbes: optimized gram-based methods for efficient read alignment
US9311323B2 (en) Multi-level inline data deduplication
US11632125B2 (en) Compressively-accelerated read mapping framework for next-generation sequencing
US8184953B1 (en) Selection of hash lookup keys for efficient retrieval
US11599505B2 (en) Reference set construction for data deduplication
Janin et al. Adaptive reference-free compression of sequence quality scores
Wandelt et al. Adaptive efficient compression of genomes
Roberts et al. Fragment assignment in the cloud with eXpress-D
US11609882B2 (en) System and method for random-access manipulation of compacted data files
Mohamed et al. Accelerating data-intensive genome analysis in the cloud
US20140244639A1 (en) Surprisal data reduction of genetic data for transmission, storage, and analysis
Ge et al. Counting kmers for biological sequences at large scale
US20230283292A1 (en) System and method for data compaction and security with extended functionality
US11734231B2 (en) System and methods for bandwidth-efficient encoding of genomic data
US20180314710A1 (en) Flattened document database with compression and concurrency
Yao et al. Parallel compression for large collections of genomes
US20230385242A1 (en) System and methods for bandwidth-efficient data encoding
Rengasamy et al. Engineering a high-performance SNP detection pipeline
US20240113728A1 (en) System and method for data compaction and security with extended functionality

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION