WO2019169117A1 - Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets - Google Patents

Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets Download PDF

Info

Publication number
WO2019169117A1
WO2019169117A1 PCT/US2019/020024 US2019020024W WO2019169117A1 WO 2019169117 A1 WO2019169117 A1 WO 2019169117A1 US 2019020024 W US2019020024 W US 2019020024W WO 2019169117 A1 WO2019169117 A1 WO 2019169117A1
Authority
WO
WIPO (PCT)
Prior art keywords
reads
sequence
reference sequence
genome
remainder
Prior art date
Application number
PCT/US2019/020024
Other languages
French (fr)
Inventor
Scott C. Blanchard
Matthew M. PARKS
Original Assignee
Cornell University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cornell University filed Critical Cornell University
Publication of WO2019169117A1 publication Critical patent/WO2019169117A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1089Design, preparation, screening or analysis of libraries using computer algorithms
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • Repetitive sequences are patterns of nucleic acids (DNA or RNA) that occur in multiple copies throughout the genome. Repetitive sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next- generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed.
  • ribosomal DNA regions in mammals for instance - for which there may be 20 to 2000 distinct rDNA operon elements within each individual (mouse or human) that exhibit related, but unknown sequences - there is no existing product specifically designed for or capable of detecting rDNA sequence variants in the encoded operons or the ribosomal RNA (rRNA) expressed from them nor estimating their copy number.
  • Human and mouse genomes encode hundreds of copies of the rDNA operon, which are arranged in tandem arrays on five chromosomes (chromosomes 13, 14, 15, 21, 22 in human; chromosomes 12, 15, 17, 18, 19 in mouse).
  • Each rDNA operon encodes a pre-rRNA that is post-transcriptionally processed into the 18S rRNA of the small nbosomai subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit.
  • Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively.
  • sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities.
  • enhancer elements in the rDNA intergenie spacers determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycledependent expression patterns and cellular differentiation.
  • IGS rDNA intergenie spacers
  • Ribosomopathies are diseases arising from aberrations m the assembly, composition, or function of the ribosome, the two-subunit RNA-protein complex responsible for cellular protein synthesis (K. L McCann, et al., Science , (2013) 341, 849- 850). Varied and tissue-specific disease phenotypes associated with the perturbation of ribosomal proteins have inspired hypotheses of functionally distinct ribosome sub- populations in the cell (S. Xue, et al, Nat Rev Mol Cell Biol., (2012) 13, 355-369; Z. Shi, et al. , Annu Rev Cell Dev Biol., (2015) 31, 31-54; E. W. Mills, et al., Science., (2017)
  • rRNA ribosomal RNA
  • rDNA ribosomal DNA
  • Ribosomeassociated proteins the "ribo-interaetome” have been implicated in ribosome-mediated impacts on gene expression (D. Simsek et al. Cell, (2017), 169, 1051-1065. el 8). While physical heterogeneities arising from ribosomal protein mutations and ribosomal protein and ribosome-associated protein stoichiometry have recently drawn significant attention (M. L. Holland et al., Science, (2016), 353, 495- 498; A. I.
  • Each rDNA operon encodes a 47S pre-rRNA that is post-transcriptional!y processed into the 18S rRNA of the small ribosomal subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit.
  • Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively (FIG. 1 A) (D. M. Stults, et al, Genome Res., (2008), 18, 13-18).
  • sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities (J. G. Gibbons, et al, Proc Natl Acad Sci USA., (2015), 112, 2485-2490; B. A. Kuo, et al, Nucleic Acids Res., (1996),
  • enhancer elements in the rDNA intergenic spacers determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycle-dependent expression paterns and cellular differentiation (S. Zhang, et al PLoS ONE, (2007), 2, e902; R. Santoro, et al, EMBO Rep., (2010), 11, 52- 58).
  • Evidence of altered rDNA operon expressi on patterns in the lifecycles of both bacteria and parasites (E. W. Mills, et al, Science., (2017) 358) argue that such control mechanisms may be evolutionarily conserved.
  • [QQQ7] The notion that ribosomes are homogeneous and consequently passive players in the protein synthesis mechanism (K. L. McCann, et a!., Science (2013) 341, 849-850; A.
  • An aspect of the present disclosure is directed to a processor programmed to perform:
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
  • the processor is further programmed to perform:
  • the processor is further programmed to perform
  • the number of reads expected to map to the reference sequence is calculated from a GC-content specific fragmentation rate analysis of the remainder reads.
  • the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the present disclosure is directed to a computer-readable storage device, comprising instructions to perform:
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to perform:
  • the present disclosure is directed to a computer-readable storage device, further comprising instructions to perform determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads
  • the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used m generating the first reference sequence
  • Another aspect of the disclosure is directed to a method comprising
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity:
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the method further comprises performing:
  • the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the set of high-throughput sequencing reads comprises DNA sequencing reads (e.g., DNA-seq).
  • RNA sequencing reads e.g , RNA-seq.
  • the contiguous stretch of the reference sequence is at least 25 nucleotides long.
  • the contiguous stretch of the reference genome is at least 25 nucleotides long.
  • the first reference sequence is a ribosoma! sequence.
  • the ribosomal sequence comprises a 47S/45S rDNA prototype sequence.
  • the first reference sequence is selected from the group consisting of NCBI reference sequences NR_G03278 3, NR_003279. l, NR__0Q3280.2, and NR_030686.1.
  • the first reference sequence is selected from the group consisting of NCBI reference sequences NR_003285.2, NR_003286.2, NR_003287.2, and NR_023379.1.
  • the first reference sequence is selected from the group consisting of GenBank IDs U13369.1 and X12811.1.
  • the first reference sequence is selected from the group consisting of BK000964.3 and Mus mus cuius chromosome 8 region [123538334, 123539354] [0029]
  • the second reference sequence comprises a sequence generated by sequencing ribosomes.
  • the organism is a prokaryote.
  • the organism is a mammal .
  • the mammal is selected from the group consisting of Homo sapiens, Mus mus cuius , and Ralius norvegicus.
  • FIGS. 1A - IB rDNA operons in the human genome.
  • A Chromosomal locations of rDNA operons and organization of the rRNA genes within the 47 S rDNA operon m the human genome. Together with the 5S rRNA, the IBS, 5.8S, and 28S rRNA form the RNA core of the ribosome
  • B Per-individual rDNA copy number estimation in humans, grouped by population.
  • FIGS. 2A - 2B High frequency genomic rRNA variants detected in human.
  • FIGS. 3.4 - 3D rRNA variants with population-stratified intra-individual AF.
  • B -(D). Tertiary structures of the ribosome showing strongly population-stratified variants (Vst>0.5) with high intra-individual AF (>20%) in any individual analyzed that are located (B) in expansion segments (ES6S and ESS) in the surface near ribosoma!
  • Ribosome tertiary structures show rRNA (tan), ribosomal proteins (green), and key structural landmarks.
  • FIGS. 4A - 4C Evolutionary conserved rRNA variants in functionally important centers of the human ribosome.
  • A 18S C543U variant (red) of helix hl6 contacts DHX29 (yellow), a component of translation initiation.
  • B 18S G480A variant (red) of helix h5 occurs near contact points with residues 256-260 within Domain 2 (yellow) of eEFlA (blue).
  • C 28S G1764A variant (red) of helix H38 in the central protuberance.
  • the structural models shown are based on EMD-3056-3058(45), PDB IDs 5LVS (B E.
  • FIG. 5 Tissue-specific expression of rRNA variants.
  • rRNA variant expression heatmap and hierarchical clustering of the 26 variants detected to be differentially expressed among pairs of tissues.
  • Each row represents a biological replicate. Rows are grouped by tissue source (3 biological replicates, i.e., rows, per tissue source).
  • Each column represents an rRNA variant. Expression is normalized per rRNA variant (i.e., by column), across all replicates and tissues (i.e., 12 samples per each column). For example, the rRNA variant represented by the leftmost column has higher relative expression in brain, while the variant represented by the rightmost column has lowest relative expression in liver.
  • FIGS. 6A - 61 Bioinformatics strategy for rRNA copy number estimation and variant discover ⁇ ' .
  • A The rDNA prototype and RNA-seq reads from actively translating ribosomes are used to identify whole genome sequencing (WGS) reads putatively generated from rDNA regions by
  • B computational hybridization, wherein paired-end reads are selected if at least one of the mates contains a contiguous stretch of 30 nt of perfect identity to the prototype or any RNA-seq read.
  • C Candidate rRNA WGS reads are then aligned to the reference genome and the rDNA prototype, separately, and (D) only paired-end reads which do not have better alignments to the reference are retained.
  • FIG. 7 Block diagram of the system in accordance with the aspects of the disclosure.
  • CPU Central Processing Unit
  • FIG, 8 Flow- chart of an embodiment identifying repetitive sequence reads among high throughput sequencing reads.
  • FIG, 9. Flowchart of embodiments for detecting rare sequence variants, and for determining the number of copies a repetitive sequence has in the genome.
  • base quality score refers to per-base estimates of error emitted by the sequencing machines, which express how confident the machine was that it called the correct base each time. For example, if the machine reads an A (Adenosine) nucleotide, and assigns a quality score of Q20— in Phred-scale, which means it's 99% sure it identified the base correctly. This means that one ca expect it to be wrong in one case out of 100: so if there are several billion base calls (one gets ⁇ 90 billion in a 3 Ox genome), at that rate the machine would make the wrong call in 900 million bases.
  • A Addenosine
  • base quality score recalibration refers to a process in which machine learning is used to model the per-base estimates of error emitted by the sequencing machines empirically and to adjust the quality scores accordingly. For example, for a given run, it may be identified that whenever two A nucleotides are called in a row, the next base called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. Adjusting the quality scores is done over several different covariables (e.g., mainly sequence context and position in read, or cycle) in a way that is additive. As a result, the same base may have its quality score increased for one reason and decreased for another.
  • covariables e.g., mainly sequence context and position in read, or cycle
  • Base calls themselves are not corrected, i.e., one can't determine whether that low-quality A should actually have been a T but recalibration allows the variant caller more accurately hotv far it can trust that A.
  • computer readable medium refers to a computer readable storage device or a computer readable signal medium.
  • a computer readable storage device may be, for example, a magnetic, optical, electronic, electromagnetic, infrared, or
  • the computer readable storage device is not limited to these examples except a computer readable storage device excludes computer readable signal medium.
  • Additional examples of the computer readable storage device can include: a portable computer diskette, a hard disk, a magnetic storage device, a portable compact disc read-only memory (CD-ROM), a random access memory (RAM), a read-only memory' (ROM), an erasable programmable read-only memory' (EPROM or Flash memory ), an optical storage device, or any appropriate combination of the foregoing; however, the computer readable storage device is also not limited to these examples. Any tangible medium that can contain, or store, a program for use by or in connection with an instruction execution system, apparatus, or device could be a computer readable storage device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, such as, but not limited to, m baseband or as part of a carrier wave.
  • a propagated signal may take any of a plurality of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium (exclusive of computer readable storage device) that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device.
  • Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • GATK Best Practices refers to step-by-step recommendations from the Broad Institute for performing variant discovery' analysis in high-throughp t sequencing (HTS) data.
  • HTS high-throughp t sequencing
  • the Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencin machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses.
  • high homology' refers to a sequence that demonstrates a high level of sequence identity to another given sequence.
  • “high homolog ⁇ '” refers to at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95% or at least 99% sequence identity.
  • two sequences ith “high homology" are substantially complementary' to each other.
  • nucleic acid fragment is capable of hybridizing to at least one nucleic acid strand or duplex even if less than ail nucleobases do not base pair with a counterpart nucleobase.
  • a“substantially complementary’' nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, 8%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double stranded nucleic acid molecule during hybridization.
  • mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another.
  • sequence alignments of proteins the degree of similarity between amino acids occupying a particular position in the sequence can he interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages.
  • DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
  • high throughput sequencing refers to sequencing technologies having increased throughput as compared to the traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands or millions of relatively short sequence reads at a time.
  • next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
  • next generations sequencing methods include, but are not limited to, pyrosequencing as used by the GS Junior and GS FLX Systems (454 Life Sciences, Bradford, Conn.); sequencing by synthesis as used by Miseq and Solexa system (Illumina, Inc., San Diego, Calif.); the SOLiDTM (Sequencing by Oligonucleotide Ligation and Detection) system and Ion Torrent Sequencing systems such as the Personal Genome Machine or the Proton Sequencer (Thermo Fisher Scientific, Waltham, Mass.), and nanopore sequencing systems (Oxford Nanopore Technologies, Oxford, united Kingdom).
  • the term "local indel realignment” refers to a process to locally realign reads such that the number of mismatching bases is minimized across all the reads.
  • a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome.
  • Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs.
  • read mapping algorithms operate on each read independently, it is impossibl e to place reads on the reference genome such at mismatches are minimized across all reads.
  • Local realignment around indels allows correction of mapping errors made by genome aligners. Local realignment makes read alignments more consistent in regions that contain indels. Genome aligners can only consider each read independently, and the scoring strategies they use to align reads relative to the reference limit their ability to align reads well in the presence of indels. Depending on the variant event and its relative location within a read, the aligner may favor alignments with mismatches or soft-clips instead of opening a gap in either the read or the reference sequence. In addition, the aligner's scoring scheme may use arbitrary tie-breaking, leading to different, non- parsimonious representations of the event in di fferent reads. In contrast, local realignment considers all reads spanning a given position. This makes it possible to achieve a high- scoring consensus that supports the presence of an indel event. It also produces a more parsimonious representation of the data in the region.
  • mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly.
  • the term "memory” as used herein comprises program memory and working memory.
  • the program memory may have one or more programs or software modules.
  • the working memory stores data or information used by the CPU in executing the functionality described herein.
  • processor may include a single core processor, a multi-core processor, multiple processors located m a single device, or multiple processors in wared or wireless communication with each other and distnaded over a network of devices, the Internet, or the cloud.
  • functions, features or instructions performed or configured to be performed by a '‘processor” may include the performance of the functions, features or instructions by a single core processor, may include performance of the functions, features or instructions collectively or collaborative!) by multiple cores of a multi-core processor, or may include performance of the functions, features or instructions collectively or coliaboratively by multiple processors, where each processor or core is not required to perform every function, feature or instruction individually.
  • the processor may be a CPU (central processing unit).
  • the processor may comprise other types of processors such as a GPU (graphical processing unit).
  • the processor may be an ASIC (application-specific integrated circuit), analog circuit or other functional logic, such as a FPGA (field-programmable gate array), PAL (Phase
  • PLA programmable logic array
  • the CP U is configured to execute programs (also described herein as modules or instructions) stored in a program memory to perform the functionality described herein.
  • the memory may be, but not limited to, RAM (random access memeory), ROM (read only memory') and persistent storage.
  • the memory is any piece of hardware that is capable of storing information, such as, for example without limitation, data, programs, instructions, program code, and/or other suitable information, either on a temporary basis and/or a permanent basis.
  • terminal read quality reduction refers to reducing the base quality scores for the 8 terminal nucleotides of both ends of ever) ' read to Q0 (meaning 0% accuracy, 100% error rate).
  • variants refer to alternative forms of a gene, genetic locus or gene sequence. Each variant has a distinct nucleic acid sequence at the locus of interest. For example, the inventors have discovered 1,790 variant alleles were detected at 1,662 positions of the 7,184 nucleotides (23%) in human 5S, 5.8S, 18S and 28S rDNA.
  • a variant of a gene or DNA sequence can show a sequence identity , e.g., 60%, 65%, 70%, 75%, 78%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 95%, 97%,
  • Sequence identity refers to the percent of exact matches between the nucleic acids of two sequences which are being compared.
  • the processor, the computer-readable storage device or the method of the present disclosure (“the technolog ⁇ ' described herein") are applied to discover sequence diversity, sequence variations and to estimate copy number of repetitive sequences within any multi-copy gene or genomic region.
  • the technology described herein entails the use of a "prototype" template sequence ("a reference sequence") for a specific gene type of interest that is then used to robustly and reliably quantify the subset of high-throughput whole genome sequencing (WGS) reads corresponding to this region of genomic DNA using a rigorous computational hybridization approach.
  • a reference sequence a "prototype” template sequence
  • WGS whole genome sequencing
  • the gene type of interest is ribosomal DM A.
  • multi-copy genes, genomic regions or repetitive sequences used in the technology described herein include ribosomal DNA, centromeric DNA, telomeric DNA, the human leukocyte antigen (HLA) complex of genes, the major histocompatibility complex (MHC) of genes, segmental duplications (Bailey, Jeffrey A., et al., Science , 297.5583 (2002): 1003-1007), and retrotransposons such as Alu, LI, Bl, LINE, and SINE elements.
  • HLA human leukocyte antigen
  • MHC major histocompatibility complex
  • retrotransposons such as Alu, LI, Bl, LINE, and SINE elements.
  • the technology described herein is used to rapidly assess the diversity of species present in a mixed biological sample by extracting and analyzing WGS reads from the mixture and computationally mapping these reads to a prototype (reference) sequence.
  • the technology' described herein is applied to the assessment of the diversity of rDNA operons present within a microhiome sample to rapidly characterize the gut flora of a patient.
  • the technology described herein is used to assess the diversity' of sequence variations in expressed RNA transcripts arising from RNA editing of post-transcriptional modifications - coming from a distinct region of the genome.
  • the rDNA sequence variants identified by the disclosed technology' are used to distinguish local populations or ethnic groups and to predict the ancestry' of an individual using sequencing data from a biological sample.
  • the discovery' and identification of sequence variants in rDNA with the disclosed technology is used to screen, diagnose, or predict the onset, progression, severity, life expectancy, or general health of an individual.
  • the high-throughput sequencing method used is DNA sequencing (DNA-seq). In some embodiments the high-throughput sequencing used is RNA sequencing (RNA-seq).
  • Various aspects of the present disclosure may be embodied as a program, software, or computer instructions embodied or stored in a computer or machine usable or readable medium, or a group of media which causes the computer or machine to perform the steps of the method when executed on the computer, processor, and/or machine.
  • a progra storage device readable by a machine e.g., a computer readable medium, tangibly embodying a program of instructions executable by the machine to perform various functionalities and methods described in the present disclosure is also provided.
  • the present disclosure includes a system comprising a CPU, a display, a network interface, a user interface, a memory, a program memory and a working memor' (FIG. 7), where the system is programmed to execute a program, software, or computer instructions directed to methods or processes of the instant disclosure.
  • the organism used in the present disclosure is selected from the kingdoms consisting of Monera (bacteria). Protista, Fungi, Plantae, and Animalia. In a specific embodiment, the organism is a member of the kingdom Animalia selected from the group consisting of a human ⁇ Homo Sapiens ), a macaque (Macaca fuscaia), a horse ( Equus cahallus), a cow ( Bos taurus), a sheep (Ovis aries), a dog ( Canis lupus familiar is), a cat (Felis cams), a mouse (Mus rnusculus), and a rat ( Raitus norvegicus). In some embodiments, the organism is a bacteria.
  • the reference genome used in the present disclosure is the genome of the organism on which the high-throughput sequencing was performed.
  • At least one sequence is designated as the predefined reference sequence (aka. "prototype sequence") (labeled as "Input 2: First reference sequence” in FIG. 8).
  • the predefined reference sequence displays high homology with the putatively existing copies of the gene/a!le!e/genetic element.
  • the sequence chosen to be the prototype can be any one of the alleles/copies of the gene or genetic element from any individual, or it can be any of the alleles/copies appearing in a reference genome or other genome assembly, or it can be a generalization such as consensus sequence. As an increasing number of genome sequences become available and as the depth and quality of genome sequencing data for repetitive sequences improves, the prototype may be updated and improved as well.
  • the repetitive sequence is rDNA and the prototype comprises published rDNA sequences (e.g , Genbank ID U13369, or sequences in Gonzales et al., Genomics, 1995 May 20; 27(2): 320-8) or corresponding rRNA sequences.
  • the prototype comprises rDNA sequence reads generated from the sequencing of pure ribosomes or polysomes.
  • the organism is mouse (Mus musculus) and the rDNA reference sequence is selected from the group consisting of CBI reference sequences NR 003278.3, NR 003279.1 , NR 003280.2, and NR 030686.1.
  • the mouse rDNA reference sequence is selected from the group consisting of genomic region BK000964.3 and chromosome 8 region [123538334, 123539354]
  • the organism is human ( Homo sapiens) and the rD A reference sequences are selected from the group consisting of NCBI reference sequences NR 003285.2, NR 003286.2, NR 003287.2. NR 023379.1
  • the human rDNA reference sequence is selected from the group consisting of genomic region U13369.1 and Xl2811. l.
  • the process outlined in FIG. 8 begins by identifying reads in a high-throughput sequencing data (which comprise a plurality of reads) that map to a predefined reference sequence (a "prototype" - as described above) with a contiguous stretch of 100% sequence identity (See FIG. 8).
  • a predefined reference sequence a "prototype" - as described above
  • the contiguous stretch of 100% sequence identity comprises at least a stretch of contiguous 20
  • nucleotides 21 nucleotides, 22 nucleotides, 23 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27, nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 35 nucleotides, or 40 nucleotides that map to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the identification of reads that map to a predefined reference sequence with a contiguous stretch of 100% sequence identity is achieved by using the command "-mumreference" from the software package "MUMmer” (Kurtz et al, Genome Biol. 2004; 5(2):R12).
  • sequence mapping is achieved by the sequence aligner "bowtie2" with the "— end-to-end” option. In some embodiments, mapping is achieved by the sequence aligner “BWA.” In some embodiments, mapping is achieved by a sequence aligner selected from the group consisting of bowtie2, BWA, mrFast & mrsFast, and novoAlign.
  • some reads that map to the reference sequence may have been generated from non-repeiitive regions of the genome. Some non-repetitive regions of the genome may display homology/sequence identity to subregi ons of the reference sequence (prototype) representing the repetitive region.
  • This procedure includes mapping reads to the non-repetitive regions of the published genome sequence of the organism and to the reference sequence, separately, and discarding a read if it maps to the non-repetitive regions of the genome with fewer errors than to the reference sequence (See FIG. 8). In some embodiments, a read is discarded if it maps to the non-repetiti ve regions of the genome with smaller edit distance.
  • the total edit distance of an aligned pairs is calculated as the sum of the edit distance of each mate, and only concordant alignments are considered.
  • regions of the reference genome with significant homology to the reference sequence are masked by aligning contiguous subsequences of the reference sequence to the reference genome and masking regions which have > 75% homology to the reference sequence.
  • regions of the reference genome which significant homology to the reference sequence are not considered for evaluating if a read maps better to the non-repetitive regions of the genome than to the reference sequence.
  • read sequences that map to the reference sequence, and not to the non- repetitive regions of the genome represent "remainder" reads (aka. candidate reads).
  • a "local indel realignment” is performed to obtain a realignment of the remainder reads (See FIG. 9).
  • local indel realignment locally realigns reads such that the number of mismatching bases is minimized across all the reads.
  • indel realignment is performed using the Genome analysis Toolkit (GATK) software package (DePristo et a!., Nat Genet , 201 1 May; 43(5):49l-8).
  • the GATK software package is used to perform the "IndelRealigner” function.
  • the LoFreq software package is used for indel realignment.
  • a "base quality score recalibration" is performed after the realignment (See FIG. 9).
  • the base quality score recalibration involves using machine learning algorithms to model the per-base estimates of error emitted by the sequencing machines empirically and adjusting the quality scores accordingly.
  • base quality score recalibration is performed using the GATK software package, where the recalibration table is obtained by aligning reads to the reference genome and following the GATK Best Practices (DePristo et al., Nat Genet., 201 1 May; 43(5):491 ⁇ 8).
  • base quality score recalibration is performed using the“Pr tlleads” function of the GATK software package.
  • the base quality score recalibration is achieved using software selected from the group consisting of GATK, BBMap and FreeBayes.
  • a "terminal read quality reduction" is performed after the base quality score recalibration (See FIG. 9).
  • the base quality scores of 6 terminal read nucleotides, at both ends of every read are reduced to Q0 (Phred scale).
  • the base quality scores of 7 terminal read nucleotides, at both ends of every read are reduced to Q0.
  • the base quality scores of 8 terminal read nucleotides, at both ends of ever ⁇ ' read are reduced to Q0.
  • the base quality scores of 9 terminal read nucleotides, at both ends of every read are reduced to Q0
  • the base quality scores of 10 terminal read nucleotides, at both ends of every read are reduced to Q0.
  • a "detection of rare sequence variants” is performed on the reads that have gone through terminal read quality reduction (See FIG. 9).
  • the detection of rare sequence variants is achieved by LoFreq (Wilm et af, Nucleic Acids Res., 2012 Dec; 40(22): 11189-201), a software for detecting rare sequence variants from a heterogeneous cell population in non-repetitive regions of the genome.
  • the detection of rare sequence variants is achieved by achieved by variant calling algorithms selected from LoFreq, samtools, GATK, and SOAPsnp.
  • Another aspect of the disclosure includes determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence (See FIG. 9)
  • the number of reads expected to map to the reference sequence is determined by calculating GC-content specific fragmentation rates.
  • an estimation of insert size L is computed.
  • L refers to the median insert size estimated from all concordantly aligned paired-end reads.
  • L is estimated empirically from the median insert size of all properly paired paired-end reads mapped to the genome.
  • insert sizes are estimated from library fractionation protocol parameters.
  • the insert size is estimated by the median insert size as computed by PICARD tool "CollectlnsertSizeMetrics.”
  • known repetitive regions or regions that may have experienced structural variation should be excluded.
  • regions can be identified as those regions with structural variation as reported by the 1000 Genomes Project or Mouse Genomes Project. Repeats such as segmental duplications (Bailey, Jeffrey A., et al, Science , 297.5583 (2002): 1003-1007) constitute repeats which should be excluded.
  • the fragmentation rate is calculated as follows.
  • the per- position GC-content of the hypothetical fragment of length L at p to be the number of guanine (G) and cytosine (C) nucleotides in the region j p, p+L-lJ.
  • G guanine
  • C cytosine
  • the estimation of a copy number is achieved by comparing the number of observed reads mapping to the prototype to the number of expected reads as estimated by the GC-content specific fragmentation rates.
  • the expected number of fragments generated at p i.e. with 5 end at p
  • G(p) gives the GC-content of the region [p, p+L-1].
  • the division by 2 is necessary for diploid organisms where the genome assembly aligned to is haploid.
  • this formula one can estimate the copy number of any region of the genome or the copy number of any repetitive element using a prototype. Namely, given a region or prototype R with observed fragment count X, then the copy number C of R is estimated as
  • the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to the average coverage of reads across the genome. In some embodiments, the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to an estimate of coverage derived from selected regions of the genome, such as the exons of single-copy genes.
  • the disclosure is directed to a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads.
  • a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the processor is programmed to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the processor is further programmed to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
  • the processor is further programmed to perform a base quality score recalibration on the remainder reads after the realignment.
  • the processor is further programmed to perform a terminal read quality reduction on the remainder reads after the base quality score recalibration.
  • the processor is further programmed to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the disclosure provides a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by- excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a processor programmed to detect the number of copies of a repetitive sequence in a genome.
  • a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the processor is programmed to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the processor is further programmed to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -con tent specific fragmentation rate analysis of the remainder reads.
  • the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity .
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect rare sequence variants in repetitive sequences.
  • a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identify comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
  • the computer-readable storage device further comprises instructions to perform a base quality score recalibration on the remainder reads after the realignment.
  • the computer-readable storage device further comprises instructions perform a terminal read quality reduction on the remainder reads after the base qualify score recalibration.
  • the computer-readable storage device further comprises instructions to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the disclosure provides a computer-readable storage device comprising instructions to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rR A read identification by minimal perfect homology, and comparison between alignments to the non -rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect the number of copies of a repetitive sequence in a genome.
  • a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity.
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the disclosure is directed to a method for detecting rare sequence variants in repetitive sequences.
  • the method comprises mapping a set of high- throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the method comprises identifying candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identify.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
  • the method further comprises performing a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads. [00127] In some embodiments, the method further comprises performing a base quality score recalibration on the remainder reads after the realignment.
  • the method further comprises performing a terminal read quality reduction on the remainder reads after the base quality score recalibration.
  • the method further comprises detecting rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the discl osure provides a method of detecting rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences fro high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a method of detecting the number of copies of a repetitive sequence in a genome.
  • the method comprises performing mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the method comprises identifying candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence winch maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the mul tiple copies of the first reference sequence in the genome of the organism.
  • the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the method further comprises mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity.
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • Mouse mammary epithelial cells were obtained from ATCC and were grown in T225 flasks (Corning) in high glucose (4 5 g/L) DMEM (Gibco) supplemented with 10% fetal bovine serum (Atlanta Bio!ogieals), 1%
  • HEK293T cells penicillin/streptomycin (Gibco), and insulin (10 pg/mL) HEK293T cells were grown in the same manner and media with the exception of insulin . Both cell lines were routinely checked for mycoplasma using the MycoAlertTM Mycoplasma Detection Kit (Lonza). To stabilize polysomes prior to harvest, cells were treated with 350 mM cycloheximide (Sigma Aldrich) for 30 minutes at 37 °C. Cells were then released from the surface using 0 05% Trypsin-EDTA (Gibco), centrifuged at 750 rpm for 5 min.
  • cycloheximide Sigma Aldrich
  • lysis buffer (1 g cells/mL lysis buffer) containing 20 mM Tri-HCl (pH :::: 7.5), 5 mM MgC12, 10 mM KC1, 1 mM DTT, 5 mM putrescine, 350 mM cycloheximide (Sigma Aldrich),
  • Clarified lysate was then loaded onto pre-cooled 10-50% sucrose gradients buffered in the same manner as the lysis buffer, but without RNase inhibitors, protease inhibitors, or DNase.
  • Sucrose gradients 'ere spun at 30k rpm for 3 h. at +4 °C m an SW32 rotor m an OptimaTM XPN-100 ultracentrifuge (Beckman Coulter). Gradients were then fractionated using a BR-188 density gradient fractionation system (Brandel). Fractions corresponding to polysomes were collected and pelleted at 35.7k rpm for 18 h.
  • NMuMG polysomes isolated as described above m biological triplicate, were denatured, reduced, and alkylated followed by proteolytic digestion with LysC (Wako Chemicals) and trypsin (Promega). Approximately 120 femtomol of each desalted (S. Shao et al., Cell., (2016), 167, 1229-1240. el 5sample was analyzed by nano-LC- MS/MS system (Ultimate 3000 coupled to a QExactive Plus, Thermo Scientific). Peptides were separated using a 12 cm x 75pm Cl 8 column (3 pm particles, Nikkyo Technos Co., Ltd.
  • RNA samples Prior to sequencing, the quantity and concentration of all RNA samples was determined using a Nano VueTM Spectrophotometer (GE Healthcare). High RNA integrity was verified using a 2100 Bioanalyzer (Agilent). Illumina-compatible libraries were prepared using a modified TruSeq RNA Sample Preparation method (I!!umina) that omitted the poly-A enrichment step. Sequencing w3 ⁇ 4s performed using 100 nucleotide paired-end sequencing on a HiSeq 4000 (illumina) in the Genomics Resources Core Facility at Weill Cornell Medicine.
  • Genbank IDs of the rRNA prototypes (reference sequences) used are:
  • Genbank IDs of rDNA prototypes used were: U13369.1 and X12811.1 (human), BK000964.3 and chromosome 8 region [123538334, 123539354] (mouse). This region of chromosome 8 wus chosen as a representative 5S rRNA and operon for mouse, as used previously (D M. Stulls, et al. Genome Res., (2008), 18, 13—18), since no such sequence could be found in Genbank.
  • the whole genome reference sequence assemblies used were GRCm38 for mouse and GRCh38 for human, downloaded from Ensembl. dbSNP release bl47 was used for known variant locations in the reference genome assemblies. Read libraries from the 1000 Genomes Project and Mouse Genomes Project were downloaded from NCBI SRA using the sra-toolkit.
  • BaseRecalibrator tool from GATK with mismatch context size of 3, and using the dbSNP and individuai-/strainspecific variant sites for the "-knownSites" option, as well as the reference regions declared to be unannotated rRNAs, as defined above. Paired-end read insert size distributions were computed with the CollectlnsertSizeMetrics tool of Picard.
  • a paired-end read was discarded if it did not map concordantly to the prototype rDNA, or if there existed a concordant alignment in the reference genome on one of the fully assembled chromosomes that did not overlap with any of the unannotated rRNA regions which had a strictly smaller edit distance than the edit distance of the paired-end read on the rDNA prototype.
  • the edit distance was computed as the sum of the edit distances for each mate.
  • the base qualities for the 8 nt on each end of each mate were manually reduced to Q0.
  • variants were called with LoFreq, restricted to the regions of the rDNA consisting of the transcribed rRNA.
  • the variant discover)' strategy detected a variant allele defined by variant nucleotide v at position x m individual A.
  • f the fraction of reads for individual B overlapping position x which exhibit nucleotide v
  • n the estimated rDNA copy number of individual B. Then, the variant AF in individual B is f if f*n > 1, and 0 otherwise.
  • Short-read sequencing data read depth distribution is biased according to the GC-content of the fragment that generate the read(s) (J. G. Gibbons, et al., Nat Commun., (2014), 5, 4850).
  • the median fragment length was computed with Picard "CollectlnsertSizeMetrics”.
  • RNA-seq reads from were aligned against the rRNA prototypes with bowtie2 with parameters as described above ("Calling rRNA variants").
  • Calling rRNA variants To prevent false positive variant calls due to alignment artifacts and poorly aligned read termini, reads were realigned with GATK and the quality scores of read termini were manually reduced to Q0 to preclude their contribution to variant calling, again as described above ("Calling rRNA variants").
  • LoFreq which implements the strand bias test and accounts for qualify scores and has been proven to have an extremely low' false positive rate, was then used to call sequence variants. Per-samp!e allele frequencies for detected variants were then computed from read alignment pileups.
  • each sample represents RNA-seq reads obtained from a single organ of a single mouse.
  • each organ there were three observations - one from each of the three mice analyzed.
  • Differential expression of variant allele frequencies between pairs of tissues were calculated using Lirnma, which uses a moderated t-statistic to account for common biological variation, with an false discovery rate (FDR) threshold of 0.05.
  • FDR false discovery rate
  • RNA-seq Rationalizing that the full di versify of rRNA valiants is best captured by actively translating ribosomes, RNA-seq was performed on polysomes isolated from proliferating mouse and human cells to use as bait. Polysomes represent the cellular fraction of ribosomes actively engaged in translating single mRNAs (D. J. Adams, et al, Mamin Genome ., (2015), 26, 403-412; M. Reschke et al., Cell Rep., (2013), 4, 1276- 1287; H. Chasse, et al., Nucleic Acids Res., (2017), 45, e!5; E.
  • Example 3 rDNA copy number varies widely across individuals in human and mouse
  • this variant discovery strategy detected the known rDNA sequence variants using WGS data from Escherichia coli K-12 with high specificity and sensitivity, where the estimated allel e frequency (AF) was highly correlated with the true AF (r>0.98, p ⁇ le-6).
  • Example 5 Humans exhibit pervasive inter - and intra-individual sequence variation in the rRNA genes
  • intra-individual rRNA variant AFs ranged over a broad spectrum, from a single copy to all rDNA copies in an individual’s genome, with numerous variants having extreme penetrance-to-population relationships. At one extreme, 19 variants were observed occurring in >50% of humans with a maximum intra-individual AF ⁇ 5%. In other words, these variants were found in over half of individuals tested, but w ithin any single individual, at most 5% of the individual’s rDNA operons contained the variant. For instance, the G928A 18S rRNA variant occurred with a maximum intra-individual AF of 3.7%.
  • 62 rRNA variants were found to occur with maximum intra-individual AF >75%, meaning that at least 75% of the rDNA operons within at least one individual contained the variant. Further, 22 variants in the 18S and 28S rRNA w'ere found to occur at a minimum intra-individual AF >10% in more than 500 individuals. Such high intra-individual AFs suggest that the majority of expressed ribosomes in these respective individuals likely contain these variants.
  • Eukaryotic rRNA is post-transcriptionally modified in functional centers of the ribosome(I B. Lomakin, et a! ., Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268).
  • I B. Lomakin, et a! . Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268).
  • 1,790 rDNA variants identified in the 5S, 5.8S, 18S and 28S rRNA 61 localized to 59 of the 256 (23%) positions known to be modified (I. B. Lomakin, e
  • the C462T variant in 18S rRNA a post-transcriptionally modified position that is a known site of interaction with eukaryotic release factor 1 (eRFl) during translation termination and ribosome release (MADEN, B E H, Progress in nucleic acid research and molecular biology ⁇ , (1990)), is found with intra-individual AF up to 61%.
  • the Al 183G variant in 18S rRNA found with intra-individual AF as high as 27%, occurs at a position located in intersubunit bridge eB14 near the decoding center of the ribosome that is 2 , -0-methylated by the small-nucleolar RNA (snoRNA) snR41 (1.
  • rRNA variants may impact ribosome function
  • rRNA variants with high intra-individual AF >20%) localized to intersubunit bridge elements or to the binding sites of ribosomal proteins that mediate them (Bla, B2b, B2c, B2e, B4, B7b, B7c, eBl 1, eB14, eB8b).
  • no variants were observed in intersubunit bridges B2a, B5, or B5b.
  • Intersubunit bridge Bla often referred to as the A-site finger (ASF)
  • ASF A-site finger
  • Bridge B2c contributes to the central intersubunit bridge structure, bridge B2, that links the site of peptide-bond formation within the large subunit to the site responsible for ammoacyl-tRNA decoding in the small subunit decoding center (H.
  • Bridge eBl 3 located at the leading edge of the translating ribosome and established through ribosomal protein eL24, extends from the base of the large subunit to ribosomal protein eS6, a small subunit protein implicated in mTQR-dependent regulation of the translation mechanism (T. Budkevich et al., Mol Cell. , (201 1), 44, 214-224; A. C. Hsieh et al., Nature, (2012), 485, 55-61).
  • variants spanning bridge el ements may impact the process of translation initiation by altering intersubunit bridge formation or the dynamic remodeling of intersubunit bridge elements that accompany the mechanism of protein synthesis (H. Chasse, et al .Nucleic Acids Res., (2017), 45, el 5; H. Khatter, et al , Nature, (2015), 520, 640-645; I. A. Dunkle et al., Science, (2011), 332, 981-984).
  • Variants on both sides of an intersubunit bridge may also template a mechanism for coordinating the pairing of specific 40S and 60S subunits comprised of distinct rRNA alleles in the cellular milieu.
  • Example 8 rRNA variants stratify by ancestry
  • rRNA variants to impact ribosome function may be greater if they occur on the same rRNA or in the same ribosome. Unambiguous determination of the occurrence of rR A variants on the same operon is prohibited by the large number of highly homologous rDNA operons in the genome and the short insert size of the paired-end reads (R. Heilig. Nature , (2004), 431, 931-945). Therefore correlation between intraindividual rRNA variant AFs were assessed to provide a first approximation of whether variants may be genetically linked. To do so, all pairwise intra-individual AF correlations were calculated for variants occurring in >75% of 1 ,887 unrelated individuals.
  • Correlations of this kind may exist if variants occurred within the same ancestral rDNA operon, which has since experienced extensive duplication and deletion rearrangements through meiotic recombination, or if direct or indirect functional linkages exist between these distal regions of the ribosome.
  • Example 10 rRNA variants are evolutionarily conserved
  • Table 1 Summary of rRNA variants detected in mouse strains from the Mouse Genomes Project.
  • the C543T variant in hi 6 of 18S rRNA, winch has an intra- individual AF as high as 22% in humans, is thought to directly contact the mRNA helicase DHX29 (D. F. Gudbjartsson et al., Scientific data, (2015), 2, 150011 ), an extra-ribosomal factor implicated in translation initiation (FIG. 4A).
  • the G480A variant in h5 of 18S rRNA which has an intra-individual AF as high as 65% in humans, resides at a position thought to contribute to GTP hydrolysis during tRNA selection (O.
  • Example 10 rRNA variants are differentially expressed between organs
  • RNA-seq without rRNA depletion (Methods).
  • 70 distinct rRNA variants were reproducibly detected: 19 in the 18S, 1 in the 5.8S, and 50 in the 28S. Consistent with a direct relationship between rDNA variant copy number and expression level, 31 of these variants (44%) were also detected in the rDNA of the BALB/cJ mouse strain.
  • ES27 Half of the ten variants showing the largest tissue-specific expression differences were found in ES27 (H63) located on the back side of the 60S subunit.
  • This highly dynamic ES plays an important yet unknown role in stabilizing the ribosome-associated complex (RAC) and coordinating extra-nbosomal factors implicated in co-transiationai protein folding near the nascent peptide exit tunnel (G. Serin et al., Biochirnie., (1996), 78, 530-538; R. Freitas, et al., Kinematic Seif-Replicating Machines (Landes Bioscience, 2004)).
  • the full repertoire of ES27 contributions to the translation mechanism has, however, yet to be elucidated. Intriguingly, variants in these regions were also found in human at high intra-individual AF, suggesting that their tissue-specific expression is conserved.
  • Example 11 rRNA variants are incorporated into actively translating ribosomes
  • 62 rRNA variants were detected, of wfiich 30 were found in the rDNA of at least one of the mouse strains analyzed and another 19 coincided with known positions of rRNA modification.
  • Ribosomal DNA is currently "dark matter" that is missing from contemporary genome assemblies (I. L. Gonzalez, et al., Nucleic Acids Res., (1988), 16, 10213-10224). Consequently, the contribution of rRNA heterogeneities to
  • ribosomopathies or functional distinctions m the pool of assembled ribosomes has yet to be explored.
  • the rigorous rDNA copy number estimation and rRNA sequence variant discovery strategies that are described here reveal pervasive intraand inter-individual rRNA sequence variation in the human genome, with at least 1,662 of the 7,184 nt of rRNA (23%) exhibiting sequence variation in the global population, and variation in rDNA copy number between individuals that spans nearly two orders of magnitude.
  • Tissue-specific rRNA variant expression and extensive overlap between the discovered rRNA variants and positions of known functional importance in the assembled ribosome, including sites identified as being post- transcriptionally modified were further observed (W. A.
  • rDNA operon-specific regulation or 2] tissue-to-tissue variation in the copy number of sub-classes of rDNA operons bearing specific variant alleles. While differences in total rDNA copy number between mouse tissues have been reported (H. F. Noller, Philos Trans R Soc Lond, B, Biol Sci., (2017), 372), it is unknown how the copy number of specific sub-classes of rDNA operons varies by tissue. Notably, larger variations in tissue-specific rRNA variant expression were observed than expected from rDNA copy variation alone.
  • chromatin structure and epigenetic modifications of the rDNA promoter contribute to the specificity of rDNA operon transcription (R. M. Voorhees, et al, Science , (2010), 330, 835-838; C. Lei dig et al, Nat Struct Mol Biol, (2013), 20, 23-28; R. Santoro, et al., Mol Cell, (2001), 8, 719- 725).
  • the expression of rDNA is also correlated with nucleotide variants within the IGS region, rDNA promoter, and the 5’ ETS region upstream of the rRNA genes (J. W. Briggs et al, Mol Cell, (2017) 67, 3-4; B. A.
  • tissue-specific rRNA variant expression is likely also regulated by rDNA promoter, IGS, and ETS sequence heterogeneities.
  • IGS regions contain variable-length tandem repeats, transposable A!u repeat elements that are highly abundant in the human genome, simple sequence microsatellites (B. Xu et al., PLoS Genet., (2017), 13, el006771), and exhibit substantial structural variation (P. Grozdanov, et al., Genomics , (2003), 82, 637-643; R.
  • compositions a feature that has been recently connected to the translation of distinct mRNAs (S. Caburet et al , Genome Res., (2005), 15, 1079-1085).
  • rDNA copy number and rRNA sequence variation may template specialized functions that enable the ribosome to actively participate in determining the cellular proteome (M. Bama, FASEB J (2017); J. W. Briggs, et al., Mol Cell, (2017) 67, 3-4).
  • the present analysis likely represents a lower bound on the true diversity of rDNA sequences in the mouse and human genomes and thus sequence variation within the assembled ribosome.
  • Read coverage is low in regions of extremely high GC-content (35), which applies to rRNA in general and especially for regions of the 28S (>80% GC) for which we observed nearly zero read depth. These regions include those identified as mter-species variable domains that are expressed and present in actively translating ribosomes (J. G. Gibbons, et al., Proc Natl Acad Sci USA., (2015), 1 12, 2485- 2490: H. Tseng et al, PLoS ONE, (2008), 3, el 843; H.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medicinal Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Theoretical Computer Science (AREA)
  • Plant Pathology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The present disclosure is directed to processors, computer readable-media and methods for detecting rare sequence variants in repetitive sequences in a genome. The present disclosure is also directed to processors, computer readable-media and methods for determining the copy number of a given repetitive sequence in a genome.

Description

DETECTING VARIANT ALLELES IN COMPLEX, REPETITIVE SEQUENCES WITHIN WHOLE GENOME SEQUENCING DATA SETS
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of priority from U.S. Provisional Application No. 62/636,466, filed February 28, 2018, the entire contents of which are incorporated herein by reference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[QQQ2] Tins invention was made with government support under Grant No. R01- GM079238, awarded by the National Institute of Health. The government has certain rights m the invention.
BACKGROUND
[0003] Repetitive sequences (also known as repeat elements, repeating units or repeats) are patterns of nucleic acids (DNA or RNA) that occur in multiple copies throughout the genome. Repetitive sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next- generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed.
[0004] For the specific case of ribosomal DNA (rDNA) regions in mammals for instance - for which there may be 20 to 2000 distinct rDNA operon elements within each individual (mouse or human) that exhibit related, but unknown sequences - there is no existing product specifically designed for or capable of detecting rDNA sequence variants in the encoded operons or the ribosomal RNA (rRNA) expressed from them nor estimating their copy number. Human and mouse genomes encode hundreds of copies of the rDNA operon, which are arranged in tandem arrays on five chromosomes (chromosomes 13, 14, 15, 21, 22 in human; chromosomes 12, 15, 17, 18, 19 in mouse). Each rDNA operon encodes a pre-rRNA that is post-transcriptionally processed into the 18S rRNA of the small nbosomai subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit. Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively. In humans, sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities. In mouse, enhancer elements in the rDNA intergenie spacers (IGS) determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycledependent expression patterns and cellular differentiation. Evidence of altered rDNA operon expression patterns in the lifecycles of both bacteria and parasites argue that such control mechanisms may be evolutionanly conserved. While algorithms and software have been described for detecting sequence variants in DNA for regions other than rDNA, these products are, by their own admission, not suitable for reliable detection of sequence variants in repetitive regions, of which rDNA qualifies.
[0005] Ribosomopathies are diseases arising from aberrations m the assembly, composition, or function of the ribosome, the two-subunit RNA-protein complex responsible for cellular protein synthesis (K. L McCann, et al., Science , (2013) 341, 849- 850). Varied and tissue-specific disease phenotypes associated with the perturbation of ribosomal proteins have inspired hypotheses of functionally distinct ribosome sub- populations in the cell (S. Xue, et al, Nat Rev Mol Cell Biol., (2012) 13, 355-369; Z. Shi, et al. , Annu Rev Cell Dev Biol., (2015) 31, 31-54; E. W. Mills, et al., Science., (2017)
358), often referred to as specialized ribosomes (M. Sauert, et al., Biochimie., (2015) 114, 39-47). Abnormalities in ribosomal proteins and post-transcriptional and -translational modifications are associated with disease conditions (K. L. McCann, et al, Science,
(2013) 341, 849-850; M. Bama, FASEB J (2017),
htp://www.fasebj.Org/content/31/1 JSupplement/387.1.short). However, little is presently known about the extent of genetically encoded sequence v ariation in ribosomal RNA (rRNA), the core component of the assembl ed ribosome, and whether such heterogeneities may serve as the basis for ribosome dysfunction or functional specialization. Recent studies suggest a potentially critical role for rRNA sequence variation in gene expression. For example, nutritional stress during pregnancy impacts the epigenetic status of a subset of ribosomal DNA (rDNA) operons delineated by a variant in the rDNA promoter sequence (J. W. Briggs, et al., Mol Cell, (2017) 67, 3-4). Variant rRNA alleles are also hypothesized to play a role in long-term memory formation (T. Preiss, Trends Biochem Sci., (2016), 41, 121-123). Somatic genome rearrangements of rDNA observed in >50% of adult solid tumors (N Slavov, et al, Cell Rep., (2015), 13, 865-873) also raise the possibility that rRNA sequence and copy number variation contribute to cancer (H. F. Noller, Philos Trans R Soc Land, B, Biol Sci., (2017), 372).
[0006] As the integration point for the regulation of protein synthesis in all cells, the ribosome interacts with myriad extra-ribosoma! factors when synthesizing proteins from messenger RNA (mRNA). Ribosomeassociated proteins, the "ribo-interaetome", have been implicated in ribosome-mediated impacts on gene expression (D. Simsek et al. Cell, (2017), 169, 1051-1065. el 8). While physical heterogeneities arising from ribosomal protein mutations and ribosomal protein and ribosome-associated protein stoichiometry have recently drawn significant attention (M. L. Holland et al., Science, (2016), 353, 495- 498; A. I. Hernandez, et al, Commun Integr Biol, (2015), 8, el 017163; D. M. Stubs et al. Cancer Res., (2009), 69, 9096-9104), the existence and contribution of sequence variation in rRNA has received less consideration. Human and mouse genomes encode hundreds of copies of the rDNA operon, which are arranged in tandem arrays on five chromosomes (13, 14, 15, 21, 22 m human; 12, 15, 17, 18, 19 in mouse) (FIG. 1A) (B. Xu et al, PLoS Genet., (2017), 13, el00677l; P. Grozdanov, et al, Genomics, (2003), 82, 637-643). Each rDNA operon encodes a 47S pre-rRNA that is post-transcriptional!y processed into the 18S rRNA of the small ribosomal subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit. Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively (FIG. 1 A) (D. M. Stults, et al, Genome Res., (2008), 18, 13-18). In humans, sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities (J. G. Gibbons, et al, Proc Natl Acad Sci USA., (2015), 112, 2485-2490; B. A. Kuo, et al, Nucleic Acids Res., (1996),
24, 4817-4824). In mouse, enhancer elements in the rDNA intergenic spacers (IGS) determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycle-dependent expression paterns and cellular differentiation (S. Zhang, et al PLoS ONE, (2007), 2, e902; R. Santoro, et al, EMBO Rep., (2010), 11, 52- 58). Evidence of altered rDNA operon expressi on patterns in the lifecycles of both bacteria and parasites (E. W. Mills, et al, Science., (2017) 358) argue that such control mechanisms may be evolutionarily conserved. [QQQ7] The notion that ribosomes are homogeneous and consequently passive players in the protein synthesis mechanism (K. L. McCann, et a!., Science (2013) 341, 849-850; A.
I. Hernandez, et a!., Commun Integr Biol., (2015), 8, el017163; G E. Zentner, et a!., G3 (Bethesda), (2014), 4, 243-254) is based in part on early restriction digest and nuclease protection studies demonstrating relatively low' levels of rDNA sequence variation (J. G Gibbons, et al, Proc Natl Acad Sci USA., (2015), 112, 2485-2490; M. L Holland, EBioMedicine (2017); H. Tseng et al, PLoS ONE, (2008), 3, e!843; H. Leffers, et al., Nucleic Acids Res., (1993), 21, 1449-1455). Contemporary efforts to discover rDNA sequence variation with high-throughput sequencing technologies are challenged by the absence of rDNA sequences from mammalian reference genome assemblies (I. L.
Gonzalez, et al., Nucleic Acids Res., (1988), 16, 10213-10224) and the inherent difficulty of variant discovery and sequence assembly in highly repetitive regions (R. Heilig. Nature, (2004), 431 , 931-945)
SUMMARY OF THE DISCLOSURE
[0008] An aspect of the present disclosure is directed to a processor programmed to perform:
i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
li) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity; and
iii) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
[0009] In some embodiments, the processor is further programmed to perform:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base quality score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality score recalibration; and vii) detecting rare sequence variants on the reads that have gone through terminal read quality reduction.
[0010] in some embodiments, the processor is further programmed to perform
determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence. In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC-content specific fragmentation rate analysis of the remainder reads.
[0011] In some embodiments, the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
[0012] In another aspect, the present disclosure is directed to a computer-readable storage device, comprising instructions to perform:
i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
ii) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity; and
iii) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
[0013] In some embodiments, the computer-readable storage device further comprises instructions to perform:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base qualify score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality score recalibration; and
vii) detecting rare sequence variants on the reads that have gone through terminal read quality reduction
[0014] In some embodiments, the present disclosure is directed to a computer-readable storage device, further comprising instructions to perform determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence. In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads
[0015] In some embodiments, the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used m generating the first reference sequence
[0016] Another aspect of the disclosure is directed to a method comprising
i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
ii) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity: and
hi) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
[0017] In some embodiments, the method further comprises performing:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base quality score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality score recalibration; and
vii) detecting rare sequence variants on the reads that have gone through terminal read quality reduction
[0018] In some embodiments, the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence. In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
[0019] In some embodiments, the set of high-throughput sequencing reads comprises DNA sequencing reads (e.g., DNA-seq).
[0020] In some embodiments, wherein the set of high-throughput sequencing reads comprises RNA sequencing reads (e.g , RNA-seq).
[0021] In some embodiments, the contiguous stretch of the reference sequence is at least 25 nucleotides long.
[0022] In some embodiments, the contiguous stretch of the reference genome is at least 25 nucleotides long.
[0023] In some embodiments, the first reference sequence is a ribosoma! sequence.
[0024] In some embodiments, the ribosomal sequence comprises a 47S/45S rDNA prototype sequence.
[QQ25] In some embodiments, the first reference sequence is selected from the group consisting of NCBI reference sequences NR_G03278 3, NR_003279. l, NR__0Q3280.2, and NR_030686.1.
[0026] In some embodiments, the first reference sequence is selected from the group consisting of NCBI reference sequences NR_003285.2, NR_003286.2, NR_003287.2, and NR_023379.1.
[0027] In some embodiments, the first reference sequence is selected from the group consisting of GenBank IDs U13369.1 and X12811.1.
[0028] In some embodiments, the first reference sequence is selected from the group consisting of BK000964.3 and Mus mus cuius chromosome 8 region [123538334, 123539354] [0029] In some embodiments, the second reference sequence comprises a sequence generated by sequencing ribosomes.
[0030] in some embodiments, the organism is a prokaryote.
[0031] In some embodiments, the organism is a mammal .
[0032] In some embodiments, the mammal is selected from the group consisting of Homo sapiens, Mus mus cuius , and Ralius norvegicus.
BRIEF DESCRIPTION OF 'THE DRAWINGS
[0033] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and pay ment of the necessary fee.
[0034] FIGS. 1A - IB. rDNA operons in the human genome. (A) Chromosomal locations of rDNA operons and organization of the rRNA genes within the 47 S rDNA operon m the human genome. Together with the 5S rRNA, the IBS, 5.8S, and 28S rRNA form the RNA core of the ribosome (B) Per-individual rDNA copy number estimation in humans, grouped by population.
[0035] FIGS. 2A - 2B. High frequency genomic rRNA variants detected in human. Subunit interface views of the (A) small subunit (40S) and B) large subunit (60S) tertiary structures of the human ribosome showing positions (red) with variant alleles exhibiting intra-individual AF>20% in any individual analyzed. Structural features within the 40S and 60S subunits are indicated.
[0036] FIGS. 3.4 - 3D. rRNA variants with population-stratified intra-individual AF. (A) Intra-individual allele frequency of 28S rRNA sequence variant A2538G, by population (max V st=0.56). x-axis has log2 scaling. (B)-(D). Tertiary structures of the ribosome showing strongly population-stratified variants (Vst>0.5) with high intra-individual AF (>20%) in any individual analyzed that are located (B) in expansion segments (ES6S and ESS) in the surface near ribosoma! protein eS6, (C) proximal to the binding site for ribosomal protein eS19, and (D) proximal to the binding site for ribosomal protein eL38. Ribosome tertiary structures (PDB ID: 4V6X (S. Shao, ei al. , Mol Cell., (2015), 57, 433- 444)) show rRNA (tan), ribosomal proteins (green), and key structural landmarks.
[QQ37] FIGS. 4A - 4C. Evolutionary conserved rRNA variants in functionally important centers of the human ribosome. (A) 18S C543U variant (red) of helix hl6 contacts DHX29 (yellow), a component of translation initiation. (B) 18S G480A variant (red) of helix h5 occurs near contact points with residues 256-260 within Domain 2 (yellow) of eEFlA (blue). (C) 28S G1764A variant (red) of helix H38 in the central protuberance. The structural models shown are based on EMD-3056-3058(45), PDB IDs 5LVS (B E. Maden et al , Biochem J., (1987), 246, 519-527) and 4V6X (S. Shao, et ai„ Mol Cell, (2015), 57, 433-444), respectively, where eIF3 and DHX29 from EMD-3056- 3058 were modeled onto PDB ID 5LVS.
[0038] FIG, 5, Tissue-specific expression of rRNA variants. rRNA variant expression heatmap and hierarchical clustering of the 26 variants detected to be differentially expressed among pairs of tissues. Each row represents a biological replicate. Rows are grouped by tissue source (3 biological replicates, i.e., rows, per tissue source). Each column represents an rRNA variant. Expression is normalized per rRNA variant (i.e., by column), across all replicates and tissues (i.e., 12 samples per each column). For example, the rRNA variant represented by the leftmost column has higher relative expression in brain, while the variant represented by the rightmost column has lowest relative expression in liver.
[0039] FIGS. 6A - 61. Bioinformatics strategy for rRNA copy number estimation and variant discover}'. (A) The rDNA prototype and RNA-seq reads from actively translating ribosomes are used to identify whole genome sequencing (WGS) reads putatively generated from rDNA regions by (B) computational hybridization, wherein paired-end reads are selected if at least one of the mates contains a contiguous stretch of 30 nt of perfect identity to the prototype or any RNA-seq read. (C) Candidate rRNA WGS reads are then aligned to the reference genome and the rDNA prototype, separately, and (D) only paired-end reads which do not have better alignments to the reference are retained.
(E) Sample-specific biases in read depth distribution according to GC-content are computed to (F) estimate rDNA copy number from the identified rRNA WGS reads. (G) WGS rRNA reads are aligned to the prototype rDNA, indei realignment and base quality score recalibration are performed via GATK, H) base qualities at terminal 8nt on both ends of the reads are manually reduced to Q0, and (I) variants are called using LoFreq.
[0040] FIG, 7. Block diagram of the system in accordance with the aspects of the disclosure. CPU: Central Processing Unit ("processor")
[0041] FIG, 8, Flow- chart of an embodiment identifying repetitive sequence reads among high throughput sequencing reads.
0 [0042] FIG, 9. Flowchart of embodiments for detecting rare sequence variants, and for determining the number of copies a repetitive sequence has in the genome.
Figure imgf000012_0001
Definitions
[0043] As used herein, the term "about" refers to a variation within approximately ±10% from a given value.
[0044] The phrase "base quality score" (or "Phred quality score") refers to per-base estimates of error emitted by the sequencing machines, which express how confident the machine was that it called the correct base each time. For example, if the machine reads an A (Adenosine) nucleotide, and assigns a quality score of Q20— in Phred-scale, which means it's 99% sure it identified the base correctly. This means that one ca expect it to be wrong in one case out of 100: so if there are several billion base calls (one gets ~90 billion in a 3 Ox genome), at that rate the machine would make the wrong call in 900 million bases.
[0045] The phrase "base quality score recalibration (BQSR)" refers to a process in which machine learning is used to model the per-base estimates of error emitted by the sequencing machines empirically and to adjust the quality scores accordingly. For example, for a given run, it may be identified that whenever two A nucleotides are called in a row, the next base called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. Adjusting the quality scores is done over several different covariables (e.g., mainly sequence context and position in read, or cycle) in a way that is additive. As a result, the same base may have its quality score increased for one reason and decreased for another. This allows getting more accurate base qualities overall, which in turn improves the accuracy of the variant calls. Base calls themselves are not corrected, i.e., one can't determine whether that low-quality A should actually have been a T but recalibration allows the variant caller more accurately hotv far it can trust that A.
[0046] The phrase "computer readable medium" refers to a computer readable storage device or a computer readable signal medium. A computer readable storage device, may be, for example, a magnetic, optical, electronic, electromagnetic, infrared, or
semiconductor system, apparatus, or device, or any suitable combination of the foregoing; however, the computer readable storage device is not limited to these examples except a computer readable storage device excludes computer readable signal medium. Additional examples of the computer readable storage device can include: a portable computer diskette, a hard disk, a magnetic storage device, a portable compact disc read-only memory (CD-ROM), a random access memory (RAM), a read-only memory' (ROM), an erasable programmable read-only memory' (EPROM or Flash memory ), an optical storage device, or any appropriate combination of the foregoing; however, the computer readable storage device is also not limited to these examples. Any tangible medium that can contain, or store, a program for use by or in connection with an instruction execution system, apparatus, or device could be a computer readable storage device.
[0047] A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, such as, but not limited to, m baseband or as part of a carrier wave. A propagated signal may take any of a plurality of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium (exclusive of computer readable storage device) that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device.
Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
[0048] The phrase "GATK Best Practices" refers to step-by-step recommendations from the Broad Institute for performing variant discovery' analysis in high-throughp t sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencin machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses.
[0049] The phrase "high homology'" refers to a sequence that demonstrates a high level of sequence identity to another given sequence. In some embodiments, "high homolog}'" refers to at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95% or at least 99% sequence identity. In some embodiments, two sequences ith "high homology " are substantially complementary' to each other. By "substantially
complementary" it is meant that the nucleic acid fragment is capable of hybridizing to at least one nucleic acid strand or duplex even if less than ail nucleobases do not base pair with a counterpart nucleobase. In certain embodiments, a“substantially complementary’' nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, 8%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double stranded nucleic acid molecule during hybridization. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can he interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
[0050] As used herein, the phrase "high throughput sequencing" (or "next generation sequencing") refers to sequencing technologies having increased throughput as compared to the traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands or millions of relatively short sequence reads at a time. Examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. Examples of next generations sequencing methods include, but are not limited to, pyrosequencing as used by the GS Junior and GS FLX Systems (454 Life Sciences, Bradford, Conn.); sequencing by synthesis as used by Miseq and Solexa system (Illumina, Inc., San Diego, Calif.); the SOLiD™ (Sequencing by Oligonucleotide Ligation and Detection) system and Ion Torrent Sequencing systems such as the Personal Genome Machine or the Proton Sequencer (Thermo Fisher Scientific, Waltham, Mass.), and nanopore sequencing systems (Oxford Nanopore Technologies, Oxford, united Kingdom). [0051] The term "local indel realignment" refers to a process to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossibl e to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.
[0052] Local realignment around indels allows correction of mapping errors made by genome aligners. Local realignment makes read alignments more consistent in regions that contain indels. Genome aligners can only consider each read independently, and the scoring strategies they use to align reads relative to the reference limit their ability to align reads well in the presence of indels. Depending on the variant event and its relative location within a read, the aligner may favor alignments with mismatches or soft-clips instead of opening a gap in either the read or the reference sequence. In addition, the aligner's scoring scheme may use arbitrary tie-breaking, leading to different, non- parsimonious representations of the event in di fferent reads. In contrast, local realignment considers all reads spanning a given position. This makes it possible to achieve a high- scoring consensus that supports the presence of an indel event. It also produces a more parsimonious representation of the data in the region.
[0053] The term "mapping" refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly.
[0054] The term "memory" as used herein comprises program memory and working memory. The program memory may have one or more programs or software modules.
The working memory stores data or information used by the CPU in executing the functionality described herein.
[0055] The term“processor” may include a single core processor, a multi-core processor, multiple processors located m a single device, or multiple processors in wared or wireless communication with each other and distnbuted over a network of devices, the Internet, or the cloud. Accordingly, as used herein, functions, features or instructions performed or configured to be performed by a '‘processor”, may include the performance of the functions, features or instructions by a single core processor, may include performance of the functions, features or instructions collectively or collaborative!) by multiple cores of a multi-core processor, or may include performance of the functions, features or instructions collectively or coliaboratively by multiple processors, where each processor or core is not required to perform every function, feature or instruction individually. The processor may be a CPU (central processing unit). The processor may comprise other types of processors such as a GPU (graphical processing unit). In other aspects of the disclosure, instead of or in addition to a CPU executing instructions that are programmed in the program memory , the processor may be an ASIC (application-specific integrated circuit), analog circuit or other functional logic, such as a FPGA (field-programmable gate array), PAL (Phase
Alternating Line) or PLA (programmable logic array).
[0056] The CP U is configured to execute programs (also described herein as modules or instructions) stored in a program memory to perform the functionality described herein. The memory may be, but not limited to, RAM (random access memeory), ROM (read only memory') and persistent storage. The memory is any piece of hardware that is capable of storing information, such as, for example without limitation, data, programs, instructions, program code, and/or other suitable information, either on a temporary basis and/or a permanent basis.
[0057] The phrase "terminal read quality reduction" refers to reducing the base quality scores for the 8 terminal nucleotides of both ends of ever)' read to Q0 (meaning 0% accuracy, 100% error rate).
[0058] As used herein, the term "variants" refer to alternative forms of a gene, genetic locus or gene sequence. Each variant has a distinct nucleic acid sequence at the locus of interest. For example, the inventors have discovered 1,790 variant alleles were detected at 1,662 positions of the 7,184 nucleotides (23%) in human 5S, 5.8S, 18S and 28S rDNA. A variant of a gene or DNA sequence can show a sequence identity , e.g., 60%, 65%, 70%, 75%, 78%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 95%, 97%,
98% or 99% sequence identity, to the reference sequence (prototype) of said gene.
Sequence identity refers to the percent of exact matches between the nucleic acids of two sequences which are being compared. General Description
[0059] In some embodiments, the processor, the computer-readable storage device or the method of the present disclosure ("the technolog}' described herein") are applied to discover sequence diversity, sequence variations and to estimate copy number of repetitive sequences within any multi-copy gene or genomic region.
[0060] In some embodiments, the technology described herein entails the use of a "prototype" template sequence ("a reference sequence") for a specific gene type of interest that is then used to robustly and reliably quantify the subset of high-throughput whole genome sequencing (WGS) reads corresponding to this region of genomic DNA using a rigorous computational hybridization approach. In a specific embodiment, the gene type of interest is ribosomal DM A.
[0061] In some embodiments, multi-copy genes, genomic regions or repetitive sequences used in the technology described herein include ribosomal DNA, centromeric DNA, telomeric DNA, the human leukocyte antigen (HLA) complex of genes, the major histocompatibility complex (MHC) of genes, segmental duplications (Bailey, Jeffrey A., et al., Science , 297.5583 (2002): 1003-1007), and retrotransposons such as Alu, LI, Bl, LINE, and SINE elements.
[0062] In some embodiments, the technology described herein is used to rapidly assess the diversity of species present in a mixed biological sample by extracting and analyzing WGS reads from the mixture and computationally mapping these reads to a prototype (reference) sequence.
[0063] In some embodiments, the technology' described herein is applied to the assessment of the diversity of rDNA operons present within a microhiome sample to rapidly characterize the gut flora of a patient. In some embodiments, the technology described herein is used to assess the diversity' of sequence variations in expressed RNA transcripts arising from RNA editing of post-transcriptional modifications - coming from a distinct region of the genome.
[0064] In some embodiments, the rDNA sequence variants identified by the disclosed technology' are used to distinguish local populations or ethnic groups and to predict the ancestry' of an individual using sequencing data from a biological sample.
[0065] In some embodiments, the discovery' and identification of sequence variants in rDNA with the disclosed technology is used to screen, diagnose, or predict the onset, progression, severity, life expectancy, or general health of an individual. [0066] In some embodiments, the high-throughput sequencing method used is DNA sequencing (DNA-seq). In some embodiments the high-throughput sequencing used is RNA sequencing (RNA-seq).
[0067] Various aspects of the present disclosure may be embodied as a program, software, or computer instructions embodied or stored in a computer or machine usable or readable medium, or a group of media which causes the computer or machine to perform the steps of the method when executed on the computer, processor, and/or machine. A progra storage device readable by a machine, e.g., a computer readable medium, tangibly embodying a program of instructions executable by the machine to perform various functionalities and methods described in the present disclosure is also provided.
[0068] In some embodiments, the present disclosure includes a system comprising a CPU, a display, a network interface, a user interface, a memory, a program memory and a working memor' (FIG. 7), where the system is programmed to execute a program, software, or computer instructions directed to methods or processes of the instant disclosure.
Organism and Reference genome
[0069] In some embodiments, the organism used in the present disclosure is selected from the kingdoms consisting of Monera (bacteria). Protista, Fungi, Plantae, and Animalia. In a specific embodiment, the organism is a member of the kingdom Animalia selected from the group consisting of a human {Homo Sapiens ), a macaque (Macaca fuscaia), a horse ( Equus cahallus), a cow ( Bos taurus), a sheep (Ovis aries), a dog ( Canis lupus familiar is), a cat (Felis cams), a mouse (Mus rnusculus), and a rat ( Raitus norvegicus). In some embodiments, the organism is a bacteria.
[0070] In some embodiments, the reference genome used in the present disclosure is the genome of the organism on which the high-throughput sequencing was performed.
Reference Sequences
[0071] In some embodiments, at least one sequence is designated as the predefined reference sequence (aka. "prototype sequence") (labeled as "Input 2: First reference sequence" in FIG. 8). The predefined reference sequence displays high homology with the putatively existing copies of the gene/a!le!e/genetic element. In some embodiments, the sequence chosen to be the prototype can be any one of the alleles/copies of the gene or genetic element from any individual, or it can be any of the alleles/copies appearing in a reference genome or other genome assembly, or it can be a generalization such as consensus sequence. As an increasing number of genome sequences become available and as the depth and quality of genome sequencing data for repetitive sequences improves, the prototype may be updated and improved as well.
[0072] In some embodiments, the repetitive sequence is rDNA and the prototype comprises published rDNA sequences (e.g , Genbank ID U13369, or sequences in Gonzales et al., Genomics, 1995 May 20; 27(2): 320-8) or corresponding rRNA sequences. In some embodiments, the prototype comprises rDNA sequence reads generated from the sequencing of pure ribosomes or polysomes.
[QQ73] In some embodiments, the organism is mouse (Mus musculus) and the rDNA reference sequence is selected from the group consisting of CBI reference sequences NR 003278.3, NR 003279.1 , NR 003280.2, and NR 030686.1. In some embodiments, the mouse rDNA reference sequence is selected from the group consisting of genomic region BK000964.3 and chromosome 8 region [123538334, 123539354]
[QQ74] In some embodiments, the organism is human ( Homo sapiens) and the rD A reference sequences are selected from the group consisting of NCBI reference sequences NR 003285.2, NR 003286.2, NR 003287.2. NR 023379.1 In some embodiments, the human rDNA reference sequence is selected from the group consisting of genomic region U13369.1 and Xl2811. l.
[0075] In some embodiments, the process outlined in FIG. 8 begins by identifying reads in a high-throughput sequencing data (which comprise a plurality of reads) that map to a predefined reference sequence (a "prototype" - as described above) with a contiguous stretch of 100% sequence identity (See FIG. 8). In some embodiments, the contiguous stretch of 100% sequence identity comprises at least a stretch of contiguous 20
nucleotides, 21 nucleotides, 22 nucleotides, 23 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27, nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 35 nucleotides, or 40 nucleotides that map to a contiguous stretch of the reference sequence with 100% sequence identity. In a specific embodiment, the identification of reads that map to a predefined reference sequence with a contiguous stretch of 100% sequence identity is achieved by using the command "-mumreference" from the software package "MUMmer" (Kurtz et al, Genome Biol. 2004; 5(2):R12). In some embodiments, sequence mapping is achieved by the sequence aligner "bowtie2" with the "— end-to-end" option. In some embodiments, mapping is achieved by the sequence aligner "BWA." In some embodiments, mapping is achieved by a sequence aligner selected from the group consisting of bowtie2, BWA, mrFast & mrsFast, and novoAlign.
Removal of reads that Map to Nan-Repetitive Regions of the Reference Genome
[0076] In some embodiments, some reads that map to the reference sequence may have been generated from non-repeiitive regions of the genome. Some non-repetitive regions of the genome may display homology/sequence identity to subregi ons of the reference sequence (prototype) representing the repetitive region. This procedure includes mapping reads to the non-repetitive regions of the published genome sequence of the organism and to the reference sequence, separately, and discarding a read if it maps to the non-repetitive regions of the genome with fewer errors than to the reference sequence (See FIG. 8). In some embodiments, a read is discarded if it maps to the non-repetiti ve regions of the genome with smaller edit distance. In some embodiments, for paired-end reads, the total edit distance of an aligned pairs is calculated as the sum of the edit distance of each mate, and only concordant alignments are considered. In some embodiments, regions of the reference genome with significant homology to the reference sequence are masked by aligning contiguous subsequences of the reference sequence to the reference genome and masking regions which have > 75% homology to the reference sequence.
[0077] In some embodiments, regions of the reference genome which significant homology to the reference sequence are not considered for evaluating if a read maps better to the non-repetitive regions of the genome than to the reference sequence. In some embodiments, read sequences that map to the reference sequence, and not to the non- repetitive regions of the genome, represent "remainder" reads (aka. candidate reads).
Figure imgf000020_0001
[0078] In some embodiments, a "local indel realignment" is performed to obtain a realignment of the remainder reads (See FIG. 9). In some embodiments, local indel realignment locally realigns reads such that the number of mismatching bases is minimized across all the reads. In a specific embodiment, indel realignment is performed using the Genome analysis Toolkit (GATK) software package (DePristo et a!., Nat Genet , 201 1 May; 43(5):49l-8). In a specific embodiment, the GATK software package is used to perform the "IndelRealigner" function. In a specific embodiment, the LoFreq software package is used for indel realignment.
Base Quality Score Recalibration
[0079] In some embodiments, a "base quality score recalibration" is performed after the realignment (See FIG. 9). In some embodiments, the base quality score recalibration involves using machine learning algorithms to model the per-base estimates of error emitted by the sequencing machines empirically and adjusting the quality scores accordingly. In a specific embodiment, base quality score recalibration is performed using the GATK software package, where the recalibration table is obtained by aligning reads to the reference genome and following the GATK Best Practices (DePristo et al., Nat Genet., 201 1 May; 43(5):491~8). In a specific embodiment, base quality score recalibration is performed using the“Pr tlleads” function of the GATK software package. In some embodiments, the base quality score recalibration is achieved using software selected from the group consisting of GATK, BBMap and FreeBayes.
Figure imgf000021_0001
[0080] In some embodiments, a "terminal read quality reduction" is performed after the base quality score recalibration (See FIG. 9). In a specific embodiment, the base quality scores of 6 terminal read nucleotides, at both ends of every read are reduced to Q0 (Phred scale). In a specific embodiment, the base quality scores of 7 terminal read nucleotides, at both ends of every read are reduced to Q0. In a specific embodiment, the base quality scores of 8 terminal read nucleotides, at both ends of ever}' read are reduced to Q0. In a specific embodiment, the base quality scores of 9 terminal read nucleotides, at both ends of every read are reduced to Q0 In a specific embodiment, the base quality scores of 10 terminal read nucleotides, at both ends of every read are reduced to Q0.
Detection of Rare Sequence Variants
[0081] In some embodiments, a "detection of rare sequence variants" is performed on the reads that have gone through terminal read quality reduction (See FIG. 9). In a specific embodiment, the detection of rare sequence variants is achieved by LoFreq (Wilm et af, Nucleic Acids Res., 2012 Dec; 40(22): 11189-201), a software for detecting rare sequence variants from a heterogeneous cell population in non-repetitive regions of the genome. In some embodiments, the detection of rare sequence variants is achieved by achieved by variant calling algorithms selected from LoFreq, samtools, GATK, and SOAPsnp.
Figure imgf000022_0001
[0082] Another aspect of the disclosure includes determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence (See FIG. 9)
[QQ83] In some embodiments, the number of reads expected to map to the reference sequence is determined by calculating GC-content specific fragmentation rates.
[0084] in some embodiments, an estimation of insert size L is computed. In a specific embodiment, L refers to the median insert size estimated from all concordantly aligned paired-end reads. In some embodiments, L is estimated empirically from the median insert size of all properly paired paired-end reads mapped to the genome. In some embodiments, for single-end reads, insert sizes are estimated from library fractionation protocol parameters. In a specific embodiment, the insert size is estimated by the median insert size as computed by PICARD tool "CollectlnsertSizeMetrics."
[0085] In some embodiments, GC-content specific fragmentation rates Xg (as defined in Benjamim et al. (Nucleic Acids Res., 2012 May; 40(10):e72)) are empirically computed for each possible GC-content g = 0, L using all properly aligned reads. Here, known repetitive regions or regions that may have experienced structural variation should be excluded. For human and mouse, such regions can be identified as those regions with structural variation as reported by the 1000 Genomes Project or Mouse Genomes Project. Repeats such as segmental duplications (Bailey, Jeffrey A., et al, Science , 297.5583 (2002): 1003-1007) constitute repeats which should be excluded. The fragmentation rate is calculated as follows. Given a specific position p in a DNA sequence, define the per- position GC-content of the hypothetical fragment of length L at p to be the number of guanine (G) and cytosine (C) nucleotides in the region j p, p+L-lJ. For each GC content g = 0, ... , L, the number of positions in the genome with per-position GC-content = g can thereby be computed. Cali this value Rg. For each GC content g = 0, ... , L, we can therefore also compute the number of observed properly paired paired-end reads whose 5’ most position of alignment (with regards to both mates) in the genome occurs at a genomic position with positionwise GC-content g. This value is called Eg The GC-content specific fragmentation rate l8 can therefore be computed as:
Figure imgf000023_0001
[0086] In some embodiments, the estimation of a copy number is achieved by comparing the number of observed reads mapping to the prototype to the number of expected reads as estimated by the GC-content specific fragmentation rates. Here, given a specific position p in the genome with GC-content x m the region [p, p+L-1], the expected number of fragments generated at p (i.e. with 5 end at p) is lc/2. Then, the expected number of fragments generated from a region R = fj¾,
Figure imgf000023_0002
(i.e., the number of fragments with 5’ end in R) is then given by
Figure imgf000023_0003
[0087] where G(p) gives the GC-content of the region [p, p+L-1]. The division by 2 is necessary for diploid organisms where the genome assembly aligned to is haploid. With this formula, one can estimate the copy number of any region of the genome or the copy number of any repetitive element using a prototype. Namely, given a region or prototype R with observed fragment count X, then the copy number C of R is estimated as
Figure imgf000023_0004
[0088] In some embodiments, the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to the average coverage of reads across the genome. In some embodiments, the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to an estimate of coverage derived from selected regions of the genome, such as the exons of single-copy genes.
Processors Programmed to Detect Rare Sequence Variants in Repetitive Sequences [0089] In one aspect, the disclosure is directed to a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads. [0090] In some embodiments, a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
[0091] In some embodiments, the processor is programmed to identify candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
[QQ92] In some embodiments, the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
[0093] In some embodiments, the processor is further programmed to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
[QQ94] In some embodiments, the processor is further programmed to perform a base quality score recalibration on the remainder reads after the realignment.
[0095] In some embodiments, the processor is further programmed to perform a terminal read quality reduction on the remainder reads after the base quality score recalibration.
[0096] In some embodiments, the processor is further programmed to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
[0097] In another aspect, the disclosure provides a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by- excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence. In some embodiments, the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded.
[0098] However, omission or modification of the steps of the procedure described in this patent is likely to increase the number of false positives and false negatives. This is because the steps described here are specifically designed to account for numerous established sources of bias and error in analysis of sequencing information. Additionally, the careful and rigorous use of base quality scores in combination with LoFreq is specifically designed to detect low frequency variants that other approaches missed.
Figure imgf000025_0001
[0099] Another aspect of the disclosure is directed to a processor programmed to detect the number of copies of a repetitive sequence in a genome.
[00100] In some embodiments, a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is present in multiple copies in the genome of the organism.
[00101] In some embodiments, the processor is programmed to identify candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
[00102] In some embodiments, the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism. [00103] In some embodiments, the processor is further programmed to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
[00104] In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC -con tent specific fragmentation rate analysis of the remainder reads.
[QQ105] In some embodiments, the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity . In some embodiments, the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
Computer-readable Storage Devices to Detect Rare Sequence Variants in Repetitive Sequences
[00106] In one aspect, the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect rare sequence variants in repetitive sequences.
[00107] In some embodiments, a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
[00108] In some embodiments, the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identify comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
[00109] In some embodiments, the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
[00110] In some embodiments, the computer-readable storage device further comprises instructions to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
[00111] In some embodiments, the computer-readable storage device further comprises instructions to perform a base quality score recalibration on the remainder reads after the realignment.
[00112] In some embodiments, the computer-readable storage device further comprises instructions perform a terminal read quality reduction on the remainder reads after the base qualify score recalibration.
[00113] In some embodiments, the computer-readable storage device further comprises instructions to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
[00114] In another aspect, the disclosure provides a computer-readable storage device comprising instructions to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rR A read identification by minimal perfect homology, and comparison between alignments to the non -rDNA reference genome and the reference sequence. In some embodiments, the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded. Computer-readable Storage Devices to Detect Number of Copies of a Repetitive
Sequence
[00115] Another aspect of the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect the number of copies of a repetitive sequence in a genome.
[00116] In some embodiments, a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is present in multiple copies in the genome of the organism.
[00117] In some embodiments, the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
[00118] In some embodiments, the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
[00119] In some embodiments, the computer-readable storage device further comprises instructions to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
[QQ120] In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads. [00121] In some embodiments, the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity. In some embodiments, the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
Methods for Detecting Rare Sequence Variants in Repetitive Sequences
[QQ122] In one aspect, the disclosure is directed to a method for detecting rare sequence variants in repetitive sequences.
|00123] In some embodiments, the method comprises mapping a set of high- throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
[QQ124] In some embodiments, the method comprises identifying candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identify. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
[00125] In some embodiments, the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
[QQ126] In some embodiments, the method further comprises performing a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads. [00127] In some embodiments, the method further comprises performing a base quality score recalibration on the remainder reads after the realignment.
[00128] In some embodiments, the method further comprises performing a terminal read quality reduction on the remainder reads after the base quality score recalibration.
[00129] In some embodiments, the method further comprises detecting rare sequence variants on the reads that have gone through terminal read quality reduction.
[00130] In another aspect, the discl osure provides a method of detecting rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence. In some embodiments, the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences fro high throughput sequencing reads as compared to instances where no process is excluded.
Methods of Detecting the Number of Copies of a Repetitive Sequence
[00131] Another aspect of the disclosure is directed to a method of detecting the number of copies of a repetitive sequence in a genome.
[00132] In some embodiments, the method comprises performing mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment. In some embodiments, the first reference sequence is present in multiple copies in the genome of the organism.
[00133] In some embodiments, the method comprises identifying candidate reads within the set based on the initial alignment. In some embodiments, a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity. In some embodiments, the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides. [00134] In some embodiments, the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence winch maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the mul tiple copies of the first reference sequence in the genome of the organism.
[00135] In some embodiments, the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
[00136] In some embodiments, the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
[00137] In some embodiments, the method further comprises mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity. In some embodiments, the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
[00138] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one skilled in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, the preferred methods and materials are now' described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
[00139] The examples listed below are only illustrative and by no means limiting. Example 1: Materials and Methods
RNA isolation from mouse epithelial cell polysomes
[00140] Mouse mammary epithelial cells, NMuMG cells, were obtained from ATCC and were grown in T225 flasks (Corning) in high glucose (4 5 g/L) DMEM (Gibco) supplemented with 10% fetal bovine serum (Atlanta Bio!ogieals), 1%
penicillin/streptomycin (Gibco), and insulin (10 pg/mL) HEK293T cells were grown in the same manner and media with the exception of insulin . Both cell lines were routinely checked for mycoplasma using the MycoAlert™ Mycoplasma Detection Kit (Lonza). To stabilize polysomes prior to harvest, cells were treated with 350 mM cycloheximide (Sigma Aldrich) for 30 minutes at 37 °C. Cells were then released from the surface using 0 05% Trypsin-EDTA (Gibco), centrifuged at 750 rpm for 5 min. at 4 °C in an Allegra X- 30R centrifuge (Beckman Coulter), washed in cold IX phosphate-buffered saline (pH 7 4) (Gibco), pelleted at 10k rpm for 1 minute at +4 °C in a microcentrifuge, and flash frozen on liquid nitrogen. Frozen cell pellets were thawed on ice and resuspended in lysis buffer (1 g cells/mL lysis buffer) containing 20 mM Tri-HCl (pH :::: 7.5), 5 mM MgC12, 10 mM KC1, 1 mM DTT, 5 mM putrescine, 350 mM cycloheximide (Sigma Aldrich),
RNaseOUT™ (1: 10,000) (Thermo Scientific), IX Halt™ Protease Inhibitor Cocktail (EDTA free) (Thermo Scientific). Resuspended cells were incubated on ice for 5 minutes before the addition of NP-40 (0.5% final concentration), sodium deoxycholate (0.5% final concentration), and TURBO™ DNase (1: 100, Invitrogen). Samples were then incubated on a rotator for 20 minutes at +4 °C before lysate w¾s clarified in a microcentrifuge (14k rpm for 15 min. at +4 °C). Clarified lysate was then loaded onto pre-cooled 10-50% sucrose gradients buffered in the same manner as the lysis buffer, but without RNase inhibitors, protease inhibitors, or DNase. Sucrose gradients 'ere spun at 30k rpm for 3 h. at +4 °C m an SW32 rotor m an Optima™ XPN-100 ultracentrifuge (Beckman Coulter). Gradients were then fractionated using a BR-188 density gradient fractionation system (Brandel). Fractions corresponding to polysomes were collected and pelleted at 35.7k rpm for 18 h. at +4 °C in a Ti70 rotor in an Optima XPN-100 ultracentrifuge (Beckman Coulter). Polysome pellets were resuspended in buffer before subunits were separated by increasing KC! concentration to 0.5M and treating with 5 mM puromycin for 30 min. on ice and 15 min. at 37 °C. Separated subunits were layed atop a 37% sucrose cushion containing 20 mM Tris-HCl (pH = 7.5), 50 mM KC1, 1.5 mM MgC12, and 1 mM DTT and spun at 100k rpm for 6 h. at +4 °C. in TLA 100.3 rotor in a Optima™ MAX -XP ultracentrifuge (Beckman Coulter). Pelleted subunits were resuspended in buffer and RNA was extracted using QIAzol® Lysis Reagent (Qiagen) and purified using the RNeasy (Qiagen).
Mass Spectrometry of mouse epithelial cell polysomes
[00141] NMuMG polysomes, isolated as described above m biological triplicate, were denatured, reduced, and alkylated followed by proteolytic digestion with LysC (Wako Chemicals) and trypsin (Promega). Approximately 120 femtomol of each desalted (S. Shao et al., Cell., (2016), 167, 1229-1240. el 5sample was analyzed by nano-LC- MS/MS system (Ultimate 3000 coupled to a QExactive Plus, Thermo Scientific). Peptides were separated using a 12 cm x 75pm Cl 8 column (3 pm particles, Nikkyo Technos Co., Ltd. Japan) at a flow rate of 200 nL/min, with a 5-40% gradient over 130 minutes (buffer A: 0.1% formic acid, buffer B: 0.1% formic acid in acetonitrile). Data were quantified and searched against E. coli database (July 2014) using MaxQuant (version 1.5.3.28) (J.
Rappsilber, et al, Nat Protoc., (2007), 2, 1896-1906). Oxidation of methionine and protein N-terminal acetylation were allowed as variable modifications, cysteine carbamidomethyl was set as a fixed modification and two missed cleavages were allowed. "Match between runs" option was enabled, and false discovery rates for proteins and peptides were set to 1%. Protein abundances were expressed as LFQ (label free quantitation) values.
Total RNA isolation from mouse tissues
[00142] All mouse work was performed in accordance with institutional, lACIJC and AAALAS guidelines, by the animal protocol 0709-666 A. Brain, lung, liver, and ovary tissue was dissected from three 8 week old BALB/cJ littermates and flash frozen before storage at -80 °C. Tissue (50-100 mg) wus mechanically lysed using a Mixer Mill 400 (Retsch) in 1.5 mL microcentrifuge tubes containing 1.4 mm ceramic beads (Omni) and Lysis/Binding Buffer (Ambion). Samples were lysed 6 x 1 min. at 30 Hz with a short incubation on ice between each cycle. Lysate was clarified for 30 seconds at 10k rpm and +4 °C in a microcentrifuge and RNA was extracted from clarified lysate using the mirVana miRNA Isolation Kit (Ambion). RNA was then treated with TURBO™ DNase (Invitrogen) and purified using the RNeasy Mini Kit (Qiagen).
RNA Sequencing
[00143] Prior to sequencing, the quantity and concentration of all RNA samples was determined using a Nano Vue™ Spectrophotometer (GE Healthcare). High RNA integrity was verified using a 2100 Bioanalyzer (Agilent). Illumina-compatible libraries were prepared using a modified TruSeq RNA Sample Preparation method (I!!umina) that omitted the poly-A enrichment step. Sequencing w¾s performed using 100 nucleotide paired-end sequencing on a HiSeq 4000 (illumina) in the Genomics Resources Core Facility at Weill Cornell Medicine.
References, prototypes, and databases
[00144] Genbank IDs of the rRNA prototypes (reference sequences) used are:
N R 003278.3. NR 003279.1, NR 003280.2, NR 030686.1 (mouse), and NR_003285.2, NR_003286.2, NR_003287.2, NR_023379.1 (human). Genbank IDs of rDNA prototypes used were: U13369.1 and X12811.1 (human), BK000964.3 and chromosome 8 region [123538334, 123539354] (mouse). This region of chromosome 8 wus chosen as a representative 5S rRNA and operon for mouse, as used previously (D M. Stulls, et al. Genome Res., (2008), 18, 13—18), since no such sequence could be found in Genbank.
The whole genome reference sequence assemblies used were GRCm38 for mouse and GRCh38 for human, downloaded from Ensembl. dbSNP release bl47 was used for known variant locations in the reference genome assemblies. Read libraries from the 1000 Genomes Project and Mouse Genomes Project were downloaded from NCBI SRA using the sra-toolkit.
identification of putatively unannotated rRNA in the reference genome
[00145] The rRNA components of the rDNA prototypes were aligned to the reference genome using BLAST. For the 18S and 28S, all hits to the assembled chromosomes with sequence identity > 75% and length > 75% with respect to the respective prototype were considered to be unannotated rRNA within the reference genome. The same criteria was used for the 5.8s and 5s, except with a length threshold of 85% of the respective prototype. Recalibration of base qualities
[QQ146] Reads were mapped to the whole genome reference with BWA mem with the "~M" flag, sorted, and converted to BAM format with samtools. The pre-processing steps of the Genome Analysis Toolkit (GATK) Best Practices pipeline were then followed for valiant discovery. Namely, the BAN! file was further prepared with the CleanS am and AddOrReplaceReadGroups tools in Picard, read duplicates were marked with hiobambam. Reads were realigned using the RealignerTargetCreator tool from GATK, where both the dbSNP variant database and the individual-zstrain-specific variant sites were used for the "-known" option. Base quality score recaiibration tables were obtained using the
BaseRecalibrator tool from GATK with mismatch context size of 3, and using the dbSNP and individuai-/strainspecific variant sites for the "-knownSites" option, as well as the reference regions declared to be unannotated rRNAs, as defined above. Paired-end read insert size distributions were computed with the CollectlnsertSizeMetrics tool of Picard.
Figure imgf000035_0001
[00147] Only read libraries containing paired-end reads with each mate having length > 70 were used. For each read library, reads with an exact, contiguous 30 nt match to any of the rRNA prototypes for that species or to the reads obtained from RNA-seq of polysomes for that species (above) were identified using MUMmer. For all reads satisfying the criteria, both the read and its mate were extracted for downstream analysis.
Calling rRNA variants
[00148] Candidate rRNA paired-end reads identified by computational
hybridization were aligned to the 47S/45S and 5S operon rDNA prototypes for the appropriate species using Bowtie2 with parameters "-q— phred33 -N 1 -L 10 -1 C,3,0— n- ceil 0— dpad 30— ignore-quals—end-to-end— mp 1,1— rdg 4,1— rfg 4,1— score-min L,G,- 0.12 -I a -X b— fr ", where a and b were set as 4 median absolute deviations below and above the insert size mean, respectively, as calculated from the whole genome data (abo ve). The options were set as such to ensure that the entire length of the read aligned to the prototype, up to a certain edit distance, and that the edit distance calculation was independent of base qualify . Indels in the reads were normalized using LeftAlignlndels and IndelRealigner of GATK. Read base qualities were then recalibrated using the recaiibration tables obtained as described above. [00149] Reads identified via computational hybridization were also aligned to the reference genome using bowtie2 with parameters "-q ~phred33 -I a -X b— fr -end-to- end", where a and b were set as 4 median absolute deviations below and above the insert size mean, respectively, as calculated from the whole genome data (above). A paired-end read was discarded if it did not map concordantly to the prototype rDNA, or if there existed a concordant alignment in the reference genome on one of the fully assembled chromosomes that did not overlap with any of the unannotated rRNA regions which had a strictly smaller edit distance than the edit distance of the paired-end read on the rDNA prototype. Here, the edit distance was computed as the sum of the edit distances for each mate. Of the remaining reads, the base qualities for the 8 nt on each end of each mate were manually reduced to Q0. Finally, variants were called with LoFreq, restricted to the regions of the rDNA consisting of the transcribed rRNA.
[00150] The accuracy of variant detection was evaluated using publicly available Illumina whole genome sequence libraries SRR3226042, SRR447638, SRR447643, SRR447662 for Escherichia eoli K-12 MG1655. rRNA variants and their true allele frequencies (AFs) w¾re defined from multiple alignments of each rRNA component (16S, 5S, 23S) of the annotated rDNA operons in the MG1655 genome, designating rDNA operon rmB as the "prototype". For each librar , no false positive variant calls were made, and all true SNP variants were detected with minor exceptions (below). The estimated AF and true AF were highly correlated (r > 0.98, p < l e-6) for all libraries tested. These results are consistent with the reported extremely high specifi city of LoFreq (Z. Peng et al., Nat BiotechnoL, (2012), 30, 253-260).
[00151] Several "true" MG1655 variants were not detected by the pipeline.
Positions 2547, 2548, and 2549 of the 23S, reported to be "GAT" in one of the operons, zero reads were found matching this sequence, indicating that it is an error in the reference genome for MG1655. Additionally, variants at positions 3 and 117 of the 5S were not detected due to edge-effects; namely, being on the extremities of the 5S, variants at these positi ons are supported by nucleotides in the ends of reads, but read termini are excluded from supporting variants by the manual reduction of read termini base qualities to Q0. Thus, these false negatives are expected from the conservative design of the method. Imputation ofrRNA variant AFs
[00152] As expected from LoFreqs limited sensitivity at low sequencing depth (Z. Peng et al. , Nat Biotechnol., (2012), 30, 253-260), our variant detection strategy had limited statistical power to detect all rDNA variants in each individual due to 1] the low sequencing depth of most individuals from the 1000 Genomes Project; 2] our restriction to paired-end reads with each read > 70 nt in length; and 3] low' relative read depth in GC- rich regions which includes all rRNA. To prevent false negative from biasing inter individual comparisons such population-stratification analyses, rRNA variant AFs were imputed for the purposes of inter-individual comparisons. Variant AFs were imputed as follows. Suppose the variant discover)' strategy detected a variant allele defined by variant nucleotide v at position x m individual A. For another individual B, let f be the fraction of reads for individual B overlapping position x which exhibit nucleotide v, and let n be the estimated rDNA copy number of individual B. Then, the variant AF in individual B is f if f*n > 1, and 0 otherwise.
Copy Number Estimation
[00153] Short-read sequencing data read depth distribution is biased according to the GC-content of the fragment that generate the read(s) (J. G. Gibbons, et al., Nat Commun., (2014), 5, 4850). For each library, the median fragment length was computed with Picard "CollectlnsertSizeMetrics". GC-content specific fragmentation rates kg as defined in Benjamini & Speed 2012 (J. G. Gibbons, et al, Nat Commun., (2014), 5, 4850) were empirically computed for each possible GC-content g = 0, ... , L using all properly aligned reads on the autosomes, and excluding regions with structural variation as reported by the 1000 Genomes Project or Mouse Genomes Project for the given sample, or annotated as segmental duplications (Segmental Duplication Database, University of Washington; Mouse Paralogy Server, University of Washington) or with homology to rRNA (above). Thus, given a specific position p in the haploid genome with GC-content x in the region [p, p+L-l], the expected number of fragments generated at p (i.e., with 5’ end at p) is lc/2. It is assumed that the number of fragments generated at a position p of the haploid genome with GC-content x in the region [p, p+L-l] follows a Poisson distribution and that the process of fragment generation is conditionally independent gi ven the reference genome:
Figure imgf000038_0001
where G(p) gives the GC-content of the region [p, p+L-1]
The expected number of fragments generated from a region R ------ [pi, pn] (i.e. the number of fragments with 5’ end in R) is then given by
Figure imgf000038_0002
[00155] With this model, one can estimate the copy number of any region of the genome. amely, given a region R with observed fragment count X, then the maximum likelihood estimate of the copy number C of R is given by:
Figure imgf000038_0003
[00156] This method was validated in several ways. For each individual analyzed, 50,000 regions of width 4,000 nt on autosome that were not overlapping segmental duplications or regions reported to have structural variation for that individual were randomly sampled. Hence, these regions are all putatively of copy number C = 2. Copy number predictions for each sample were computed as described above. For each individual, the Pearson correlation coefficient r for correlation between predicted copy number and GC-content of the region was computed using the 50,000 randomly sampled regions for that individual. These correlations were low' for all individuals. This reconfirms that the GC-content specific fragmentation rate method implemented here indeed controlled for GC-content specific biases in read libraries. Second, for each individual, the 90% quantile q of the absolute error in copy number estimates of the 50,000 sampled regions was computed, showing that for nearly every' single individual analyzed, the copy number estimates of 90% of the randomly sampled control regions differed from the true copy number by less than 0.5. Third, rDNA copy number predictions based on 18S and 28S, separately, were highly correlated per individual (r=0.99) and uncorrelated with library depth (r=-0.02, p=0.33) despite the distinct GC contents of the 18S and 28S.
Differential expression of rRNA variants in mouse organs
[00157] RNA-seq reads from were aligned against the rRNA prototypes with bowtie2 with parameters as described above ("Calling rRNA variants"). To prevent false positive variant calls due to alignment artifacts and poorly aligned read termini, reads were realigned with GATK and the quality scores of read termini were manually reduced to Q0 to preclude their contribution to variant calling, again as described above ("Calling rRNA variants"). LoFreq, which implements the strand bias test and accounts for qualify scores and has been proven to have an extremely low' false positive rate, was then used to call sequence variants. Per-samp!e allele frequencies for detected variants were then computed from read alignment pileups. Here, each sample represents RNA-seq reads obtained from a single organ of a single mouse. Thus, for each organ there were three observations - one from each of the three mice analyzed. Differential expression of variant allele frequencies between pairs of tissues were calculated using Lirnma, which uses a moderated t-statistic to account for common biological variation, with an false discovery rate (FDR) threshold of 0.05.
Figure imgf000039_0001
[00158] Modem analyses of genome variation based on high-throughput sequencing data, including such comprehensive studies as the 1000 Genomes Project (28) and the Mouse Genomes Project (Consortium et a!.. Nature , (2015), 526, 68-74systematically exclude rDNA from consideration due to its absence from reference genome assemblies. Therefore a bioinformatics strategy was developed to: 1 ] identify reads from rRNA genes in whole genome sequencing (WGS) data; 2] reliably estimate rDNA copy number, and;
3] detect sequence variation, while controlling for errors, systematic bias, mapping artifacts inherent to short-read data analysis, and homology to regions of the genome outside of rDNA (FIGS. 6A- 61).
[00159] Rationalizing that the full di versify of rRNA valiants is best captured by actively translating ribosomes, RNA-seq was performed on polysomes isolated from proliferating mouse and human cells to use as bait. Polysomes represent the cellular fraction of ribosomes actively engaged in translating single mRNAs (D. J. Adams, et al, Mamin Genome ., (2015), 26, 403-412; M. Reschke et al., Cell Rep., (2013), 4, 1276- 1287; H. Chasse, et al., Nucleic Acids Res., (2017), 45, e!5; E. Behrmann et al, Cell., (2015), 161, 845-857) and are physically separated from the individual ribosomal subunits and precursors of ribosome biogenesis by sucrose gradient fractionation (E. Behrmann et al, Cell. , (2015), 161, 845-857). Candidate WGS reads from rRNA regions were then extracted by a computational hybridization approach designed to identify those reads from mouse and human with homology to the RNA-seq reads from translating ribosomes or the corresponding "prototype" 5S, 5.8S, 18S, and 28S rRNA sequences (FIGS. 6A,6B). Reads that mapped with fewer errors to the assembled reference genome were discarded to minimize false positive calls arising from reads generated outside of rDNA operons but incorrectly mapped to the rDNA prototypes, such as those generated from regions with homology to rDNA outside of the known, unassembled clusters of rDNA (A. Ferguson et al., Mol Cell. , (2015), 60, 475-486) (FIGS 6C, 6D)
Example 3: rDNA copy number varies widely across individuals in human and mouse
[00160] Experimental estimates of rDNA copy number derived from PCR amplieons from 27 human tissue samples and computational analyses of 168 individuals suggest that humans possess between 200 and 600 rDNA copies (P. Grozdanov, et al, Genomics, (2003), 82, 637-643; A. Ferguson et al., Mol Cell., (2015), 60, 475-486). As rDNA regions have been reported to have >10% meiotic rearrangement frequency (P. Grozdanov, et al., Genomics, (2003), 82, 637-643) and to be hotspots for genome rearrangement (N. Slavov, et al., Cell Rep., (2015), 13, 865-873), the full extent of human rDNA copy number diversity was examined by assessing all 2,546 human genomes from the 1000 Genomes Project. The approach we employed accounts for biases in read depth in short-read data, where regions exhibiting high GC-content, such as rRNA, are systematically underrepresented (I. G. Gibbons, et al, Nat Coinmun., (2014), 5, 4850).
[00161] To estimate rDNA copy number in a single individual, a statistically rigorous method was developed that accounts for sample-specific GC content biases that impact read coverage in short-read sequencing data (FIGS 6E,6F) (I G Gibbons, et al., Nat Commun., (2014), 5, 4850). This approach was validated by 1] accurate estimation of 50,000 randomly sampled internal control regions of the genome of known copy number, 2] the high correlation (r=0.99) observed between rDNA copy number estimates obtained from 18S, 5.8S and 28S rDNA genes, and 3] verification that rDNA copy number estimates were not correlated with GC content. Applying this method to the analysis of the 2,546 sequenced human genomes, rDNA copy number was estimated in humans to vary over a range spanning two orders of magnitude, from 61 to 1 ,590 copies per individual (mean=315, sd=104, median=301). These findings are in strong agreement with early experimental estimates of 320 rDNA operons per individual (Y. Benjamini, et al, Nucleic Acids Res.,( 2012), 40, e72) and with recent results obtained for
subpopulations (P. Grozdanov, et al., Genomics, (2003), 82, 637-643; A. Ferguson et al., Mol Cell, (2015), 60, 475-486).
[QQ162] Allele frequencies that differ by population, termed population
stratification, may indicate genetic loci under selection. Using the Vst statistic (R. D. Schmickel, Pediatr Res , (1973), 7, 5-12), an analog of the fixation index that quantifies multi-copy number population stratification, it was found that 19 of 26 populations represented were differentiated by rDNA copy number (Vst>0.2, max Vst=0.40) (FIG.
IB). Consistent with rDNA existing only on autosomes, rDNA copy number did not stratify by sex. Interestingly, however, rDNA copy number was not stratified by continental populations (comparing South Asian, European, East Asian, American and African) (max Vst= 0.09), suggesting that rDNA copy number is dynamic, changing on a relatively rapid scale such that differentiation is only apparent between smaller, local populations.
[00163] In line with previous results (H. F. Noller, Philos Trans R Soc Lond, B, Biol Sci., (2017), 372; D. M. Stubs, et al, Genome Res , (2008), 18, 13—18) and as expected from the relationship between rDNA copy number and genome size (R. Redon et al, Nature, (2006), 444, 444-454), it ivas found mouse to have fewer rDNA copies than human on average, varying over 5.7-fold across 32 strains, from 74 to 419 rDNA copies per strain (mean=222, sd=74, median=218). The wide range of rDNA copy numbers observed in both mouse and human, representing differences of approximately 66 million nucleotides of rDNA between individuals at the extremes, together with the observed population stratification, are consistent with high rates of meiotic rDNA recombination (N. Slavov, et al., Cell Rep., (2015), 13, 865-873; P. Grozdanov, et al, Genomics, (2003), 82, 637-643). Example 4: rDNA sequence variation ca be detected from whole genome sequencing
[00164] Next, the extent of rDNA sequence variation was determined from whole genome sequencing data. To minimize spurious variant calls arising from instrument error, the base qualities of the extracted reads were empirically recalibrated (FIG. 6G) (C. D. Prokopowich, et a!.. Genome, (2003), 46, 48-50; G. A. Van der Auwera et ai., Curr Protoc Bioinformatics., (2013), 11, 11.10.1-11.10.33). To prevent false positive variant calls arising from ambiguities in read alignments, especially at read termini (M. A.
DePristo et a!., Nat Genet., (2011), 43, 491-498), reads were locally re-aligned around putative indels and the base qualities of the 8 nucleotides (nt) at each read end were reduced to quality score zero (Q0) to preclude read termini from analysis and thus the possibility of misalignment contributing to variant calls (FIG. 6H). Then LoFreq (Z. Peng et al., Nat Biolechnol, (2012), 30, 253-260), a base quality -aware algorithm designed for conservative detection of rare sequence variants that implements a strand bias test with Bonferrom correction, was used to call statistically significant variants in rDNA (FIG. 61). As a positive control, this variant discovery strategy detected the known rDNA sequence variants using WGS data from Escherichia coli K-12 with high specificity and sensitivity, where the estimated allel e frequency (AF) was highly correlated with the true AF (r>0.98, p<le-6).
Example 5 : Humans exhibit pervasive inter - and intra-individual sequence variation in the rRNA genes
[00165] Applying this variant discover}' strategy to WGS data from the 1000 Genomes Project, 1,790 variant alleles were detected at 1,662 positions of the 7,184 nucleotides (23%) in human 5S, 5.8S, 18S and 28S rDNA. Measuring the intra-individual AF of each rRNA sequence variant, i.e., the fraction of operons in an individual’s genome containing a specific rRNA sequence variant, 497 variants that occur in at least one individual with intra-individual AF>20% were identified (FIG. 2). Single individuals were found to encode 32 high frequency variants (intra-individual AF>20%) on average. Across individuals, 128 positions of the 5S, 18S and 28S were found to be multi-allelic. The vast majority of variants (1,739 of 1,790) were single nucleotide polymorphisms (SNPs), which were overrepresented at adenine (A) (p=2.4e-8) and cytosine (C) (p=1.5e- 5) positions and underrepresented at guanine (G) (p:=2 2e-19) positions. In line with reliable SNP identification (40), the mean Transition/Transversion (Ti/Tv) ratio per individual was 1.96, substantially higher than the Ti/Tv ratio for SNPs observed in exomes, introns, and intergenic regions of similarly extreme GC content (43, 44).
Accounting for low complexity regions of the rRNA for which read depth was extremely low due to GC-content bias, variant positions were overrepresented in 18S rRNA (p=2.7e- 14) and underrepresented in 28S (p=9.5e-l7) both per individual and across populations. Consistent with previous results (24, 25), the 51 distinct indel variants, ranging in length from 1 to 12 nt insertions or 1 to 9 nt deletions, largely consisted of GC-rich repeat sequences. These findings reveal pervasive inter- and intra-individual rRNA sequence variation in the human population.
[00166] intra-individual rRNA variant AFs ranged over a broad spectrum, from a single copy to all rDNA copies in an individual’s genome, with numerous variants having extreme penetrance-to-population relationships. At one extreme, 19 variants were observed occurring in >50% of humans with a maximum intra-individual AF <5%. In other words, these variants were found in over half of individuals tested, but w ithin any single individual, at most 5% of the individual’s rDNA operons contained the variant. For instance, the G928A 18S rRNA variant occurred with a maximum intra-individual AF of 3.7%. This residue occurs in helix 22 (h22) of 18S rRNA at a known contact point with the C subunit (Gly344) of initiation factor eIF3 (D. F. Gudbjartsson et al., Scientific data, (2015), 2, 150011). Similarly, the G1233A 18S rRNA variant, which occurs with a maximum intra-individual AF of 4.6%, is found in h30, a component of the peptidyl- †RNA binding domain (A. des Georges et al.. Nature, (2015), 525, 491-495). These findings suggest that low? intra-individual AFs at these functional positions may be evolutionarily advantageous. At the other extreme, 62 rRNA variants were found to occur with maximum intra-individual AF >75%, meaning that at least 75% of the rDNA operons within at least one individual contained the variant. Further, 22 variants in the 18S and 28S rRNA w'ere found to occur at a minimum intra-individual AF >10% in more than 500 individuals. Such high intra-individual AFs suggest that the majority of expressed ribosomes in these respective individuals likely contain these variants.
Figure imgf000044_0001
[00167] Eukaryotic rRNA is post-transcriptionally modified in functional centers of the ribosome(I B. Lomakin, et a! ., Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268). Of the 1,790 rDNA variants identified in the 5S, 5.8S, 18S and 28S rRNA, 61 localized to 59 of the 256 (23%) positions known to be modified (I. B. Lomakin, et al., Nature , (2013), 500, 307-311; W. A. Decatur, et al., Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268; K. E. Sloan et al., RNA Biol, (2017), 14, 1138-1152). Variants at these positions were identified with high intra- individual AF. For instance, the C462T variant in 18S rRNA, a post-transcriptionally modified position that is a known site of interaction with eukaryotic release factor 1 (eRFl) during translation termination and ribosome release (MADEN, B E H, Progress in nucleic acid research and molecular biology ·, (1990)), is found with intra-individual AF up to 61%. The Al 183G variant in 18S rRNA, found with intra-individual AF as high as 27%, occurs at a position located in intersubunit bridge eB14 near the decoding center of the ribosome that is 2,-0-methylated by the small-nucleolar RNA (snoRNA) snR41 (1. Ofengand, et al, JMol Biol, (1997), 266, 246-268; M. Muhs et al , Mol Cell, (2015), 57, 422-432). Interestingly, 29 positions reported as sites of pseudouridine modification exhibited SNP variants with intra-individual AF as high as 41%, suggesting that a substantial population of ribosomes in human cells lack this functionally important modification (J. Ofengand, et al, J Mol Biol, (1997), 266, 246-268). These findings collectively suggest that rRNA modifications in these individuals may be absent or made on nucleotides of distinct identities, perhaps implying alternative ribosome biogenesis machineiy.
Example 7: Variants occur in intersubunit bridge elements
[00168] The formation, breakage, and dynamic remodeling of the intersubunit bridges that establish the binding interface linking the small and large ribosomal subunits influence nearly every' aspect of the translati on mechanism (H. Chasse, et al, Nucleic Acids Res., (2017), 45, el5; H. Khatter, et al., Nature, (2015), 520, 640-645; J. A. Dunkle et al., Science, (2011), 332, 981 -984). Lending support to the notion that rRNA variants may impact ribosome function, rRNA variants with high intra-individual AF (>20%) localized to intersubunit bridge elements or to the binding sites of ribosomal proteins that mediate them (Bla, B2b, B2c, B2e, B4, B7b, B7c, eBl 1, eB14, eB8b). By contrast, no variants were observed in intersubunit bridges B2a, B5, or B5b. Notably, variants were found on both sides of intersubunit bridges Bla, B2c, B2e, and eB13 Intersubunit bridge Bla, often referred to as the A-site finger (ASF), extends from the 28S rRNA across the tRNA binding sites to the small subunit head domain and is implicated in peptidyl-tRNA positioning within the mammalian Ammoacyl (A) site and substrate translocation (S. Melnikov et al., Nat Struct Mol Biol., (2012), 19, 560-567; L. Wang, et al, RNA, (2011), 17, 2189-2200). Bridge B2c contributes to the central intersubunit bridge structure, bridge B2, that links the site of peptide-bond formation within the large subunit to the site responsible for ammoacyl-tRNA decoding in the small subunit decoding center (H.
Khatter, et al, Nature, (2015), 520, 640-645). Bridge eBl 3, located at the leading edge of the translating ribosome and established through ribosomal protein eL24, extends from the base of the large subunit to ribosomal protein eS6, a small subunit protein implicated in mTQR-dependent regulation of the translation mechanism (T. Budkevich et al., Mol Cell. , (201 1), 44, 214-224; A. C. Hsieh et al., Nature, (2012), 485, 55-61). Taken together, it is possible that variants spanning bridge el ements may impact the process of translation initiation by altering intersubunit bridge formation or the dynamic remodeling of intersubunit bridge elements that accompany the mechanism of protein synthesis (H. Chasse, et al .Nucleic Acids Res., (2017), 45, el 5; H. Khatter, et al , Nature, (2015), 520, 640-645; I. A. Dunkle et al., Science, (2011), 332, 981-984). Variants on both sides of an intersubunit bridge may also template a mechanism for coordinating the pairing of specific 40S and 60S subunits comprised of distinct rRNA alleles in the cellular milieu.
Example 8: rRNA variants stratify by ancestry
[00169] SNP and copy number variations outside of rDNA regions are known to stratify by ancestry, which can be indicative of selective pressures (T. 1. Treangen, et al, Nat Rev Genet., (201 1), 13, 36-46; I. Li et at., Cancer Res., (2016), 76, 4816-4827). Although the genes encoding rRNA represent only -0.1 % of the human genome, they are a known hotspot for genomic rearrangements (N. Slavov, et al., Cell Rep., (2015), 13, 865-873; P. Grozdanov, et al, Genomics, (2003), 82, 637-643). Therefore the relationship between intra-individual rRNA variant AF and population was investigated. Strikingly, the intra-individual AF of 327 rRNA variants (18% of the total) were found to stratify' by population (V st>0.2). Population-stratified variants spanned all four rRNA genes. In total, 44 variants exhibited signatures of extreme population stratification (Vst>0.5), such as the A2538G 28S variant (max Vst=0.56) (FIG. 3 A), of which 32 were from 18S rRNA and 12 were from the 28S rRNA. Only four rRNA variants were stratified by continental groups (max Vst=0.30).
[00170] In total, 239 out of all 325 possible population pairs (74%) stratified at least one rRNA variant. Any two populations stratified 30 to 239 variants. The number of stratified variants across populations was not unusually distributed by continental group (Kruskal -Wallis test, p=0. l l). By contrast, population studies of variation outside of rDNA regions have found African populations to exhibit more variation compared to other populations because of higher genetic diversity among Afri cans and the derivation of the reference genome from European lineage samples (T. i . Treangen, et al, Nat Rev Genet., (2011), 13, 36-46; J. Li et al., Cancer Res., (2016), 76, 4816-4827). Consistent with a >10% recombination rate per rDNA cluster per meiotic cycle (P. Grozdanov, et al., Genomics, (2003), 82, 637-643), the observed disparity m stratification at the population and continental levels suggests that genetic variation m rDNA loci occurs at relatively high rates such that isolated populations quickly diverge with respect to rRNA sequence.
[00171] Notably, 24 of the most highly population-stratified vari ants occurring with intra-individual AF >20% in at least one individual cluster in three distinct regions of the ribosome. Seven populationstratified variants cluster within the rRNA region comprising the base of the small subunit body domain that includes expansion segments (ESs) ES6S and ES3 as well as the surface surrounding ribosomal protein eS6, which is implicated in myriad translational dysfunctions in cancer (FIG. 3B) (P. H. Sudmant et al., Science, (2015), 349, aab376l). Eight populationstratified variants cluster on the apical, solvent- exposed surface of the small subunit head domain, immediately proximal to ribosomal protein eS19, which is causally linked to Diamond-Blackfan anemia (FIG. 3C) (X. M. Ma, et al., Nat Rev Mol Cell Biol, (2009), 10, 307-318; R. Horos et al, Blood, (2012), 1 19, 262-272). Finally, seven population-stratified variants cluster in immediate proximity of the ribosomal protein eL38 binding site, which is implicated in the translation of Hox genes and body patern formation during development (FIG. 3D) (J. Flygare et al., Blood, (2007), 109, 980-986). These observations support the notion that rRNA sequence variants may contribute to the regulation of gene expression in human physiology and disease.
Example 9: rRNA variant allele frequencies are correlated
[00172] The potential for rRNA variants to impact ribosome function may be greater if they occur on the same rRNA or in the same ribosome. Unambiguous determination of the occurrence of rR A variants on the same operon is prohibited by the large number of highly homologous rDNA operons in the genome and the short insert size of the paired-end reads (R. Heilig. Nature , (2004), 431, 931-945). Therefore correlation between intraindividual rRNA variant AFs were assessed to provide a first approximation of whether variants may be genetically linked. To do so, all pairwise intra-individual AF correlations were calculated for variants occurring in >75% of 1 ,887 unrelated individuals. By this method, 1 18 highly correlated (|rj>0.75) variants were found in 18S rRNA that were separated by less than 500 nt, winch formed multiple nexus of correlated variants within each 18S rRNA subdomain. Another 16 variants were highly correlated (jrj>G.75) and separated by more than 500 nt. For instance, two highly correlated variant clusters link a -200 nt region near li2 i in the small subunit body domain to a -150 nt region spanning h34 through h40 in the small subunit head domain, which are distant in both primary and tertiary structure. Correlations of this kind may exist if variants occurred within the same ancestral rDNA operon, which has since experienced extensive duplication and deletion rearrangements through meiotic recombination, or if direct or indirect functional linkages exist between these distal regions of the ribosome.
Example 10: rRNA variants are evolutionarily conserved
[00173] Human and mouse reference rRNA sequences are highly homologous (percent identity : 5S=!00%, 5.8S=99%, 18S=99%, 288=85%) Applying the instant variant discovery strategy to the 32 strains sequenced by the Mouse Genomes Project, 285 distinct variant alleles were detected at 276 positions of the rRNA genes in mouse rDNA (Table 1 ). Individual strains possessed between 39 and 191 rRNA variants (mean=T25; sd=31; median=131). Given the relatively small number of available mouse strains and their limited depth of coverage, these results likely represent a lower bound for rDNA heterogeneity.
Figure imgf000048_0001
Table 1: Summary of rRNA variants detected in mouse strains from the Mouse Genomes Project.
|00174] Eighty of the 276 rRNA variant positions identified in mouse (29%) were also variant in human. Of these, 48 had the same variant alleles, the majority of which (45/48; -94%) also had the same prototype allele. Many of these 45 variants occurred at positions of known functional relevance, including interaction sites between rRNA and extra-ribosomal factors involved m a wide range of translation-related processes, including: mRNA quality control and degradation (Dom34, Hbs! ) (N Kondrashov et al , Cell, (2011), 145, 383-397), ribosome-associated protein quality control (nuclear export mediator factor NEMF) (T. Becker et al., Nat Struct Mol Biol, (2011), 18, 715-720), translation elongation (eEF2) (S. Shao, et a I., Moί Cell, (2015), 57, 433-444), translation initiation (eIF5B, elFl A/elFl) (A. des Georges et al. , Nature, (2015), 525, 491 -495; A.
M. Anger et al., Nature, (2013), 497, 80-85), ribosome biogenesis (snR87, HBII-10, U16, and Ul03)( FI. Yamamoto et al., Nat Rev Microbiol, (2014), 12, 89- 100), translation termination (eRFl) (R W. van Nues et al, EMBO J., (2011), 30, 2420-2430), and selenoprotein synthesis (SBP2) (A. des Georges et al, Nucleic Acids Res., (2014), 42, 3409-3418). For instance, the C543T variant in hi 6 of 18S rRNA, winch has an intra- individual AF as high as 22% in humans, is thought to directly contact the mRNA helicase DHX29 (D. F. Gudbjartsson et al., Scientific data, (2015), 2, 150011 ), an extra-ribosomal factor implicated in translation initiation (FIG. 4A). The G480A variant in h5 of 18S rRNA, which has an intra-individual AF as high as 65% in humans, resides at a position thought to contribute to GTP hydrolysis during tRNA selection (O. Kossinova, et al., RNA, (2014), 20, 1046-1056), and thus the rate and fidelity of protein synthesis (FIG. 4B). The G1764A variant of H38 of the 28S rRNA occurs in the ASF (intersubunit bridge Bla), which is implicated in peptidyl-tRNA positioning and substrate translocation (S. Melnikov et al., Nat Struct Mol Biol, (2012), 19, 560-567; L. Wang, et al., RNA, (2011), 17, 2189- 2200) (FIG. 4C). These findings suggest that rDNA sequence variants occurring in functionally important regions of the ribosome are conserved across species.
Example 10: rRNA variants are differentially expressed between organs
[00175] To investigate whether the identified rDNA variants are expressed, total RNA was harvested from four major organs - brain, lung, liver, and ovary - representing all three germ layers from three 8-weekold BALB/cJ littermates and performed RNA-seq without rRNA depletion (Methods). Among the 4 tissue types examined, 70 distinct rRNA variants were reproducibly detected: 19 in the 18S, 1 in the 5.8S, and 50 in the 28S. Consistent with a direct relationship between rDNA variant copy number and expression level, 31 of these variants (44%) were also detected in the rDNA of the BALB/cJ mouse strain. For these variants, intra-individual genomic AF and rRNA expression level of variants were highly correlated (r=0 97, p=1.8e~19). Hence, consistent with their high intra-individual AF, sequence variants present in rRNA genes are actively expressed. Another 14 expressed rRNA variants were non-overlapping with mouse rDNA variants and instead coincided with positions of known post-transcriptional modification (I. B. Lomakin, et al., Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., J Mol Biol., (1997), 266, 246-268; K. E. Sloan et al, RNA Biol., (2017), 14, 1 138—1152). The remaining 25 variants wore neither detected in BALB/cJ rDNA nor found to coincide with positions of known modification. This subset may reflect undetected positions of rDNA variation or posttranscriptionaily modified positions in rRNA that have yet to be annotated.
[00176] Tissue-specific histone marks on rDNA (R. Santoro, et al., EMBO Rep., (2010), 11, 52-58; R. M. Voorhees, et al, Science, (2010), 330, 835-838), the methylation of non-expressed rDNA regions (J. W. Briggs, et al , Mol Cell., (2017) 67, 3-4; B. A. Kuo, et al, Nucleic Acids Res., (1996), 24, 4817-4824), and tissue-specific expression of variants in the ETS region of rDNA (M. L Holland, EBioMedicine (2017)) are well documented. This precedent suggests that rDNA gene variants, and thus ribosomes assembled from distinct rRNA sequences, may exhibit tissue-specific expression.
Consistent with such a model, it was observed that 26 of the 70 expressed rRNA variants (37%) were differentially expressed in at least one tissue (FIG. 5; Table 3) Each pair of tissues was distinguished by 6 to 17 differentially expressed rRNA variants. These differentially expressed rRNA variants consisted of substitutions, deletions, and insertions that localized to regions of functional and structural importance, including intersubunit bridge B4 and the binding sites for eIF3 (ES7), deacylated tRNA, ribosoma! proteins eL33 and eL44, and ribosome biogenesis factors nucleolin and Arxl (G. E. Zentner, et al., Nucleic Acids Res., (2011), 39, 4949-4960; G. Serin et al., Biochmie., (1996), 78, 530- 538; B. Bradatsch et al., Nat Struct Mol Biol, (2012), 19, 1234-1241 ). Of the 26 differentially expressed rRNA variants identified, 15 (58%) were detected in the BALB/cJ rDNA and another 4 coincided with known positions of rRN A modification: positions 1032 and 1248 of the 18S, and 1136 and 4083 of the 28S (I. Ofengand, et al., JMol Biol. , (1997), 266, 246-268; K. E. Sloan et al, RNA Biol, (2017), 14, 1138-1 152). As reverse transcriptase error profiles are dependent on modification type (A. S. Petrov et al. , Proc Natl Acad Sei USA., (2014), 111 , 10251-10256) and multiple types of modification have been observed at 18S positions 1032 and 1248 and 28S position 1136 (W. A. Decatur, et al., Trends Bi.och.em Sci., (2002), 27, 344-351; 1. Ofengand, et al., JMol Biol, (1997), 266, 246-268; K. E. Sloan et al., RNA Biol, (2017), 14, 1138-1152), these results suggest that modification profiles at these positions differ across tissues, either by overall modifi cation levels or by the distribution of modification types at the same position.
Figure imgf000051_0002
Figure imgf000051_0001
Figure imgf000051_0003
Figure imgf000052_0001
[00177] In absolute terms, these findings show that a substantial fraction of the total pool of expressed ribosomes differs between organs (Table 3). For instance, given the estimate of 10 million ribosomes per mammalian cell (M. Helm, et al. , Nat Rev Genet., (2017), 18, 275-291), there are over 3 million more ribosomes (30% of the total) bearing a U insertion m 28S rRNA at position 2234 in lung compared to ovary. Interestingly, variants with the largest estimated differences in absolute number of ribosomes between tissues were found almost entirely in eukaryoticspeeific expansion segments, whose functional relevance have yet to be thoroughly elucidated (Table 3). Half of the ten variants showing the largest tissue-specific expression differences were found in ES27 (H63) located on the back side of the 60S subunit. This highly dynamic ES plays an important yet unknown role in stabilizing the ribosome-associated complex (RAC) and coordinating extra-nbosomal factors implicated in co-transiationai protein folding near the nascent peptide exit tunnel (G. Serin et al., Biochirnie., (1996), 78, 530-538; R. Freitas, et al., Kinematic Seif-Replicating Machines (Landes Bioscience, 2004)). The full repertoire of ES27 contributions to the translation mechanism has, however, yet to be elucidated. Intriguingly, variants in these regions were also found in human at high intra-individual AF, suggesting that their tissue-specific expression is conserved.
Example 11: rRNA variants are incorporated into actively translating ribosomes
[00178] To assess whether expressed rRNA variants assemble into functional ribosomes, rRNA from polysomes isolated from mouse mammary epithelial cells were isolated and sequenced, which could be readily grown in sufficient quantities for quantitative polysome analyses (Methods). Consistent with expectation (D J. Adams, et al., Mamm Genome., (2015), 26, 403-412), quantitative mass spectrometry' confirmed that these polysome fractions were principally ribosome components and devoid of ribosome biogenesis-related proteins (Methods). Within this subpopulation of the total cellular ribosome pool, 62 rRNA variants were detected, of wfiich 30 were found in the rDNA of at least one of the mouse strains analyzed and another 19 coincided with known positions of rRNA modification. The expression levels of the rRNA variants detected in the actively translating ribosome pool were also found to be highly correlated with intraindividual genomic variant AF (r=0.96, p=1.2e-17) (Methods). Moreover, 42 of the 70 variants identified as expressed in tissues were detected in actively translating ribosomes, where expression levels were again highly correlated (r=0.99, p=2.0e-36). Importantly, 8 of the 15 differentially expressed rRN A variants observed in both mouse tissues and BALB/cJ rDNA (53%) were also present in the actively translating ribosome pool. The substantial overlap between the variants detected in polysomes with those found in tissues and in genomic DNA, together with the high correlation between genomica!ly encoded and expressed AF, provide compelling evidence that variant rRNAs are present in ribosomes actively engaged in protein synthesis.
[00179] To further elucidate the correspondence between rRNA variant levels in expressed versus actively translating ribosomes, total RNA isolated from the same mouse epithelial cells were sequenced (Methods). It was found that 38 out of 40 rRNA variants detected in expressed ribosomes were also present in actively translating ribosomes from the same cells, again exhibiting highly correlated expression levels (r=0.98, p=3.0e-28). The strong correlations in rRNA variant AF among genomic, expressed, and actively translating ribosomes provide further evidence that genomically encoded rRNA variants are present in actively translating ribosomes at similar levels as their frequency in the genome. These findings support the notion that genomically encoded rRNA variants are assembled into functional ribosomes and have the potential to impact the translation process and consequently gene expression. The observed distinctions in the rRNA sequence composition of actively translating ribosomes may impact translation by affecting dynamic aspects of the protein synthesis mechanism, subcellular localization of the ribosome, or the ribosome’s association with extra-ribosomal factors, the "ribo- interactome" (M. Sauert, et al., Biochimie., (2015) 1 14, 39-47; D. Sirnsek el al., Cell., (2017), 169, 105 l-1065.eI8).
[00180] Ribosomal DNA is currently "dark matter" that is missing from contemporary genome assemblies (I. L. Gonzalez, et al., Nucleic Acids Res., (1988), 16, 10213-10224). Consequently, the contribution of rRNA heterogeneities to
ribosomopathies or functional distinctions m the pool of assembled ribosomes has yet to be explored. The rigorous rDNA copy number estimation and rRNA sequence variant discovery strategies that are described here reveal pervasive intraand inter-individual rRNA sequence variation in the human genome, with at least 1,662 of the 7,184 nt of rRNA (23%) exhibiting sequence variation in the global population, and variation in rDNA copy number between individuals that spans nearly two orders of magnitude. Tissue-specific rRNA variant expression and extensive overlap between the discovered rRNA variants and positions of known functional importance in the assembled ribosome, including sites identified as being post- transcriptionally modified were further observed (W. A. Decatur, et a!., Trends Biochem Sci ., (2002), 27, 344-351; J. Ofengand, et a!., J Mol Biol, (1997), 266, 246-268; K. E. Sloan et al, RNA Biol , (2017), 14, 1 138-1152). These findings collectively suggest that tissue-specific expression of distinct rDNA operons may contribute to important aspects of cellular physiology.
[00181] There are at least two possible mechanisms of differential rDNA operon expression: T| rDNA operon-specific regulation or 2] tissue-to-tissue variation in the copy number of sub-classes of rDNA operons bearing specific variant alleles. While differences in total rDNA copy number between mouse tissues have been reported (H. F. Noller, Philos Trans R Soc Lond, B, Biol Sci., (2017), 372), it is unknown how the copy number of specific sub-classes of rDNA operons varies by tissue. Notably, larger variations in tissue-specific rRNA variant expression were observed than expected from rDNA copy variation alone. In this context, the inventors noted that chromatin structure and epigenetic modifications of the rDNA promoter contribute to the specificity of rDNA operon transcription (R. M. Voorhees, et al, Science , (2010), 330, 835-838; C. Lei dig et al, Nat Struct Mol Biol, (2013), 20, 23-28; R. Santoro, et al., Mol Cell, (2001), 8, 719- 725). The expression of rDNA is also correlated with nucleotide variants within the IGS region, rDNA promoter, and the 5’ ETS region upstream of the rRNA genes (J. W. Briggs et al, Mol Cell, (2017) 67, 3-4; B. A. Kuo, et al., Nucleic Acids Res., (1996), 24, 4817- 4824; M. L. Holland, EBioMedicine (2017)). Hence, tissue-specific rRNA variant expression is likely also regulated by rDNA promoter, IGS, and ETS sequence heterogeneities. IGS regions contain variable-length tandem repeats, transposable A!u repeat elements that are highly abundant in the human genome, simple sequence microsatellites (B. Xu et al., PLoS Genet., (2017), 13, el006771), and exhibit substantial structural variation (P. Grozdanov, et al., Genomics , (2003), 82, 637-643; R. Santoro, et al., Nat Genet., (2002), 32, 393-396). While these properties complicate the detection of sequence and structural variation in these regions (R. Heilig. Nature, (2004), 431, 931- 945), the present findings reveal that knowledge of the full repertoire of complete rDNA operon sequences will be necessary to comprehend the true diversity of these complex elements and the mechanisms by which IGS and ETS regions contribute to the regulation of rDNA operon expression.
[00182] The extensive catalog of rRNA variants revealed by the present analysis solidify and substantially extend reports of rRNA sequence heterogeneity in mammals (1. G. Gibbons, et al., Proc Natl Acad Sci USA., (2015), 112, 2485-2490; H. Tseng et al., PLoS ONE, (2008), 3, el843; H. Leffers, et al., Nucleic Acids Res., (1993), 21, 1449- 1455). Despite the conservative nature of the analytic approach applied, it was observed that nearly 25% of the functional rRNA sequences can vary. Nearly 7% of the detected variants are more than 20% penetrant in at least one sequenced individual. Given the observed correlation between genomic AF and expression, these data suggest an unanticipated level of rRNA sequence variation in the actively translating ribosome pool. Our finding that positions of conserved sequence variation map to functional regions of the assembled ribosome, including the binding sites for ribosomal proteins implicated in ribosomopathies, suggests that specific alleles may be conserved for yet unknown reasons. Heterogeneities in rRNA sequence may impact ribosome function by altering the binding of core ribosomal proteins, transient associations of the ribosome with extra-ribosomal factors, or conformational changes in the ribosome that are important to the mechanism of protein synthesis. For instance, rRNA sequence variants in ribosomal protein binding domains may template the biogenesis of ribosomes with distinct core protein
compositions, a feature that has been recently connected to the translation of distinct mRNAs (S. Caburet et al , Genome Res., (2005), 15, 1079-1085). As heterogeneities of this kind are sufficient to define subclasses of physically distinct ribosomes, the inventors conclude that rDNA copy number and rRNA sequence variation may template specialized functions that enable the ribosome to actively participate in determining the cellular proteome (M. Bama, FASEB J (2017); J. W. Briggs, et al., Mol Cell, (2017) 67, 3-4).
[00183] importantly, the present analysis likely represents a lower bound on the true diversity of rDNA sequences in the mouse and human genomes and thus sequence variation within the assembled ribosome. Read coverage is low in regions of extremely high GC-content (35), which applies to rRNA in general and especially for regions of the 28S (>80% GC) for which we observed nearly zero read depth. These regions include those identified as mter-species variable domains that are expressed and present in actively translating ribosomes (J. G. Gibbons, et al., Proc Natl Acad Sci USA., (2015), 1 12, 2485- 2490: H. Tseng et al, PLoS ONE, (2008), 3, el 843; H. Leffers, et al., Nucleic Acids Res., (1993), 21, 1449-1455; Z. Shi et al, Mol Cell, (2017), 67, 71-83.e7). These considerations argue for in-depth investigations into the nature and extent of sequence diversity among the multitude of full-length rDNA operons. They also warrant focused investigations into the mechanisms regulating rDNA operon expression and rRNA assembly into functional ribosomes as well as how, and to what extent, such variation contributes to gene expression and disease.

Claims

WHAT IS CLAIMED IS:
1. A processor programmed to perform:
i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
ii) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity; and
iii) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
2. The processor of claim 1, the processor is further programmed to perform:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base qualify score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality' score recalibration; and
vii) detecting rare sequence variants on the reads that have gone through terminal read quality reduction.
3. The processor of claim 1 , the processor is further programmed to perform:
iv) determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
4. The processor of claim 3, wherein the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
5. The processor of claim 1, the processor is further programmed to perform:
mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
6. The processor of claim 1, wherein the set of high-throughput sequencing reads comprises DNA sequencing reads.
7. The processor of claim 1, wherein the set of high-throughput sequencing reads comprises RNA sequencing reads.
8. The processor of claim 1, wherein the contiguous stretch of the reference sequence is at least 25 nucleotides long.
9. The processor of claim 1, wherein the contiguous stretch of the reference genome is at least 25 nucleotides long
10. The processor of claim 1, wherein the first reference sequence is a nbosomal sequence.
11. The processor of claim 10, wherein the ribosomal sequence comprises a 47S/45S rDNA prototype sequence.
12. The processor of claim 10, wherein the first reference sequence is selected from the group consisting of NCB! reference sequences NR_003278.3, NR_003279.1, NR_003280.2, and NR_030686.1.
13. The processor of claim 10, wherein the first reference sequence is selected from the group consisting of NCBI reference sequences NR 003285.2, NR 003286.2, NR_003287.2, and NR 023379 1.
14. The processor of claim 10, wherein the first reference sequence is selected from the group consisting of GenBank IDs U13369.1 and C128G1.1.
15. The processor of claim 10, wherein the first reference sequence is selected from the group consisting of BK000964.3 and Mus mus cuius chromosome 8 region
[123538334, 123539354]
16. The processor of claim 5, wherein the second reference sequence comprises a sequence generated by sequencing ribosomes.
17. The processor of claim 1, wherein the organism is a prokaryote.
18. The processor of claim 1, wherein the organism is a mammal.
19. The processor of claim 18, wherein the mammal is selected from the group consisting of Homo sapiens , Mus musculus, and Rattus norvegicus.
20. A computer-readable storage device, comprising instructions to perform: i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
h) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity; and
hi) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
21. The computer-readable storage device of claim 20, further comprising instructions to perform:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base quality score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality score recalibration; and
vii) detecting rare sequence variants on the reads that have gone through terminal read quality reduction
22 The computer-readable storage device of claim 20, further comprising instructions to perform:
iv) determining the number of copies of the first reference sequence the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence
23. The computer-readable storage device of claim 22, wherein the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
24 The computer-readable storage device of claim 20, further comprising instructions to perform:
mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequen ce which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
25. The computer-readable storage device of claim 20, wherein the set of high- throughput sequencing reads comprises DNA sequencing reads.
26. The computer-readable storage device of claim 20, wherein the set of high- throughput sequencing reads comprises RNA sequencing reads.
27. The computer-readable storage device of claim 20, wherein the contiguous stretch of the reference sequence is at least 25 nucleotides long.
28. The computer-readable storage device of claim 20, wherein the contiguous stretch of the reference genome is at least 25 nucleotides long.
29. The computer-readable storage device of claim 20, wherein the first reference sequence is a ribosomal sequence.
30. The computer-readable storage device of claim 29, wherein the ribosomal sequence comprises a 47S/45S rDNA prototype sequence.
31. The computer-readable storage device of claim 29, wherein the first reference sequence is selected from the group consisting ofNCBI reference sequences
\R 0032.78.3. NR 003279.1, NR 003280.2, and NR 030686. 1
32. The computer-readable storage device of claim 29, wherein the first reference sequence is selected from the group consisting ofNCBI reference sequences
NR 003285.2, NR 003286.2, NR_003287.2, and NR 023379 1.
33. The computer-readable storage device of claim 29, wherein the first reference sequence is selected from the group consisting of GenBank IDs U13369.1 and XI2811.1.
34. The computer-readable storage device of claim 29, wherein the first reference sequence is selected from the group consisting of BK000964.3 and Mus mus cuius chromosome 8 region [123538334, 123539354]
35. The computer-readable storage device of claim 24, wherein the second reference sequence comprises a sequence generated by sequencing ribosomes.
36. The computer-readable storage device of claim 20, wherein the organism is a prokaryote.
37. The computer-readable storage device of claim 20, wherein the organism is a mammal.
38. The computer-readable storage device of claim 37, wherein the mammal is selected from the group consisting of Homo sapiens , Mas musculus, and Raitus norvegicus.
39. A method, comprising:
i) mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment, wherein the first reference sequence is present in multiple copies in the genome of the organism;
ii) identifying candidate reads within the set based on the initial alignment, wherein each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity; and
hi) mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multipl e copies of the first reference sequence in the genome of the organism.
40. The method of claim 39, further comprising performing:
iv) a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads;
v) a base quality score recalibration on the remainder reads after the realignment; vi) a terminal read quality reduction on the remainder reads after the base quality score recalibration; and
vii) detecting rare sequence variants on the reads that have gone through terminal read qualify' reduction.
41. The method of claim 39, further comprising:
iv) determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
42. The computer-readable storage device of claim 41, wherein the number of reads expected to map to the reference sequence is calculated from a GC-content specific fragmentation rate analysis of the remainder reads.
43. The method of claim 39, further comprising:
mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
44. The method of claim 39, wherein the set of high-throughput sequencing reads comprises DNA sequencing reads.
45. The method of claim 39, wherein the set of high-throughput sequencing reads comprises RNA sequencing reads.
46. The method of claim 39, wherein the contiguous stretch of the reference sequence is at least 25 nucleotides long
47. The method of claim 39, wherein the contiguous stretch of the reference genome is at least 25 nucleotides long.
48. The method of claim 39, wherein the first reference sequence is a nbosomai sequence.
49. The method of claim 48, wherein the ribosomai sequence comprises a 47S/45S rDNA prototype sequence.
50. The method of claim 48, wherein the first reference sequence is selected from the group consisting ofNCBI reference sequences NR_003278.3, NR_003279.1,
NR 003280.2, and NR 030686.1.
51. The method of claim 48, wherein the first reference sequence is selected from the group consisting ofNCBI reference sequences NR_003285.2, NR_003286.2,
NR 003287 2. and NR 023379 i .
52. The method of claim 48, wherein the first reference sequence is selected from the group consisting of GenBank IDs Ul 3369 1 and C128G1.1.
53. The method of claim 48, wherein the first reference sequence is selected from the group consisting of BK000964.3 and Mus musculus chromosome 8 region [123538334, 123539354]
54. The method of claim 43, wherein the second reference sequence comprises a sequence generated by sequencing ribosomes.
55. The method of claim 39, wherein the organism is a prokaryote.
56. The method of claim 39, wherein the organism is a mammal.
57. The method of claim 37, wherein the mammal is selected from the group consisting of Homo sapiens , Mus musculus, and Rattus norvegicus.
PCT/US2019/020024 2018-02-28 2019-02-28 Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets WO2019169117A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862636466P 2018-02-28 2018-02-28
US62/636,466 2018-02-28

Publications (1)

Publication Number Publication Date
WO2019169117A1 true WO2019169117A1 (en) 2019-09-06

Family

ID=67805916

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/020024 WO2019169117A1 (en) 2018-02-28 2019-02-28 Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets

Country Status (1)

Country Link
WO (1) WO2019169117A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113308540A (en) * 2020-02-27 2021-08-27 上海鹍远生物技术有限公司 Thyroid nodule-related rDNA methylation marker and application thereof
CN113436683A (en) * 2020-03-23 2021-09-24 北京合生基因科技有限公司 Method and system for screening candidate inserts
WO2021195594A1 (en) * 2020-03-26 2021-09-30 San Diego State University (SDSU) Foundation, dba San Diego State University Research Foundation Compositions and methods for treating or ameliorating infections
CN113571131A (en) * 2021-08-06 2021-10-29 广东省农业科学院水稻研究所 Pangenome construction method and corresponding structural variation mining method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083004A1 (en) * 2013-12-02 2015-06-11 Population Genetics Technologies Ltd. Method for evaluating minority variants in a sample

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083004A1 (en) * 2013-12-02 2015-06-11 Population Genetics Technologies Ltd. Method for evaluating minority variants in a sample

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BENJAMINI, Y ET AL.: "Summarizing and Correcting the GC Content Bias in High-Throughput Sequencing", NUCLEIC ACIDS RESEARCH, vol. 40, no. 10, 9 February 2012 (2012-02-09), pages 1 - 14, XP055162924, doi:10.1093/nar/gks001 *
GROZDANOV, P ET AL.: "Complete Sequence of the 45-Kb Mouse Ribosomal DNA Repeat: Analysis of the Intergenic Spacer", GENOMICS, vol. 82, no. 6, December 2003 (2003-12-01), pages 637 - 643, XP004472311, doi:10.1016/S0888-7543(03)00199-X *
PARKS, MM ET AL.: "Variant Ribosomal RNA Alleles Are Conserved And Exhibit Tissue-Specific Expression", SCIENCE ADVANCES, vol. 4, no. 2, 28 February 2018 (2018-02-28), pages 1 - 13, XP055635181 *
WILM, A ET AL.: "Lofreq: A Sequence-Quality Aware, Ultra-Sensitive Variant Caller For Uncovering Cell -Population Heterogeneity From High-Throughput Sequencing Datasets", NUCLEIC ACIDS RESEARCH, vol. 40, no. 22, 12 October 2012 (2012-10-12), pages 11189 - 11201 *
YU , S ET AL.: "A Portrait of Ribosomal DNA Contacts with Hi-C Reveals 5S and 45S rDNA Anchoring Points in the Folded Human Genome", GENOME BIOLOGY AND EVOLUTION, vol. 8, no. 11, 25 October 2016 (2016-10-25), pages 3545 - 3558, XP055635180 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113308540A (en) * 2020-02-27 2021-08-27 上海鹍远生物技术有限公司 Thyroid nodule-related rDNA methylation marker and application thereof
CN113436683A (en) * 2020-03-23 2021-09-24 北京合生基因科技有限公司 Method and system for screening candidate inserts
WO2021195594A1 (en) * 2020-03-26 2021-09-30 San Diego State University (SDSU) Foundation, dba San Diego State University Research Foundation Compositions and methods for treating or ameliorating infections
CN113571131A (en) * 2021-08-06 2021-10-29 广东省农业科学院水稻研究所 Pangenome construction method and corresponding structural variation mining method

Similar Documents

Publication Publication Date Title
Lee et al. Simultaneous profiling of chromatin accessibility and methylation on human cell lines with nanopore sequencing
Plassais et al. Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology
Swart et al. The Oxytricha trifallax macronuclear genome: a complex eukaryotic genome with 16,000 tiny chromosomes
Parks et al. Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expression
Shukla et al. Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes
Shibata et al. Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection
Duan et al. Adaptation of A-to-I RNA editing in Drosophila
WO2019169117A1 (en) Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets
Guttman et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
Bottomly et al. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays
Zhang et al. Genes that escape X-inactivation in humans have high intraspecific variability in expression, are associated with mental impairment but are not slow evolving
Bayega et al. Current and future methods for mRNA analysis: a drive toward single molecule sequencing
Zhang et al. Isoform evolution in primates through independent combination of alternative RNA processing events
He et al. The conservation and signatures of lincRNAs in Marek’s disease of chicken
Nowick et al. Gain, loss and divergence in primate zinc-finger genes: a rich resource for evolution of gene regulatory differences between species
JP2017537646A (en) Sequencing control
WO2018144782A1 (en) Methods of detecting somatic and germline variants in impure tumors
Hassan et al. De novo reconstruction of the Toxoplasma gondii transcriptome improves on the current genome annotation and reveals alternatively spliced transcripts and putative long non-coding RNAs
Chen et al. RNASEQR—a streamlined and accurate RNA-seq sequence analysis program
Orozco et al. Intergenerational genomic DNA methylation patterns in mouse hybrid strains
Yurtman et al. Archaeogenetic analysis of Neolithic sheep from Anatolia suggests a complex demographic history since domestication
US20220392570A1 (en) Method for screening ivf embryos
Johnson et al. Genetic analysis of the cardiac methylome at single nucleotide resolution in a model of human cardiovascular disease
Chen et al. Systematic investigation of insertional and deletional RNA-DNA differences in the human transcriptome
EP3588506B1 (en) Systems and methods for genomic and genetic analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19761542

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19761542

Country of ref document: EP

Kind code of ref document: A1