WO2019028189A2 - Determination of str length by short read sequencing - Google Patents

Determination of str length by short read sequencing Download PDF

Info

Publication number
WO2019028189A2
WO2019028189A2 PCT/US2018/044889 US2018044889W WO2019028189A2 WO 2019028189 A2 WO2019028189 A2 WO 2019028189A2 US 2018044889 W US2018044889 W US 2018044889W WO 2019028189 A2 WO2019028189 A2 WO 2019028189A2
Authority
WO
WIPO (PCT)
Prior art keywords
chromosome
reads
repeat
read
str
Prior art date
Application number
PCT/US2018/044889
Other languages
French (fr)
Other versions
WO2019028189A3 (en
Inventor
Haibao TANG
Original Assignee
Human Longevity, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human Longevity, Inc. filed Critical Human Longevity, Inc.
Publication of WO2019028189A2 publication Critical patent/WO2019028189A2/en
Publication of WO2019028189A3 publication Critical patent/WO2019028189A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • Short tandem repeats are hyper-mutable sequences in the human genome that are often used in forensics and population genetics, and are also the underlying cause of many genetic diseases.
  • NGS Next Generation Sequencing
  • accurate detection of pathological STR expansion is limited by the sequence read length during whole genome analysis.
  • variant calling software for example, Manta, Isaac, GATK, and lobSTR
  • Some of these software tools seek to identify STR variants by specifically examining the sequencing reads that are piled around a target STR region.
  • lobSTR uses three separate steps: sensing, alignment, and allelotyping, which explicitly model two possible alleles (diploid) as well as sequencing errors typically associated with STRs (due to stutter noise).
  • lobSTR only considers reads that fully span a STR locus.
  • STRViper The short length of Illumina reads (100-150 bases) imposes a major limitation on the length of STR alleles that can be identified.
  • an estimate of length variation at an STR can also be calculated by combining information from a prior estimate and the observed sizes of paired- end sequence fragments spanning the STR.
  • STRViper assumes a single allele at each site; which is a significant limitation for quantitating STR from diploid human calls.
  • a method of determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and aligning the extracted nucleic acid sequence reads.
  • the method can further comprise parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads.
  • the method can further comprise determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups, and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
  • a program is stored for causing a computer to perform a method for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus.
  • the method can comprise extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and aligning the extracted nucleic acid sequence reads.
  • the method can further comprise parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads.
  • the method can further comprise determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups, and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
  • a system for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus.
  • the system can comprise a sequencing unit configured to generate nucleic acid sequence reads, and an alignment engine configured to extract the nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and align the extracted nucleic acid sequence reads.
  • the system can further comprise a diagnosing unit comprising a repeat length determination engine configured to receive aligned reads and (1) parse the reads into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads, and (2) determine the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups.
  • the diagnosing unit can further comprise risk assessment engine configured to determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
  • a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: (a) extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; (b) creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (c) parsing the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (d) determining the repeat length of an STR sequence by applying a probabilistic model to
  • the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat- only reads.
  • the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater.
  • the first alignment to the genome is at an average sequence depth of 20 or greater.
  • the second alignment is aligned using a method that comprises a lower gap penalty then a method used for the first alignment to the genome.
  • the second alignment uses a Smith- Waterman algorithm or variation thereof.
  • the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • STR short tandem repeat
  • the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951- 111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63
  • the method is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the method further comprises determining a ploidy for the X chromosome from the extracted reads. In certain embodiments, the method further comprises delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format.
  • STR short tandem repeat
  • a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: (a) a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the
  • STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location;
  • a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus;
  • a software module configured to parse the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and
  • a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater.
  • the first alignment to the genome is at an average sequence depth of 20 or greater.
  • the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome.
  • the second alignment uses a Smith- Waterman algorithm or variation thereof.
  • the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • STR short tandem repeat
  • the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:6391
  • the system is able to accurately quantitate an STR read length greater than 120 base pairs.
  • the system further comprises a software module configured to determine a ploidy for the X chromosome from the extracted reads.
  • the system further comprises a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus.
  • the report is in electronic format.
  • a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: (a) creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (b) parsing the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (c) determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater.
  • the first alignment to the genome is at an average sequence depth of 20 or greater.
  • the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome.
  • the second alignment uses a Smith-Waterman algorithm or variation thereof.
  • the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • STR short tandem repeat
  • the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-1479121 10; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 11 1598951 -11 1599019; chromosome 14:9207101 1-92071034; chromosome 19: 13207859-13207897; chromosome
  • the method is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the method further comprises determining a ploidy for the X chromosome from the extracted reads. In certain embodiments, the method further comprises delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format.
  • STR short tandem repeat
  • a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: (a) a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (b) a software module configured to parse the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (c) a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the
  • the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater.
  • the first alignment to the genome is at an average sequence depth of 20 or greater.
  • the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome.
  • the second alignment uses a Smith- Waterman algorithm or variation thereof.
  • the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • STR short tandem repeat
  • the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051- 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:
  • the system is able to accurately quantitate an STR read length greater than 120 base pairs.
  • the system further comprises a software module configured to determine a ploidy for the X chromosome from the extracted reads.
  • the system further comprises a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus.
  • the report is in electronic format.
  • FIG. 1 shows a non-limiting example of a workflow for determining STR length, in accordance with various embodiments.
  • Figs. 2A-2B shows a comparison of two sequence alignment methods exploiting the periodicity of the STR sequences, in accordance with various embodiments.
  • Figs. 3A-3E show an integrated probabilistic model to call STRs with four types of evidence, in accordance with various embodiments.
  • FIG. 4 shows a non-limiting example of a digital processing device, in accordance with various embodiments.
  • FIG. 5 shows a non-limiting example of a web/mobile application provision system, in accordance with various embodiments.
  • FIG. 6 shows a non-limiting example of a cloud-based web/mobile application provision system, in accordance with various embodiments.
  • Figs. 7A-7D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus tested against several variant callers, including (A) Manta (B) Isaac (C) GATK, (D) lobSTR, in accordance with various embodiments.
  • Figs. 8A-8D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus, in accordance with various embodiments.
  • Figs. 9A and 9B show examples of posterior probability density function based on the integrated model to call STRs, in accordance with various embodiments.
  • Figs. 10A-10D illustrates the individual contribution of four types of evidence to a final STR call, namely spanning reads, partial reads, repeat reads and paired-end distance, in accordance with various embodiments.
  • FIG. 11 shows an example summary of testing and validation on 12,632 whole genome sequences, in accordance with various embodiments.
  • Figs. 12A-12C shows an example of validation of calls using Sanger and Oxford Nanopore sequencing, in accordance with various embodiments.
  • Figs. 13A-13C show individuals with risk alleles at Huntington disease (HD) locus in whole genome samples, in accordance with various embodiments.
  • HD Huntington disease
  • Figs. 14A-14D shows simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus using several known variant callers, including (A) Manta (B) Isaac (C) GATK, (D) lobSTR, in accordance with various embodiments.
  • Fig. 15 is a flow chart illustrating a method for method for determining that a subject is at risk for having a disease or disorder, in accordance with various embodiments.
  • Fig. 16 is a schematic diagram illustrating a system for determining that a subject is at risk for having a disease or disorder, in accordance with various embodiments.
  • the terms “comprise”, “comprises”, “comprising”, “contain”, “contains”, “containing”, “have”, “having” “include”, “includes”, and “including” and their variants are not intended to be limiting, are inclusive or open-ended and do not exclude additional, unrecited additives, components, integers, elements or method steps.
  • a process, method, system, composition, kit, or apparatus that comprises a list of features is not necessarily limited only to those features but may include other features not expressly listed or inherent to such process, method, system, composition, kit, or apparatus.
  • Enzymatic reactions and purification techniques are performed according to manufacturer's specifications or as commonly accomplished in the art or as described herein.
  • the techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g., Sambrook et al, Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000).
  • the nomenclatures utilized in connection with, and the laboratory procedures and techniques described herein are those well known and commonly used in the art.
  • DNA deoxyribonucleic acid
  • A adenine
  • T thymine
  • C cytosine
  • G guanine
  • RNA ribonucleic acid
  • A U
  • U uracil
  • G guanine
  • nucleic acid sequencing data denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.
  • nucleotide bases e.g., adenine, guanine, cytosine, and thymine/uracil
  • sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
  • a sequence alignment method can align a fragment sequence to a reference sequence or another fragment sequence.
  • the fragment sequence can be obtained from a fragment library, a paired-end library, a mate-pair library, a concatenated fragment library, or another type of library that may be reflected or represented by nucleic acid sequence information including for example, RNA, DNA, and protein based sequence information.
  • the length of the fragment sequence can be substantially less than the length of the reference sequence.
  • the fragment sequence and the reference sequence can each include a sequence of symbols.
  • the alignment of the fragment sequence and the reference sequence can include a limited number of mismatches between the symbols of the fragment sequence and the symbols of the reference sequence.
  • the fragment sequence can be aligned to a portion of the reference sequence to minimize the number of mismatches between the fragment sequence and the reference sequence.
  • the symbols of the fragment sequence and the reference sequence can represent the composition of biomolecules.
  • the symbols can correspond to identity of nucleotides in a nucleic acid, such as RNA or DNA, or the identity of amino acids in a protein.
  • the symbols can have a direct correlation to these subcomponents of the biomolecules.
  • each symbol can represent a single base of a polynucleotide.
  • each symbol can represent two or more adjacent subcomponents of the biomolecules, such as two adjacent bases of a polynucleotide.
  • the symbols can represent overlapping sets of adjacent subcomponents or distinct sets of adjacent subcomponents.
  • each symbol represents two adjacent bases of a polynucleotide
  • two adjacent symbols representing overlapping sets can correspond to three bases of polynucleotide sequence
  • two adjacent symbols representing distinct sets can represent a sequence of four bases.
  • the symbols can correspond directly to the subcomponents, such as nucleotides, or they can correspond to a color call or other indirect measure of the subcomponents.
  • the symbols can correspond to an incorporation or non-incorporation for a particular nucleotide flow.
  • Microsatellites or short tandem repeats (STRs) are stretches of simple nucleotide repetitions in the genome, with a typical repeat units of 1 to 6 bp in length. Short tandem repeats are often polymorphic due to strand slippage during DNA replication, and are a common source of rare genetic diseases.
  • the mutation rates of STRs are typically on the order of ⁇ 10 "4 mutations per generation per site, as compared to point mutation rates that are on the order of -10 " mutations per generation per site for single nucleotide variants (SNVs). Because of the higher mutation rate, STRs offer a different level of resolution to study kinship and trait variations among individuals.
  • STRs can be currently used in forensics to identify suspects from DNA traces left at a crime scene.
  • the amplification targets the 13 CODIS (Combined DNA Index System) STR loci and the sizes of the amplicons are analyzed by electrophoresis. The repeat number at each loci is inferred by the size of the amplicon and a DNA profile is generated.
  • STRs also have a role in inferring genealogy. For example, STR loci on the Y-chromosomes (Y-STRs) are used to define haplotypes that predated the use of Y-SNPs.
  • the STR data, coupled with public genealogy databases like Y- search can be used for "surname inference.”
  • STRs have been shown to be involved in several human genetic diseases.
  • Several neural- degenerative disorders known as the "polyglutamine” (PolyQ) diseases, are caused by variable stretches of the repeated trinucleotide CAG within protein- coding exons.
  • PolyQ diseases are Huntington's disease (HD) and several forms of Spinocerebellar ataxia (SCA).
  • Huntington's disease is caused by an expansion of the CAG repeats in the first exon of the Huntingtin gene (HTT).
  • HAT Huntingtin gene
  • Individuals carrying an expanded allele have motor, cognitive and psychological symptoms that typically appear at the age of 40 years old or older, depending on the number of repeats.
  • STRs also occur in non-coding regions and can regulate gene expression and histone modifications, affecting the expression of nearby genes in cis to the STR sites. Examples of these repeat disorders include Myotonic dystrophy (DM1) with CTG repeats, Friedreich Ataxia (FRDA) with GAA repeats, and Fragile X syndrome with CGG repeats. STRs that regulate gene expression (e-STRs) are mostly enriched in genes responsible for cognitive functions and autoimmune responses.
  • DM1 Myotonic dystrophy
  • FRDA Friedreich Ataxia
  • Fragile X syndrome with CGG repeats.
  • e-STRs are mostly enriched in genes responsible for cognitive functions and autoimmune responses.
  • STR loci Due to their unstable nature and costly testing procedures, STR loci have so far been mostly under-utilized in population efforts to assess STR disease diagnoses, risks, and prevalence.
  • the methods, systems and software also referred to as TREDPARSE herein) enable simultaneous identification of many STR loci, using whole genome sequencing data.
  • the whole genome approach offers advantage over conventional STR testing by limiting the potential bias introduced during the amplification step, reduced cost, and greater efficiency by analyzing multiple loci simultaneously. With full genome sequencing becoming more accessible across large number of individuals, it is anticipated that STR-related diseases will be of more interest to clinicians and researchers.
  • a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising, extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus
  • a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, the
  • a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • STR repeat length determination allows for accurate quantitation of repetitive indels, such as short tandem repeats (STRs).
  • the methods described herein determine each allele length at a pre-defined STR loci using short read whole-genome sequence data that are sampled at sufficient depth. Given a set of observed reads that are mapped around a particular STR locus, the method can estimate up to two haplotypes h x and h 2 , where 1 ⁇ h x ⁇ h 2 ⁇ h max , that represent the number of repeat units from an individual that maximizes the likelihood in our model.
  • Fig. 1 is a flow chart illustrating a general workflow or system 100 for determining STR length, in accordance with various embodiments. Note that, as this is a general workflow, much of the detailed discussion related to the steps of Fig. 1 will be provided in greater detail with regard to embodiments described in Figs. 15 and 16. Those detailed descriptions are still applicable to the embodiments of Fig. 1. Referring to Fig. 1 the method optionally determines the correct ploidy level
  • reads previously aligned are realigned by, for example, using an alignment algorithm that is adapted to better determining STR length 102.
  • Reads to determine the nucleic acid sequence of a whole genome, whole-exome, or large portions of the genome are usually aligned using a method that penalizes gaps (e.g., indels) that are longer than a few nucleotides in length. This is because most sequences in the genome are not STR sequences and must be properly aligned at the outset to locate the STR boundaries.
  • a dynamic programming algorithm e.g., Smith- Waterman
  • the method allows for determination of repeat lengths by short read sequencing when the reads average between 300 and 50, 250 and 100, 150 and 100, 100 and 30, 100 and 50. In various aspects, the method allows for determination of repeat lengths by short read sequencing greater than 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 nucleotides, including increments therein, or greater.
  • the method allows for determination of repeat lengths by short read sequencing equal to at least 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 nucleotides, including increments therein, or greater.
  • repeat length is equal to the repeat unit K multiplied by the number of repeats of the particular unit R.
  • STR loci are autosomal or X-linked.
  • autosomal loci a diploid allele should be taken into account depending on whether the disease is dominant or recessive.
  • ploidy also should be taken into account with the assumption that an individual possessing only a single loci (a male) will be afflicted by the presence of a single disease allele even if the disease is recessive.
  • ploidy can be determined by sequence reads if the sex of the individual is not known from other sources such as a questionnaire, medical form, or interview.
  • Autosomal STRs are modeled as diploid loci, allowing two alleles to be inferred per locus.
  • Short read sequencing technologies include sequencing by synthesis, pyrosequencing, or ion semi-conductor sequencing.
  • Short read sequencing is in contrast to long read technologies such as Sanger sequencing, single molecule real-time sequencing, or nanopore sequencing. These technologies, due to their long read length, can sequence long STRs in a single read.
  • long read sequencing technologies produce reads greater than 200, 300, 400, 500, or more base pairs, for example.
  • Short read sequencing technologies produce short nucleic acid reads generally in the range of 20-400 base pairs in length with 35 to 150 base pair read length being most typical. For many STR based diseases, a single short read sequencing technology may not encompass the SR in one read. Since short read technologies can be sequenced from both ends they produce paired-end 5' to 3' reads. Paired end reads produce a first read and a second read which are reverse complements of the same strand. These reads can overlap or be separated by 1 to several hundred base pairs of sequenced nucleic acid. Since these reads can in effect bracket an STR they are useful for methods described herein. In certain embodiments, TREDPARSE requires paired end reads.
  • TREDPARSE utilizes reads of less than 200, 150, 140, 130, 120, 110, 100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, or 30 base pairs in length, including increments therein.
  • Short read data for use with the STR repeat length determination can be any nucleic acid sequenceable by short read technologies.
  • the nucleic acid sequence can be derived from DNA, cDNA (by way of reverse transcription from RNA.
  • the DNA is genomic DNA derived from a biological sample taken from an individual including, but not limited to, saliva, blood, plasma (including cell-free), serum, tissue biopsy, extracted from circulating peripheral blood mononuclear cells, stool, urine, or semen.
  • the nucleic acid can be prepared by any art known method for preparation of sequencing libraries. This can include preparation for paired- end sequencing.
  • Initial alignments can be performed in many ways, and this step can serve to align reads to their proper genomic location.
  • the initial alignment is sometimes not focused on accurately quantitating STR read length, but in properly aligning the many millions of short reads to a proper genomic locus.
  • a Burrows-Wheeler alignment and variations thereof is an example of a suitable initial alignment method, but other technologies capable of aligning greater than 1 million 35 base pair shorter reads can be employed.
  • the initial alignment method has a higher gap penalty then a method that is used in a realignment to quantitate STR length.
  • the initial alignment method differs from a method used in a subsequent realignment to quantitate STR length.
  • Reads that are mapped around the STR region are extracted and realigned. These reads can be extracted for example from a BAM or a SAM file.
  • a goal for the re-alignment is to obtain an accurate count of the occurrences of the repeat motifs.
  • Most read mapping methods when aligning reads to a reference, have a high penalty for long indels. This often results in alignment misses or misalignments leading to false predictions. The quality of sequence alignment can be thereby crucial in accurately counting the repeats in STR regions.
  • Reads are often aligned using variations of a Burrows-Wheeler alignment (BWA). In certain embodiments, reads that are within 1, 2, 3, 4, 5 read lengths of an STR locus are extracted for realignment.
  • BWA Burrows-Wheeler alignment
  • the read length is greater than at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 1 10, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides, including increments therein. In various embodiments, the read length is less than at least 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or 210 nucleotides, including increments therein. In various embodiments, reads that are within 1, 2, 3, 4, 5 kilobases of an STR locus are extracted for realignment.
  • realignment methods can be ones that are adapted to more accurately count the number of STR repeats.
  • dynamic programming with the Smith- Waterman (SW) algorithm to count the number of repeats is used for realignment.
  • a Single-Instruction-Multiple-Data (SIMD) Smith- Waterman library for fast alignment is used for realignment.
  • the realignment method utilizes a striped Smith-Waterman algorithm.
  • the realignment method utilizes a multiple templates method, whereby the method aligns a read to a series of templates embedded with varying number of repeats, using standard SW alignment with a fixed scoring scheme.
  • Figs. 2A and 2B disclose the advantages of a "multiple templates method" compared to a "periodic" Smith- Waterman alignment.
  • Fig. 2A discloses SEQ ID NOS 18, 18 and 19, respectively, in order of appearance.
  • Fig 2B. also discloses SEQ ID NO: 19.
  • Figs. 2A and 2B a comparison of two sequence alignment methods exploiting the periodicity of the STR sequences is illustrated. Determination of a hypothetical sequence AAGTCCTTCCAGCAGCAGCAACAGCCG (SEQ ID NO: 1) is modeled.
  • a "Periodic Smith- Waterman” method modifies the recurrence table when performing the dynamic programming step so that repeat units are not penalized during matching;
  • B) "Multiple templates” method aligns the read to a series of templates SEQ ID NO: 2 to 6 to embedded with varying number of repeats, using standard SW alignment with a fixed scoring scheme. The alignment yields a series of alignments with different scores which are then compared to determine the repeat size that corresponds to the highest score.
  • TREDPARSE can realign two types of reads extracted from the BAM file: 1) reads that are mapped within a read length from the repeat location; and 2) reads that are unmapped but with its mate mapped within a distance of, for example, about 1 kb from the repeat location. Distances can also include, for example, about 2 kb, 3 kb, 4 kb, 5 kb, and so on.
  • the number of repeats are then determined for each read in the STR region.
  • the number of base pairs required to call the existence of a flank is 9 bp and plays an important role in classification of various type of reads.
  • each read is classified as a prefix read (read with flanking sequence left of the repeats) or a suffix read (read with flanking sequence right of the repeats) depending on the positions where the alignments start or end on the read.
  • a prefix read read with flanking sequence left of the repeats
  • a suffix read read with flanking sequence right of the repeats
  • reads with both prefix and suffix are classified as spanning reads, reads with either prefix or suffix but not both are classified as partial reads.
  • Reads that only consist of repeats are repeat-only reads. These reads are sorted into a set of observations that are integrated in a probabilistic model for STR size inference (Figs. 2A and 2B).
  • Figs. 3A-3E show an integrated probabilistic model to call STRs with four types of evidence.
  • A Model based on spanning reads;
  • B Model based on partial reads;
  • C Model based on repeat-only reads;
  • D Model based on paired-end reads;
  • E Predictive power for each of the four evidence types on the range of STR repeat lengths.
  • Fig. 3A shows spanning reads S (arrows).
  • the spanning reads are the reads that show both left and right flanking sequences.
  • Flanking sequences are non-STR genomic sequences that are derived from the read.
  • a flanking sequence can be any number of nucleotides but there is an optimum amount which allows for quantitating the longest STR repeat. In various embodiments, a flanking sequence is between 5-20, 6-18, 7-16, 8-14, 9-12, or equal to 8, 9, 10, 11, or 12 base pairs Since a spanning read encompasses the entire STR locus, inference on the number of repeat units by spanning reads is straightforward, with the counted size matching or close to the true size.
  • the spanning reads would show exactly the size of the underlying allele if there is no noise due to stuttering. Stuttering noise can impact the size determination from this read. Stuttering occurs due to polymerase or template slipping in highly repetitive regions and can result in deletions or insertions of repeats that are observed but not actually present.
  • a stuttering model which considers the periodicity of the repeat as well as the GC content, can allow a certain proportion of the spanning reads to show a different size than the true allele size.
  • the stuttering model is applied when analyzing any one or all of spanning, partial or repeat only reads.
  • the stuttering model can return a distribution or confidence interval for the actual repeat length.
  • Fig. 3B shows partial reads T (arrows).
  • the partial reads do not align all the way across the repeat region and comprise only one flanking sequence.
  • the partial reads have a probability mass function of discrete uniform distribution between a single repeat unit up to the true repeat length. Therefore, unlike the full spanning reads which show exactly or close to (in case of stuttering error) the number of repeat units of the underlying allele, the partial reads show a lower bound for the number of repeat units of the underlying allele.
  • the inference task is to infer the maximum number of repeats, given observed allele sizes from partial reads. The inference is analogous to the "German tank problem" but with replacement, under the condition that the allele cannot exceed the read length minus the length of the flanking sequence.
  • Fig. 3C shows repeat-only reads U (shaded arrows). Reads that almost consist entirely of repeat units are repeat-only reads. Each repeat-only read often has a relatively unique mate that allows it to be mapped. Repeat-only read are possible only when repeat length is the same or longer than a read length. Assuming each read is equally likely to start anywhere in the genome, the expected number of repeat-only reads that fall in a certain region follows a Poisson distribution. These repeat-only reads are typically mapped in the STR region because they have a read pair that mapped to a flanking site. The repeat-only reads can be critical since they allow the inference of repeats longer than the read length.
  • Fig. 3D shows paired-end reads V (arrows). Additional information can be gathered from the group of paired end reads (sometimes called "mates") that span the STR region. The observed distance between the two mate reads typically can follow a distribution p(V) for a specific sequencing library. This distribution can be inferred by compiling the distances between all (or a representative subset of) the paired-end reads across the genome. For alleles without indels in the STR region, the distribution of the observed distances can be distributed identically to p(V).
  • Fig. 3D discloses SEQ ID NOS 12-13, respectively, in order of appearance
  • a probabilistic model can be implemented predicting the size of STRs, based on evidence from a selected combination of spanning reads, partial reads, repeat-only reads and spanning pairs.
  • the spanning reads align to both flanks of the target repeat; the partial reads align to only one flank of the repeat; the repeat-only reads align entirely within the repeat tract and thus consist entirely of repeat units; the spanning pairs are read pairs that span the STR region, i.e. with one end on each side of the repeat.
  • D haplotype depth, average sequencing depth divided by ploidy. For diploid locus, it is equal to half of the sequencing depth
  • the repeat length is equal to the repeat units x repeat unit length.
  • the human reference genome hg38
  • observations are a set of / spanning reads with repeat units m partial reads with repeat units U repeat-only reads, and n
  • the spanning reads are the reads that show both left and right flanking sequences of at least F bp.
  • the spanning reads are quite straightforward, with the counted size matching or close to the true size.
  • the spanning reads would show the size of the underlying allele if there is no noise due to stuttering. The sharp peak becomes 'fuzzier' after incorporating the stuttering noise.
  • the stuttering model trained by lobSTR which considers the periodicity of the repeat as well as the GC content, the stuttering model allows a certain proportion of the spanning reads to show a different size than the true allele size.
  • the following example model can be utlized.
  • the read is a product of stutter noise, which is dependent on the repeat unit length K and also the GC content of the locus.
  • a read is a product of stutter, then with probability Poisson(s; ⁇ K ), the noisy read deviates by s units from the original allele, where Poisson(s; ⁇ K ) is a Poisson distribution with mean ⁇ K .
  • Deviation can be either positive or negative with equal probability ⁇ ( ⁇ )/2.
  • Parameters ⁇ (K) and ⁇ K were previously trained by lobSTR for a range of values K.
  • Partial reads do not align all the way across the repeat region and shows only one flanking sequence.
  • the partial reads have a probability mass function of discrete uniform distribution. Unlike the full spanning reads, which show exactly the repeat units of the underlying allele, partial reads only show a lower bound for the number of repeat units of the underlying allele. This inference is analogous to the "German tank problem" but with replacement, under the condition that the allele cannot exceed L-F.
  • the longer allele typically has a larger contribution to the number of observations.
  • repeat-only reads that consist only of repeat units are called "repeat-only" reads.
  • the repeat length hK is the same or longer than a read length L
  • repeat-only reads are possible if they start in a region with size hK-L. Repeat only reads allow the inference of reads longer than the read length. Assuming each read can start anywhere in the genome, the expected number of repeat-only reads follows a Poisson distribution: where Paired-end Reads
  • Additional information can be gathered from the group of paired-end reads (also called mate pairs) that span the STR region.
  • the observed distance between two mate reads typically follow a distribution p(V) for a specific sequencing library. This distribution can be inferred by compiling the distances between all (or a representative subset of ) the paired end reads across the genome. For alleles without indels in the STR region, the distribution of the observed distances should be distributed identically to p(V). If there is a homozygous insertion or deletion in the STR region, the distribution of p(V) would shift to p(V+RK-hK).
  • the paired- end distance is also useful to extend the prediction of allele size beyond the length of a typical sequencing read since the paired-end distance is often longer than the read length.
  • the paired-end mode is only enabled when there are at least 5 spanning pairs across the STR locus. With too few observations, the variance of our maximum likelihood estimates based on spanning pairs alone can be substantial.
  • Each of the four types of read evidence, spanning reads, partial reads, repeat-only reads, and paired-end reads has its own range of predictive power across the range of likely STR repeat length, as either limited by read length or paired-end distance of the sequencing library. This is shown graphically in Fig. 3E.
  • data from spanning reads, partial reads, repeat-only reads, and spanning pairs can be combined as a combination of two, three or four read groups. For example, with four read groups, the data can be combined under the assumption that each type of evidence is independent given the true repeat numbers:
  • the maximum likelihood estimates can be obtained from the model through a grid
  • h max 300 (e.g., limit of repeat length quantitation), so the full grid search would be 300 for haploid and 300 X 300 for diploid loci.
  • a different can be used for example between 100 and 400, 150 and 350, 200 and 300, 250 and 300, 200 and 250, or greater than 100, 150, 200, 250, 300, 350, or 400, including increments therein.
  • the CI of distribution with a parameter ⁇ is defined as:
  • the 95% CI are not unique on a posterior distribution.
  • One can use 95% CI when there is equal (1- a)/2 2.5% mass on each tail.
  • the 95% confidence interval (95% CI) for can be computed.
  • the confidence interval calculated can vary depending upon the requirements of the method. In various embodiments, the confidence interval can be within 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%.
  • Sequencing depth is a key driver of confidence interval and in various aspects the method for STR repeat length determination is performed using sequencing data with a sequencing depth greater than lOx, 20x, 3 Ox, 40x, 50x, 60x, 70x, 80x, 90x, or more, including increments therein.
  • the STR calculations using the method herein are highly accurate compared to other methods.
  • the method is more accurate than the lobSTR caller by at least 2-fold, 3 -fold, 4-fold, or 5-fold based upon a root-mean squared deviation (RMSD) from the true value over a range of repeat lengths.
  • RMSD root-mean squared deviation
  • the method has an absolute accuracy about equal to or less than 50 RMSD, 40 RMSD, 35 RMSD, 30RMSD, or 25 RMSD when based upon simulated data for a trinucleotide repeat. See example 1, for example.
  • a determination of pathogenicity can be computed, given dominance and recessive inheritance models under the assumption of complete penetrance and a point cutoff of size c.
  • pathogenicity For example Huntington's disease would have a cutoff at repeat length 120:
  • a patient can be defined as at risk if "at risk” if PP > 50%, 60%, 70%, 80%, or 90%, including increments therein, or more.
  • Fig. 15 illustrates a workflow or method 1100 for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a STR sequence at a given STR locus.
  • nucleic acid sequence reads are extracted, the reads mapped, for example, within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location.
  • the nucleic acid reads can additionally be subject to an initial alignment.
  • the extracted nucleic acid sequence reads are aligned.
  • alignment can be described as a realignment.
  • reads are often aligned (or realigned) using variations of a Burrows-Wheeler alignment (BWA).
  • BWA Burrows-Wheeler alignment
  • reads that are within 1, 2, 3, 4, 5 read lengths of an STR locus are extracted for realignment. This number can vary depending upon the read length of the initial sequencing data.
  • the read length is greater than at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides, including increments therein.
  • the read length is less than at least 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or 210 nucleotides, including increments therein.
  • reads that are within 1, 2, 3, 4, 5 kilobases of an STR locus are extracted for realignment.
  • realignment methods are ones that are adapted to more accurately count the number of STR repeats.
  • dynamic programming with the Smith-Waterman (SW) algorithm to count the number of repeats is used for realignment.
  • a Single-Instruction-Multiple-Data (SIMD) Smith-Waterman library for fast alignment is used for realignment.
  • the realignment method utilizes a striped Smith- Waterman algorithm. Examples of the use of such realignment methods are discussed herein and applicable at least at this step.
  • the reads from the alignment can be parsed into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. Discussion of these read types are provided in detail herein and applicable at least at this step.
  • repeat length of the STR sequence is determined by applying a probabilistic model to the at least two informative read groups. Discussion of various probabilistic models are provided herein and applicable at least at this step.
  • a risk probability is determinied, where the risk probability can constitute the risk of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
  • the calculation or determination of this risk probability is discussed herein and applicable at least at this step. For example, determination of PP can serve as the risk probability or can inform an associated risk probability.
  • the preceding embodiments can be provided, whole or in part, as a system of components integrated to perform the methods described.
  • the workflow of FIG. 15 can be provided as a system of components or stations, illustrated for example in Fig. 16, for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a STR sequence at a given STR locus.
  • sequencing data for use in the workflow can be generated from a sequencing unit 1202.
  • the extracting and aligning steps (1 110-1120) can be performed in an alignment engine 1204.
  • Alignment engine 1204 is illustrated as a separate component and thus can be part of a larger alignment unit.
  • alignment engine can be provided as a component of sequencing unit 1202 or even part of a diagnosing unit 1206, discussed below.
  • diagnosing unit 1206 can include a repeat length determination engine 1208 and risk assessment engine 1210.
  • the parsing step 1130 and determination of repeat length step 1140 can be perfomed by repeat length determination engine 1208. Though steps 1130 and 1140 are illustrated as part of single repeat length determination engine 1208, steps 1130 and 1 140 can be performed by separate engines such that a parsing engine can be provided in diagnosing unit 1206.
  • the determining the risk probability step 1 150 can be performed by risk assessment engine 1210.
  • an input/output device 1212 is provided.
  • Device 1212 can be configured and arranged to receive the risk probability score on one hand, and also be configured and arranged to deliver inputs that may assist in allowing the system to perform its function.
  • the methods, software and systems described herein are for use in determining a clinically relevant repeat length for STRs that cause human disease.
  • Most STR diseases possess a repeat length at which the disease become fully-penetrant.
  • Huntington's disease is fully penetrant after 40 repeats of the CAG trinucleotide (SEQ ID NO: 9) (e.g., repeat length of 120).
  • the methods herein can provide diagnosis, determination of a risk-group, or an individual's status as a carrier if the STR related disease is recessive.
  • Table 1 lists common STR diseases, repeat motif and method of inheritance.
  • the STR based disease determined is any one or more of Myotonic dystrophy 1, Myotonic dystrophy 2, Dentatorubro-pallidoluysian atrophy, Fragile X-associated tremor/ataxia, Fragile X syndrome, Mental retardation, FRAXE type, Friedreich ataxia, Huntington disease, Huntington disease-like 2, Unverricht-Lundborg Disease, Oculopharyngeal muscular dystrophy, Spinal and bulbar muscular atrophy, Spinocerebellar ataxia 1, Spinocerebellar ataxia 2, Spinocerebellar ataxia 3, Spinocerebellar ataxia 6, Spinocerebellar ataxia 7, Spinocerebellar ataxia 8, Spinocerebellar ataxia 10, Spinocerebellar ataxia 12, Spinocerebellar ataxia 17, Spinocerebellar ataxia 36, Epileptic encephalopathy, early infantile, 1, Blepharophimosis, epicanthus in
  • loci not listed in Table 1 can apply to loci not listed in Table 1, and that one would be easily able to add additional loci to the list including loci that are not currently known or described herein.
  • a minimal set of information for a new locus requires the genomic coordinates for the repeats, and the disease risk cutoff (e.g., number of repeats and repeat length) based on clinical studies in order to determine the probability of the disease,assuming a full penetrance model.
  • chromosome X:67545318-67545383 is the locus that is associated with spinal and bulbar muscular atrophy when the number of repeats exceeds the high risk cutoff (i.e., >36 repeats).
  • the platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include a digital processing device, or use of the same.
  • the digital processing device includes one or more hardware central processing units (CPUs) or general purpose graphics processing units (GPUs) that carry out the device's functions.
  • the digital processing device further comprises an operating system configured to perform executable instructions.
  • the digital processing device is optionally connected a computer network.
  • the digital processing device is optionally connected to the Internet such that it accesses the World Wide Web.
  • the digital processing device is optionally connected to a cloud computing infrastructure.
  • the digital processing device may be connected to an intranet and may be connected to a data storage device.
  • suitable digital processing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, Internet appliances, tablet computers, and mobile smartphones. Those of skill in the art will recognize that many smartphones are suitable for use in the system described herein.
  • Suitable tablet computers include those with booklet, slate, and convertible configurations, known to those of skill in the art.
  • the digital processing device includes an operating system configured to perform executable instructions.
  • the operating system is, for example, software, including programs and data, which manages the device's hardware and provides services for execution of applications.
  • suitable server operating systems include, by way of non-limiting examples, FreeBSD, OpenBSD, NetBSD ® , Linux, Apple ® Mac OS X Server ® , Oracle ® Solaris ® , Windows Server , and Novell ® NetWare ® .
  • suitable personal computer operating systems include, by way of non-limiting examples, Microsoft ® Windows ® , Apple ® Mac OS X ® , UNIX ® , and UNIX-like operating systems such as GNU/Linux ® .
  • the operating system is provided by cloud computing.
  • suitable mobile smart phone operating systems include, by way of non- limiting examples, Nokia ® Symbian ® OS, Apple ® lOS ® , Research In Motion ® BlackBerry OS ® , Google ® Android ® , Microsoft ® Windows Phone ® OS, Microsoft ® Windows Mobile ® OS, Linux ® , and Palm ® WebOS ® .
  • the digital processing device includes a storage and/or memory device.
  • the storage and/or memory device is one or more physical apparatuses used to store data or programs on a temporary or permanent basis.
  • the device is volatile memory and requires power to maintain stored information.
  • the device is non-volatile memory and retains stored information when the digital processing device is not powered.
  • the non-volatile memory comprises flash memory.
  • the non-volatile memory comprises dynamic random-access memory (DRAM).
  • the non-volatile memory comprises ferroelectric random access memory (FRAM).
  • the nonvolatile memory comprises phase-change random access memory (PRAM).
  • the device is a storage device including, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing based storage.
  • the storage and/or memory device is a combination of devices such as those disclosed herein.
  • the digital processing device may include a display to send visual information to a user.
  • the display is a liquid crystal display (LCD).
  • the display is a thin film transistor liquid crystal display (TFT-LCD).
  • the display is an organic light emitting diode (OLED) display.
  • OLED organic light emitting diode
  • on OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display.
  • the display is a plasma display.
  • the display is a video projector.
  • the display is a head-mounted display in communication with the digital processing device, such as a VR headset.
  • suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like.
  • the display is a combination of devices such as those disclosed herein.
  • the digital processing device may include an input device to receive information from a user.
  • the input device is a keyboard.
  • the input device is a pointing device including, by way of non-limiting examples, a mouse, trackball, track pad, joystick, game controller, or stylus.
  • the input device is a touch screen or a multi-touch screen.
  • the input device is a microphone to capture voice or other sound input.
  • the input device is a video camera or other sensor to capture motion or visual input.
  • the input device is a Kinect, Leap Motion, or the like.
  • the input device is a combination of devices such as those disclosed herein.
  • an exemplary digital processing device 1001 is programmed or otherwise configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parse the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • the device 401 can regulate various aspects to accurately determine a repeat length of a short tandem repeat (STR) sequence of the present disclosure, such as, for example, extracting reads from an alignment or BAM file, realigning reads in a local alignment, parsing reads into different read categories, and running a probabilistic model to determine STR length.
  • the digital processing device 401 includes a central processing unit (CPU, also "processor” and “computer processor” herein) 405, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
  • the digital processing device 401 also includes memory or memory location 410 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 415 (e.g., hard disk), communication interface 1020 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 425, such as cache, other memory, data storage and/or electronic display adapters.
  • the memory 410, storage unit 415, interface 420 and peripheral devices 425 are in communication with the CPU 405 through a communication bus (solid lines), such as a motherboard.
  • the storage unit 1015 can be a data storage unit (or data repository) for storing data.
  • the digital processing device 401 can be operatively coupled to a computer network (“network") 430 with the aid of the communication interface 420.
  • the network 430 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
  • the network 430 in some cases is a telecommunication and/or data network.
  • the network 430 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
  • the network 430 in some cases with the aid of the device 401, can implement a peer-to-peer network, which may enable devices coupled to the device 401 to behave as a client or a server.
  • the CPU 405 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
  • the instructions may be stored in a memory location, such as the memory 410.
  • the instructions can be directed to the CPU 405, which can subsequently program or otherwise configure the CPU 405 to implement methods of the present disclosure. Examples of operations performed by the CPU 405 can include fetch, decode, execute, and write back.
  • the CPU 405 can be part of a circuit, such as an integrated circuit. One or more other components of the device 401 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the storage unit 415 can store files, such as drivers, libraries and saved programs.
  • the storage unit 415 can store user data, e.g., user preferences and user programs.
  • the digital processing device 401 in some cases can include one or more additional data storage units that are external, such as located on a remote server that is in communication through an intranet or the Internet.
  • the digital processing device 401 can communicate with one or more remote computer systems through the network 430.
  • the device 401 can communicate with a remote computer system of a user.
  • remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple ® iPad, Samsung ® Galaxy Tab), telephones, Smart phones (e.g., Apple ® iPhone, Android-enabled device, Blackberry ® ), or personal digital assistants.
  • Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the digital processing device 401, such as, for example, on the memory 410 or electronic storage unit 415.
  • the machine executable or machine readable code can be provided in the form of software.
  • the code can be executed by the processor 405.
  • the code can be retrieved from the storage unit 415 and stored on the memory 410 for ready access by the processor 405.
  • the electronic storage unit 415 can be precluded, and machine-executable instructions are stored on memory 410.
  • Non-transitory computer readable storage medium
  • the platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device.
  • a computer readable storage medium is a tangible component of a digital processing device.
  • a computer readable storage medium is optionally removable from a digital processing device.
  • a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like.
  • the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
  • the platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include at least one computer program, or use of the same.
  • a computer program includes a sequence of instructions, executable in the digital processing device's CPU, written to perform a specified task.
  • Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types.
  • APIs Application Programming Interfaces
  • a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
  • a computer program may include a web application.
  • a web application in various embodiments, utilizes one or more software frameworks and one or more database systems.
  • a web application is created upon a software framework such as Microsoft ® .NET or Ruby on Rails (RoR).
  • a web application utilizes one or more database systems including, by way of non-limiting examples, relational, non-relational, object oriented, associative, and XML database systems.
  • suitable relational database systems include, by way of non- limiting examples, Microsoft ® SQL Server, mySQLTM, and Oracle ® .
  • a web application in various embodiments, is written in one or more versions of one or more languages.
  • a web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof.
  • a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or extensible Markup Language (XML).
  • a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS).
  • CSS Cascading Style Sheets
  • a web application is written to some extent in a client-side scripting language such as Asynchronous Javascript and XML (AJAX), Flash ® Actionscript, Javascript, or Silverlight ® .
  • AJAX Asynchronous Javascript and XML
  • Flash ® Actionscript Javascript
  • Javascript or Silverlight ®
  • a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion ® , Perl, JavaTM, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), PythonTM, Ruby, Tel, Smalltalk, WebDNA ® , or Groovy.
  • a web application is written to some extent in a database query language such as Structured Query Language (SQL).
  • SQL Structured Query Language
  • a web application integrates enterprise server products such as IBM ® Lotus Domino ® .
  • a web application includes a media player element.
  • a media player element utilizes one or more of many suitable multimedia technologies including, by way of non- limiting examples, Adobe ® Flash ® , HTML 5, Apple ® QuickTime ® , Microsoft ® Silverlight ® , JavaTM, and Unity ® .
  • an application provision system comprises one or more databases 500 accessed by a relational database management system (RDBMS) 1110.
  • RDBMS relational database management system
  • Suitable RDBMS s include Firebird, MySQL, PostgreSQL, SQLite, Oracle Database, Microsoft SQL Server, IBM DB2, IBM Informix, SAP Sybase, SAP Sybase, Teradata, and the like.
  • the application provision system further comprises one or more application severs 520 (such as Java servers, .NET servers, PHP servers, and the like) and one or more web servers 530 (such as Apache, IIS, GWS and the like).
  • the web server(s) optionally expose one or more web services via app application programming interfaces (APIs) 540.
  • APIs app application programming interfaces
  • an application provision system alternatively has a distributed, cloud-based architecture 600 and comprises elastically load balanced, auto-scaling web server resources 610 and application server resources 620 as well synchronously replicated databases 1230.
  • the computer program may include a mobile application provided to a mobile digital processing device.
  • the mobile application is provided to a mobile digital processing device at the time it is manufactured.
  • the mobile application is provided to a mobile digital processing device via the computer network described herein.
  • a mobile application is created by techniques known to those of skill in the art using hardware, languages, and development environments known to the art. Those of skill in the art will recognize that mobile applications are written in several languages.
  • Suitable programming languages include, by way of non-limiting examples, C, C++, C#, Objective-C, JavaTM, Javascript, Pascal, Object Pascal, PythonTM, Ruby, VB.NET, WML, and XHTML/HTML with or without CSS, or combinations thereof.
  • Suitable mobile application development environments are available from several sources. Commercially available development environments include, by way of non-limiting examples, AirplaySDK, alcheMo, Appcelerator ® , Celsius, Bedrock, Flash Lite, .NET Compact Framework, Rhomobile, and WorkLight Mobile Platform. Other development environments are available without cost including, by way of non-limiting examples, Lazarus, MobiFlex, MoSync, and Phonegap. Also, mobile device manufacturers distribute software developer kits including, by way of non-limiting examples, lPhone and lPad (IOS) SDK, AndroidTM SDK, BlackBerry ® SDK, BREW SDK, Palm ® OS SDK, Symbian SDK, webOS SDK, and Windows ® Mobile SDK.
  • IOS lPhone and lPad
  • the computer program may include a standalone application, which is a program that is run as an independent computer process, not an add-on to an existing process, e.g., not a plug-in.
  • a compiler is a computer program(s) that transforms source code written in a programming language into binary object code such as assembly language or machine code. Suitable compiled programming languages include, by way of non-limiting examples, C, C++, Objective-C, COBOL, Delphi, Eiffel, JavaTM, Lisp, PythonTM, Visual Basic, and VB .NET, or combinations thereof. Compilation is often performed, at least in part, to create an executable program.
  • a computer program includes one or more executable complied applications.
  • a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof.
  • a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof.
  • the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application.
  • software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on cloud computing platforms. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location.
  • the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same.
  • suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity- relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase.
  • a database is internet-based. In further embodiments, a database is web-based. In still further embodiments, a database is cloud computing-based.
  • a database is based on one or more local computer storage devices.
  • the methods described herein encompass delivering one or more reports detailing STR repeat length, statistics, and/or raw data for any one or more of the loci in Table 1. Reports can be delivered over the internet or through the mail to a health car provider, physician, or consumer. Reports can be delivered by e-mail, a secure network, or downloaded from a secure site. The reports can be hard-copy physical reports or in electronic format.
  • Example 1 - TREDPARSE is accurate on simulated data
  • TREDPARSE out-performs many other callers of short tandem repeats.
  • TREDPARSE was compared with commonly used general-purpose variant callers, including Manta, Isaac, and GATK. Not surprisingly, they perform poorly on the simulated datasets as shown in Figs. 7A-7D. Manta (Fig. 7A), Isaac (Fig. 7B), and GATK Fig. 7C, are unable to call a repeat length that would be useful for clinically determining Huntington's disease, while lobSTR is unable to call any STR diseases above the cutoff for Huntington's disease repeat length (e.g., greater than 120).
  • variant callers can detect small indels, but in most cases fail to recover the length of long alleles (i.e., large indels). Additionally, the indels could occur at different locations within the repeat tract, it is not sufficient to construct locus-based callers that inspect indels collectively, making direct calling of the repeat size difficult without further post-processing. Based on these comparisons, it was found that most tools tested thus far were not effective at quantifying the number of repeats. [0122] A tool that was specifically designed for STR variant calling, lobSTR performed better than other variant callers at short allele size ranges, up to 40 CAGs (SEQ ID NO: 9), which is close to the risk threshold for HD, but below the risk threshold of 12 other STR diseases (Table 1).
  • Figs. 8A-8D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus.
  • A Performance comparison of TREDPARSE and lobSTR on simulated haploid with one single allele with h number of CAGs, where h varies between 1 to 300 (SEQ ID NO: 7);
  • B Performance comparison of TREDPARSE and lobSTR on simulated diploid with two alleles, one allele fixed with 20 CAGs (SEQ ID NO: 8), another allele with h units of CAGs;
  • C Performance of TREDPARSE on simulated diploid with low haploid depth of 5 X ;
  • D Performance of TREDPARSE on simulated diploid with high haploid depth of 80 X . Shaded region represent 95% credible interval for TREDPARSE estimates of h. RMSD represents root-mean-square deviation, calculated as Figs. 8A-8D also disclose "40xCAGs" as SEQ ID NO: 9.
  • the TREDPARSE caller extended the calling of the size of the allele beyond a typical read length.
  • TREDPARSE predicted the long allele sizes with lower RMSD (root-mean- sequare deviation) in the simulated diploid cases.
  • Increasing average sequence depth decreased the CI and RMSD (Fig. 8C, 15.20 RMSD at 5x depth; Fig. 8B, 9.27 RMSD; Fig. 8D, 8.77 RMSD at 80x depth). Additionally, most truth values fell within the 95% credible intervals as shown in Fig.
  • TREDPARSE extends the limit of STR size detection well beyond the physical read length. This extension is critical in many cases since several of the disease risk cutoffs are close to or beyond the read length - 150bp for mainstream Illumina sequencers. Based on our simulations, the current detection limit for TREDPARSE is around 500 bp, which is roughly equal to the paired-end distance in Illumina HiSeq sequencing libraries. This detection limit enables detection of risk alleles for most loci listed in Table 1.
  • Each of the four types of read evidence available for use has its own range of predictive power across the spectrum of likely STR repeat length. Overall, the maximum number of repeat length that each evidence can identify is increasing from spanning reads, partial reads, paired-end reads to repeat-only reads as shown in Fig. 3E. The repeat-only reads often cover the longest range in a typical Illumina sequencing experiment, bounded by the paired-end distance.
  • TREDPARSE was run on sequence data from 12,632 and identified a total of 138 individuals with risk alleles at 15 disease loci (Fig. 5), as well as 54 individuals inferred to be 'carriers' who are capable of passing a recessive risk allele onto their offspring. Specifically, 15 DM1, 2 FXTAS, 5 HD, 8 OPMD, 1 SBMA, 26 SCA1, 4 SCA2, 2 SCA6, 3 SCA8, 52 SCA17, 1 BPES, 5 CCD, 11 CCHS, 2 HFG and 1 SD5 at-risk individuals were inferred (Table 1).
  • a subset (n 19) of 138 individuals that were reported by TREDPARSE to contain a risk allele was selected for confirmation by an orthogonal method. Summarized in Fig. 11. The cases for which there was confirmed sufficient DNA available were subjected to CLIA Sanger Sequencing (Table 2). Out of 19 cases, 11 had identical lengths for Sanger and TREDPARSE, 4 did not match exactly but were called "at risk” status by both Sanger and TREDPARSE, and 4 were discordant (an example is given in Figs. 12A-12C). In all 4 discordant cases, Sanger identified only the shorter allele, leaving an inference that these cases only contain shorter allele(s).
  • Figs. 12A-12C show an example of validation of TREDPARSE calls using Sanger and Oxford Nanopore sequencing.
  • TREDPARSE two alleles 17 and 84
  • Sanger sequencing which identified only the allele with size 17
  • Fig. 12B discloses SEQ ID NOS 14-15, respectively, in order of appearance
  • C Oxford Nanopore sequencing confirms the longer allele, showing two peaks of allele sizes that both match the prediction of TREDPARSE. Sample mean coverage of the input BAM is 33 X .
  • TREDs that are considered reliable with at least 1 validated sample included HD, DM1, SBMA, SCA1, SCA2, SCA8, SCA17, FXTAS. These are the most confident STR loci since we have observed at-risk individuals in our HLI samples, were experimentally validated, and had support from in silico simulation. There were a total of 8 TRED diseases in our list for which we have observed risk alleles but have not obtained experimental validation because of lack of DNA material. Nonetheless, simulation analysis offers good simulation support and concordant calls within families. These loci included OPMD, SCA6, BPES, CCD, CCHS, HFG, FRDA and SD5.
  • the first family had a father-to-daughter transmission of a risk allele for the Huntington locus, which has 41 CAGs repeats (SEQ ID NO: 11) in the father and 40 repeats (SEQ ID NO: 9) in the daughter as shown in Fig. 13A (Fig. 13A discloses SEQ ID NO: 9). These alleles have been experimentally validated through Sanger sequencing (Table 2).
  • the second family showed a putative DM1 risk allele transmitted from mother to both kids while the father was unaffected as shown in Fig. 13B (Fig. 13B discloses SEQ ID NO: 16).
  • GeT-RM Genetic Testing Reference Materials Coordination Program
  • GeT-RM provides cell lines or DNA that can be used as reference materials for genotyping inherited diseases, including Myotonic Dystrophy, Fragile X syndrome, and Huntington disease.
  • TREDPARSE was able to predict risk alleles for 5 out of the 6 cell lines.
  • Sample NA20236 which is known to have allele sizes of 31/53 in the FXTAS locus, was missed by TREDPARSE;
  • sample NA05164 which is known to have allele sizes of 21/340 in the DM1 locus, has the size of the long allele under-predicted by TREDPARSE.
  • the predictions on the four other cell lines exactly or closely matches the known truth (Table 3).
  • lobSTR failed to predict long alleles in all cases, and failed to generate any predictions for the two FXTAS cases.
  • ExpansionHunter Gene Res. 27: 1895-1903
  • Embodiment 1 A method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • Embodiment 2 The method of Embodiment 1, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 3 The method of Embodiments 1 or 2, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 4 The method of any one of Embodiments 1 to 3, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
  • Embodiment 5. The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
  • Embodiment 6 The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
  • Embodiment 7 The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
  • Embodiment 8 The method of any one of Embodiments 1 to 7, wherein the second alignment uses a Smith-Waterman algorithm.
  • Embodiments 9 The method of any one of Embodiments 1 to 7, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 10 The method of any one of Embodiments 1 to 9, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051- 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034;
  • Embodiment 12 The method of any one of Embodiments 1 to 11, further comprising determining a ploidy for an X chromosome from the extracted reads.
  • Embodiment 13 The method of any one of Embodiments 1 to 12, further comprising delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
  • Embodiment 14 The method of Embodiment 13, wherein the report is in electronic format.
  • Embodiment 15 A computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, a first read group comprising paired-
  • Embodiment 16 The computer-implemented system of Embodiment 15, wherein the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 17 The computer-implemented system of Embodiments 15 or 16, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 18 The computer-implemented system of any one of Embodiments 15 to 17, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
  • Embodiment 19 The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
  • Embodiment 20 The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
  • Embodiment 21 The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
  • Embodiment 22 The computer-implemented system of any one of Embodiments 15 to 21, wherein the second alignment uses a Smith-Waterman algorithm.
  • Embodiment 23 The computer-implemented system of any one of Embodiments 15 to 22, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 24 The computer-implemented system of any one of Embodiments 15 to 23, wherein the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011
  • Embodiment 25 The computer-implemented system of any one of Embodiment 15 to 24, wherein an STR read length greater than 120 base pairs is accurately quantitated.
  • Embodiment 26 The computer-implemented system of any one of Embodiments 15 to 25, further comprising a software module configured to determine a ploidy for the X chromosome from the extracted reads.
  • Embodiment 27 The computer-implemented system of any one of Embodiments 15 to 26, further comprising a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
  • Embodiment 28 The computer-implemented system of Embodiment 27, wherein the report is in electronic format.
  • Embodiment 29 A method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • Embodiment 30 The method of Embodiment 29, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 31 The method of Embodiments 29 or 30, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired- end reads, and repeat-only reads.
  • Embodiment 32 The method of any one of Embodiments 29 to 31 , wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
  • Embodiment 33 The method of any one of Embodiments 29 to 32, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
  • Embodiment 34 The method of any one of Embodiments 29 to 33, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
  • Embodiment 35 The method of any one of Embodiments 29 to 34, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
  • Embodiment 36 The method of any one of Embodiments 29 to 35, wherein the second alignment uses a Smith-Waterman algorithm.
  • Embodiment 37 The method of any one of Embodiments 29 to 36, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 38 The method of any one of Embodiments 29 to 37, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -9207
  • Embodiment 39 The method of any one of Embodiments 29 to 38, wherein an STR read length greater than 120 base pairs is accurately quantitated.
  • Embodiment 40 The method of any one of Embodiments 29 to 39, further comprising determining a ploidy for the X chromosome from the extracted reads.
  • Embodiment 41 The method of any one of Embodiments 29 to 40, further comprising delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
  • Embodiment 42 The method of Embodiment 41, wherein the report is in electronic format.
  • Embodiment 43 A computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat- only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
  • STR short tandem repeat
  • Embodiment 44 The computer-implemented system of Embodiment 43, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 45 The computer-implemented system of Embodiments 43 or 44, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 46 The computer-implemented system of any one of Embodiments 43 to 45, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
  • Embodiment 47 The computer-implemented system of any one of Embodiments 43 to 46, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
  • Embodiment 48 The computer-implemented system of any one of Embodiments 43 to 47, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
  • Embodiment 49 The computer-implemented system of any one of Embodiments 43 to 48, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
  • Embodiment 50 The computer-implemented system of any one of Embodiments 43 to 49, wherein the second alignment uses a Smith-Waterman algorithm.
  • Embodiment 51 The computer-implemented system of any one of Embodiments 43 to 50, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 52 The computer-implemented system of any one of Embodiments 43 to 51, wherein the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:920710
  • Embodiment 53 The computer-implemented system of any one of Embodiments 43 to 52, wherein an STR read length greater than 120 base pairs is accurately quantitated.
  • Embodiment 54 The computer-implemented system of any one of Embodiments 43 to 53, further comprising a software module configured to determine a ploidy for the X chromosome from the extracted reads.
  • Embodiment 55 The computer-implemented system of any one of Embodiments 43 to 54, further comprising a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
  • Embodiment 56 The computer-implemented system of Embodiment 55, wherein the report is in electronic format.
  • Embodiment 57 A method of determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; aligning the extracted nucleic acid sequence reads; parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
  • STR short
  • Embodiment 58 The method of Embodiment 57, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 59 The method of Embodiments 57 and 58, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 60 The method of any one of Embodiments 57 to 59, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 61 The method of any one of Embodiments 57 to 60, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011
  • Embodiment 62 The method of any one of Embodiments 57 to 61, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy, Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome, Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis.
  • the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer
  • Embodiment 63 The method of any one of Embodiments 57 to 62, further comprising determining a ploidy for an X chromosome from the extracted reads.
  • Embodiment 64 A non-transitory computer-readable medium in which a program is stored for causing a computer to perform a method for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus, the method comprising extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; aligning the extracted nucleic acid sequence reads; parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and determining a risk probability of having a specific disease or disorder associated with the
  • Embodiment 65 The method of Embodiment 64, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 66 The method of Embodiments 64 and 65, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 67 The method of any one of Embodiments 64 to 66, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 68 The method of any one of Embodiments 64 to 67, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011
  • Embodiment 69 The method of any one of Embodiments 64 to 68, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy, Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome, Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis.
  • Embodiment 70 The method of any one of Embodiments 64 to 69, further comprising determining a ploidy for an X chromosome from the extracted reads.
  • Embodiment 71 A system for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a sequencing unit configured to generate nucleic acid sequence reads; an alignment engine configured to extract the nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and align the extracted nucleic acid sequence reads; and a diagnosing unit comprising a repeat length determination engine configured to receive aligned reads and (1) parse the reads into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads, and (2) determine the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and a
  • Embodiment 72 The system of Embodiment 71, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
  • Embodiment 73 The system of Embodiments 71 and 72, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
  • Embodiment 74 The method of any one of Embodiments 71 to 73, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
  • Embodiment 75 Embodiment 75.
  • STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19
  • Embodiment 76 The system of any one of Embodiments 71 to 75, further comprising determining a ploidy for an X chromosome from the extracted reads.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Hall/Mr Elements (AREA)

Abstract

Described herein are methods, software, and systems of determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from nucleic acid sequence reads generated by short read technology.

Description

DETERMINATION OF STR LENGTH BY SHORT READ SEQUENCING
CROSS-REFERENCE TO RELATED APPLICATION
[0001] The present application claims the benefit of priority under 35 USC §119 to U.S. Provisional Patent Application Serial No. 62/539,896 entitled "DETERMINATION OF STR LENGTH BY SHORT READ SEQUENCING" filed on August 1, 2017, the disclosure of which is hereby incorporated by reference in its entirety for all purposes.
SEQUENCE LISTING
[0002] The instant application contains a Sequence Listing which has been submitted electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on August 1, 2018, is named 095515-0093_SL.txt and is 5,772 bytes in size.
BACKGROUND
[0003] Short tandem repeats (STRs) are hyper-mutable sequences in the human genome that are often used in forensics and population genetics, and are also the underlying cause of many genetic diseases. There are challenges associated with accurately determining the length polymorphism of STR loci in the genome using, for example, Next Generation Sequencing (NGS). In particular, accurate detection of pathological STR expansion is limited by the sequence read length during whole genome analysis.
[0004] Moreover, there are many associated variant calling software (for example, Manta, Isaac, GATK, and lobSTR) that, while being able to identify some short indels in reads that span STRs, all have functional limitations. Some of these software tools seek to identify STR variants by specifically examining the sequencing reads that are piled around a target STR region. For example, lobSTR uses three separate steps: sensing, alignment, and allelotyping, which explicitly model two possible alleles (diploid) as well as sequencing errors typically associated with STRs (due to stutter noise). However, lobSTR only considers reads that fully span a STR locus. The short length of Illumina reads (100-150 bases) imposes a major limitation on the length of STR alleles that can be identified. In another example software, STRViper, an estimate of length variation at an STR can also be calculated by combining information from a prior estimate and the observed sizes of paired- end sequence fragments spanning the STR. However, STRViper assumes a single allele at each site; which is a significant limitation for quantitating STR from diploid human calls.
[0005] Using long sequence reads like single molecule real-time sequencing or nanopore sequencing could potentially help to increase both the precision and the range of detectable variants. Long read sequencing, however, is both expensive and inefficient. The per-base cost of the long read technologies is greater than for short read technologies on whole genome sequencing, limiting its utility for typing STRs. However, high throughput genotyping of STRs by short read technologies remains limited due to low effective coverage, sequencing stutters, and lack of robust models to perform both haploid and diploid calls while distinguishing true variation from technical artifacts.
[0006] A need therefore exists to provide methods, software and systems for using whole-genome sequencing data and/or sequencing reads that map to STR loci to better predict allele lengths for STRs, of both short and long length, including disease and forensic loci. In particular, a need exists to provide methods, software and systems that yield highly accurate typing of many disease-related STRs. Moreover, a need exists to provide methods for determining STR length that allow use of more efficient next-generation sequencing technology, for example taking into account ploidy and allowing modeling of both shorter and longer stretches of repeats.
SUMMARY
[0007] In one aspect, a method of determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus is provided. The method can comprise extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and aligning the extracted nucleic acid sequence reads. The method can further comprise parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. The method can further comprise determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups, and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus. [0008] In another aspect, provided is a non-transitory computer-readable medium in which a program is stored for causing a computer to perform a method for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus. The method can comprise extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and aligning the extracted nucleic acid sequence reads. The method can further comprise parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. The method can further comprise determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups, and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
[0009] In yet another aspect, a system is provided for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus. The system can comprise a sequencing unit configured to generate nucleic acid sequence reads, and an alignment engine configured to extract the nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and align the extracted nucleic acid sequence reads. The system can further comprise a diagnosing unit comprising a repeat length determination engine configured to receive aligned reads and (1) parse the reads into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads, and (2) determine the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups. The diagnosing unit can further comprise risk assessment engine configured to determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
[0010] In another aspect, provided is a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: (a) extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; (b) creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (c) parsing the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (d) determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. In certain embodiments, the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat- only reads. In certain embodiments, the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 20 or greater. In certain embodiments, the second alignment is aligned using a method that comprises a lower gap penalty then a method used for the first alignment to the genome. In certain embodiments, the second alignment uses a Smith- Waterman algorithm or variation thereof. In certain embodiments, the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome. In certain embodiments, the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951- 111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908- 170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801; chromosome 4:41745972-41746031; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059-176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529- 27573546. In certain embodiments, the method is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the method further comprises determining a ploidy for the X chromosome from the extracted reads. In certain embodiments, the method further comprises delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format.
[0011] In another aspect, provided is a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: (a) a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the
STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; (b) a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (c) a software module configured to parse the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (d) a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. In certain embodiments, the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads. In certain embodiments, the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 20 or greater. In certain embodiments, the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome. In certain embodiments, the second alignment uses a Smith- Waterman algorithm or variation thereof. In certain embodiments, the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome. In certain embodiments, the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021; chromosome 20:2652734-2652757; chromosome X:25013662-25013691; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059-176093103; chromosome X: 140504317-140504361; and chromosome 9:27573529-27573546. In certain embodiments, the system is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the system further comprises a software module configured to determine a ploidy for the X chromosome from the extracted reads. In certain embodiments, the system further comprises a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format. [0012] In another aspect, provided is a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: (a) creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (b) parsing the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (c) determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. In certain embodiments, the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads. In certain embodiments, the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 20 or greater. In certain embodiments, the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome. In certain embodiments, the second alignment uses a Smith-Waterman algorithm or variation thereof. In certain embodiments, the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome. In certain embodiments, the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-1479121 10; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 11 1598951 -11 1599019; chromosome 14:9207101 1-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546. In certain embodiments, the method is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the method further comprises determining a ploidy for the X chromosome from the extracted reads. In certain embodiments, the method further comprises delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format.
[0013] In another aspect, provided is a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: (a) a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; (b) a software module configured to parse the reads from the second alignment into at least two informative read groups, the at least two informative read groups comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and (c) a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. In certain embodiments, the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. In certain embodiments, the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads. In certain embodiments, the nucleic acid sequence reads from the first alignment comprise paired-end reads. In certain embodiments, the first alignment to the genome is at an average sequence depth of 5 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 10 or greater. In certain embodiments, the first alignment to the genome is at an average sequence depth of 20 or greater. In certain embodiments, the second alignment is aligned using a method that comprises a lower gap penalty then then a method used for the first alignment to the genome. In certain embodiments, the second alignment uses a Smith- Waterman algorithm or variation thereof. In certain embodiments, the probabilistic model estimates a maximum likelihood of the a repeat length of a short tandem repeat (STR) sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome. In certain embodiments, the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051- 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021; chromosome 20:2652734-2652757; chromosome X:25013662-25013691; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361; and chromosome 9:27573529-27573546. In certain embodiments, the system is able to accurately quantitate an STR read length greater than 120 base pairs. In certain embodiments, the system further comprises a software module configured to determine a ploidy for the X chromosome from the extracted reads. In certain embodiments, the system further comprises a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the short tandem repeat (STR) sequence at the given STR locus. In certain embodiments, the report is in electronic format. BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Fig. 1 shows a non-limiting example of a workflow for determining STR length, in accordance with various embodiments.
[0015] Figs. 2A-2B shows a comparison of two sequence alignment methods exploiting the periodicity of the STR sequences, in accordance with various embodiments.
[0016] Figs. 3A-3E show an integrated probabilistic model to call STRs with four types of evidence, in accordance with various embodiments.
[0017] Fig. 4 shows a non-limiting example of a digital processing device, in accordance with various embodiments.
[0018] Fig. 5 shows a non-limiting example of a web/mobile application provision system, in accordance with various embodiments.
[0019] Fig. 6 shows a non-limiting example of a cloud-based web/mobile application provision system, in accordance with various embodiments.
[0020] Figs. 7A-7D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus tested against several variant callers, including (A) Manta (B) Isaac (C) GATK, (D) lobSTR, in accordance with various embodiments.
[0021] Figs. 8A-8D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus, in accordance with various embodiments.
[0022] Figs. 9A and 9B show examples of posterior probability density function based on the integrated model to call STRs, in accordance with various embodiments.
[0023] Figs. 10A-10D illustrates the individual contribution of four types of evidence to a final STR call, namely spanning reads, partial reads, repeat reads and paired-end distance, in accordance with various embodiments.
[0024] Fig. 11 shows an example summary of testing and validation on 12,632 whole genome sequences, in accordance with various embodiments.
[0025] Figs. 12A-12C shows an example of validation of calls using Sanger and Oxford Nanopore sequencing, in accordance with various embodiments. [0026] Figs. 13A-13C show individuals with risk alleles at Huntington disease (HD) locus in whole genome samples, in accordance with various embodiments.
[0027] Figs. 14A-14D shows simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus using several known variant callers, including (A) Manta (B) Isaac (C) GATK, (D) lobSTR, in accordance with various embodiments.
[0028] Fig. 15 is a flow chart illustrating a method for method for determining that a subject is at risk for having a disease or disorder, in accordance with various embodiments.
[0029] Fig. 16 is a schematic diagram illustrating a system for determining that a subject is at risk for having a disease or disorder, in accordance with various embodiments.
[0030] It is to be understood that the figures are not necessarily drawn to scales, nor are the objects in the figures necessarily drawn to scale in relationship to one another. The figures are depictions that are intended only to bring clarity and understanding to various embodiments of the methods, apparatuses, and systems disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be appreciated that the drawings are not intended to limit the scope of the present teachings in any way.
DETAILED DESCRIPTION
[0031] This specification describes exemplary embodiments and applications of the disclosure. The disclosure, however, is not limited to these exemplary embodiments and applications or to the manner in which the exemplary embodiments and applications operate or are described herein. Other embodiments, features, objects, and advantages of the present teachings will be apparent from the description and accompanying drawings, and from the claims. Moreover, the figures may show simplified or partial views, and the dimensions of elements in the figures may be exaggerated or otherwise not in proportion.
[0032] As used herein, the terms "comprise", "comprises", "comprising", "contain", "contains", "containing", "have", "having" "include", "includes", and "including" and their variants are not intended to be limiting, are inclusive or open-ended and do not exclude additional, unrecited additives, components, integers, elements or method steps. For example, a process, method, system, composition, kit, or apparatus that comprises a list of features is not necessarily limited only to those features but may include other features not expressly listed or inherent to such process, method, system, composition, kit, or apparatus.
[0033] Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. As used in this specification and the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Any reference to "or" herein is intended to encompass "and/or" unless otherwise stated.
[0034] Unless otherwise defined, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by those of ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular. Generally, nomenclatures utilized in connection with, and techniques of, cell and tissue culture, molecular biology, and protein and oligo- or polynucleotide chemistry and hybridization described herein are those well known and commonly used in the art. Standard techniques are used, for example, for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to manufacturer's specifications or as commonly accomplished in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g., Sambrook et al, Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000). The nomenclatures utilized in connection with, and the laboratory procedures and techniques described herein are those well known and commonly used in the art.
[0035] DNA (deoxyribonucleic acid) is a chain of nucleotides consisting of 4 types of nucleotides; A (adenine), T (thymine), C (cytosine), and G (guanine), and that RNA (ribonucleic acid) is comprised of 4 types of nucleotides; A, U (uracil), G, and C. Certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). That is, adenine (A) pairs with thymine (T) (in the case of RNA, however, adenine (A) pairs with uracil (U)), and cytosine (C) pairs with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand. As used herein, "nucleic acid sequencing data," "nucleic acid sequencing information," "nucleic acid sequence," "genomic sequence," "genetic sequence," or "fragment sequence," or "nucleic acid sequencing read" denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA. It should be understood that the present teachings contemplate sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
[0036] In various embodiments, a sequence alignment method can align a fragment sequence to a reference sequence or another fragment sequence. The fragment sequence can be obtained from a fragment library, a paired-end library, a mate-pair library, a concatenated fragment library, or another type of library that may be reflected or represented by nucleic acid sequence information including for example, RNA, DNA, and protein based sequence information. Generally, the length of the fragment sequence can be substantially less than the length of the reference sequence. The fragment sequence and the reference sequence can each include a sequence of symbols. The alignment of the fragment sequence and the reference sequence can include a limited number of mismatches between the symbols of the fragment sequence and the symbols of the reference sequence. Generally, the fragment sequence can be aligned to a portion of the reference sequence to minimize the number of mismatches between the fragment sequence and the reference sequence.
[0037] In particular embodiments, the symbols of the fragment sequence and the reference sequence can represent the composition of biomolecules. For example, the symbols can correspond to identity of nucleotides in a nucleic acid, such as RNA or DNA, or the identity of amino acids in a protein. In some embodiments, the symbols can have a direct correlation to these subcomponents of the biomolecules. For example, each symbol can represent a single base of a polynucleotide. In other embodiments, each symbol can represent two or more adjacent subcomponents of the biomolecules, such as two adjacent bases of a polynucleotide. Additionally, the symbols can represent overlapping sets of adjacent subcomponents or distinct sets of adjacent subcomponents. For example, when each symbol represents two adjacent bases of a polynucleotide, two adjacent symbols representing overlapping sets can correspond to three bases of polynucleotide sequence, whereas two adjacent symbols representing distinct sets can represent a sequence of four bases. Further, the symbols can correspond directly to the subcomponents, such as nucleotides, or they can correspond to a color call or other indirect measure of the subcomponents. For example, the symbols can correspond to an incorporation or non-incorporation for a particular nucleotide flow.
[0038] Microsatellites, or short tandem repeats (STRs), are stretches of simple nucleotide repetitions in the genome, with a typical repeat units of 1 to 6 bp in length. Short tandem repeats are often polymorphic due to strand slippage during DNA replication, and are a common source of rare genetic diseases. The mutation rates of STRs are typically on the order of ~10"4 mutations per generation per site, as compared to point mutation rates that are on the order of -10" mutations per generation per site for single nucleotide variants (SNVs). Because of the higher mutation rate, STRs offer a different level of resolution to study kinship and trait variations among individuals.
[0039] STRs can be currently used in forensics to identify suspects from DNA traces left at a crime scene. The amplification targets the 13 CODIS (Combined DNA Index System) STR loci and the sizes of the amplicons are analyzed by electrophoresis. The repeat number at each loci is inferred by the size of the amplicon and a DNA profile is generated. STRs also have a role in inferring genealogy. For example, STR loci on the Y-chromosomes (Y-STRs) are used to define haplotypes that predated the use of Y-SNPs. The STR data, coupled with public genealogy databases like Y- search can be used for "surname inference."
[0040] STRs have been shown to be involved in several human genetic diseases. Several neural- degenerative disorders, known as the "polyglutamine" (PolyQ) diseases, are caused by variable stretches of the repeated trinucleotide CAG within protein- coding exons. Examples of PolyQ diseases are Huntington's disease (HD) and several forms of Spinocerebellar ataxia (SCA). Huntington's disease is caused by an expansion of the CAG repeats in the first exon of the Huntingtin gene (HTT). Individuals carrying an expanded allele have motor, cognitive and psychological symptoms that typically appear at the age of 40 years old or older, depending on the number of repeats. [0041] STRs also occur in non-coding regions and can regulate gene expression and histone modifications, affecting the expression of nearby genes in cis to the STR sites. Examples of these repeat disorders include Myotonic dystrophy (DM1) with CTG repeats, Friedreich Ataxia (FRDA) with GAA repeats, and Fragile X syndrome with CGG repeats. STRs that regulate gene expression (e-STRs) are mostly enriched in genes responsible for cognitive functions and autoimmune responses.
[0042] Herein are described methods, software and systems for using, for example, whole-genome sequencing data and/or sequencing reads that map to STR loci to predict allele lengths for STRs. The methods, software and systems described herein, can incorporate various cues from read alignment and paired-end distance distribution, as well as a sequence stutter model using a probabilistic framework to infer the repeat sizes for STR loci, of both short and long read length, including disease and forensic loci.
[0043] Testing on both simulated datasets, and more than 10,000 sequenced full human genomes demonstrate that the methods, software and systems described herein (also termed TREDPARSE) yield highly accurate typing of many disease-related STRs. The methods described herein solve problems associated with these other methods for determining STR length by allowing use of more efficient next-generation sequencing technology, taking into account ploidy and allowing modeling of both shorter and longer stretches of repeats.
[0044] Due to their unstable nature and costly testing procedures, STR loci have so far been mostly under-utilized in population efforts to assess STR disease diagnoses, risks, and prevalence. The methods, systems and software (also referred to as TREDPARSE herein) enable simultaneous identification of many STR loci, using whole genome sequencing data. The whole genome approach offers advantage over conventional STR testing by limiting the potential bias introduced during the amplification step, reduced cost, and greater efficiency by analyzing multiple loci simultaneously. With full genome sequencing becoming more accessible across large number of individuals, it is anticipated that STR-related diseases will be of more interest to clinicians and researchers.
[0045] Described herein, in accordance with embodiments, is a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising, extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0046] Also described herein, in accordance with embodiments, is a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0047] Also described herein, in accordance with embodiments, is a method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0048] Also described herein, in accordance with embodiments, is a computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
Repeat length determination and disease diagnosis (TREDPARSE)
[0049] Various methods for STR repeat length determination described herein, in accordance with various embodiments, also known as TREDPARSE, allows for accurate quantitation of repetitive indels, such as short tandem repeats (STRs). The methods described herein determine each allele length at a pre-defined STR loci using short read whole-genome sequence data that are sampled at sufficient depth. Given a set of observed reads that are mapped around a particular STR locus, the method can estimate up to two haplotypes hx and h2, where 1≤ hx < h2≤ hmax, that represent the number of repeat units from an individual that maximizes the likelihood in our model.
[0050] Fig. 1 is a flow chart illustrating a general workflow or system 100 for determining STR length, in accordance with various embodiments. Note that, as this is a general workflow, much of the detailed discussion related to the steps of Fig. 1 will be provided in greater detail with regard to embodiments described in Figs. 15 and 16. Those detailed descriptions are still applicable to the embodiments of Fig. 1. Referring to Fig. 1 the method optionally determines the correct ploidy level
101 to account for X-linked and autosomal loci. First, reads previously aligned are realigned by, for example, using an alignment algorithm that is adapted to better determining STR length 102. Reads to determine the nucleic acid sequence of a whole genome, whole-exome, or large portions of the genome are usually aligned using a method that penalizes gaps (e.g., indels) that are longer than a few nucleotides in length. This is because most sequences in the genome are not STR sequences and must be properly aligned at the outset to locate the STR boundaries. For example, an initial Burrows- Wheeler alignment is realigned using a dynamic programming algorithm (e.g., Smith- Waterman). This realignment leads to a more precise counting of repeat elements. Four types of reads can be accounted for in this extraction: spanning reads 103, partial reads 104, repeat-only 105 and paired- end reads 106. These four types of reads (evidence) are incorporated into a probabilistic model 107, finally one can optionally compute the likelihood of disease using the proper inheritance model (dominant or recessive) 108. The combination of all or some of these features enables TREDPARSE to make clinically relevant profiling of STR-related diseases. The full probabilistic model can be partitioned into four major sources of evidence (the four read types that are considered together or in sub-combination of two or more types). Consequently, a maximum likelihood estimate hx, h2 for likelihood function P^observations h^ h2) is determined. In various aspects, the method allows for determination of repeat lengths by short read sequencing when the reads average between 300 and 50, 250 and 100, 150 and 100, 100 and 30, 100 and 50. In various aspects, the method allows for determination of repeat lengths by short read sequencing greater than 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 nucleotides, including increments therein, or greater. In various aspects, the method allows for determination of repeat lengths by short read sequencing equal to at least 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 nucleotides, including increments therein, or greater. As used herein "repeat length" is equal to the repeat unit K multiplied by the number of repeats of the particular unit R. For example, the human reference genome (hg38) has R = 19 for the Huntington locus, which is a repeat of "CAG" (K = 3), so RK = 57 is the total repeat length in nucleotides or base pairs.
Ploidy determination
[0051] STR loci are autosomal or X-linked. For autosomal loci, a diploid allele should be taken into account depending on whether the disease is dominant or recessive. For X-chromosome loci, ploidy also should be taken into account with the assumption that an individual possessing only a single loci (a male) will be afflicted by the presence of a single disease allele even if the disease is recessive. Optionally, ploidy can be determined by sequence reads if the sex of the individual is not known from other sources such as a questionnaire, medical form, or interview. Autosomal STRs are modeled as diploid loci, allowing two alleles to be inferred per locus. For STRs on the X chromosome (X-linked), a simple example method to infer the gender for the given sample is by computing the median read depth on selected unique regions on the Y-chromosome. If the median depth on Y-chromosome is less than 1, then the gender should be female and define ploidy(X) = 2, ploidy(Y) = 0; otherwise, define ploidy(X) = 1, ploidy(Y) = 1 which is consistent with the expected ploidy number of sex chromosomes of a male individual. In various embodiments, ploidy determination is required and can be performed before, during, after or in parallel with repeat length determination.
Short-read sequencing technologies
[0052] The methods, software, and systems developed herein can be used with short read sequence data. Exemplary short read sequencing technologies include sequencing by synthesis, pyrosequencing, or ion semi-conductor sequencing. Short read sequencing is in contrast to long read technologies such as Sanger sequencing, single molecule real-time sequencing, or nanopore sequencing. These technologies, due to their long read length, can sequence long STRs in a single read. However, as mentioned before, there are significant drawbacks to these long read technologies in the health care setting, primarily: high cost and overall inefficiency. In various embodiments, long read sequencing technologies produce reads greater than 200, 300, 400, 500, or more base pairs, for example.
[0053] Short read sequencing technologies produce short nucleic acid reads generally in the range of 20-400 base pairs in length with 35 to 150 base pair read length being most typical. For many STR based diseases, a single short read sequencing technology may not encompass the SR in one read. Since short read technologies can be sequenced from both ends they produce paired-end 5' to 3' reads. Paired end reads produce a first read and a second read which are reverse complements of the same strand. These reads can overlap or be separated by 1 to several hundred base pairs of sequenced nucleic acid. Since these reads can in effect bracket an STR they are useful for methods described herein. In certain embodiments, TREDPARSE requires paired end reads. In various embodiments, TREDPARSE utilizes reads of less than 200, 150, 140, 130, 120, 110, 100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, or 30 base pairs in length, including increments therein.
[0054] Short read data for use with the STR repeat length determination, described herein, can be any nucleic acid sequenceable by short read technologies. The nucleic acid sequence can be derived from DNA, cDNA (by way of reverse transcription from RNA. In certain aspects, the DNA is genomic DNA derived from a biological sample taken from an individual including, but not limited to, saliva, blood, plasma (including cell-free), serum, tissue biopsy, extracted from circulating peripheral blood mononuclear cells, stool, urine, or semen. The nucleic acid can be prepared by any art known method for preparation of sequencing libraries. This can include preparation for paired- end sequencing.
Initial alignments
[0055] Initial alignments can be performed in many ways, and this step can serve to align reads to their proper genomic location. The initial alignment is sometimes not focused on accurately quantitating STR read length, but in properly aligning the many millions of short reads to a proper genomic locus. A Burrows-Wheeler alignment and variations thereof is an example of a suitable initial alignment method, but other technologies capable of aligning greater than 1 million 35 base pair shorter reads can be employed. In various embodiments, the initial alignment method has a higher gap penalty then a method that is used in a realignment to quantitate STR length. In various embodiments, the initial alignment method differs from a method used in a subsequent realignment to quantitate STR length.
Realignment
[0056] Reads that are mapped around the STR region are extracted and realigned. These reads can be extracted for example from a BAM or a SAM file. A goal for the re-alignment is to obtain an accurate count of the occurrences of the repeat motifs. Most read mapping methods, when aligning reads to a reference, have a high penalty for long indels. This often results in alignment misses or misalignments leading to false predictions. The quality of sequence alignment can be thereby crucial in accurately counting the repeats in STR regions. Reads are often aligned using variations of a Burrows-Wheeler alignment (BWA). In certain embodiments, reads that are within 1, 2, 3, 4, 5 read lengths of an STR locus are extracted for realignment. This number can vary depending upon the read length of the initial sequencing data. In various embodiments, the read length is greater than at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 1 10, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides, including increments therein. In various embodiments, the read length is less than at least 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or 210 nucleotides, including increments therein. In various embodiments, reads that are within 1, 2, 3, 4, 5 kilobases of an STR locus are extracted for realignment.
[0057] In TREDPARSE, realignment methods can be ones that are adapted to more accurately count the number of STR repeats. In various embodiments, dynamic programming with the Smith- Waterman (SW) algorithm to count the number of repeats is used for realignment. In various embodiments, a Single-Instruction-Multiple-Data (SIMD) Smith- Waterman library for fast alignment is used for realignment. In various embodiments, the realignment method utilizes a striped Smith-Waterman algorithm. An exemplary scoring scheme is: match = 1, mismatch = 5, gap_open = 7, gap_extend = 2, with a higher penalty for mismatches compared to the default alignment settings in the BWA aligner. In various embodiments, the realignment method utilizes a multiple templates method, whereby the method aligns a read to a series of templates embedded with varying number of repeats, using standard SW alignment with a fixed scoring scheme. The advantages of a "multiple templates method" compared to a "periodic" Smith- Waterman alignment are shown in Figs. 2A and 2B. Fig. 2A discloses SEQ ID NOS 18, 18 and 19, respectively, in order of appearance. Fig 2B. also discloses SEQ ID NO: 19.
[0058] In Figs. 2A and 2B, a comparison of two sequence alignment methods exploiting the periodicity of the STR sequences is illustrated. Determination of a hypothetical sequence AAGTCCTTCCAGCAGCAGCAACAGCCG (SEQ ID NO: 1) is modeled. (A) A "Periodic Smith- Waterman" method modifies the recurrence table when performing the dynamic programming step so that repeat units are not penalized during matching; (B) "Multiple templates" method aligns the read to a series of templates SEQ ID NO: 2 to 6 to embedded with varying number of repeats, using standard SW alignment with a fixed scoring scheme. The alignment yields a series of alignments with different scores which are then compared to determine the repeat size that corresponds to the highest score.
[0059] TREDPARSE can realign two types of reads extracted from the BAM file: 1) reads that are mapped within a read length from the repeat location; and 2) reads that are unmapped but with its mate mapped within a distance of, for example, about 1 kb from the repeat location. Distances can also include, for example, about 2 kb, 3 kb, 4 kb, 5 kb, and so on. The number of repeats are then determined for each read in the STR region. The number of base pairs required to call the existence of a flank is 9 bp and plays an important role in classification of various type of reads. During the alignment, each read is classified as a prefix read (read with flanking sequence left of the repeats) or a suffix read (read with flanking sequence right of the repeats) depending on the positions where the alignments start or end on the read. Based on the alignment information, reads with both prefix and suffix are classified as spanning reads, reads with either prefix or suffix but not both are classified as partial reads. Reads that only consist of repeats are repeat-only reads. These reads are sorted into a set of observations that are integrated in a probabilistic model for STR size inference (Figs. 2A and 2B).
Type of Reads
[0060] Reads extracted from an initial alignment can be parsed into 4 different categories. These reads and their utility in determining STR length are detailed in Figs. 3A-3E. Figs. 3A-3E show an integrated probabilistic model to call STRs with four types of evidence. (A) Model based on spanning reads; (B) Model based on partial reads; (C) Model based on repeat-only reads; (D) Model based on paired-end reads; (E) Predictive power for each of the four evidence types on the range of STR repeat lengths.
[0061] Fig. 3A shows spanning reads S (arrows). The spanning reads are the reads that show both left and right flanking sequences. Flanking sequences are non-STR genomic sequences that are derived from the read. A flanking sequence can be any number of nucleotides but there is an optimum amount which allows for quantitating the longest STR repeat. In various embodiments, a flanking sequence is between 5-20, 6-18, 7-16, 8-14, 9-12, or equal to 8, 9, 10, 11, or 12 base pairs Since a spanning read encompasses the entire STR locus, inference on the number of repeat units by spanning reads is straightforward, with the counted size matching or close to the true size. The spanning reads would show exactly the size of the underlying allele if there is no noise due to stuttering. Stuttering noise can impact the size determination from this read. Stuttering occurs due to polymerase or template slipping in highly repetitive regions and can result in deletions or insertions of repeats that are observed but not actually present. A stuttering model, which considers the periodicity of the repeat as well as the GC content, can allow a certain proportion of the spanning reads to show a different size than the true allele size. In various embodiments, the stuttering model is applied when analyzing any one or all of spanning, partial or repeat only reads. In various embodiments, the stuttering model can return a distribution or confidence interval for the actual repeat length.
[0062] Fig. 3B shows partial reads T (arrows). The partial reads do not align all the way across the repeat region and comprise only one flanking sequence. The partial reads have a probability mass function of discrete uniform distribution between a single repeat unit up to the true repeat length. Therefore, unlike the full spanning reads which show exactly or close to (in case of stuttering error) the number of repeat units of the underlying allele, the partial reads show a lower bound for the number of repeat units of the underlying allele. The inference task is to infer the maximum number of repeats, given observed allele sizes from partial reads. The inference is analogous to the "German tank problem" but with replacement, under the condition that the allele cannot exceed the read length minus the length of the flanking sequence.
[0063] Fig. 3C shows repeat-only reads U (shaded arrows). Reads that almost consist entirely of repeat units are repeat-only reads. Each repeat-only read often has a relatively unique mate that allows it to be mapped. Repeat-only read are possible only when repeat length is the same or longer than a read length. Assuming each read is equally likely to start anywhere in the genome, the expected number of repeat-only reads that fall in a certain region follows a Poisson distribution. These repeat-only reads are typically mapped in the STR region because they have a read pair that mapped to a flanking site. The repeat-only reads can be critical since they allow the inference of repeats longer than the read length.
[0064] Fig. 3D shows paired-end reads V (arrows). Additional information can be gathered from the group of paired end reads (sometimes called "mates") that span the STR region. The observed distance between the two mate reads typically can follow a distribution p(V) for a specific sequencing library. This distribution can be inferred by compiling the distances between all (or a representative subset of) the paired-end reads across the genome. For alleles without indels in the STR region, the distribution of the observed distances can be distributed identically to p(V). If there is a homozygous insertion or deletion of repeat units in the STR region, the distribution of p(V) would shift to p(V + RK— hK), where K is the repeat unit R is the number of repeats in the reference and h is the number of repeats in the sampled allele, where
Figure imgf000024_0001
Expanded repeats (or longer h), when mapped onto the reference, show a compression of paired-end distances; conversely, shortened repeats (or shorter h) show an expansion of paired-end distances. The shift (R— h)K between the two distributions - p(V) and p(V + RK— hK) - should indicate the difference in repeat length relative to the reference genome. Like repeat-only reads, the paired-end distance is also useful to extend the prediction of allele size beyond the length of a typical sequencing read since the paired-end distance is often longer than the read length. Fig. 3D discloses SEQ ID NOS 12-13, respectively, in order of appearance
Exemplary Integrated Probabilistic Model
[0065] To fully model the uncertainties of observing a set of reads that are generated by a certain repeat size, a probabilistic model can be implemented predicting the size of STRs, based on evidence from a selected combination of spanning reads, partial reads, repeat-only reads and spanning pairs. The spanning reads align to both flanks of the target repeat; the partial reads align to only one flank of the repeat; the repeat-only reads align entirely within the repeat tract and thus consist entirely of repeat units; the spanning pairs are read pairs that span the STR region, i.e. with one end on each side of the repeat. A non-limiting example is described below, with the following notations:
• L: read length in base pairs (bp), e.g. L = 150 for 150 base pairs reads
• D: haplotype depth, average sequencing depth divided by ploidy. For diploid locus, it is equal to half of the sequencing depth
• F: number of base pairs required to call flanking sequence. By default, this can be set to at least 9 bp when matching flanking sequences so we have F = 9
• R: number of repeat units in the reference sequence
• K: repeat unit length, e.g. K = 3 for triplet 'CAG' repeats
• S: observed number of repeat units in a spanning read
• T: observed number of repeat units in a partial read
• U: number of repeat-only reads that consist of entirely repeats
• V : observed paired-end distance in bp for a spanning pair
• number of repeat units in two alleles, respectively. Without loss of generality, we assume 1≤ h1 < h2≤ hmax.For a haploid locus (such as the X-linked locus in a male), we have
[0066] The repeat length is equal to the repeat units x repeat unit length. For example, the human reference genome (hg38) has R = 19 for the Huntington locus, which is a repeat of "CAG" (K = 3), so RK = 57 is the total repeat length in base pairs. Formally, observations are a set of / spanning reads with repeat units
Figure imgf000025_0005
m partial reads with repeat units U repeat-only reads, and n
Figure imgf000025_0002
spanning pairs with paired end distance in base pairs
Figure imgf000025_0003
The goal is to estimate
Figure imgf000025_0004
that maximize the likelihood of the set of observations
Figure imgf000025_0001
Spanning reads
[0067] The spanning reads are the reads that show both left and right flanking sequences of at least F bp. The spanning reads are quite straightforward, with the counted size matching or close to the true size. The spanning reads would show the size of the underlying allele if there is no noise due to stuttering. The sharp peak becomes 'fuzzier' after incorporating the stuttering noise. Using the stuttering model trained by lobSTR, which considers the periodicity of the repeat as well as the GC content, the stuttering model allows a certain proportion of the spanning reads to show a different size than the true allele size.
[0068] To account for stutter noise, the following example model can be utlized. With probability π(Κ), the read is a product of stutter noise, which is dependent on the repeat unit length K and also the GC content of the locus. If a read is a product of stutter, then with probability Poisson(s; λK), the noisy read deviates by s units from the original allele, where Poisson(s; λK) is a Poisson distribution with mean λK. Deviation can be either positive or negative with equal probability τι(Κ)/2. Parameters π (K) and λK were previously trained by lobSTR for a range of values K. Hence, the probability of generating a spanning read with S observed repeat units in the STR region from a hemizygous locus with an STR with h repeat units:
Figure imgf000026_0001
Note that there may not be any spanning reads expected when s(h1) = s(h2) = 0 if both allele lengths are longer than L— 2F. In that case, we set We then have the mixing distribution:
Figure imgf000026_0003
Figure imgf000026_0002
In the case of spanning reads, the longer allele typically has a smaller contribution to the number of observations
Partial reads [0069] Partial reads do not align all the way across the repeat region and shows only one flanking sequence. The partial reads have a probability mass function of discrete uniform distribution. Unlike the full spanning reads, which show exactly the repeat units of the underlying allele, partial reads only show a lower bound for the number of repeat units of the underlying allele. This inference is analogous to the "German tank problem" but with replacement, under the condition that the allele cannot exceed L-F. The probability of generating a partial read with T observed repeat units in the STR region from a hemizygous locus with an STR with h repeat units:
Figure imgf000027_0004
For a diploid STR locus with hx and h repeat units, we have a mixed distribution with mixing rate
where
Figure imgf000027_0001
We then have the mixing distribution:
Figure imgf000027_0002
In the case of partial reads, the longer allele typically has a larger contribution to the number of observations.
Repeat-only reads
[0070] Reads that consist only of repeat units are called "repeat-only" reads. When the repeat length hK is the same or longer than a read length L, repeat-only reads are possible if they start in a region with size hK-L. Repeat only reads allow the inference of reads longer than the read length. Assuming each read can start anywhere in the genome, the expected number of repeat-only reads follows a Poisson distribution: where
Figure imgf000027_0003
Paired-end Reads
[0071] Additional information can be gathered from the group of paired-end reads (also called mate pairs) that span the STR region. The observed distance between two mate reads typically follow a distribution p(V) for a specific sequencing library. This distribution can be inferred by compiling the distances between all (or a representative subset of ) the paired end reads across the genome. For alleles without indels in the STR region, the distribution of the observed distances should be distributed identically to p(V). If there is a homozygous insertion or deletion in the STR region, the distribution of p(V) would shift to p(V+RK-hK). Expanded repeats (or longer h), when mapped onto the reference, show a compression of paired-end distances. The shift (R — h)K between the two distributions - p(V) and p(V + RK — hK) - should indicate the difference in repeat length relative to the reference genome. Then we have:
Figure imgf000028_0001
where C is a normalizing constant to ensure Pv(V\h) sum to 1. Like repeat-only reads, the paired- end distance is also useful to extend the prediction of allele size beyond the length of a typical sequencing read since the paired-end distance is often longer than the read length. For a diploid STR locus with hx and h2 repeat units, we have a mixed distribution with mixing rate πν:
where
Figure imgf000028_0002
We then have the mixing distribution:
Figure imgf000028_0003
The paired-end mode is only enabled when there are at least 5 spanning pairs across the STR locus. With too few observations, the variance of our maximum likelihood estimates based on spanning pairs alone can be substantial.
Integrated Probabilistic Model
[0072] Each of the four types of read evidence, spanning reads, partial reads, repeat-only reads, and paired-end reads has its own range of predictive power across the range of likely STR repeat length, as either limited by read length or paired-end distance of the sequencing library. This is shown graphically in Fig. 3E. Taken altogether in various combinations, data from spanning reads, partial reads, repeat-only reads, and spanning pairs can be combined as a combination of two, three or four read groups. For example, with four read groups, the data can be combined under the assumption that each type of evidence is independent given the true repeat numbers:
Figure imgf000029_0001
[0073] The maximum likelihood estimates can be obtained from the model through a grid
Figure imgf000029_0003
search. Examples of typical likelihood surface can be seen in Figs. 5A and 5B. A suitable hmax can be set at the detection limit of the method given the exact types and lengths of the reads deployed. In a certain embodiment, hmax = 300 (e.g., limit of repeat length quantitation), so the full grid search would be 300 for haploid and 300 X 300 for diploid loci. However, a different can be used for example between 100 and 400, 150 and 350, 200 and 300, 250 and 300, 200 and 250, or greater than 100, 150, 200, 250, 300, 350, or 400, including increments therein.
Confidence of calls
[0074] Combining the evidence in the integrated probabilistic model, the marginal distribution of are computed, where:
Figure imgf000029_0002
Figure imgf000030_0001
From these marginal distributions, one can compute the 95% confidence intervals (CI) for
Figure imgf000030_0002
The
Figure imgf000030_0003
CI of distribution with a parameter Θ is defined as:
Figure imgf000030_0004
The 95% CI are not unique on a posterior distribution. One can use 95% CI when there is equal (1- a)/2=2.5% mass on each tail.
STR calculations
[0075] All the read evidence can be combined in the integrated model and the point estimates are computed based on maximum likelihood. The marginal distribution of
Figure imgf000030_0005
observations) and P(h2 | observations) can be computed. From these marginal distributions,
Figure imgf000030_0006
the 95% confidence interval (95% CI) for
Figure imgf000030_0007
can be computed. The confidence interval calculated can vary depending upon the requirements of the method. In various embodiments, the confidence interval can be within 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%.
Confidence intervals for the estimates are typically much wider for larger repeat sizes and are tighter with data with high sequencing depth or shorter repeat sizes. Sequencing depth is a key driver of confidence interval and in various aspects the method for STR repeat length determination is performed using sequencing data with a sequencing depth greater than lOx, 20x, 3 Ox, 40x, 50x, 60x, 70x, 80x, 90x, or more, including increments therein. The STR calculations using the method herein are highly accurate compared to other methods. In various embodiments, the method is more accurate than the lobSTR caller by at least 2-fold, 3 -fold, 4-fold, or 5-fold based upon a root-mean squared deviation (RMSD) from the true value over a range of repeat lengths. See Gymrek et al. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res 22, 1154-1162 (2012), for lobSTR method. In various embodiments, the method has an absolute accuracy about equal to or less than 50 RMSD, 40 RMSD, 35 RMSD, 30RMSD, or 25 RMSD when based upon simulated data for a trinucleotide repeat. See example 1, for example.
[0076] Optionally a determination of pathogenicity can be computed, given dominance and recessive inheritance models under the assumption of complete penetrance and a point cutoff of size c. For example Huntington's disease would have a cutoff at repeat length 120:
Figure imgf000031_0001
[0077] where Z is a normalizing constant. The inheritance model and risk cutoff size c (Table 1) are both important in the calculation of PP. Recessive inheritance requires the shorter allele 1¾ to be greater than or equal to the risk cutoff size c, while dominant inheritance requires the longer allele h2 to be greater than or equal to c. For X-linked recessive inheritance, only one allele needs to be greater than or equal to c in order to show pathology. Both the 95% CI and the pathological probability PP reflect the confidence of repeat size inference, with PP more pertinent for a clinical statement. In various embodiments, a patient can be defined as at risk if "at risk" if PP > 50%, 60%, 70%, 80%, or 90%, including increments therein, or more.
[0078] Accordingly, in accordance with various embodiments, Fig. 15 illustrates a workflow or method 1100 for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a STR sequence at a given STR locus.
[0079] At step 1110, nucleic acid sequence reads are extracted, the reads mapped, for example, within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location. As disclosed herein, the nucleic acid reads can additionally be subject to an initial alignment.
[0080] At step 1120, the extracted nucleic acid sequence reads are aligned. As disclosed herein, such alignment can be described as a realignment. As discussed herein, reads are often aligned (or realigned) using variations of a Burrows-Wheeler alignment (BWA). In various embodiments, reads that are within 1, 2, 3, 4, 5 read lengths of an STR locus are extracted for realignment. This number can vary depending upon the read length of the initial sequencing data. In various embodiments, the read length is greater than at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides, including increments therein. In various embodiments, the read length is less than at least 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or 210 nucleotides, including increments therein. In various embodiments, reads that are within 1, 2, 3, 4, 5 kilobases of an STR locus are extracted for realignment.
[0081] In TREDPARSE, realignment methods are ones that are adapted to more accurately count the number of STR repeats. In various embodiments, dynamic programming with the Smith-Waterman (SW) algorithm to count the number of repeats is used for realignment. In various embodiments, a Single-Instruction-Multiple-Data (SIMD) Smith-Waterman library for fast alignment is used for realignment. In various embodiments, the realignment method utilizes a striped Smith- Waterman algorithm. Examples of the use of such realignment methods are discussed herein and applicable at least at this step.
[0082] At step 1130, the reads from the alignment can be parsed into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. Discussion of these read types are provided in detail herein and applicable at least at this step.
[0083] At step 1140, repeat length of the STR sequence is determined by applying a probabilistic model to the at least two informative read groups. Discussion of various probabilistic models are provided herein and applicable at least at this step.
[0084] At step 1150, a risk probability is determinied, where the risk probability can constitute the risk of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus. The calculation or determination of this risk probability is discussed herein and applicable at least at this step. For example, determination of PP can serve as the risk probability or can inform an associated risk probability.
[0085] It should be understood that the preceding embodiments can be provided, whole or in part, as a system of components integrated to perform the methods described. For example, the workflow of FIG. 15 can be provided as a system of components or stations, illustrated for example in Fig. 16, for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a STR sequence at a given STR locus.
[0086] For example, referring to Fig. 16, sequencing data for use in the workflow can be generated from a sequencing unit 1202. The extracting and aligning steps (1 110-1120) can be performed in an alignment engine 1204. Alignment engine 1204 is illustrated as a separate component and thus can be part of a larger alignment unit. Alternatively, alignment engine can be provided as a component of sequencing unit 1202 or even part of a diagnosing unit 1206, discussed below.
[0087] Referring to Fig. 16, diagnosing unit 1206 can include a repeat length determination engine 1208 and risk assessment engine 1210. The parsing step 1130 and determination of repeat length step 1140 can be perfomed by repeat length determination engine 1208. Though steps 1130 and 1140 are illustrated as part of single repeat length determination engine 1208, steps 1130 and 1 140 can be performed by separate engines such that a parsing engine can be provided in diagnosing unit 1206.
[0088] Referring to Fig. 16, the determining the risk probability step 1 150 can be performed by risk assessment engine 1210.
[0089] It should be noted that these steps, in accordance with various embodiments, carried out on these system components in the illustrated arrangement or alternative arrangement, can be performed on an engine, module, or component that can be implemented as computer hardware, firmware, software, or any combination thereof. It should further be noted that these steps, in accordance with various embodiments or portions of the embodiments of the present teachings, can be implemented on a computer system such as that illustrated in Figs. 4-6.
[0090] Finally, an input/output device 1212 is provided. Device 1212 can be configured and arranged to receive the risk probability score on one hand, and also be configured and arranged to deliver inputs that may assist in allowing the system to perform its function. STR Based diseases
[0091] The methods, software and systems described herein are for use in determining a clinically relevant repeat length for STRs that cause human disease. Most STR diseases possess a repeat length at which the disease become fully-penetrant. For example, Huntington's disease is fully penetrant after 40 repeats of the CAG trinucleotide (SEQ ID NO: 9) (e.g., repeat length of 120). The methods herein can provide diagnosis, determination of a risk-group, or an individual's status as a carrier if the STR related disease is recessive. Table 1 lists common STR diseases, repeat motif and method of inheritance. In a certain aspect the STR based disease determined is any one or more of Myotonic dystrophy 1, Myotonic dystrophy 2, Dentatorubro-pallidoluysian atrophy, Fragile X-associated tremor/ataxia, Fragile X syndrome, Mental retardation, FRAXE type, Friedreich ataxia, Huntington disease, Huntington disease-like 2, Unverricht-Lundborg Disease, Oculopharyngeal muscular dystrophy, Spinal and bulbar muscular atrophy, Spinocerebellar ataxia 1, Spinocerebellar ataxia 2, Spinocerebellar ataxia 3, Spinocerebellar ataxia 6, Spinocerebellar ataxia 7, Spinocerebellar ataxia 8, Spinocerebellar ataxia 10, Spinocerebellar ataxia 12, Spinocerebellar ataxia 17, Spinocerebellar ataxia 36, Epileptic encephalopathy, early infantile, 1, Blepharophimosis, epicanthus inversus, and ptosis, Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome, Holoprosencephaly-5, Syndactyly, Mental retardation, X-linked, Amyotrophic lateral sclerosis.
[0092] It should be noted that the systems, software and methods discussed herein can apply to loci not listed in Table 1, and that one would be easily able to add additional loci to the list including loci that are not currently known or described herein. A minimal set of information for a new locus requires the genomic coordinates for the repeats, and the disease risk cutoff (e.g., number of repeats and repeat length) based on clinical studies in order to determine the probability of the disease,assuming a full penetrance model.
[0093] For example, chromosome X:67545318-67545383, as illustrated in Table 1 below, is the locus that is associated with spinal and bulbar muscular atrophy when the number of repeats exceeds the high risk cutoff (i.e., >36 repeats).
Figure imgf000034_0001
Figure imgf000035_0001
Figure imgf000036_0001
Figure imgf000037_0001
Digital processing device
[0094] The platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include a digital processing device, or use of the same. The digital processing device includes one or more hardware central processing units (CPUs) or general purpose graphics processing units (GPUs) that carry out the device's functions. The digital processing device further comprises an operating system configured to perform executable instructions. In some cases, the digital processing device is optionally connected a computer network. In further cases, the digital processing device is optionally connected to the Internet such that it accesses the World Wide Web. In some cases, the digital processing device is optionally connected to a cloud computing infrastructure. The digital processing device may be connected to an intranet and may be connected to a data storage device.
[0095] In accordance with the description herein, suitable digital processing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, Internet appliances, tablet computers, and mobile smartphones. Those of skill in the art will recognize that many smartphones are suitable for use in the system described herein. Suitable tablet computers include those with booklet, slate, and convertible configurations, known to those of skill in the art.
[0096] The digital processing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device's hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of non-limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server , and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smart phone operating systems include, by way of non- limiting examples, Nokia® Symbian® OS, Apple® lOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
[0097] The digital processing device includes a storage and/or memory device. The storage and/or memory device is one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some embodiments, the device is volatile memory and requires power to maintain stored information. In some embodiments, the device is non-volatile memory and retains stored information when the digital processing device is not powered. In further embodiments, the non-volatile memory comprises flash memory. In some embodiments, the non-volatile memory comprises dynamic random-access memory (DRAM). In some embodiments, the non-volatile memory comprises ferroelectric random access memory (FRAM). In some embodiments, the nonvolatile memory comprises phase-change random access memory (PRAM). In other embodiments, the device is a storage device including, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing based storage. In further embodiments, the storage and/or memory device is a combination of devices such as those disclosed herein.
[0098] The digital processing device may include a display to send visual information to a user. In some embodiments, the display is a liquid crystal display (LCD). In further embodiments, the display is a thin film transistor liquid crystal display (TFT-LCD). In some embodiments, the display is an organic light emitting diode (OLED) display. In various further embodiments, on OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some embodiments, the display is a plasma display. In other embodiments, the display is a video projector. In yet other embodiments, the display is a head-mounted display in communication with the digital processing device, such as a VR headset. In further embodiments, suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like. In still further embodiments, the display is a combination of devices such as those disclosed herein.
[0099] The digital processing device may include an input device to receive information from a user. In some embodiments, the input device is a keyboard. In some embodiments, the input device is a pointing device including, by way of non-limiting examples, a mouse, trackball, track pad, joystick, game controller, or stylus. In some embodiments, the input device is a touch screen or a multi-touch screen. In other embodiments, the input device is a microphone to capture voice or other sound input. In other embodiments, the input device is a video camera or other sensor to capture motion or visual input. In further embodiments, the input device is a Kinect, Leap Motion, or the like. In still further embodiments, the input device is a combination of devices such as those disclosed herein.
[0100] Referring to Fig. 4, in a particular embodiment, an exemplary digital processing device 1001 is programmed or otherwise configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parse the reads from the second alignment into at least two informative read groups, the read groups selected from the list consisting of spanning reads, partial reads, paired-end reads, and repeat-only reads; and determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. The device 401 can regulate various aspects to accurately determine a repeat length of a short tandem repeat (STR) sequence of the present disclosure, such as, for example, extracting reads from an alignment or BAM file, realigning reads in a local alignment, parsing reads into different read categories, and running a probabilistic model to determine STR length. In this embodiment, the digital processing device 401 includes a central processing unit (CPU, also "processor" and "computer processor" herein) 405, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The digital processing device 401 also includes memory or memory location 410 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 415 (e.g., hard disk), communication interface 1020 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 425, such as cache, other memory, data storage and/or electronic display adapters. The memory 410, storage unit 415, interface 420 and peripheral devices 425 are in communication with the CPU 405 through a communication bus (solid lines), such as a motherboard. The storage unit 1015 can be a data storage unit (or data repository) for storing data. The digital processing device 401 can be operatively coupled to a computer network ("network") 430 with the aid of the communication interface 420. The network 430 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 430 in some cases is a telecommunication and/or data network. The network 430 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 430, in some cases with the aid of the device 401, can implement a peer-to-peer network, which may enable devices coupled to the device 401 to behave as a client or a server.
[0101] Continuing to refer to Fig. 4, the CPU 405 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 410. The instructions can be directed to the CPU 405, which can subsequently program or otherwise configure the CPU 405 to implement methods of the present disclosure. Examples of operations performed by the CPU 405 can include fetch, decode, execute, and write back. The CPU 405 can be part of a circuit, such as an integrated circuit. One or more other components of the device 401 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
[0102] Continuing to refer to Fig. 4, the storage unit 415 can store files, such as drivers, libraries and saved programs. The storage unit 415 can store user data, e.g., user preferences and user programs. The digital processing device 401 in some cases can include one or more additional data storage units that are external, such as located on a remote server that is in communication through an intranet or the Internet.
[0103] Continuing to refer to Fig. 4, the digital processing device 401 can communicate with one or more remote computer systems through the network 430. For instance, the device 401 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
[0104] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the digital processing device 401, such as, for example, on the memory 410 or electronic storage unit 415. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 405. In some cases, the code can be retrieved from the storage unit 415 and stored on the memory 410 for ready access by the processor 405. In some situations, the electronic storage unit 415 can be precluded, and machine-executable instructions are stored on memory 410.
Non-transitory computer readable storage medium
[0105] The platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In some cases, a computer readable storage medium is a tangible component of a digital processing device. In other cases, a computer readable storage medium is optionally removable from a digital processing device. In some embodiments, a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
Computer program
[0106] The platforms, systems, media, and methods for determining and quantitating STR repeats described herein, in some cases, include at least one computer program, or use of the same. A computer program includes a sequence of instructions, executable in the digital processing device's CPU, written to perform a specified task. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages.
[0107] The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some embodiments, a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
Web application
[0108] A computer program may include a web application. In light of the disclosure provided herein, those of skill in the art will recognize that a web application, in various embodiments, utilizes one or more software frameworks and one or more database systems. In some embodiments, a web application is created upon a software framework such as Microsoft® .NET or Ruby on Rails (RoR). In some embodiments, a web application utilizes one or more database systems including, by way of non-limiting examples, relational, non-relational, object oriented, associative, and XML database systems. In further embodiments, suitable relational database systems include, by way of non- limiting examples, Microsoft® SQL Server, mySQL™, and Oracle®. Those of skill in the art will also recognize that a web application, in various embodiments, is written in one or more versions of one or more languages. A web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof. In some embodiments, a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or extensible Markup Language (XML). In some embodiments, a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS). In some embodiments, a web application is written to some extent in a client-side scripting language such as Asynchronous Javascript and XML (AJAX), Flash® Actionscript, Javascript, or Silverlight®. In some embodiments, a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion®, Perl, Java™, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), Python™, Ruby, Tel, Smalltalk, WebDNA®, or Groovy. In some embodiments, a web application is written to some extent in a database query language such as Structured Query Language (SQL). In some embodiments, a web application integrates enterprise server products such as IBM® Lotus Domino®. In some embodiments, a web application includes a media player element. In various further embodiments, a media player element utilizes one or more of many suitable multimedia technologies including, by way of non- limiting examples, Adobe® Flash®, HTML 5, Apple® QuickTime®, Microsoft® Silverlight®, Java™, and Unity®.
[0109] Referring to Fig. 5, in a particular embodiment, an application provision system comprises one or more databases 500 accessed by a relational database management system (RDBMS) 1110. Suitable RDBMS s include Firebird, MySQL, PostgreSQL, SQLite, Oracle Database, Microsoft SQL Server, IBM DB2, IBM Informix, SAP Sybase, SAP Sybase, Teradata, and the like. In this embodiment, the application provision system further comprises one or more application severs 520 (such as Java servers, .NET servers, PHP servers, and the like) and one or more web servers 530 (such as Apache, IIS, GWS and the like). The web server(s) optionally expose one or more web services via app application programming interfaces (APIs) 540. Via a network, such as the Internet, the system provides browser-based and/or mobile native user interfaces.
[0110] Referring to Fig. 6, in a particular embodiment, an application provision system alternatively has a distributed, cloud-based architecture 600 and comprises elastically load balanced, auto-scaling web server resources 610 and application server resources 620 as well synchronously replicated databases 1230.
Mobile application
[0111] The computer program may include a mobile application provided to a mobile digital processing device. In some embodiments, the mobile application is provided to a mobile digital processing device at the time it is manufactured. In other embodiments, the mobile application is provided to a mobile digital processing device via the computer network described herein. [0112] In view of the disclosure provided herein, a mobile application is created by techniques known to those of skill in the art using hardware, languages, and development environments known to the art. Those of skill in the art will recognize that mobile applications are written in several languages. Suitable programming languages include, by way of non-limiting examples, C, C++, C#, Objective-C, Java™, Javascript, Pascal, Object Pascal, Python™, Ruby, VB.NET, WML, and XHTML/HTML with or without CSS, or combinations thereof.
[0113] Suitable mobile application development environments are available from several sources. Commercially available development environments include, by way of non-limiting examples, AirplaySDK, alcheMo, Appcelerator®, Celsius, Bedrock, Flash Lite, .NET Compact Framework, Rhomobile, and WorkLight Mobile Platform. Other development environments are available without cost including, by way of non-limiting examples, Lazarus, MobiFlex, MoSync, and Phonegap. Also, mobile device manufacturers distribute software developer kits including, by way of non-limiting examples, lPhone and lPad (IOS) SDK, Android™ SDK, BlackBerry® SDK, BREW SDK, Palm® OS SDK, Symbian SDK, webOS SDK, and Windows® Mobile SDK.
[0114] Those of skill in the art will recognize that several commercial forums are available for distribution of mobile applications including, by way of non-limiting examples, Apple® App Store, Google® Play, Chrome WebStore, BlackBerry® App World, App Store for Palm devices, App Catalog for webOS, Windows® Marketplace for Mobile, Ovi Store for Nokia® devices, Samsung® Apps, and Nintendo® DSi Shop.
Standalone application
[0115] The computer program may include a standalone application, which is a program that is run as an independent computer process, not an add-on to an existing process, e.g., not a plug-in. Those of skill in the art will recognize that standalone applications are often compiled. A compiler is a computer program(s) that transforms source code written in a programming language into binary object code such as assembly language or machine code. Suitable compiled programming languages include, by way of non-limiting examples, C, C++, Objective-C, COBOL, Delphi, Eiffel, Java™, Lisp, Python™, Visual Basic, and VB .NET, or combinations thereof. Compilation is often performed, at least in part, to create an executable program. In some embodiments, a computer program includes one or more executable complied applications. Software modules
[0116] The platforms, systems, media, and methods disclosed herein, in some cases, include software, server, and/or database modules, or use of the same. In view of the disclosure provided herein, software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art. The software modules disclosed herein are implemented in a multitude of ways. In various embodiments, a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof. In various embodiments, the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application. In some embodiments, software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on cloud computing platforms. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location.
Databases
[0117] The platforms, systems, media, and methods disclosed herein, in some cases, include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for storage and retrieval of nucleic acid read information. In various embodiments, suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity- relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase. In some embodiments, a database is internet-based. In further embodiments, a database is web-based. In still further embodiments, a database is cloud computing-based. In other embodiments, a database is based on one or more local computer storage devices. [0118] The methods described herein encompass delivering one or more reports detailing STR repeat length, statistics, and/or raw data for any one or more of the loci in Table 1. Reports can be delivered over the internet or through the mail to a health car provider, physician, or consumer. Reports can be delivered by e-mail, a secure network, or downloaded from a secure site. The reports can be hard-copy physical reports or in electronic format.
EXAMPLES
[0119] The following illustrative examples are representative of embodiments of the software applications, systems, and methods described herein and are not meant to be limiting in any way.
[0120] The methods, software, and systems disclosed herein outperform other methods and software. Analyzing the full genome sequences of 12,632 individuals that were sampled to an average read depth of ~30-40x with Illumina HiSeqX allowed identification of a total of 138 individuals with risk alleles at 15 STR disease loci. A representative subset of the samples (n = 19) were validated by Sanger and by Oxford Nanopore sequencing. Importantly, methods, software and systems described herein extend the limit of STR size detection beyond the physical sequence read length. This extension is critical since many of the disease risk cutoffs are close to or beyond the short sequence read length of 100 to 150 bases.
Example 1 - TREDPARSE is accurate on simulated data
[0121] Simulation with synthetic data shows that TREDPARSE out-performs many other callers of short tandem repeats. TREDPARSE was compared with commonly used general-purpose variant callers, including Manta, Isaac, and GATK. Not surprisingly, they perform poorly on the simulated datasets as shown in Figs. 7A-7D. Manta (Fig. 7A), Isaac (Fig. 7B), and GATK Fig. 7C, are unable to call a repeat length that would be useful for clinically determining Huntington's disease, while lobSTR is unable to call any STR diseases above the cutoff for Huntington's disease repeat length (e.g., greater than 120). These variant callers can detect small indels, but in most cases fail to recover the length of long alleles (i.e., large indels). Additionally, the indels could occur at different locations within the repeat tract, it is not sufficient to construct locus-based callers that inspect indels collectively, making direct calling of the repeat size difficult without further post-processing. Based on these comparisons, it was found that most tools tested thus far were not effective at quantifying the number of repeats. [0122] A tool that was specifically designed for STR variant calling, lobSTR performed better than other variant callers at short allele size ranges, up to 40 CAGs (SEQ ID NO: 9), which is close to the risk threshold for HD, but below the risk threshold of 12 other STR diseases (Table 1). It was found that TREDPARSE out-performed lobSTR at longer allele lengths typically above the risk threshold, which were more critical in assessing disease status, in either a haploid setting or a diploid setting. Since the HD risk threshold (40 CAGs=120bp (SEQ ID NO: 9)) is very close to a read length, lobSTR was unable to correctly predict risk alleles, whereas TREDPARSE calls were close to the truth, and identified all long HD alleles as risk alleles.
[0123] Figs. 8A-8D show simulations with synthetic datasets of implanted STR alleles at Huntington (HD) locus. (A) Performance comparison of TREDPARSE and lobSTR on simulated haploid with one single allele with h number of CAGs, where h varies between 1 to 300 (SEQ ID NO: 7); (B) Performance comparison of TREDPARSE and lobSTR on simulated diploid with two alleles, one allele fixed with 20 CAGs (SEQ ID NO: 8), another allele with h units of CAGs; (C) Performance of TREDPARSE on simulated diploid with low haploid depth of 5 X ; (D) Performance of TREDPARSE on simulated diploid with high haploid depth of 80 X . Shaded region represent 95% credible interval for TREDPARSE estimates of h. RMSD represents root-mean-square deviation, calculated as Figs. 8A-8D also disclose "40xCAGs" as SEQ ID NO: 9.
Figure imgf000047_0001
[0124] The TREDPARSE caller extended the calling of the size of the allele beyond a typical read length. In simulations with both the 'haploid' (ploidy = 1) model shown in Fig. 8A and 'diploid' (ploidy = 2) model shown if Figs. 8B-8D, TREDPARSE predicted the long allele sizes with lower RMSD (root-mean- sequare deviation) in the simulated diploid cases. Increasing average sequence depth decreased the CI and RMSD (Fig. 8C, 15.20 RMSD at 5x depth; Fig. 8B, 9.27 RMSD; Fig. 8D, 8.77 RMSD at 80x depth). Additionally, most truth values fell within the 95% credible intervals as shown in Fig. 8B. For longer allele sizes, the calls did not precisely match the true values, but were nonetheless close. LobSTR performed poorly at all sequencing depths compared to TREDPARSE (RMSD greater than 70 in each of Figs. 8A-8D), especially nearing and above a repeat length of 120. The main source of errors was mostly from evidence based on either repeat- only reads and paired-end distances which have much more variation than full spanning reads. [0125] Most importantly, TREDPARSE extends the limit of STR size detection well beyond the physical read length. This extension is critical in many cases since several of the disease risk cutoffs are close to or beyond the read length - 150bp for mainstream Illumina sequencers. Based on our simulations, the current detection limit for TREDPARSE is around 500 bp, which is roughly equal to the paired-end distance in Illumina HiSeq sequencing libraries. This detection limit enables detection of risk alleles for most loci listed in Table 1.
Methods
[0126] Read data from individuals that have a Huntington disease (HD) locus of varying lengths was simulated. The simulation was performed using EAGLE (https://github.com/sequencing/EAGLE), which is designed to simulate the behavior of sequencing instruments by introducing various errors that are characteristic of the Illumina sequencing platform. Simulations were with 2xl 50bp reads with paired-end distance of 500bp (with standard deviation of 50bp), at varying level of sequencing depth.
[0127] Given the simulated data, reads were mapped using the BWA aligner, and then ran TREDPARSE and popular variant calling software (e.g., Manta, Isaac, GATK and lobSTR), comparing the inferred lengths with the true ones. For the HD locus, the pathological threshold (full penetrance allele) is established at 40 CAGs (SEQ ID NO: 9), so there was significant interest in identifying expansions of the STR that are greater than or equal to 40.
[0128] We simulated various combinations of h1 and h2 and studied the joint posterior probability distributions that our model generated. In most simulated cases, particularly when h1 was relatively short, there was little dependence between h1 and h2 so that the joint distribution could well be represented by two marginal distributions over h1 and h2, respectively as shown in Fig. 9A. However, when both h1 and h2 were longer than the paired-end distance, and since both alleles contributed to the repeat-only reads which was the only signal left to be identified, our model could not accurately distribute the fixed signal among the two alleles so there appeared to be a strong negative correlation between h1 and h2. A weaker negative dependency between h1 and h2 could also be seen at lower allele sizes as shown in Fig. 9B. Of course, cases where both h1 and h2 are expanded pathological alleles are rare. Example 2 - Influence on STR prediction by different types of reads
[0129] Each of the four types of read evidence available for use has its own range of predictive power across the spectrum of likely STR repeat length. Overall, the maximum number of repeat length that each evidence can identify is increasing from spanning reads, partial reads, paired-end reads to repeat-only reads as shown in Fig. 3E. The repeat-only reads often cover the longest range in a typical Illumina sequencing experiment, bounded by the paired-end distance.
[0130] In an effort to understand the contribution of each types of evidence to a final STR call, simulation experiments were rerun using just a single type of evidence, for example, using only spanning reads and ignoring all other evidence. This experiment permitted isolation of the contribution of each type of evidence. Results of these simulations are shown in Figs. 10A-10D. As expected, the predictive power of spanning reads (Fig. 10A) and partial reads (Fig. 10B) was limited by roughly the read length while evidence like paired-end reads (Fig. 10D) and repeat-only reads (Fig. IOC) were limited by roughly the paired distance, respectively. Notably, no evidence covered the complete range of STR repeat sizes, so was important to make use of multiple types of evidence for accurate estimates.
Example 3 - Determination of STR length in 12, 632 genomes
[0131] TREDPARSE was run on sequence data from 12,632 and identified a total of 138 individuals with risk alleles at 15 disease loci (Fig. 5), as well as 54 individuals inferred to be 'carriers' who are capable of passing a recessive risk allele onto their offspring. Specifically, 15 DM1, 2 FXTAS, 5 HD, 8 OPMD, 1 SBMA, 26 SCA1, 4 SCA2, 2 SCA6, 3 SCA8, 52 SCA17, 1 BPES, 5 CCD, 11 CCHS, 2 HFG and 1 SD5 at-risk individuals were inferred (Table 1).
[0132] To understand the strength of correlation of each signal to the inferred allele size, evidence available after analysis of WGS data from these individuals that were sequenced at -30-40 X with Illumina instruments was measured. The evidence per sample is correlated to the sequencing coverage of each individual and the length of the inferred longer STR allele. All correlations increased linearly with the sample mean coverage. However, the increase was more pronounced for paired-end reads than for partial reads and spanning reads. As expected, long STR repeat alleles have fewer full spanning reads and more partial reads. The amount of paired- end evidence was largely unaffected by the repeat length. These observations on read depth vs. allele size support our probabilistic model.
Example 4 - Validation of TREDPARSE with Sanger and Oxford Nanopore sequencing
[0133] A subset (n = 19) of 138 individuals that were reported by TREDPARSE to contain a risk allele was selected for confirmation by an orthogonal method. Summarized in Fig. 11. The cases for which there was confirmed sufficient DNA available were subjected to CLIA Sanger Sequencing (Table 2). Out of 19 cases, 11 had identical lengths for Sanger and TREDPARSE, 4 did not match exactly but were called "at risk" status by both Sanger and TREDPARSE, and 4 were discordant (an example is given in Figs. 12A-12C). In all 4 discordant cases, Sanger identified only the shorter allele, leaving an inference that these cases only contain shorter allele(s).
[0134] In particular, Figs. 12A-12C show an example of validation of TREDPARSE calls using Sanger and Oxford Nanopore sequencing. In this example, there is disagreement between (A) TREDPARSE (two alleles 17 and 84); and (B) Sanger sequencing which identified only the allele with size 17 (Fig. 12B discloses SEQ ID NOS 14-15, respectively, in order of appearance). (C) Oxford Nanopore sequencing confirms the longer allele, showing two peaks of allele sizes that both match the prediction of TREDPARSE. Sample mean coverage of the input BAM is 33 X .
[0135] To resolve the discrepancy between TREDPARSE predictions and Sanger validation results, Oxford Nanopore sequencing (ONP) was run on samples that failed Sanger validation (Fig. IOC). Oxford Nanopore sequencing yielded an approximation of the repeat size, but nonetheless qualitatively validated the existence of long alleles for validation purposes (Table 2). Overall, TREDPARSE was validated in all 19 cases. In contrast, only 4 of the 19 validated cases were called with lobSTR, with long alleles missing from all inferences (Table 2).
Figure imgf000050_0001
Figure imgf000051_0001
Figure imgf000052_0001
[0136] Due to lack of outlier samples for some TREDs, not all TREDs were considered fully validated. TREDs that are considered reliable with at least 1 validated sample included HD, DM1, SBMA, SCA1, SCA2, SCA8, SCA17, FXTAS. These are the most confident STR loci since we have observed at-risk individuals in our HLI samples, were experimentally validated, and had support from in silico simulation. There were a total of 8 TRED diseases in our list for which we have observed risk alleles but have not obtained experimental validation because of lack of DNA material. Nonetheless, simulation analysis offers good simulation support and concordant calls within families. These loci included OPMD, SCA6, BPES, CCD, CCHS, HFG, FRDA and SD5. Additionally, there were a total of 15 diseases in our list for which we did not identify any at-risk individuals. Finally, for TRED disease loci FXS, FRAXE, SCA10 and SCA36, the risk allele exceeded the 500 bp detection limit of the software.
[0137] Even though the inference of STR alleles are independent across individuals when using TREDPARSE, we confirmed that the pathological alleles were consistently called within pedigrees. Among the 12,623 individuals, we have a total of 2,257 families with at least two members sequenced, as well as 6,527 single individuals with no related family members. In total there are 8,784 families plus single individuals that are unrelated to one another and could be viewed as independent. Some families contain more than one individual inferred by TREDPARSE to be "at risk" at a given locus (Table 1).
[0138] Described here are three pedigrees. The first family had a father-to-daughter transmission of a risk allele for the Huntington locus, which has 41 CAGs repeats (SEQ ID NO: 11) in the father and 40 repeats (SEQ ID NO: 9) in the daughter as shown in Fig. 13A (Fig. 13A discloses SEQ ID NO: 9). These alleles have been experimentally validated through Sanger sequencing (Table 2). The second family showed a putative DM1 risk allele transmitted from mother to both kids while the father was unaffected as shown in Fig. 13B (Fig. 13B discloses SEQ ID NO: 16). Although the exact size estimates for the putative risk allele were different due to uncertainties associated with repeat alleles exceeding the read length, the 95% CI of these estimates were overlapping. The third family showed the putative SCA17 risk allele transmitted from father to both kids while mother was unaffected as shown in Fig. 13C (Fig. 13C discloses SEQ ID NO: 17). None of the "at risk" individuals in these families had reported phenotypes associated with symptoms of either HD, DM1, or SCA17.
[0139] Population-scale analyses enable better estimates of STR mutation rates and allele frequencies. We use both alleles in the computation of the allele frequencies for diploid loci. Allele frequencies can display either a single peak or multiple peaks, reflecting population structure within the human population. Although inferred to harbor abnormally long STR alleles that would be "at risk" according to current understanding, most of the individuals that we identified with risk alleles are asymptomatic.
[0140] There are two possible explanations for the lack of disease symptoms in the study population. First, in order to show the disease phenotype for the samples that are determined to be "at risk", the disease needs to have a high penetrance. For example, the Huntington's disease mutation is genetically dominant and thought to be fully penetrant with one allele with 40 or more CAG repeats 35. Even so, it might be worthwhile to look for cases of reduced penetrance due to protective alleles somewhere else in the genome among these asymptomatic individuals, so called "resilience". Second, it may also be due to the late onset of the disease, i.e., these individuals have not reached the age of onset.
[0141] We have observed an inflation of STR disease prevalence in samples when compared to the known prevalence estimates based on literature review. For example, Huntington's disease was previously estimated to have a population frequency of 6.5-15 per 100,000 individuals in the United States. The inferred prevalence of Huntington's disease of 5 out of 12,623 was higher than previous estimates. After correcting for relatedness among families (i.e., the family in Fig. 13 A), we observed a frequency of 4 of 8,784 independent families plus single individuals (Table 1). This implies an inflation of 3 X compared to the known prevalence. Overall, among the STR diseases that have a reported prevalence based on prior studies, DM1, HD, SBMA, SCA6 have prevalence estimates in the present study that are similar to reported. However, our estimated prevalence for SCA1, SCA17 and CCD are orders of magnitudes higher compared to the known prevalence for these diseases.
[0142] The number of inferred at-risk individuals is most heavily influenced by the exact size cutoff for the full penetrance allele that was chosen. For some diseases, there have been conflicting estimates in the literature regarding both the size cutoff for full penetrance alleles as well as prevalence in the human population. The inconsistencies are partially due to the fact that penetrance and prevalence of many STR diseases are highly variable among different ethnicity and geographic locations, due to potential founder effect. Since TREDPARSE generates a full joint posterior density during probabilistic inference that is completely independent from the chosen size cutoff, our inference could be revised accordingly if different cutoffs would be used.
Example 5-Validation with GeT-RM cell line reference materials
[0143] The Genetic Testing Reference Materials Coordination Program (GeT-RM) has characterized reference materials for quality control, test development and validation. GeT-RM provides cell lines or DNA that can be used as reference materials for genotyping inherited diseases, including Myotonic Dystrophy, Fragile X syndrome, and Huntington disease. We sequenced 6 cell lines obtained from GeT-RM, including 2 DM1, 2 FXTAS/FXS and 2 HD cases with known true allele sizes confirmed by several different labs.
[0144] TREDPARSE was able to predict risk alleles for 5 out of the 6 cell lines. Sample NA20236, which is known to have allele sizes of 31/53 in the FXTAS locus, was missed by TREDPARSE; sample NA05164, which is known to have allele sizes of 21/340 in the DM1 locus, has the size of the long allele under-predicted by TREDPARSE. The predictions on the four other cell lines exactly or closely matches the known truth (Table 3). In contrast, lobSTR failed to predict long alleles in all cases, and failed to generate any predictions for the two FXTAS cases. Recently, a new tool for sizing STRs was described; ExpansionHunter (Genome Res. 27: 1895-1903). ExpansionHunter predictions were close to the truth on the HD cases as shown in Fig. 14A-D, but failed on both the two DM1 as well as the FXTAS cases, where TREDPARSE yielded predictions much closer to the truth, as shown in Table 3.
Figure imgf000054_0001
Figure imgf000055_0001
[0145] In describing the various embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art would appreciate, other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.
Figure imgf000056_0001
Recitation of Embodiments
[0146] Embodiment 1. A method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: extracting nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0147] Embodiment 2. The method of Embodiment 1, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
[0148] Embodiment 3. The method of Embodiments 1 or 2, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0149] Embodiment 4. The method of any one of Embodiments 1 to 3, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads. [0150] Embodiment 5. The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
[0151] Embodiment 6. The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
[0152] Embodiment 7. The method of any one of Embodiments 1 to 3, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
[0153] Embodiment 8. The method of any one of Embodiments 1 to 7, wherein the second alignment uses a Smith-Waterman algorithm.
[0154] Embodiments 9. The method of any one of Embodiments 1 to 7, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0155] Embodiment 10. The method of any one of Embodiments 1 to 9, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051- 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546. [0156] Embodiment 11. The method of any one of Embodiments 1 to 10, wherein an STR read length greater than 120 base pairs is accurately quantitated.
[0157] Embodiment 12. The method of any one of Embodiments 1 to 11, further comprising determining a ploidy for an X chromosome from the extracted reads.
[0158] Embodiment 13. The method of any one of Embodiments 1 to 12, further comprising delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
[0159] Embodiment 14. The method of Embodiment 13, wherein the report is in electronic format.
[0160] Embodiment 15. A computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a software module configured to extract nucleic acid sequence reads from a first alignment to a genome, wherein the extracted reads are reads mapped within a read length of the STR locus and/or reads that are not mapped to the STR locus but have a mate-pair mapped within a distance of about 2 kb from the repeat location; a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0161] Embodiment 16. The computer-implemented system of Embodiment 15, wherein the reads are parsed into at least three informative read groups, the at least three informative read groups comprising paired-end reads and a second and a third read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads. [0162] Embodiment 17. The computer-implemented system of Embodiments 15 or 16, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0163] Embodiment 18. The computer-implemented system of any one of Embodiments 15 to 17, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
[0164] Embodiment 19. The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
[0165] Embodiment 20. The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
[0166] Embodiment 21. The computer-implemented system of any one of Embodiments 15 to 18, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
[0167] Embodiment 22. The computer-implemented system of any one of Embodiments 15 to 21, wherein the second alignment uses a Smith-Waterman algorithm.
[0168] Embodiment 23. The computer-implemented system of any one of Embodiments 15 to 22, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0169] Embodiment 24. The computer-implemented system of any one of Embodiments 15 to 23, wherein the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0170] Embodiment 25. The computer-implemented system of any one of Embodiment 15 to 24, wherein an STR read length greater than 120 base pairs is accurately quantitated.
[0171] Embodiment 26. The computer-implemented system of any one of Embodiments 15 to 25, further comprising a software module configured to determine a ploidy for the X chromosome from the extracted reads.
[0172] Embodiment 27. The computer-implemented system of any one of Embodiments 15 to 26, further comprising a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
[0173] Embodiment 28. The computer-implemented system of Embodiment 27, wherein the report is in electronic format.
[0174] Embodiment 29. A method of accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: creating a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; parsing the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; and determining the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups.
[0175] Embodiment 30. The method of Embodiment 29, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
[0176] Embodiment 31. The method of Embodiments 29 or 30, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired- end reads, and repeat-only reads.
[0177] Embodiment 32. The method of any one of Embodiments 29 to 31 , wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
[0178] Embodiment 33. The method of any one of Embodiments 29 to 32, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
[0179] Embodiment 34. The method of any one of Embodiments 29 to 33, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
[0180] Embodiment 35. The method of any one of Embodiments 29 to 34, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
[0181] Embodiment 36. The method of any one of Embodiments 29 to 35, wherein the second alignment uses a Smith-Waterman algorithm.
[0182] Embodiment 37. The method of any one of Embodiments 29 to 36, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0183] Embodiment 38. The method of any one of Embodiments 29 to 37, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0184] Embodiment 39. The method of any one of Embodiments 29 to 38, wherein an STR read length greater than 120 base pairs is accurately quantitated.
[0185] Embodiment 40. The method of any one of Embodiments 29 to 39, further comprising determining a ploidy for the X chromosome from the extracted reads.
[0186] Embodiment 41. The method of any one of Embodiments 29 to 40, further comprising delivering a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
[0187] Embodiment 42. The method of Embodiment 41, wherein the report is in electronic format.
[0188] Embodiment 43. A computer-implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application to accurately determine a repeat length of a short tandem repeat (STR) sequence at a given STR locus from extracted nucleic acid sequence reads from a first alignment to a genome comprising: a software module configured to create a second alignment utilizing the nucleic acid sequence reads from the first alignment, wherein the second alignment is a local alignment to the STR locus; a software module configured to parse the reads from the second alignment into at least two informative read groups, a first read group comprising paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat- only reads; and a software module configured to determine the repeat length of an STR sequence by applying a probabilistic model to the at least two informative read groups. [0189] Embodiment 44. The computer-implemented system of Embodiment 43, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
[0190] Embodiment 45. The computer-implemented system of Embodiments 43 or 44, wherein the reads are parsed into at least four informative read groups, the read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0191] Embodiment 46. The computer-implemented system of any one of Embodiments 43 to 45, wherein the nucleic acid sequence reads from the first alignment comprise paired-end reads.
[0192] Embodiment 47. The computer-implemented system of any one of Embodiments 43 to 46, wherein the first alignment to the genome is at an average sequence depth of 5 or greater.
[0193] Embodiment 48. The computer-implemented system of any one of Embodiments 43 to 47, wherein the first alignment to the genome is at an average sequence depth of 10 or greater.
[0194] Embodiment 49. The computer-implemented system of any one of Embodiments 43 to 48, wherein the first alignment to the genome is at an average sequence depth of 20 or greater.
[0195] Embodiment 50. The computer-implemented system of any one of Embodiments 43 to 49, wherein the second alignment uses a Smith-Waterman algorithm.
[0196] Embodiment 51. The computer-implemented system of any one of Embodiments 43 to 50, wherein the probabilistic model estimates a maximum likelihood of the a repeat length of a STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0197] Embodiment 52. The computer-implemented system of any one of Embodiments 43 to 51, wherein the STR locus is selected from the group consisting of chromosome 19:45770205- 45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051-147912110; chromosome X: 148500638-148500682; chromosome 9:69037287- 69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0198] Embodiment 53. The computer-implemented system of any one of Embodiments 43 to 52, wherein an STR read length greater than 120 base pairs is accurately quantitated.
[0199] Embodiment 54. The computer-implemented system of any one of Embodiments 43 to 53, further comprising a software module configured to determine a ploidy for the X chromosome from the extracted reads.
[0200] Embodiment 55. The computer-implemented system of any one of Embodiments 43 to 54, further comprising a software module configured to deliver a report to a consumer or health care provider, wherein the report comprises information on the repeat length of the STR sequence at the given STR locus.
[0201] Embodiment 56. The computer-implemented system of Embodiment 55, wherein the report is in electronic format.
[0202] Embodiment 57. A method of determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; aligning the extracted nucleic acid sequence reads; parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
[0203] Embodiment 58. The method of Embodiment 57, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
[0204] Embodiment 59. The method of Embodiments 57 and 58, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0205] Embodiment 60. The method of any one of Embodiments 57 to 59, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0206] Embodiment 61. The method of any one of Embodiments 57 to 60, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0207] Embodiment 62. The method of any one of Embodiments 57 to 61, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy, Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome, Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis.
[0208] Embodiment 63. The method of any one of Embodiments 57 to 62, further comprising determining a ploidy for an X chromosome from the extracted reads.
[0209] Embodiment 64. A non-transitory computer-readable medium in which a program is stored for causing a computer to perform a method for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus, the method comprising extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; aligning the extracted nucleic acid sequence reads; parsing the reads from the alignment into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads; determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
[0210] Embodiment 65. The method of Embodiment 64, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads. [0211] Embodiment 66. The method of Embodiments 64 and 65, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0212] Embodiment 67. The method of any one of Embodiments 64 to 66, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
[0213] Embodiment 68. The method of any one of Embodiments 64 to 67, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0214] Embodiment 69. The method of any one of Embodiments 64 to 68, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy, Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome, Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis. [0215] Embodiment 70. The method of any one of Embodiments 64 to 69, further comprising determining a ploidy for an X chromosome from the extracted reads.
[0216] Embodiment 71. A system for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a sequencing unit configured to generate nucleic acid sequence reads; an alignment engine configured to extract the nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and align the extracted nucleic acid sequence reads; and a diagnosing unit comprising a repeat length determination engine configured to receive aligned reads and (1) parse the reads into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads, and (2) determine the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and a risk assessment engine configured to determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
[0217] Embodiment 72. The system of Embodiment 71, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
[0218] Embodiment 73. The system of Embodiments 71 and 72, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
[0219] Embodiment 74. The method of any one of Embodiments 71 to 73, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome. [0220] Embodiment 75. The method of any one of Embodiments 71 to 74, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12:6936729-6936773; chromosome X: 147912051 - 147912110; chromosome X: 148500638-148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16: 87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318- 67545383; chromosome 6: 16327636-16327722; chromosome 12: 111598951 -111599019; chromosome 14:92071011 -92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729-146878758; chromosome 6: 170561908-170562021 ; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751 -45422801 ; chromosome 4:41745972-41746031 ; chromosome 7:27199925-27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059- 176093103; chromosome X: 140504317-140504361 ; and chromosome 9:27573529-27573546.
[0221] Embodiment 76. The system of any one of Embodiments 71 to 75, further comprising determining a ploidy for an X chromosome from the extracted reads.

Claims

WHAT IS CLAIMED IS:
1. A method of determining that a subject is at risk for having a disease or disorder by
accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising:
a) extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; b) aligning the extracted nucleic acid sequence reads;
c) parsing the reads from the alignment into at least two informative read groups,
wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads;
d) determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and
e) determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
2. The method of claim 1, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
3. The method of claim 1, wherein the reads are parsed into at least four informative read
groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
4. The method of claim 1, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
5. The method of claim 1, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12: 6936729-6936773 ; chromosome X: 147912051 - 147912110; chromosome X: 148500638- 148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933;
chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636- 16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729- 146878758; chromosome 6: 170561908-170562021; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801 ; chromosome 4:41745972-41746031; chromosome 7:27199925- 27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059-176093103; chromosome X: 140504317-140504361; and chromosome 9:27573529-27573546.
6. The method of claim 1, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy,
Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome,
Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis.
7. The method of claim 1 , further comprising determining a ploidy for an X chromosome from the extracted reads.
8. A non-transitory computer-readable medium in which a program is stored for causing a computer to perform a method for determining that a subject is at risk for having a disease or disorder by accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus, the method comprising a) extracting nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location; b) aligning the extracted nucleic acid sequence reads;
c) parsing the reads from the alignment into at least two informative read groups,
wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads;
d) determining the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and
e) determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
9. The method of claim 8, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
10. The method of claim 8, wherein the reads are parsed into at least four informative read
groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
1 1. The method of claim 8, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
12. The method of claim 8, wherein the STR locus is selected from the group consisting of
chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome
12: 6936729-6936773 ; chromosome X: 147912051 - 147912110; chromosome X: 148500638-
148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933; chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636- 16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729- 146878758; chromosome 6: 170561908-170562021; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801 ; chromosome 4:41745972-41746031; chromosome 7:27199925- 27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059-176093103; chromosome X: 140504317-140504361; and chromosome 9:27573529-27573546.
13. The method of claim 8, wherein the disease or disorder is selected from the group consisting of myotonic dystrophy, Dentatorubro-pallidoluysian atrophy, Fragile X syndrome, Mental retardation, Friedreich ataxia, Huntington disease, prostate cancer, Unverricht-Lundborg Disease, muscular dystrophy, Spinocerebellar ataxia, Epileptic encephalopathy,
Blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), Cleidocranial dysplasia, Central hypoventilation syndrome, Hand-foot-uterus syndrome,
Holoprosencephaly-5, Syndactyly, and Amyotrophic lateral sclerosis.
14. The method of claim 8, further comprising determining a ploidy for an X chromosome from the extracted reads.
15. A system for determining that a subject is at risk for having a disease or disorder by
accurately determining a repeat length of a short tandem repeat (STR) sequence at a given STR locus comprising: a) a sequencing unit configured to generate nucleic acid sequence reads;
b) an alignment engine configured to extract the nucleic acid sequence reads mapped within a read length of the STR locus and/or reads that have a mate-pair mapped within a distance of about 2 kb from the repeat location, and align the extracted nucleic acid sequence reads; and
c) a diagnosing unit comprising a. a repeat length determination engine configured to receive aligned reads and (1) parse the reads into at least two informative read groups, wherein a first read group comprises paired-end reads and a second read group selected from the list consisting of spanning reads, partial reads, and repeat-only reads, and (2) determine the repeat length of the STR sequence by applying a probabilistic model to the at least two informative read groups; and
b. a risk assessment engine configured to determining a risk probability of having a specific disease or disorder associated with the given STR locus when the predicted repeat length falls beyond a predetermined risk threshold for the given STR locus.
16. The system of claim 15, wherein the reads are parsed into at least three informative read groups, the first read group comprising paired-end reads, the second group selected from the group consisting of spanning reads, partial reads, and repeat-only reads, and a third read group selected from the group consisting of spanning reads, partial reads, and repeat-only reads.
17. The system of claim 15, wherein the reads are parsed into at least four informative read groups, the at least four read groups comprising spanning reads, partial reads, paired-end reads, and repeat-only reads.
18. The system of claim 15, wherein the probabilistic model estimates a maximum likelihood of the repeat length of the STR sequence for two alleles if the given STR locus is located on any one of chromosomes 1 to 22 and/or the X chromosome, if the nucleic acid sequence reads were derived from an individual with more than 1 X chromosome.
19. The system of claim 15, wherein the STR locus is selected from the group consisting of chromosome 19:45770205-45770264; chromosome 3: 129172577-129172656; chromosome 12: 6936729-6936773 ; chromosome X: 147912051 - 147912110; chromosome X: 148500638- 148500682; chromosome 9:69037287-69037304; chromosome 4:3074877-3074933;
chromosome 16:87604288-87604329; chromosome 21 :43776444-43776479; chromosome 14:23321473-23321502; chromosome X:67545318-67545383; chromosome 6: 16327636- 16327722; chromosome 12: 111598951-111599019; chromosome 14:92071011-92071034; chromosome 19: 13207859-13207897; chromosome 3:63912686-63912715; chromosome 13:70139384-70139428; chromosome 22:45795355-45795424; chromosome 5: 146878729- 146878758; chromosome 6: 170561908-170562021; chromosome 20:2652734-2652757; chromosome X:25013662-25013691 ; chromosome 3: 138946021-138946062; chromosome 6:45422751-45422801 ; chromosome 4:41745972-41746031; chromosome 7:27199925- 27199966; chromosome 13:99985449-99985493; chromosome 2: 176093059-176093103; chromosome X: 140504317-140504361; and chromosome 9:27573529-27573546.
20. The system of claim 15, further comprising determining a ploidy for an X chromosome from the extracted reads.
PCT/US2018/044889 2017-08-01 2018-08-01 Determination of str length by short read sequencing WO2019028189A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762539896P 2017-08-01 2017-08-01
US62/539,896 2017-08-01

Publications (2)

Publication Number Publication Date
WO2019028189A2 true WO2019028189A2 (en) 2019-02-07
WO2019028189A3 WO2019028189A3 (en) 2019-02-28

Family

ID=65234183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/044889 WO2019028189A2 (en) 2017-08-01 2018-08-01 Determination of str length by short read sequencing

Country Status (1)

Country Link
WO (1) WO2019028189A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112195228A (en) * 2020-09-28 2021-01-08 苏州阅微基因技术有限公司 X-STR fluorescent amplification system, kit and application

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013123330A1 (en) * 2012-02-15 2013-08-22 Battelle Memorial Institute Methods and compositions for identifying repeating sequences in nucleic acids
US20140163900A1 (en) * 2012-06-02 2014-06-12 Whitehead Institute For Biomedical Research Analyzing short tandem repeats from high throughput sequencing data for genetic applications
US20150337388A1 (en) * 2012-12-17 2015-11-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci
CN107077537B (en) * 2014-09-12 2021-06-22 伊鲁米纳剑桥有限公司 Detection of repeat amplification with short read sequencing data
JP6762932B2 (en) * 2014-10-10 2020-09-30 インヴァイティ コーポレイションInvitae Corporation Methods, systems, and processes for de novo assembly of sequencing leads

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112195228A (en) * 2020-09-28 2021-01-08 苏州阅微基因技术有限公司 X-STR fluorescent amplification system, kit and application
CN112195228B (en) * 2020-09-28 2022-02-22 苏州阅微基因技术有限公司 X-STR fluorescent amplification system, kit and application

Also Published As

Publication number Publication date
WO2019028189A3 (en) 2019-02-28

Similar Documents

Publication Publication Date Title
Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Chen et al. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
Suwinski et al. Advancing personalized medicine through the application of whole exome sequencing and big data analytics
Bakhtiari et al. Variable number tandem repeats mediate the expression of proximal genes
Tang et al. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes
Li Toward better understanding of artifacts in variant calling from high-coverage samples
Wang et al. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions
Kelley et al. Quake: quality-aware detection and correction of sequencing errors
KR102385062B1 (en) Methods and processes for non-invasive assessment of genetic variations
Corbett-Detig et al. Natural selection constrains neutral diversity across a wide range of species
Highnam et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles
Zhao et al. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives
Derrien et al. Fast computation and applications of genome mappability
Van Leeuwen et al. Population-specific genotype imputations using minimac or IMPUTE2
Mezlini et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data
Li et al. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph
Seifuddin et al. lncRNAKB, a knowledgebase of tissue-specific functional annotation and trait association of long noncoding RNA
US9773091B2 (en) Systems and methods for genomic annotation and distributed variant interpretation
Chen et al. Exact algorithms for haplotype assembly from whole-genome sequence data
Allhoff et al. Discovering motifs that induce sequencing errors
Dalca et al. Genome variation discovery with high-throughput sequencing data
Szatkiewicz et al. Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation
Pajuste et al. FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads
Watanabe et al. Analysis of whole Y-chromosome sequences reveals the Japanese population history in the Jomon period
Li et al. Single nucleotide polymorphism (SNP) detection and genotype calling from massively parallel sequencing (MPS) data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18840987

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18840987

Country of ref document: EP

Kind code of ref document: A2