EP2971114A2 - Procédés et compositions pour l'évaluation de marqueurs génétiques - Google Patents

Procédés et compositions pour l'évaluation de marqueurs génétiques

Info

Publication number
EP2971114A2
EP2971114A2 EP14762322.7A EP14762322A EP2971114A2 EP 2971114 A2 EP2971114 A2 EP 2971114A2 EP 14762322 A EP14762322 A EP 14762322A EP 2971114 A2 EP2971114 A2 EP 2971114A2
Authority
EP
European Patent Office
Prior art keywords
nucleic acid
sequence
target
capture
probes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP14762322.7A
Other languages
German (de)
English (en)
Other versions
EP2971114A4 (fr
Inventor
Gregory Porreca
Mark UMBARGER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Good Start Genetics Inc
Original Assignee
Good Start Genetics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/934,093 external-priority patent/US20130337447A1/en
Application filed by Good Start Genetics Inc filed Critical Good Start Genetics Inc
Publication of EP2971114A2 publication Critical patent/EP2971114A2/fr
Publication of EP2971114A4 publication Critical patent/EP2971114A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the invention relates to methods and compositions for determining genotypes in patient samples.
  • Medical advice is increasingly personalized, with individual decisions and recommendations being based on specific genetic information. Information about the type and number of alleles at one or more genetic loci impacts disease risk, prognosis, therapeutic options, and genetic counseling amongst other healthcare considerations.
  • aspects of the invention relate to preparative and analytical methods and compositions for evaluating genotypes, and in particular, for determining the allelic identity (or identities in a diploid organism) of one or more genetic loci in a subject.
  • aspects of the invention are based, in part, on the identification of different sources of ambiguity and error in genetic analyses, and, in part, on the identification of one or more approaches to avoid, reduce, recognize, and/or resolve these errors and ambiguities at different stages in a genetic analysis.
  • certain types of genetic information can be under- represented or over-represented in a genetic analysis due to a combination of stochastic variation and systematic bias in any of the preparative stages (e.g., capture, amplification, etc.), determining stages (e.g., allele- specific detection, sequencing, etc.), data interpretation stages (e.g., determining whether the assay information is sufficient to identify a subject as homozygous or heterozygous), and/or other stages.
  • the preparative stages e.g., capture, amplification, etc.
  • determining stages e.g., allele- specific detection, sequencing, etc.
  • data interpretation stages e.g., determining whether the assay information is sufficient to identify a subject as homozygous or heterozygous
  • error or ambiguity may be apparent in a genetic analysis, but not readily resolved without running additional samples or more expensive assays (e.g., array-based assays may report no-calls due to noisy/low signal).
  • error or ambiguity may not be accounted for in a genetic analysis and incorrect base calls may be made even when the evidence for them is limited and/or not statistically significant (e.g., next- generation sequencing technologies may report base calls even if the evidence for them is not statistically significant).
  • error or ambiguity may be problematic for a multi-step genetic analysis because it is apparent but not readily resolved in one or more steps of the analysis and not apparent or accounted for in other steps of the analysis.
  • sources of error and ambiguity in one or more steps can be addressed by capturing and/or interrogating each target locus of interest with one or more sets of overlapping probes that are designed to overcome any systematic bias or stochastic effects that may impact the complexity and/or fidelity of the genetic information that is generated.
  • sources of error and ambiguity in one or more steps can be addressed by capturing and/or interrogating each target locus of interest with at least one set of probes, wherein different probes are labeled with different identifiers that can be used to track the assay reactions and determine whether certain types of genetic information are under-represented or over-represented in the information that is generated.
  • errors and ambiguities associated with the analysis of regions containing large numbers of sequence repeats are addressed by systematically analyzing frequencies of certain nucleic acids at particular stages in an assay (e.g., at a to capture, sequencing, or detection stage). It should be appreciated that such techniques may be particularly useful in the context of a standardized protocol that is designed to allow many different loci to be evaluated in parallel without requiring different assay procedures for each locus.
  • the use of a single detection modality (e.g., sequencing) to assay multiple types of genetic lesions is e.g., point mutations, insertions/deletions, length polymorphisms.
  • aspects of the invention provide methods for overcoming preparative and/or analytical bias by combining two or more techniques, each having a different bias (e.g., a known bias towards under-representation or over-representation of one or more types of sequences), and using the resulting data to determine a genetic call for a subject with greater confidence.
  • a different bias e.g., a known bias towards under-representation or over-representation of one or more types of sequences
  • multiplex diagnostic methods comprise capturing a plurality of genetic loci in parallel (e.g., one or more genetic loci from Table 1).
  • the genetic loci possess one or more polymorphisms (e.g., one or more polymorphisms from Table 2) the genotypes of which correspond to disease causing alleles.
  • the disclosure provides methods for assessing multiple heritable disorders in parallel.
  • methods are provided for diagnosing multiple heritable disorders in parallel at a pre-implantation, prenatal, perinatal, or postnatal stage.
  • the disclosure provides methods for analyzing multiple genetic loci (e.g., a plurality of target nucleic acids selected from Table 1) from a patient sample, such as a blood, pre-implantation embryo, chorionic villus or amniotic fluid sample, or other sample (e.g., other biological fluid or tissue sample such as a biopsy sample) as aspects of the invention are not limited in this respect.
  • a patient sample such as a blood, pre-implantation embryo, chorionic villus or amniotic fluid sample, or other sample (e.g., other biological fluid or tissue sample such as a biopsy sample) as aspects of the invention are not limited in this respect.
  • a patient sample e.g., a tumor tissue or cell sample
  • a sample comprises cells from a non-host organism (e.g., bacterial or viral infections in a human subject) or a sample for environmental monitoring (e.g., bacterial, viral, fungal composition of a soil, water, or air sample).
  • a non-host organism e.g., bacterial or viral infections in a human subject
  • a sample for environmental monitoring e.g., bacterial, viral, fungal composition of a soil, water, or air sample.
  • aspects of the methods disclosed herein relate to genotyping a polymorphism of a target nucleic acid.
  • the genotyping may comprise determining that one or more alleles of the target nucleic acid are heterozygous or homozygous.
  • the genotyping may comprise determining the sequence of a polymorphism and comparing that sequence to a control sequence that is indicative of a disease risk.
  • the polymorphism is selected from a locus in Table 1 or Table 2. However, it should be appreciated that any locus associated with a disease or condition of interest may be used.
  • a diagnosis, prognosis, or disease risk assessment is provided to a subject based on a genotype determined for that subject at one or more genetic loci (e.g., based on the analysis of a biological sample obtained from that subject).
  • an assessment is provided to a couple, based on their respective genotypes at one or more genetic loci, of the risk of their having one or more children having a genotype associated with a disease or condition (e.g., a homozygous or heterozygous genotype associated with a disease or condition).
  • a subject or a couple may seek genetic or reproductive counseling in connection with a genotype determined according to embodiments of the invention.
  • genetic information from a tumor or circulating tumor cells is used to determine prognosis and guide selection of appropriate drugs/treatments.
  • aspects of the invention provide effective methods for overcoming challenges associated with systematic errors (bias) and/or stochastic effects in multiplex genomic capture and/or analysis (including sequencing analysis).
  • aspects of the invention are useful to avoid, reduce and/or account for variability in one or more sampling and/or analytical steps. For example, in some embodiments, variability in target nucleic acid representation and unequal sampling of heterozygous alleles in pools of captured target nucleic acids can be overcome.
  • the disclosure provides methods that reduce variability in the detection of target nucleic acids in multiplex capture methods.
  • methods improve allelic representation in a capture pool and, thus, improve variant detection outcomes.
  • the disclosure provides preparative methods for capturing target nucleic acids (e.g., genetic loci) that involve the use of different sets of multiple probes (e.g., molecular inversion probes MIPs) that capture overlapping regions of a target nucleic acid to achieve a more uniform representation of the target nucleic acids in a capture pool compared with methods of the prior art.
  • methods reduce bias, or the risk of bias, associated with large scale parallel capture of genetic loci, e.g., for diagnostic purposes.
  • methods are provided for increasing reproducibility (e.g., by reducing the effect of polymorphisms on target nucleic acid capture) in the detection of a plurality of genetic loci in parallel.
  • methods are provided for reducing the effect of probe synthesis and/or probe amplification variability on the analysis of a plurality of genetic loci in parallel.
  • methods of analyzing a plurality of genetic loci comprise contacting each of a plurality of target nucleic acids with a probe set, wherein each probe set comprises a plurality of different probes, each probe having a central region flanked by a 5' region and a 3' region that are complementary to nucleic acids flanking the same strand of one of a plurality of subregions of the target nucleic acid, wherein the subregions of the target nucleic acid are different, and wherein each subregion overlaps with at least one other subregion, isolating a plurality of nucleic acids each having a nucleic acid sequence of a different subregion for each of the plurality of target nucleic acids, and analyzing the isolated nucleic acids.
  • methods comprise contacting each of a plurality of target nucleic acids with a probe set, wherein each probe set comprises a plurality of different probes, each probe having a central region flanked by a 5' region and a 3' region that are complementary to nucleic acids flanking the same strand of one of a plurality of subregions of the target nucleic acid, wherein the subregions of the target nucleic acid are different, and wherein a portion of the 5' region and a portion of the 3' region of a probe have, respectively, the sequence of the 5' region and the sequence of the 3' region of a different probe, isolating a plurality of nucleic acids each having a nucleic acid sequence of a different subregion for each of the plurality of target nucleic acids, and analyzing the isolated nucleic acids.
  • methods of the invention involve analyzing one or more genes with one or more molecular inversion probes provided in Appendix A.
  • those molecular inversion probes are used to capture various targets or subregions thereof on a gene selected from the group consisting of ABCC8, ASPA, BCKDHA, BCKDHB, BLM, CFTR, CLRN1, DLD, FANCC, G6PC, HEXA, IKBKAP, MCOLN1, PCDH15, and SMPD1.
  • a set of two or more molecular inversion probes provided in Appendix A may be used to tile across different, but overlapping sub-regions of one or more genes so that one or more targets on the one or more genes are captured by at least two molecular inversion probes of the set.
  • the number of molecular inversion probes used in a set for tile capture depends on the amount of overlapping coverage one desires for a certain target.
  • a portion of one or more genes is captured using one or more molecular inversion probes in
  • One or more molecular inversion probes of Appendix A may also be chosen to capture particular regions of interest, such as coding or noncoding regions, of a gene. In addition, one or more molecular inversion probes may be chosen to capture regions specific to certain diseases.
  • the diseases may include, for example, Familial hyperinsulinism, Canavan disease, Maple syrup urine disease type la/lb, Bloom syndrome, Cystic fibrosis, Usher syndrome type IDA, Dihydrolipoamide dehydrogenase deficiency, Fanconi anemia group C, Glycogen storage disease type la, Tay-Sachs disease, Familial dysautonomia, Mucolipidosis type IV, Usher syndrome type IF, Niemann-Pick disease type A/B.
  • Familial hyperinsulinism Canavan disease
  • Maple syrup urine disease type la/lb Bloom syndrome
  • Cystic fibrosis IDA
  • Dihydrolipoamide dehydrogenase deficiency Fanconi anemia group C
  • Glycogen storage disease type la Tay-Sachs disease
  • Familial dysautonomia Mucolipidosis type IV
  • Usher syndrome type IF Niemann-Pick disease type A/B.
  • aspects of the disclosure are based, in part, on the discovery of methods for overcoming problems associated with systematic and random errors (bias) in genome capture, amplification and sequencing methods, namely high variability in the capture and amplification of nucleic acids and disproportionate representation of heterozygous alleles in sequencing libraries.
  • bias systematic and random errors
  • the disclosure provides methods that reduce errors associated with the variability in the capture and amplification of nucleic acids.
  • the methods improve allelic representation in sequencing libraries and, thus, improve variant detection outcomes.
  • the disclosure provides preparative methods for capturing target nucleic acids (e.g., genetic loci) that involve the use of differentiator tag sequences to uniquely tag individual nucleic acid molecules.
  • the differentiator tag sequence permit the detection of bias based on the occurrence of combinations of differentiator tag and target sequences observed in a sequencing reaction.
  • the methods reduce errors caused by bias, or the risk of bias, associated with the capture, amplification and sequencing of genetic loci, e.g., for diagnostic purposes.
  • aspects of the invention relate to providing sequence tags (referred to as differentiator tags) that are useful to determine whether target nucleic acid sequences identified in an assay are from independently isolated target nucleic acids or from multiple copies of the same target nucleic acid molecule (e.g., due to bias in a preparative step, for example, amplification).
  • This information can be used to help analyze a threshold number of independently isolated target nucleic acids from a biological sample in order to obtain sequence information that is reliable and can be used to make a genotype conclusion (e.g., call) with a desired degree of confidence.
  • This information also can be used to detect bias in one or more nucleic acid preparative steps.
  • the methods disclosed herein are useful for any application where reduction of bias, e.g., associated with genomic isolation, amplification, sequencing, is important. For example, detection of cancer mutations in a heterogeneous tissue sample, detection of mutations in maternally-circulating fetal DNA, and detection of mutations in cells isolated during a preimplantation genetic diagnostic procedure.
  • methods of genotyping a subject comprise determining the sequence of at least a threshold number of independently isolated nucleic acids, wherein the sequence of each isolated nucleic acid comprises a target nucleic acid sequence and a differentiator tag sequence, wherein the threshold number is a number of unique combinations of target nucleic acid and differentiator tag sequences, wherein the isolated nucleic acids are identified as independently isolated if they comprise unique combinations of target nucleic acid and differentiator tag sequences, and wherein the target nucleic acid sequence is the sequence of a genomic locus of a subject.
  • the isolated nucleic acids are products of a circularization selection-based preparative method, e.g., molecular inversion probe capture products. In other embodiments, the isolated nucleic acids are products of an amplification-based preparative methods. In other embodiments, the isolated nucleic acids are products of hybridization-based preparative methods.
  • Circularization selection-based preparative methods selectively convert regions of interest (target nucleic acids) into a covalently-closed circular molecule which is then isolated typically by removal (usually enzymatic, e.g. with exonuclease) of any non-circularized linear nucleic acid.
  • Oligonucleotide probes e.g., molecular inversion probes
  • primer sites e.g., sequencing primer sites.
  • the probes are allowed to hybridize to the genomic target, and enzymes are used to first (optionally) fill in any gap between probe ends and second ligate the probe closed.
  • Circularization selection-based preparative methods include molecular inversion probe capture reactions and 'selector' capture reactions.
  • molecular inversion probe capture of a target nucleic acid is indicative of the presence of a polymorphism in the target nucleic acid.
  • genomic loci target nucleic acids
  • a polymerase chain reaction or ligase chain reaction or other amplification method
  • primers will be sufficiently complementary to the target sequence to hybridize with and prime amplification of the target nucleic acid. Any one of a variety of art known methods may be utilized for primer design and synthesis. One or more of the primers may be perfectly complementary to the target sequence. Degenerate primers may also be used.
  • Primers may also include additional nucleic acids that are not complementary to target sequences but that facilitate downstream applications, including for example restriction sites and differentiator tag sequences.
  • Amplification-based methods include amplification of a single target nucleic acid and multiplex amplification (amplification of multiple target nucleic acids in parallel).
  • Hybridization-based preparative methods involve selectively immobilizing target nucleic acids for further manipulation. It is to be understood that one or more oligonucleotides
  • immobilization oligonucleotides which comprise differentiator tag sequences, and which may be from 15 to 170 nucleotides in length, are used which hybridize along the length of a target region of a genetic locus to immobilize it.
  • immobilization oligonucleotides which comprise differentiator tag sequences, and which may be from 15 to 170 nucleotides in length, are used which hybridize along the length of a target region of a genetic locus to immobilize it.
  • immobilization oligonucleotides which comprise differentiator tag sequences, and which may be from 15 to 170 nucleotides in length, are used which hybridize along the length of a target region of a genetic locus to immobilize it.
  • oligonucleotides are either immobilized before hybridization is performed (e.g.,
  • Roche/Nimblegen 'sequence capture' or are prepared such that they include a moiety (e.g. biotin) which can be used to selectively immobilize the target nucleic acid after hybridization by binding to e.g., streptavidin-coated microbeads (e.g. Agilent 'SureSelect').
  • a moiety e.g. biotin
  • streptavidin-coated microbeads e.g. Agilent 'SureSelect'.
  • hybridization based methods described herein may be used in connection with one or more of the tiling/staggering, tagging, size-detection, and/or sensitivity enhancing algorithms described herein.
  • the methods disclosed herein comprise determining the sequence of molecular inversion probe capture products, each comprising a molecular inversion probe and a target nucleic acid, wherein the sequence of the molecular inversion probe comprises a differentiator tag sequence and, optionally, a primer sequence, and wherein the target nucleic acid is a captured genomic locus of a subject, and genotyping the subject at the captured genomic locus based on the sequence of at least a threshold number of unique combinations of target nucleic acid and differentiator tag sequences of molecular inversion probe capture products.
  • the methods disclosed herein comprise obtaining molecular inversion probe capture products, each comprising a molecular inversion probe and a target nucleic acid, wherein the sequence of the molecular inversion probe comprises a differentiator tag sequence and, optionally, a primer sequence, wherein the target nucleic acid is a captured genomic locus of the subject, amplifying the molecular inversion probe capture products, and genotyping the subject by determining, for each target nucleic acid, the sequence of at least a threshold number of unique combinations of target nucleic acid and differentiator tag sequence of molecular inversion probe capture products.
  • obtaining comprises capturing target nucleic acids from a genomic sample of the subject with molecular inversion probes, each comprising a unique differentiator tag sequence.
  • capturing is performed under conditions wherein the likelihood of obtaining two or more molecular inversion probe capture products with identical combinations of target and differentiator tag sequences is equal to or less than a predetermined value, optionally wherein the predetermined value is about 0.05.
  • the threshold number for a specific target nucleic acid sequence is selected based on a desired statistical confidence for the genotype. In some embodiments, the methods further comprising determining a statistical confidence for the genotype based on the number of unique combinations of target nucleic acid and differentiator tag sequences.
  • the methods comprise obtaining a plurality of molecular inversion probe capture products each comprising a molecular inversion probe and a target nucleic acid, wherein the sequence of the molecular inversion probe comprises a differentiator tag sequence and, optionally, a primer sequence (e.g., a sequence that is complementary to the sequence of a nucleic acid that is used as a primer for sequencing or other extension reaction), amplifying the plurality of molecular inversion probe capture products, determining numbers of occurrence of combinations of target nucleic acid and differentiator tag sequence of molecular inversion probe capture products in the amplified plurality, and if the number of occurrence of a specific combination of target nucleic acid sequence and differentiator tag sequence exceeds a predetermined value, detecting bias in the amplification of the molecular inversion probe comprising the specific combination.
  • the methods further comprise genotyping target sequences in the plurality
  • the target nucleic acid is a gene (or portion thereof) selected from Table 1.
  • the genotyping comprises determining the sequence of a target nucleic acid (e.g., a polymorphic sequence) at one or more (both) alleles of a genome (a diploid genome) of a subject.
  • the genotyping comprises determining the sequence of a target nucleic acid at both alleles of a diploid genome of a subject, wherein in the target nucleic acid comprises, or consists of, a sequence of Table 1, Table 2, or other locus of interest.
  • aspects of the invention provide methods and compositions for identifying nucleic acid insertions or deletions in genomic regions of interest without
  • nucleotide sequences of these regions are particularly useful for detecting nucleic acid insertions or deletions in genomic regions containing nucleic acid sequence repeats (e.g., di- or tri-nucleotide repeats).
  • nucleic acid sequence repeats e.g., di- or tri-nucleotide repeats
  • the invention is not limited to analyzing nucleic acid repeats and may be used to detect insertions or deletions in any target nucleic acid of interest.
  • aspects of the invention are particularly useful for analyzing multiple loci in a multiplex assay.
  • aspects of the invention relate to determining whether an amount of target nucleic acid that is captured in a genomic capture assay is higher or lower than expected. In some embodiments, a statistically significant deviation from an expected amount (e.g., higher or lower) is indicative of the presence of a nucleic acid insertion or deletion in the genomic region of interest. In some embodiments, the amount is a number of nucleic acid molecules that are captured. In some embodiments, the amount is a number of independently captured nucleic acid molecules in a sample. It should be appreciated that the captured nucleic acids may be literally captured from a sample, or their sequences may be captured without actually capturing the original nucleic acids in the sample. For example, nucleic acid sequences may be captured in an assay that involves a template-based extension of nucleic acids having the region of interest, in the sample.
  • aspects of the invention are based on the recognition that the efficiency of certain capture techniques is affected by the length of the nucleic acid being captured. Accordingly, an increase or decrease in the length of a target nucleic acid (e.g., due to an insertion or deletion of a repeated sequence) can alter the capture efficiency of that nucleic acid. In some embodiments, a difference in the capture efficiency (e.g., a statistically significant difference in the capture efficiency) of a target nucleic acid is indicative of an insertion or deletion in the target nucleic acid.
  • the capture efficiency for a target nucleic acid may be evaluated based on an amount of captured nucleic acid (e.g., number of captured nucleic acid molecules) relative to a control amount (e.g., based on an amount of control nucleic acid that is captured).
  • an amount of captured nucleic acid e.g., number of captured nucleic acid molecules
  • a control amount e.g., based on an amount of control nucleic acid that is captured.
  • the invention is not limited in this respect and other techniques for evaluating capture efficiency also may be used.
  • evaluating the capture efficiency as opposed to determining the sequence of the entire repeat region reduces errors associated with sequencing through repeat regions.
  • Repeat sequences often give rise to stutters or skips in sequencing reactions that make it very difficult to accurately determine the number of repeats in a target region without running multiple sequencing reactions under different conditions and carefully analyzing the results.
  • Such procedures are cumbersome and not readily scalable in a manner that is consistent with high throughput analyses of target nucleic acids.
  • repeat regions may be longer than the length of the individual sequence read, making length
  • aspects of the invention are useful to increase the sensitivity of detecting insertions or deletions in target regions, particularly target regions containing repeated sequences.
  • aspects of the invention relate to capturing genomic nucleic acid sequences using a molecular inversion probe (e.g., MIP or Padlock probe) technique, and determining whether the amount (e.g., number) of captured sequences is higher or lower than expected. In some embodiments, the amount (e.g., number) of captured sequences is compared to an amount (e.g., number) of sequences captured in a control assay.
  • the control assay may involve analyzing a control sample that contains a nucleic acid from the same genetic locus having a known sequence length (e.g., a known number of nucleic acid repeats).
  • a control may involve analyzing a second (e.g., different) genetic locus that is not expected to contain any insertions or deletions.
  • the second genetic locus may be analyzed in the same sample as the locus being interrogated or in a different sample where its length has been previously determined.
  • the second genetic locus may be a locus that is not characterized by the presence of nucleic acid repeats (and thus not expected to contain insertions or deletions of the repeat sequence).
  • a target nucleic acid region that is being evaluated may be determined by the identity of the targeting arms of a probe that is designed to capture the target region (or sequence thereof).
  • the targeting arms of a MIP probe may be designed to be complementary (e.g., sufficiently complementary for selective hybridization and/or polymerase extension and/or ligation) to genomic regions flanking a target region suspected of containing an insertion or deletion.
  • two targeting arms may be designed to be complementary (e.g., sufficiently complementary for selective hybridization and/or polymerase extension and/or ligation) to the two flanking regions that are immediately adjacent (e.g., immediately 5' and 3', respectively) to a region of a sequence repeat on one strand of a genomic nucleic acid.
  • one or both targeting arms may be designed to hybridize several bases (e.g., 1-5, 5-10, 10-25, 25-50, or more) upstream or downstream from the repeat region in such a way that the captured sequence includes a region of unique genomic sequence that on one or both sides of the repeat region. This unique region can then be used to identify the captured target (e.g., based on sequence or hybridization information).
  • two or more (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) different loci may be interrogated in parallel in a single assay (e.g., in a multiplex assay).
  • a single assay e.g., in a multiplex assay.
  • the ratio of captured nucleic acids for each locus may be used to determine whether a nucleic acid insertion or deletion is present in one locus relative to the other. For example, the ratio may be compared to a control ratio that is representative of the two loci when neither one has an insertion or deletion relative to control sequences (e.g., sequences that are normal or known to be associated with healthy phenotypes for those loci). However, the amount of captured nucleic acids may be compared to any suitable control as discussed herein.
  • the locus of a captured sequence may be identified by determining a portion of unique sequence 5' and/or 3' to the repeat region in the target nucleic acid suspected of containing a deletion or insertion. This does not require sequencing the captured repeat region itself.
  • repeat region also could be sequenced as aspects of the invention are not limited in this respect.
  • aspects of the invention may be combined with one or more sequence-based assays (e.g.,
  • SNP detection assays for example in a multiplex format, to determine the genotype of one or more regions of a subject.
  • methods of detecting a polymorphism in a nucleic acid in a biological sample comprise evaluating the efficiency of capture at one or more loci and determining whether one or both alleles at that locus contain an insertion or deletion relative to a control locus (e.g., a locus indicative of a length of repeat sequence that is associated with a healthy phenotype).
  • a control locus e.g., a locus indicative of a length of repeat sequence that is associated with a healthy phenotype.
  • aspects of the invention relate to methods for determining whether a target nucleic acid has an abnormal length by evaluating the capture efficiency of a target nucleic acid in a biological sample from a subject, wherein a capture efficiency that is different from a reference capture efficiency is indicative of the presence, in the biological sample, of a target nucleic acid having an abnormal length.
  • a normal length is a length that is associated with a normal (e.g., healthy or non-carrier phenotype).
  • an abnormal length is a length that is either shorter or longer than the normal length.
  • the presence of an abnormal length is indicative of an increased risk that the locus is associated with a disease or a disease carrier phenotype.
  • the abnormal length is indicative that the subject is either has a disease or condition or is a carrier of a disease or condition (e.g., associated with the locus).
  • the description of embodiments relating to detecting the presence of an abnormal length also support detecting the presence of a length that is different from an expected or control length.
  • aspects of the invention relate to estimating the length of a target nucleic acid (e.g., of a sub-target region within a target nucleic acid).
  • aspects of the invention relate to methods for estimating the length of a target nucleic acid by contacting the target nucleic acid with a plurality of detection probes under conditions that permit hybridization of the detection probes to the target nucleic acid, wherein each detection probe is a polynucleotide that comprises a first arm that hybridizes to a first region of the target nucleic acid and a second arm that hybridizes to a second region of the target nucleic acid, wherein the first and second regions are on a common strand of the target nucleic acid, and wherein the nucleotide sequence of the target between the 5' end of the first region and the 3' end of the second region is the nucleotide sequence of a sub-target nucleic acid; and capturing a plurality of sub-target nucleic acids that are
  • methods for estimating a nucleic acid length may involve comparing a capture efficiency for a target nucleic acid region to two or more reference efficiencies for known nucleic acid lengths in order to determine whether the target nucleic acid region is smaller, intermediate, or larger in size than the known control lengths.
  • a series of nucleic acids of known different lengths may be used to provide a calibration curve for evaluating the length of a target nucleic acid region of interest.
  • the capture efficiency of a target region suspected of having a deletion or insertion is determined by comparing the capture efficiency to a reference indicative of a normal capture efficiency. In some embodiments, the capture efficiency is lower than the reference capture efficiency. In some embodiments, the subject is identified as having an insertion in the target region. In some embodiments, the capture efficiency is higher than the reference capture efficiency. In some embodiments, the subject is identified as having a deletion in the target region. In some embodiments, the subject is identified as being heterozygous for the insertion. In some embodiments, the subject is identified as being heterozygous for the deletion.
  • aspects of the invention relate to capturing a sub-target nucleic acid (or a sequence of a sub-target nucleic acid).
  • a molecular inversion probe technique is used.
  • a molecular inversion probe is a single linear strand of nucleic acid that comprises a first targeting arm at its 5' end and a second targeting arm at its 3' end, wherein the first targeting arm is capable of specifically hybridizing to a first region flanking one end of the sub-target nucleic acid, and wherein the second targeting arm is capable of specifically hybridizing to a second region flanking the other end of the sub-target nucleic acid on the same strand of the target nucleic acid.
  • the first and second targeting arms are between about 10 and about 100 nucleotides long. In some embodiments, the first and second targeting arms are about 10-20, 20- 30, 30-40, or 40-50 nucleotides long.
  • the first and second targeting arms are about 20 nucleotides long. In some embodiments, the first and second targeting arms have the same length. In some embodiments, the first and second targeting arms have different lengths. In some embodiments, each pair of first and second targeting arms in a set of probes has the same length. Accordingly, if one of the targeting arms is longer, the other one is
  • a quality control step in some embodiments to confirm that all captured probe/target sequence products have the same length after a multiplexed plurality of capture reactions.
  • a set of probes may be designed to have the same length if the intervening region is varied to accommodate any differences in the length of either one or both of the first and second targeting arms.
  • the hybridization Tms of the first and second targeting arms are similar. In some embodiments, the hybridization Tms of the first and second targeting arms are within 2-5° C. of each other. In some embodiments, the hybridization Tms of the first and second targeting arms are identical. In some embodiments, the hybridization Tms of the first and second targeting arms are close to empirically-determined optima but not necessarily identical.
  • the first and second targeting arms of a molecular inversion probe have different Tms.
  • the Tm of the first targeting arm at the 5' end of the molecular inversion probe
  • the Tm of the second targeting arm at the 3' end of the molecular inversion probe.
  • a relatively high Tm for the first targeting arm may help avoid or prevent the first targeting arm from being displaced after hybridization by the extension product of the 3' end of the second targeting arm.
  • a reference to the Tm of a targeting arm as used herein relates to the Tm of hybridization of the targeting arm to a nucleic acid having the complementary sequence (e.g., the region of the target nucleic acid that has a sequence that is complementary to the sequence of the targeting arm). It also should be appreciated that the Tms of the targeting arms described herein may be calculated using any appropriate method.
  • an experimental method e.g., a gel shift assay, a hybridization assay, a melting curve analysis, for example in a PCR machine with a SYBR dye by stepping through a temperature ramp while monitoring signal level from an intercalating dye, for example, bound to a double-stranded DNA, etc.
  • an optimal Tm may be determined by evaluating the number of products formed (e.g., for each of a plurality of MIP probes), and determining the optimal Tm as the center point in a histogram of Tm for all targeting arms.
  • a predictive algorithm may be used to determine a Tm theoretically.
  • a relatively simple predictive algorithm may be used based on the number of G/C and A/T base pairs when the sequence is hybridized to its target and/or the length of the hybridized product (e.g., for example, 64.9+41*([G+C]-16.4)/(A+T+G+C), see for example, Wallace, R. B., Shaffer, J., Murphy, R. F., Bonner, J., Hirose, T., and Itakura, K. (1979) Nucleic Acids Res 6:3543-3557).
  • a more complex algorithm may be used to account for the effects of base stacking entropy and enthalpy, ion concentration, and primer concentration (see, for example, SantaLucia J (1998), Proc Natl Acad Sci USA, 95: 1460-5).
  • an algorithm may use modified parameters (e.g., nearest-neighbor parameters for basepair entropy/enthalpy values). It should be appreciated that any suitable algorithm may be used as aspects of the invention are not limited in this respect. However, it also should be appreciated that different methodologies may results in different calculated or predicted Tms for the same sequences.
  • the same empirical and/or theoretical method is used to determine the Tms of different sequences for a set of probes to avoid a negative impact of any systematic difference in the Tm determination or prediction when designing a set of probes with predetermined similarities or differences for different Tms.
  • the Tm of the first targeting arm may be about 1° C, about 2° C, about 3° C, about 4° C, about 5° C, or more than about 5° C. higher than the Tm of the second targeting arm.
  • each probe in a plurality of probes e.g., each probe in a set of 5-10, each probe in a set of at least 10, each probe in a set of 10-50, each probe in a set of 50- 100, each probe in a set of 100-500, each probe in a set of 500-1,000, each probe in a set of 1,000-1,500, each probe in a set of 1,500-2,000, each probe in a set of 2,000-3,000, 3,000-5,000, 5,000-10,000 or each probe in a set of at least 5,000 different probes) has a unique first targeting arm (e.g., they all have different sequences) and a unique second targeting arm (e.g., they all have different sequences).
  • the first targeting arm has a Tm for its complementary sequence that is higher (e.g., about 1° C, about 2° C, about 3° C, about 4° C, about 5° C, or more than about 5° C. higher) than the Tm of the second targeting arm for its complementary sequence.
  • each of the first targeting arms have similar or identical Tms for their respective complementary sequences and each of the second targeting arms have similar or identical Tms for their respective
  • the Tm of the first arm(s) may be about 58° C. and the Tm of the second arm(s) may be about 56° C. In some embodiments, the Tm of the first arm(s) may be about 68° C, and the Tm of the second arm(s) may be about 65° C.
  • the similarity e.g., within a range of 1° C, 2° C, 3° C, 4° C, 5° C.
  • identity of the Tms for the different targeting arms should be based either on empirical data for each arm or based on the same predictive algorithm for each arm (e.g., Wallace, R. B., Shaffer, J., Murphy, R. F., Bonner, J., Hirose, T., and Itakura, K. (1979) Nucleic Acids Res 6:3543-3557, SantaLucia J (1998), Proc Natl Acad Sci USA, 95: 1460-5, or other algorithm).
  • the Tm of the first targeting arm of a molecular inversion probe (at the 5' end of the molecular inversion probe) is selected to be sufficiently stable to prevent displacement of the first targeting arm from its complementary sequence on a target nucleic acid.
  • the Tm of the first targeting arm is 50-55° C, at least 55° C, 55-60° C, at least 60° C, 60-65° C, at least 65° C, at least 70° C, at least 75° C, or at least 80° C.
  • the for a particular targeting arm may be determined empirically or theoretically.
  • each probe in a plurality of probes e.g., each probe in a set of 5-10, each probe in a set of at least 10, each probe in a set of 10-50, each probe in a set of 50-100, each probe in a set of 100-500, or each probe in a set of at least 500 different probes
  • each probe in a plurality of probes has a different first targeting arm (e.g., different sequences) but each different first targeting arm has a similar or identical Tm for its complementary sequence on a target nucleic acid.
  • the similarity (e.g., within a range of 1 C, 2 C, 3 C, 4 C, 5 C) or identity of the Tms for the different targeting arms should be based either on empirical data for each arm or based on the same predictive algorithm for each arm (e.g., Wallace, R. B., Shaffer, J., Murphy, R. F., Bonner, J., Hirose, T., and Itakura, K. (1979) Nucleic Acids Res 6:3543-3557, SantaLucia J (1998), Proc Natl Acad Sci USA, 95: 1460-5, or other algorithm).
  • the sub-target nucleic acid contains a nucleic acid repeat.
  • the nucleic acid repeat is a dinucleotide or trinucleotide repeat.
  • the sub-target nucleic acid contains 10-100 copies of the nucleic acid repeat in the absence of an abnormal increase or decrease in nucleic acid repeats.
  • the sub-target nucleic acid is a region of the Fragile-X locus that contains a nucleic acid repeat.
  • one or both targeting arms hybridize to a region on the target nucleic acid that is immediately adjacent to a region of nucleic acid repeats.
  • one or both targeting arms hybridize to a region on the target nucleic acid that is separated from a region of nucleic acid repeats by a region that does not contain any nucleic acid repeats.
  • the molecular inversion probe further comprises a primer-binding region that can be used to sequence the captured sub-target nucleic acid and optionally the first and/or second targeting arm.
  • aspects of the invention relate to evaluating the length of a plurality of different target nucleic acids in a biological sample.
  • the plurality of target nucleic acids are analyzed using a plurality of different molecular inversion probes.
  • each different molecular inversion probe comprises a different pair of first and second targeting arms at each of the 3' and 5' ends.
  • each different molecular inversion probe comprises the same primer-binding sequence.
  • aspects of the invention relate to analyzing nucleic acid from a biological sample obtained from a subject.
  • the biological sample is a blood sample.
  • the biological sample is a tissue sample, specific cell population, tumor sample, circulating tumor cells, or environmental sample.
  • the biological sample is a single cell.
  • nucleic acids are analyzed in biological samples obtained from a plurality of different subjects.
  • nucleic acids from a biological sample are analyzed in multiplex reactions. It should be appreciated that a biological sample contains a plurality of copies of a genome derived from a plurality of cells in the sample. Accordingly, a sample may contain a plurality of independent copies of a target nucleic acid region of interest, the capture efficiency of which can be used to evaluate its size as described herein.
  • aspects of the invention relate to evaluating a nucleic acid capture efficiency by determining an amount of target nucleic acid that is captured (e.g., an amount of sub-target nucleic acid sequences that are captured).
  • the amount of target nucleic acid that is captured is determined by determining a number of independently captured target nucleic acid molecules (e.g., the amount of independently captured molecules that have the sequence of the sub-target region).
  • the amount of target nucleic acid that is captured is compared to a reference amount of captured nucleic acid.
  • the reference amount is determined by determining a number of independently captured molecules of a reference nucleic acid.
  • the reference nucleic acid is a nucleic acid of a different locus in the biological sample that is not suspected of containing a deletion or insertion. In some embodiments, the reference nucleic acid is a nucleic acid of known size and amount that is added to the capture reaction. As described herein, a number of independently captured nucleic acid sequences can be determined by contacting a nucleic acid sample with a preparation of a probe (e.g., a MIP probe as described herein). It should be appreciated that the preparation may comprise a plurality of copies of the same probe and accordingly a plurality of independent copies of the target region may be captured by different probe molecules.
  • a probe e.g., a MIP probe as described herein
  • the number of probe molecules that actually capture a sequence can be evaluated by determining an amount or number of captured molecules using any suitable technique. This number is a reflection of both the number of target molecules in the sample and the efficiency of capture of those target molecules, which in turn is related to the size of the target molecules as described herein. Accordingly, the capture efficiency can be evaluated by controlling for the abundance of the target nucleic acid, for example by comparing the number or amount of captured target molecules to an appropriate control (e.g., a known size and amount of control nucleic acid, or a different locus that should be present in the same amount in the biological sample and is not expected to contain any insertions or deletions).
  • an appropriate control e.g., a known size and amount of control nucleic acid, or a different locus that should be present in the same amount in the biological sample and is not expected to contain any insertions or deletions.
  • aspects of the invention relate to identifying a subject as having an insertion or deletion in one or more alleles of a genetic locus if the capture efficiency for that genetic locus is statistically significantly different than a reference capture efficiency.
  • hybridization conditions used for any of the capture techniques described herein can be based on known hybridization buffers and conditions.
  • aspects of the invention relate to basing a nucleic acid sequence analysis on results from two or more different nucleic acid preparatory techniques that have different systematic biases in the types of nucleic acids that they sample.
  • different techniques have different sequence biases that are systematic and not simply due to stochastic effects during nucleic acid capture or amplification. Accordingly, the degree of oversampling required to overcome variations in nucleic acid preparation needs to be sufficient to overcome the biases (e.g., an oversampling of 2-5 fold, 5-10 fold, 5-15 fold, 15-20 fold, 20-30 fold, 30-50 fold, or intermediate to higher fold).
  • different techniques have different characteristic or systematic biases. For example, one technique may bias a sample analysis towards one particular allele at a genetic locus of interest, whereas a different technique would bias the sample analysis towards a different allele at the same locus. Accordingly, the same sample may be identified as being different depending on the type of technique that is used to prepare nucleic acid for sequence analysis. This effectively represents a sensitivity limitation, because each technique has different relative sensitivities for polymorphic sequences of interest.
  • the sensitivity of a nucleic acid analysis can be increased by combining the sequences from different nucleic acid preparative steps and using the combined sequence information for a diagnostic assay (e.g., for a making a call as to whether a subject is homozygous or heterozygous at a genetic locus of interest).
  • a diagnostic assay e.g., for a making a call as to whether a subject is homozygous or heterozygous at a genetic locus of interest.
  • the invention provides a method of increasing the sensitivity of a nucleic acid detection assay by obtaining a first preparation of a target to nucleic acid using a first preparative method on a biological sample, obtaining a second preparation of a target nucleic acid using a second preparative method on the biological sample, assaying the sequences obtained in both first and second nucleic acid preparations, and using the sequence information from both first and second nucleic acid preparations to determine the genotype of the target nucleic acid in the biological sample, wherein the first and second preparative methods have different systematic sequence biases.
  • the first and second nucleic acid preparations are combined prior to performing a sequence assay.
  • the first preparative method is an amplification- based, a hybridization-based, or a circular probe-based preparative method.
  • the second method is an amplification-based, a hybridization-based, or a circular probe-based preparative method.
  • the first and second methods are of different types (e.g., only one of them is an amplification-based, a hybridization-based, or a circular probe-based preparative method, and the other one is one or the other two types of method).
  • the second preparative method is an amplification- based, a hybridization-based, or a circular probe-based preparative method, provided that the second method is different from the first method.
  • both methods may be of the same type, provided they are different methods (e.g., both are amplification based or hybridization-based, but are different types of amplification or hybridization methods, e.g., with different relative biases).
  • genomic loci target nucleic acids
  • a polymerase chain reaction or ligase chain reaction or other amplification method
  • primers will be sufficiently complementary to the target sequence to hybridize with and prime amplification of the target nucleic acid. Any one of a variety of art known methods may be utilized for primer design and synthesis. One or both of the primers may be perfectly complementary to the target sequence. Degenerate primers may also be used.
  • Primers may also include additional nucleic acids that are not complementary to target sequences but that facilitate downstream applications, including for example restriction sites and identifier sequences (e.g., source sequences).
  • PCR based methods may include amplification of a single target nucleic acid and multiplex amplification
  • Hybridization-based preparative may methods involve selectively immobilizing target nucleic acids for further manipulation. It is to be understood that one or more oligonucleotides (immobilization oligonucleotides), which in some embodiments may be from 10 to 200 nucleotides in length, are used which hybridize along the length of a target region of a genetic locus to immobilize it.
  • immobilization oligonucleotides which in some embodiments may be from 10 to 200 nucleotides in length, are used which hybridize along the length of a target region of a genetic locus to immobilize it.
  • immobilization oligonucleotides are either immobilized before hybridization is performed (e.g., Roche/Nimblegen 'sequence capture'), or are prepared such that they include a moiety (e.g., biotin) which can be used to selectively immobilize the target nucleic acid after hybridization by binding to e.g., streptavidin-coated microbeads (e.g., Agilent 'SureSelect').
  • a moiety e.g., biotin
  • Circularization selection-based preparative methods selectively convert each region of interest into a covalently-closed circular molecule which is then isolated by removal (usually enzymatic, e.g., with exonuclease) of any non-circularized linear nucleic acid.
  • Oligonucleotide probes are designed which have ends that flank the region of interest. The probes are allowed to hybridize to the genomic target, and enzymes are used to first (optionally) fill in any gap between probe ends and second ligate the probe closed.
  • any remaining (non-target) linear nucleic acid can be removed, resulting in isolation (capture) of target nucleic acid.
  • Circularization selection-based preparative methods include molecular inversion probe capture reactions and 'selector' capture reactions. However, other techniques may be used as aspects of the invention are not limited in this respect.
  • molecular inversion probe capture of a target nucleic acid is indicative of the presence of a polymorphism in the target nucleic acid.
  • a variety of methods may be used to evaluate and compare bias profiles of each preparative technique.
  • Next- generation sequencing may be used to quantitatively measure the abundance of each isolated target nucleic acid obtained from a certain preparative method. This abundance may be compared to a control abundance value (e.g., a known starting abundance of the target nucleic acid) and/or with an abundance determined through the use of an alternative preparative method.
  • a control abundance value e.g., a known starting abundance of the target nucleic acid
  • a set of target nucleic acids may be isolated by one or more of the three preparative methods; the target nucleic acid may be observed x times using the amplification technique, y times using the hybridization enrichment technique, and z times using the circularization selection technique.
  • a pairwise correlation coefficient may be computed between each abundance value (e.g., x and y, x and z, and y and z) to assess bias in nucleic acid isolation between pairs of preparative methods. Since the mechanisms of isolation are different in each approach, the abundances will usually be different and largely uncorrelated with each other.
  • the invention provides a method of obtaining a nucleic acid preparation that is representative of a target nucleic acid in a biological sample by obtaining a first preparation of a target nucleic acid using a first preparative method on a biological sample, obtaining a second preparation of a target nucleic acid using a second preparative method on the biological sample, and combining the first and second nucleic acid preparations to obtain a combined preparation that is representative of the target nucleic acid in the biological sample.
  • a third preparation of the target nucleic acid is obtained using a third preparative method that is different from the first and second preparative methods, wherein the first, second, and third preparative methods all have different systematic sequence biases.
  • the different preparative methods are used for a plurality of different loci in the biological sample to increase the sensitivity of a multiplex nucleic acid analysis.
  • the target nucleic acid has a sequence of a gene selected from Table 1.
  • a genotyping method of the invention may include several steps, each of which independently may involve one or more different preparative techniques described herein.
  • a nucleic acid preparation may be obtained using one or more (e.g., 2, 3, 4, 5, or more) different techniques described herein (e.g., amplification, hybridization capture, circular probe capture, etc., or any combination thereof) and the nucleic acid preparation may be analyzed using one or more different techniques (e.g., amplification, hybridization capture, circular probe capture, etc., or any combination thereof) that are selected independently of the techniques used for the initial preparation.
  • aspects of the invention also provide compositions, kits, devices, and analytical methods for increasing the sensitivity of nucleic acid assays. Aspects of the invention are particularly useful for increasing the confidence level of genotyping analyses. However, aspects of the invention may be used in the context of any suitable nucleic acid analysis, for example, but not limited to, a nucleic acid analysis that is designed to determine whether more than one sequence variant is present in a sample.
  • aspects of the invention relate to a plurality of nucleic acid probes (e.g., 10-50, 50-100, 100-250, 250-500, 500-1,000, 1,000-2,000, 2,000-5,000, 5,000-7,500, 7,500-10,000, or lower, higher, or intermediate number of different probes).
  • a plurality of nucleic acid probes e.g., 10-50, 50-100, 100-250, 250-500, 500-1,000, 1,000-2,000, 2,000-5,000, 5,000-7,500, 7,500-10,000, or lower, higher, or intermediate number of different probes.
  • each probe or each of a subset of probes has a different first targeting arm.
  • each probe or each probe of a subset of probes has a different second targeting arm.
  • the first and second targeting arms are separated by the same intervening sequence.
  • the first and second targeting arms are complementary to target nucleic acid sequences that are separated by the same or a similar length (e.g., number of nucleic acids, for example, 0-25, 25-50, 50-100, 100-250, 250-500, 500-1,000, 1,000-2,500 or longer or intermediate number of nucleotides) on their respective target nucleic acids (e.g., genomic loci).
  • each probe or a subset of probes e.g., 10-25%, 25-50%, 50-75%, 75-90%, or 90-99%
  • the primer binding sequence is the same (e.g., it can be used to prime sequencing or other extension reaction).
  • each probe or a subset of probes includes a unique identifier sequence tag (e.g., that is predetermined and can be used to distinguish each probe).
  • the methods disclosed herein are useful for any application where sensitivity is important. For example, detection of cancer mutations in a heterogenous tissue sample, detection of mutations in maternally-circulating fetal DNA, and detection of mutations in cells isolated during a preimplantation genetic diagnostic procedure.
  • the methods comprise obtaining a nucleic acid preparation using a preparative method (e.g., any of the preparative methods disclosed herein) on a biological sample, and performing a molecular inversion probe capture reaction on the nucleic acid preparation, wherein a molecular inversion probe capture (e.g., using a mutation-detection MIP) of a target nucleic acid of the nucleic acid preparation is indicative of the presence of a mutation (polymorphism) in the target nucleic acid, optionally wherein the polymorphism is selected from Table 2.
  • a preparative method e.g., any of the preparative methods disclosed herein
  • a molecular inversion probe capture reaction e.g., using a mutation-detection MIP
  • methods of genotyping a nucleic acid in a biological sample comprise obtaining a nucleic acid preparation using a preparative method on a biological sample, sequencing a target nucleic acid of the nucleic acid preparation, and performing a molecular inversion probe capture reaction on the biological sample, wherein a molecular inversion probe capture of the target nucleic acid in the biological sample is indicative of the presence of a polymorphism in the target nucleic acid, genotyping the target nucleic acid based on the results of the sequencing and the capture reaction.
  • the target nucleic acid has a sequence of a gene selected from Table 1. It should be appreciated that any one or more embodiments described herein may be used for evaluating multiple genetic markers in parallel. Accordingly, in some embodiments, aspects of the invention relate to determining the presence of one or more markers (e.g., one or more alleles) at multiple different genetic loci in parallel. Accordingly, the risk or presence of multiple heritable disorders may be evaluated in parallel. In some embodiments, the risk of having offspring with one or more heritable disorders may be evaluated. In some embodiments, an evaluation may be performed on a biological sample of a parent or a child (e.g., at a pre- implantation, prenatal, perinatal, or postnatal stage).
  • a biological sample of a parent or a child e.g., at a pre- implantation, prenatal, perinatal, or postnatal stage.
  • the disclosure provides methods for analyzing multiple genetic loci (e.g., a plurality of target nucleic acids selected from Table 1 or 2) from a patient sample, such as a blood, pre-implantation embryo, chorionic villus or amniotic fluid sample.
  • a patient or subject may be a human.
  • aspects of the invention are not limited to humans and may be applied to other species (e.g., mammals, birds, reptiles, other vertebrates or invertebrates) as aspects of the invention are not limited in this respect.
  • a subject or patient may be male or female.
  • samples from a male and female member of a couple may be analyzed.
  • samples from a plurality of male and female subjects may be analyzed to determine compatible or optimal breeding partners or strategies for particular traits or to avoid one or more diseases or conditions. Accordingly, reproductive risks may be determined and/or reproductive recommendations may be provided based on information derived from one or more embodiments of the invention.
  • aspects of the invention may be used in connection with any medical evaluation where the presence of one or more alleles at a genetic locus of interest is relevant to a medical determination (e.g., risk or detection of disease, disease prognosis, therapy selection, therapy monitoring, etc.). Further aspects of the invention may be used in connection with detection, in tumor tissue or circulating tumor cells, of mutations in cellular pathways that cause cancer or predict efficacy of treatment regimens, or with detection and identification of pathogenic organisms in the environment or a sample obtained from a subject, e.g., a human subject.
  • FIG. 1 illustrates a non-limiting embodiment of a tiled probe layout
  • FIG. 2 illustrates a non-limiting embodiment of a staggered probe layout
  • FIG. 3 illustrates a non-limiting embodiment of an alternating staggered probe layout
  • FIG. 4 panels a), b), and c) depict various non-limiting methods for combining differentiator tag sequence and target sequences (NNNN depicts a differentiator tag sequence);
  • FIG. 5 depicts a non-limiting method for genotyping based on target and differentiator tag sequences
  • FIG. 6 depicts non-limiting results of a simulation of a MIP capture reaction
  • FIG. 7 depicts a non-limiting graph of sequencing coverage
  • FIG. 8 illustrates that shorter sequences are captured with higher efficiency that longer sequences using MIPs
  • FIG. 9 illustrates a non-limiting scheme of padlock (MIP) capture of a region that includes both repetitive regions (thick wavy line) and the adjacent unique sequence (thick strait line);
  • MIP padlock
  • FIG. 10 illustrates a non-limiting hypothetical relationship between target gap size and the relative number of reads of the repetitive region
  • FIG. 11 A depicts MIP capture of FMR1 repeat regions from a diploid genome
  • FIG. 1 IB depicts preparative methods for biallelic resolution of FMR1 repeat region lengths in a diploid genome using MIP capture probes and unique differentiator tags;
  • FIG. 11C depicts an analysis of FMR1 repeat region lengths in a diploid genome
  • FIG. 12 is a schematic of an embodiment of an algorithm of the invention.
  • FIG. 13 illustrates a non-limiting example of a graph of per-target abundance with MIP capture
  • FIG. 14 shows a non-limiting a graph of correlation between two MIP capture reactions.
  • FIGS. 15A-B show a SNaPshot validation of a putative Sanger variant call.
  • FIG. 15A discloses "GM17080" sequences as SEQ ID NO: 6328, 6329, and 6328 and FIG. 15B discloses the "GM 17074" sequences as SEQ ID NO: 6328, 6328, and 6328, all respectively, in order of appearance.
  • FIGS. 16A-16D depict skewed allelic fractions in aneuploid cell line GM18540.
  • FIG. 16A depicts an IGV view of NGS data from GM 18540 for the genotype call of interest (shown between vertical lines) (figure 16A discloses SEQ ID NO: 6330-6331).
  • FIG. 16B depicts bidirectional Sanger data for the variant-containing region.
  • FIG. 16C depicts a histogram of allele ratios for all non-reference genotype calls in chromosome 11 derived from wholegenome shotgun sequencing (WGSS) of GM18540 and control sample GM18537.
  • FIG. 16D depicts genome- wide relative coverage for GM 18540. WGSS coverage data for each of the
  • FIGS. 17A-D depict detection of previously-uncharacterized mutations in samples from individuals affected with cystic fibrosis.
  • FIG. 17A depicts IGV of heterozygous splice site mutation c.3368-2A>T in sample GM 12960 (figure 17A discloses SEQ ID NO: 6332-6333).
  • FIG. 17B depicts IGV of heterozygous premature stop codon mutation Rl 158X in sample GM18802 (figure 17B discloses SEQ ID NO: 6334-6335).
  • FIG. 17C depicts Sanger data confirming existence of mutation c.3368-2A>T in sample GM12960 (figure 17C discloses SEQ ID NO: 6336 and 6336).
  • FIG. 17D depicts Sanger data confirming existence of mutation R1158X in sample GM18802 (figure 17D discloses SEQ ID NO: 6337 and 6337).
  • FIGS. 18A-E depict next-generation DNA sequencing workflow according to certain embodiments.
  • FIG. 18B discloses (top panel) SEQ ID NO: 6338-6349, (left panel) SEQ ID NO: 6338-6343, and (right panel) SEQ ID NO: 6344-6349, all respectively, in order of appearance.
  • FIG. 18C discloses SEQ ID NO: 6350-6356, 6353, 6352, 6357, and 6357, respectively, in order of appearance.
  • FIG. 18B discloses (top panel) SEQ ID NO: 6338-6349, (left panel) SEQ ID NO: 6338-6343, and (right panel) SEQ ID NO: 6344-6349, all respectively, in order of appearance.
  • FIG. 18C discloses SEQ ID NO: 6350-6356, 6353, 6352, 6357, and 6357, respectively, in order of appearance.
  • 18D discloses (left panel) SEQ ID NO: 6352, 6358, 6350, 6352, 6358, 6350, 6359, and 6359, and (right panel) SEQ ID NO: 6360, 6361, 6355, 6360, 6361, 6355,
  • FIG. 18E discloses (left panel) SEQ ID NO: 6358, 6352, and 6350, (right panel) SEQ ID NO: 6360, 6361, and 6355, and (bottom panel) SEQ ID NO: 6364 and 6364, all respectively, in order of appearance.
  • FIGS. 19A-D depict data from genotyping by assembly template alignment (GATA).
  • GATA correctly genotypes insertions and deletions that are undetectable by the Alignment Only method.
  • each panel provides tracks for cumulative depth of coverage (vertical grey bars); representative MIP alignments (horizontal grey bars) with mismatches (letters), insertions (black bars), and gaps (dashed lines); chromatogram; reference DNA andamino acid sequence for FIG. 19A heterozygous BLM c.2207_2212delinsTAGATTC in sample GM04408 as well as several alleles in the first exon of SMPDl (FIG. 19A discloses SEQ ID NO: 6365 and 6366) including FIG.
  • FIG. 19B a heterozygous 18bp deletion in sample GM20342 (minus strand) (FIG.19B discloses SEQ ID NO: 6367 and 6368)
  • FIG. 19C a heterozygous 12bp insertion and homozygous substitution in sample GM17282 (plus strand)
  • FIG. 19D compound heterozygous 6 and 12 bp deletions in sample GM00502 (minus strand) (FIG. 19D discloses SEQ ID NO: 6369 and 6370).
  • Chromatogram trace offsets corresponding to specific heterozygous insertion and deletion patterns are indicated with slanted lines color coded by reference base.
  • FIGS. 20A-1, 20A-2, 20A-3, 20B-1, 20B-2 and 20B-3 show NGS detection of allele dropout in Sanger reactions.
  • FIG. 20A-1 discloses SEQ ID NO: 6371, 6372, and 6372
  • FIG. 20A-B depicts SEQ ID NO: 6371, 6371, 6372
  • FIG. 20A-3 disclosesSEQ ID NO: 6373 and 6374, all respectively, in order of appearance.
  • FIG. 20B-1 discloses SEQ ID NO: 6371, 6372, and 6372
  • FIG. 20B-2 discloses SEQ ID NO: 6371, 6371, 6372
  • FIG. 20B-3 discloses SEQ ID NO: 6373 and 6374, all respectively, in order of appearance.
  • FIG. 21 diagrams use of methods of the invention to validate a genotyping by assembly- templated alignment (GATA) technique.
  • FIG. 22 illustrates obtaining sequence reads and inserting a simulated mutation.
  • FIG. 23 shows standard analysis of sequence reads for comparison to GATA.
  • FIG. 24 shows analysis by GATA. DETAILED DESCRIPTION
  • aspects of the invention relate to preparative and analytical methods and compositions for evaluating genotypes, and in particular, for determining the allelic identity (or identities in a diploid organism) of one or more genetic loci in a subject. Aspects of the invention are based, in part, on the identification of different sources of ambiguity and error in genetic analyses, and, in part, on the identification of one or more approaches to avoid, reduce, recognize, and/or resolve these errors and ambiguities at different stages in a genetic analysis. Aspects of the invention relate to methods and compositions for addressing bias and/or stochastic variation associated with one or more preparative and/or analytical steps of a nucleic acid evaluation technology.
  • preparative methods can be adapted to avoid or reduce the risk of bias skewing the results of a genetic analysis.
  • analytical methods can be adapted to recognize and correct for data variations that may give rise to misinterpretation (e.g., incorrect calls such as homozygous when the subject is actually heterozygous or heterozygous when the subject is actually homozygous).
  • Methods of the invention may be used for any type of mutation, for example a single base change (e.g., insertion, deletion, transversion or transition, etc.), a multiple base insertion, deletion, duplication, inversion, and/or any other change or combination thereof.
  • additional or alternative techniques may be used to address loci characterized by multiple repeats of a core sequence where the length of the repeat is longer than a typical sequencing read thereby making it difficult to determine whether a deletion or duplication of one or more core sequence units has occurred based solely on a sequence read.
  • increased confidence in an assay result may be obtained by i) selecting two or more different preparative and/or analytical techniques that have different biases (e.g., known to have different biases), ii) evaluating a patient sample using the two or more different techniques, iii) comparing the results from the two or more different techniques, and/or iv) determining whether the results are consistent for the two or more different techniques.
  • step (iv) if determining in step (iv) indicates that the results are consistent (e.g., the same) then increased confidence in the assay result is obtained. In other embodiments, if determining in step (iv) indicates that the results are inconsistent (e.g., that the results are ambiguous) then one or more additional preparative and/or analytical techniques, which have a different bias (e.g., known to have a different bias) compared with the two or more different preparative and/or analytical techniques selected in step (i), are used to evaluate the patient sample, and the results of the one or more additional preparative and/or analytical techniques are compared with the results from step (ii) to resolve the inconsistency.
  • a different bias e.g., known to have a different bias
  • two or more independent samples may be obtained from a subject and independently analyzed. In some embodiments, two or more independent samples are obtained at approximately the same time point. In some embodiments, two or more independent samples are obtained at multiple different time points. In some embodiments, the use of two or more independent sample facilitates the elimination, normalization, and/or quantification of stochastic measurement noise. It is to be appreciated that two or more independent samples may be obtained in connection with any of the methods disclosed herein, including, for example, methods for pathogen profiling in a human or other animal subjects, monitoring tumor progression/regression, analyzing circulating tumor cells, analyzing fetal cells in maternal circulation, and analyzing/monitoring/profiling of environmental pathogens.
  • one or more of the techniques described herein may be combined in a single assay protocol for evaluating multiple patient samples in parallel.
  • aspects of the invention may be useful for high throughput, cost- effective, yet reliable, genotyping of multiple patient samples (e.g., in parallel, for example in multiplex reactions).
  • aspects of the invention are useful to reduce the error frequency in a multiplex analysis.
  • Certain embodiments may be particularly useful where multiple reactions (e.g., multiple loci and/or multiple patient samples) are being processed. For example, 10-25, 25-50, 50-75, 75-100 or more loci may be evaluated for each subject out of any number of subject samples that may be processed in parallel (e.g., 1-25, 25-50, 50-100, 100-500, 500-1,000, 1,000-2,500, 2,500-5,000 or more or intermediate numbers of patient samples).
  • different embodiments of the invention may involve conducting two or more target capture reactions and/or two or more patient sample analyses in parallel in a single multiplex reaction.
  • a plurality of capture reactions e.g., using different capture probes for different target loci
  • a plurality of captured nucleic acids from each one of a plurality of patient samples may be combined in a single multiplex analysis reaction.
  • samples from different subjects are tagged with subject- specific (e.g., patient- specific) tags (e.g., unique sequence tags) so that the information from each product can be assigned to an identified subject.
  • each of the different capture probes used for each patient sample have a common patient-specific tag.
  • the capture probes do not have patient- specific tags, but the captured products from each subject may be amplified using one or a pair of amplification primers that are labeled with a patient- specific tag.
  • Other techniques for associating a patient-specific tag with the captured product from a single patient sample may be used as aspects of the invention are not limited in this respect.
  • patient- specific tags as used herein may refer to unique tags that are assigned to identify patients in a particular assay. The same tags may be used in a separate multiplex analysis with a different set of patient samples (e.g., from different patients) each of which is assigned one of the tags.
  • different sets of unique tags may be used in sequential (e.g., alternating) multiplex reactions in order to reduce the risk of contamination from one assay to the next and allow contamination to be detected on the basis of the presence of tags that are not expected to be present in a particular assay.
  • Embodiments of the invention may be used for any of a number of different settings: reproductive settings, disease screening, identifying subjects having cancer, identifying subjects having increased risk for a disease, stratifying a population of subjects according to one or more of a number of factors, for example responsiveness to a particular drug, lack or not of an adverse reaction (or risk therefore) to a particular drug, and/or providing information for medical records (e.g., homozygosity, heterozygosity at one or more loci).
  • the invention is not limited to genomic analysis of patient samples.
  • aspects of the invention may be useful for high throughput genetic analysis of environment samples to detect pathogens.
  • the methods disclosed herein are useful for diagnosis of one or more heritable disorders.
  • a heritable disorder that may be diagnosed with the methods disclosed herein is a genetic disorder that is prevalent in the Ashkenazi Jewish population.
  • the heritable disorders are selected from: 21-Hydroxylase- Defiocient Congenital Adrenal Hyperplasia; ABCC8-Related Hyperinsulinism; Alpha- Thalassemia, includes Constant Spring, & MR associated; Arylsulfatase A Deficiency- Metyachromatic Leukodystrophy; Biotimidase Deficiency-Holocarboxylase Synthetase
  • Hemoglobinopathies beta-chain disorders Glycogen Storage Disease Type 1A; Maple Syrup Urine Disease; Types 1A, IB, 2, 3; Medium Chain Acyl-Coenzyme A; Dehydrogenase
  • Peroxisomal Bifunctional Enzyme Deficiencies including Zellweger, NALD, and/or infantile Refsums. However, all of these, subsets of these, other genes, or combinations thereof may be used.
  • the disclosure relates to multiplex diagnostic methods.
  • multiplex diagnostic methods comprise capturing a plurality of genetic loci in parallel (e.g., a genetic locus of Table 1).
  • genetic loci possess one or more polymorphisms (e.g., a polymorphism of Table 2) the genotypes of which correspond to disease causing alleles.
  • the disclosure provides methods for assessing multiple heritable disorders in parallel.
  • methods are provided for diagnosing multiple heritable disorders in parallel at a pre-implantation, prenatal, perinatal, or postnatal stage.
  • the disclosure provides methods for analyzing multiple genetic loci (e.g., a plurality of target nucleic acids selected from Table 1) from a patient sample, such as a blood, pre-implantation embryo, chorionic villus or amniotic fluid sample.
  • a patient sample such as a blood, pre-implantation embryo, chorionic villus or amniotic fluid sample.
  • a patient or subject may be a human.
  • aspects of the invention are not limited to humans and may be applied to other species (e.g., mammals, birds, reptiles, other vertebrates or invertebrates) as aspects of the invention are not limited in this respect.
  • a subject or patient may be male or female.
  • samples from a male and female member of a couple may be analyzed.
  • samples from a plurality of male and female subjects may be analyzed to determine compatible or optimal breeding partners or strategies for particular traits or to avoid one or more diseases or conditions.
  • any other diseases may be studied and/or risk factors for diseases or disorders including, but not limited to allergies, responsiveness to treatment, cancer tumor profiling for treatment and prognosis, monitoring and identification of patient infections, and monitoring of environmental pathogens.
  • aspects of the invention relate to methods that reduce bias and increase reproducibility in multiplex detection of genetic loci, e.g., for diagnostic purposes.
  • Molecular inversion probe technology is used to detect or amplify particular nucleic acid sequences in potentially complex mixtures. Use of molecular inversion probes has been demonstrated for detection of single nucleotide polymorphisms (Hardenbol et al. 2005 Genome Res 15:269-75) and for preparative amplification of large sets of exons (Porreca et al. 2007 Nat Methods 4:931-6, Krishnakumar et al. 2008 Proc Natl Acad Sci USA 105:9296-301).
  • aspects of the disclosure are based, in part, on the discovery of effective methods for overcoming challenges associated with systematic errors (bias) in multiplex genomic capture and sequencing methods, namely high variability in target nucleic acid representation and unequal sampling of heterozygous alleles in pools of captured target nucleic acids (e.g., isolated from a biological sample). Accordingly, in some embodiments, the disclosure provides methods that reduce variability in the detection of target nucleic acids in multiplex capture methods. In other embodiments, methods improve allelic representation in a capture pool and, thus, improve variant detection outcomes.
  • the disclosure provides preparative methods for capturing target nucleic acids (e.g., genetic loci) that involve the use of different sets of multiple probes (e.g., molecular inversion probes MIPs) that capture overlapping regions of a target nucleic acid to achieve a more uniform representation of the target nucleic acids in a capture pool compared with methods of the prior art.
  • methods reduce bias, or the risk of bias, associated with large scale parallel capture of genetic loci, e.g., for diagnostic purposes.
  • methods are provided for increasing reproducibility (e.g., by reducing the effect of polymorphisms on target nucleic acid capture) in the detection of a plurality of genetic loci in parallel.
  • methods are provided for reducing the effect of probe synthesis and/or probe amplification variability on the analysis of a plurality of genetic loci in parallel.
  • a 'probe' is a nucleic acid having a central region flanked by a 5' region and a 3' region that are complementary to nucleic acids flanking the same strand of a target nucleic acid or subregion thereof.
  • An exemplary probe is a molecular inversion probe (MIP).
  • MIP molecular inversion probe
  • a 'target nucleic acid' may be a genetic locus. Exemplary genetic loci are disclosed herein in Table 1 (RefSeqGene Column).
  • probes have been typically designed to meet certain constraints (e.g. melting temperature, G/C content, etc.) known to partially affect capture/amplification efficiency (Ball et al (2009) Nat Biotech 27:361-8 AND Deng et al (2009) Nat Biotech 27:353-60), a set of constraints which is sufficient to ensure either largely uniform or highly reproducible
  • constraints e.g. melting temperature, G/C content, etc.
  • the disclosure provides multiple MIPs per target to be captured, where each MIP in a set designed for a given target nucleic acid has a central region and a 5' region and 3' region ('targeting arms') which hybridize to (at least partially) different nucleic acids in the target nucleic acid
  • the methods involve designing a single probe for each target (a target can be as small as a single base or as large as a kilobase or more of contiguous sequence). It may be preferable, in some cases, to design probes to capture molecules (e.g., target nucleic acids or subregions thereof) having lengths in the range of 1-200 bp (as used herein, a by refers to a base pair on a double-stranded nucleic acid— however, where lengths are indicated in bps, it should be appreciated that single-stranded nucleic acids having the same number of bases, as opposed to base pairs, in length also are contemplated by the invention). However, probe design is not so limited.
  • probes can be designed to capture targets having lengths in the range of up to 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 1000, or more bps, in some cases.
  • the length of a capture molecule e.g., a target nucleic acid or subregion thereof
  • the target length should typically match the sequencing read-length so that shotgun library construction is not necessary.
  • captured nucleic acids may be sequenced using any suitable sequencing technique as aspects of the invention are not limited in this respect.
  • target nucleic acids are too large to be captured with one probe. Consequently, it may be necessary to capture multiple subregions of a target nucleic acid in order to analyze the full target.
  • a subregion of a target nucleic acid is at least 1 bp. In other embodiments, a subregion of a target nucleic acid is at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 bp or more. In other embodiments, a subregion of a target nucleic acid has a length that is up to 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or more percent of a target nucleic acid length.
  • MIPs are designed such that they are several hundred basepairs (e.g., up to 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 bp or more) longer than corresponding target (e.g., subregion of a target nucleic acid, target nucleic acid).
  • lengths of subregions of a target nucleic acid may differ. For example, if a target nucleic acid contains regions for which probe hybridization is not possible or inefficient, it may be necessary to use probes that capture subregions of one or more different lengths in order to avoid hybridization with problematic nucleic acids and capture nucleic acids that encompass a complete target nucleic acid.
  • the set of probes for a given target can be designed to 'tile' across the target, capturing the target as a series of shorter sub-targets.
  • some probes in the set capture flanking non-target sequence).
  • the set can be designed to 'stagger' the exact positions of the hybridization regions flanking the target, capturing the full target (and in some cases capturing flanking non-target sequence) with multiple probes having different targeting arms, obviating the need for tiling.
  • the particular approach chosen will depend on the nature of the target set. For example, if small regions are to be captured, a staggered-end approach might be appropriate, whereas if longer regions are desired, tiling might be chosen. In all cases, the amount of bias-tolerance for probes targeting pathological loci can be adjusted ('dialed in') by changing the number of different MIPs used to capture a given molecule.
  • the 'coverage factor' or number of probes used to capture a basepair in a molecule, is an important parameter to specify. Different numbers of probes per target are indicated depending on whether one is using the tiling approach (see, e.g., FIG. 1) or one of the staggered approaches (see, e.g., FIG. 2 or 3).
  • FIG. 1 illustrates a non-limiting embodiment of a tiled probe layout showing ten captured sub-targets tiled across a single target. Each position in the target is covered by three sub-targets such that MIP performance per base pair is averaged across three probes.
  • FIG. 2 illustrates a non-limiting embodiment of a staggered probe layout showing the targets captured by a set of three MIPs.
  • Each MIP captures the full target, shown in black, plus (in some cases) additional extra- target sequence, shown in gray, such that the targeting arms of each MIP fall on different sequence.
  • Each position in the target is covered by three sub-targets such that MIP performance per basepair is averaged across three probes.
  • Targeting arms land immediately adjacent to the black or gray regions shown. It should be appreciated that in some embodiments, the targeting arms (not shown) can be designed so that they do not overlap with each other.
  • FIG. 3 illustrates a non-limiting embodiment of an alternating staggered probe layout showing the targets captured by a set of three MIPs.
  • Each MIP captures the full target, shown in black, plus (in some cases) additional extra-target sequence, shown in gray, such that the targeting arms of each MIP fall on different sequence.
  • Each position in the target is covered by three sub-targets such that MIP performance per basepair is averaged across three probes.
  • Targeting arms land immediately adjacent to the black or gray regions shown.
  • the targeting arms on adjacent tiled or staggered probes may be designed to either overlap, not overlap, or overlap for only a subset of the probes.
  • a coverage factor of about 3 to to about 10 is used.
  • the methods are not so limited and coverage factors of up to 2, 3, 4, 5, 6, 7, 8, 9, 10, 20 or more may be used. It is to be appreciated that the coverage factor selected may depend the probe layout being employed.
  • the number of probes per target is typically a function of target length, sub- target length, and spacing between adjacent sub-target start locations (step size).
  • a 200 bp target with a start- site separation of 20 bp and sub- target length of 60 bp may be encompassed with 12 MIPs (FIG. 1).
  • a specific coverage factor may be achieved by varying the number of probes per target nucleic acid and the length of the molecules captured.
  • a fixed-length target nucleic acid is captured as several subregions or as 'super- targets', which are molecules comprising the target nucleic acid and additional flanking nucleic acids, which may be of varying lengths.
  • a target of 50 bp can be captured at a coverage factor of 3 with 3 probes in either a 'staggered' (FIG. 2) or 'alternating staggered' configuration (FIG. 3).
  • the coverage factor will be driven by the extent to which detection bias is tolerable. In some cases, where the bias tolerance is small, it may be desirable to target more subregions of target nucleic acid with, perhaps, higher coverage factors. In some embodiments, the coverage factor is up to 2, 3, 4, 5, 6, 7, 8, 9, 10 or more.
  • T target length
  • S sub-target length
  • C coverage factor
  • the disclosure provides methods to increase the uniformity of amplification efficiency when multiple molecules are amplified in parallel; methods to increase the reproducibility of amplification efficiency; methods to reduce the contribution of targeting probe variability to amplification efficiency; methods to reduce the effect on a given target nucleic acid of polymorphisms in probe hybridization regions; and/or methods to simplify downstream workflows when multiplex amplification by MIPs is used as a preparative step for analysis by nucleic acid sequencing.
  • Polymorphisms in the target nucleic acid under the regions flanking a target can interfere with hybridization, polymerase fill-in, and/or ligation. Furthermore, this may occur for only one allele, resulting in allelic drop-out, which ultimately decreases downstream sequencing accuracy.
  • the probability of loss from polymorphism is substantially decreased because not all targeting arms in the set of MIPs will cover the location of the mutation.
  • Probes for MIP capture reactions may be synthesized on programmable microarrays because of the large number of sequences required. Because of the low synthesis yields of these methods, a subsequent amplification step is required to produce sufficient probe for the MIP amplification reaction. The combination of multiplex oligonucleotide synthesis and pooled amplification results in uneven synthesis error rates and representational biases. By synthesizing multiple probes for each target, variation from these sources may be averaged out because not all probes for a given target will have the same error rates and biases.
  • Multiplex amplification strategies disclosed herein may be used analytically, as in detection of SNPs, or preparatively, often for next- generation sequencing or other sequencing techniques.
  • the output of an amplification reaction is generally the input to a shotgun library protocol, which then becomes the input to the sequencing platform.
  • the shotgun library is necessary in part because next- generation sequencing yields reads significantly shorter than amplicons such as exons.
  • tiling also obviates the need for shotgun library preparation. Since the length of the capture molecule can be specified when the probes, e.g., MIPs, are designed, it can be chosen to match the readlength of the sequencer. In this way, reads can 'walk' across an exon by virtue of the start position of each capture molecule in the probe set for that exon.
  • Exemplary molecular inversion probes are provided in Appendix A. These molecular inversion probes are designed to capture targets or sub-regions thereof on one or more genes listed in Table 5 (provided in Example 8). In certain applications, the molecular inversion probes provided in Appendix A may be used to tile-capture targets or sub-regions thereof on the one or more genes provided in Table 5. In particular applications, two or more of the molecular inversion probes of Appendix A tile across different, but overlapping sub-regions of one or more genes listed in Table 5 so that a target on the gene is capture by both of the two or more molecular inversion probes, as exemplified in Figure 1.
  • the molecular inversion probes of Appendix A that are chosen for tile-capture a target depends on the desired amount of overlapping coverage for the target.
  • two or more molecular inversion probes of Appendix A, being in directly ascending SEQ ID NO: order and corresponding to a target nucleic acid will tile across the target nucleic acid with a period of 25 base pairs such that every genomic position of the target nucleic acid is capture by multiple probes with orthogonal targeting arm sequences. If less coverage is desired for a target nucleic acid, one may select, for example, every other molecular inversion probes of Appendix A in ascending order that correspond to that target.
  • the first and second targeting arms of the molecular inversion probes are designed to hybridize to nucleotides upstream and downstream of a capture region of a gene (i.e. the targeting arms flank the region to be captured).
  • the capture region may be a target nucleic acid or a sub-region thereof.
  • Appendix B lists the capture regions of the genes that correspond to the molecular inversion probes listed in Appendix A.
  • Appendix A also specifies the upstream and downstream regions of the capture regions corresponding to each targeting arm of the molecular inversion probes.
  • the upstream and downstream regions of the capture region are between the start position and the end position coordinates, which are relative to the Human Genome 18 (HG 18).
  • the molecular inversion probes of Appendix A include a central region flanked by a 5' first targeting arm (i.e. ligation arm or left arm) and a 3' second targeting arm (i.e. extension arm or right arm).
  • the targeting arm sequences are shown in lowercase letters and the central region sequence is shown in uppercase letters.
  • the 5' first targeting arm and the 3' second targeting arm of the molecular inversion probes provided in Appendix A include a total of 40 nucleotides, and are designed to flank 130 bp capture regions.
  • the genes listed in Table 5 corresponded to diseases, and as such, the molecular inversion probes listed in Appendix A can be utilized to analyze one or more of the diseases provided in table 5.
  • the molecular inversion probes provided in Appendix A are described in more detail in Example 8.
  • While all of the molecular inversion probes provided in Appendix A may be used in a single assay to comprehensively examine several or all of the genomic regions of the genes provided in Table 5, one may also select one or more molecular inversion probes provided in Appendex A to evaluate one or more targets present in one gene or a combination of the genes provided in Table 5. For example, one may choose to only examine the coding regions of one or more of the genes listed in Table 5, and therefore use the one or more of the molecular inversion probes designed to capture those regions. In another example, one may choose to only examine the non-coding regions of one or more gene listed in Table 5, and therefore use the one or more molecular inversion probes designed to capture those regions.
  • the sequence of the central region of the molecular inversion probes may be different from the sequence of the central region provided in Appendix A without changing capture region of the probe.
  • the sequence chosen for the central region is preferably the same across each molecular inversion probe in a set of probes. This allows the capture targets to be amplified with a single set of primers. It is also preferable that the central region is designed so that it is not complementary to the target sequences or any other sequence in the sample.
  • molecular inversion probes may be used to tile-capture different regions of the genes listed in Table 5.
  • Those molecular inversion probes may include a different first targeting arm, second targeting arm, and/or central region from the molecular inversion probes listed in Appendix A.
  • a modified molecular inversion probe may include the first targeting arm sequence of SEQ ID NO: 300, but have a different sequence for the central region and the second targeting arm. The specific sequences and length of the sequences chosen for the first targeting arm, second targeting arm, and/or central region depend on the desired capture region and coverage.
  • the molecular inversion probes for tile or staggered capture are selected to maximize performance with respect to both capture efficiency and robustness to common polymorphisms.
  • methods of the invention involve designing all possible probes capable of targeting a genomic interval and ranking the probes based on a number of score tuples or ranking factors.
  • the possible probes are assigned score tuples including, but not limited to: 1) presence of quanine or cystosine as the 5'- most base of the ligation arm, 2) the number of dbSNP (version 130) entries intersecting targeting arm sites, 3) the root mean squared deviation of the targeting arms' predicted melting temperatures from optimal values derived from empirical studies of efficiencies.
  • score tuples including, but not limited to: 1) presence of quanine or cystosine as the 5'- most base of the ligation arm, 2) the number of dbSNP (version 130) entries intersecting targeting arm sites, 3) the root mean squared deviation of the targeting arms' predicted melting temperatures from optimal values derived from empirical studies of efficiencies.
  • method of the invention provide for shearing or fragmenting genomic nucleic acid prior to performing capture with a molecular inversion probe (e.g. capture with one or more of the molecular inversion probes provided in Appendix A). Fragmenting the genomic nucleic acid prior to performing a capture reaction allows for greater exposure of a target site to a molecular inversion probe, which reduces failed capture and increases the percentage of molecular inversion probes that hybridize to targets within the genome. This advantageously yields a target abundance distribution that is significantly more uniform than if a native high molecular weight genomic nucleic acid is used.
  • Molecular inversion techniques involving a fragmenting step are described in co-owned and co-assigned U.S. Serial Number 13/448,961, having U.S. Publication No. 2012/0252020, entitled "Capture Reactions.”
  • Fragmenting the nucleic acid can be accomplished by any technique known in the art.
  • Exemplary techniques include mechanically fragmenting, chemically fragmenting, and/or enzymatically fragmenting.
  • Mechanical nucleic acid fragmentation can be, for example, sonication, nebulization, and hydro- shearing (e.g., point-sink shearing).
  • Enzymatic nucleic acid fragmenting includes, for example, use of nicking endonucleases or restriction endonucleases.
  • the nucleic acid can also be chemically fragmented by performing acid hydrolysis on the nucleic acid or treating of the nucleic acid with alkali or other reagents. The fragment length can be adjusted based on the sizes of the nucleic acid targets to be captured.
  • the nucleic acid fragments can be of uniform length or of a distribution of lengths.
  • the nucleic acid is fragmented into nucleic acid fragments having a length of about 10 kb or 20 kb.
  • the nucleic acid fragments can range from between 1 kb to 20 kb, with various distributions.
  • the nucleic acid is also denatured, which may occur prior to, during, or after the fragmenting step.
  • the nucleic acid can be denatured using any means known in the art, such as pH-based denaturing, heat-based denaturing, formamide or urea, exonuclease degradation, or endonuclease nicking.
  • the use of pH such as in acid hydrolysis, alone or in combination with heat fragments and either partially or fully denatures the nucleic acid. This combined fragmenting and denaturing method can be used to fragment the nucleic acid for MIP capture or to fragment captured target nucleic acids or whole genomic DNA for shotgun library preparation.
  • a nucleic acid is fragmented by heating a nucleic acid immersed in a buffer system at a certain temperature for a certain period to time to initiate hydrolysis and thus fragment the nucleic acid.
  • the pH of the buffer system, duration of heating, and temperature can be varied to achieve a desired fragmentation of the nucleic acid.
  • a genomic nucleic acid is purified, it is resuspended in a Tris-based buffer at a pH between 7.5 and 8.0, such as Qiagen's DNA hydrating solution. The resuspended genomic nucleic acid is then heated to 65°C and incubated overnight (about 16-24 hours) at 65°C.
  • Heating shifts the pH of the buffer into the low- to mid- 6 range, which leads to acid hydrolysis.
  • the acid hydrolysis causes the genomic nucleic acid to fragment into single- stranded and/or double- stranded products.
  • the above method of fragmenting can be modified by increasing the temperature and reducing the heating time.
  • a nucleic acid is fragmented by incubating the nucleic acid in the Tris-based buffer at a pH between 7.5 and 8.0 for 15 minutes at 92°C.
  • the pH of the Tris- based buffer can be adjusted to achieve a desired nucleic acid fragmentation.
  • the captured target may further be subjected to an enzymatic gap-filling and ligation step, such that a copy of the target sequence is incorporated into a circle.
  • Capture efficiency of the MIP to the target sequence on the nucleic acid fragment can be improved by lengthening the hybridization and gap-filing incubation periods. (See, e.g., Turner EH, et al., Nat Methods. 2009 Apr 6: 1-2.).
  • the result of molecular inversion probe capture as described above is a library of circular target probes, which then can be processed in a variety of ways.
  • adaptors for sequencing can be attached during common linker-mediated PCR, resulting in a library with non- random, fixed starting points for sequencing.
  • a common linker-mediated PCR is performed on the circle target probes, and the post- capture amplicons are linearly concatenated, sheared, and attached to adaptors for sequencing.
  • Methods for shearing the linear concatenated captured targets can include any of the methods disclosed for fragmenting nucleic acids discussed above.
  • performing a hydrolysis reaction on the captured amplicons in the presence of heat is the desired method of shearing for library production.
  • Sequencing may be by any method known in the art.
  • DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing.
  • Separated molecules may be sequenced by sequential or single extension reactions using polymerases or ligases as well as by single or sequential differential hybridizations with libraries of probes.
  • Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented, and adapters are added to the 5' and 3' ends of the fragments. DNA fragments that are attached to the surface of flow cell channels are extended and bridge amplified. The fragments become double stranded, and the double stranded molecules are denatured. Multiple cycles of the solid-phase amplification followed by denaturation can create several million clusters of approximately 1,000 copies of single-stranded DNA molecules of the same template in each channel of the flow cell.
  • Primers, DNA polymerase and four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. The 3' terminators and fluorophores from each incorporated base are removed and the incorporation, detection and identification steps are repeated. Sequencing according to this technology is described in U.S. Pat. 7,960,120; U.S. Pat. 7,835,871; U.S. Pat. 7,232,656; U.S. Pat. 7,598,035; U.S. Pat.
  • Sequencing generates a plurality of reads.
  • Reads generally include sequences of nucleotide data less than about 150 bases in length, or less than about 90 bases in length. In certain embodiments, reads are between about 80 and about 90 bases, e.g., about 85 bases in length. In some embodiments, these are very short reads, i.e., less than about 50 or about 30 bases in length.
  • a set of sequence reads can be analyzed by any suitable method known in the art. For example, in some embodiments, sequence reads are analyzed by hardware or software provided as part of a sequence instrument. In some embodiments, individual sequence reads are reviewed by sight (e.g., on a computer monitor). A computer program may be written that pulls an observed genotype from individual reads. In certain embodiments, analyzing the reads includes assembling the sequence reads and then genotyping the assembled reads.
  • the sequences obtained using the molecular inversion probe techniques of the invention are analyzed using the methods for evaluating of genetic test, which are described in co-pending and co-owned U.S. Provisional Serial Number 61/723,508, entitled "Validation of Genetic Test.”
  • the method involves obtaining a plurality of sequence reads, introducing a simulated mutation into at least one of the plurality of sequence reads, and analyzing the sequence reads to determine if the test identifies the simulated mutation.
  • the simulated mutation can be introduced into each of those sequence reads that span a location of the mutation with a probability of 0.5 (e.g., into about half of those sequence reads that should contain the location of the simulated mutation).
  • the simulated mutation can be introduced by manipulating a data field in the sequence read such as, for example, a base sequence field or quality data field.
  • the sequences can be manipulated by a computer program. For example, a program can be written using Java, Groovy, Python, Perl, or other languages, or a combination thereof, that can automatically insert simulated mutations into sequence reads. Computer-based methods can be used to automatically introduce a number of different simulated mutations into different ones of the plurality of sequence reads.
  • sequence reads including the manipulated reads are analyzed to detect a genotype. Analysis can include any method known in the art, such as de novo assembly, alignment to a reference, or a combination thereof.
  • the sequence reads are assembled into a contig.
  • the contig can be aligned to a reference genome.
  • individual reads are then aligned back to the contig.
  • Sequence assembly can be done by methods known in the art including reference-based assemblies, de novo assemblies, assembly by alignment, or combination methods. Assembly can include methods described in U.S. Pat. 8,209,130 titled Sequence Assembly by Porecca and Kennedy, the contents of each of which are hereby incorporated by reference in their entirety for all purposes.
  • sequence assembly uses the low coverage sequence assembly software (LOCAS) tool described by Klein, et al., in LOCAS-A low coverage sequence assembly tool for re-sequencing projects, PLoS One 6(8) article 23455 (2011), the contents of which are hereby incorporated by reference in their entirety. Sequence assembly is described in U.S. Pat. 8,165,821; U.S. Pat. 7,809,509; U.S. Pat. 6,223,128; U.S. Pub.
  • LOCAS low coverage sequence assembly software
  • genetic test of the invention are validated using a genotyping by assembly-template alignment (GATA) technique, which is also described in co-pending and co- owned U.S. Provisional Serial Number 61/723,508, entitled "Validation of Genetic Test.”
  • FIG. 21 diagrams the validation of a genotyping by assembly-templated alignment (GATA) technique.
  • Genetic analysis by GATA-based methods includes obtaining 401 sequence reads and
  • genotyping 421 to produce an observed genotyping.
  • the GATA-based method is evaluated by introducing 403 at least one simulated mutation into the reads.
  • FIG. 22 illustrates obtaining sequence reads and inserting a simulated mutation.
  • the raw sequence reads may only include wild type sequence.
  • a mutation of interest may be known, for example, from the literature or it may be desirable to simply invent a difficult-to-detect mutation to use in methods of validating a genetic analysis.
  • a hypothetical 8 base pair deletion proximal to a C>A substitution is depicted.
  • the raw sequence reads are edited so that they include base sequence data, quality data, or both that would arise from sequencing the simulated mutation.
  • FIG. 23 shows an example in which a standard analytical method is performed for comparison to a GATA-based method. The standard analysis is demonstrated to not be able to detect a mutation.
  • FIG. 23 depicts a workflow in which edited sequence reads (e.g., as depicted in FIG. 22) are aligned to a reference genome (here, using BWA and GATK). The alignment software properly aligns the wild type sequence reads to the reference genome, finding a perfect match and giving a result indicating that the sample is the wild type. However, the alignment software finds no valid alignment for the edited sequence reads and is unable to produce a result.
  • edited sequence reads e.g., as depicted in FIG. 22
  • a reference genome here, using BWA and GATK
  • sequence reads Due to the fact that the expected genotype of the edited sequence reads is known a priori (and, in fact intentionally supplied by editing), an operator is able to identify that this analysis method— alignment of sequence reads to a reference genome— is incapable of detecting the mutation. For comparison, the sequence reads are also analyzed by a GATA-based method.
  • FIG. 24 shows analysis of sequence reads that include simulated mutations by GATA.
  • reads are assembled into contigs. Assembly can include any method including those discussed below.
  • each contig is aligned to a reference genome. Alignment can be by any method such as those discussed below, including, e.g., the bwa-sw algorithm implemented by BWA. As shown in FIG. 24, both align to the same reference position. Differences between the contig and the reference genome are identified and, as shown in FIG. 24, described by a CIGAR string.
  • step 3 raw reads are aligned to contigs (using any method such as, for example, BWA with bwa- short and writing, for example, a CIGAR string).
  • step 4 raw read alignments are mapped from contig space to original reference space (e.g., via position and CIGAR
  • step 5 genotyping is performed using the translated, aligned reads from step 4 (e.g., including raw quality scores for substitutions).
  • reads may be assembled into contigs by any method known in the art.
  • Algorithms for the de novo assembly of a plurality of sequence reads are known in the art.
  • One algorithm for assembling sequence reads is known as overlap consensus assembly. Assembly with overlap graphs is described, for example, in U.S. Pat. 6,714,874.
  • de novo assembly proceeds according to so-called greedy algorithms, as described in U.S. Pub. 2011/0257889, incorporated by reference in its entirety.
  • assembly proceeds by either exhaustive or heuristic pairwise alignment. Exhaustive pairwise alignment, sometimes called a "brute force" approach, calculates an alignment score for every possible alignment between every possible pair of sequences among a set. Assembly by heuristic multiple sequence alignment ignores certain mathematically unlikely combinations and can be computationally faster.
  • One heuristic method of assembly by multiple sequence alignment is the so-called "divide-and-conquer" heuristic, which is described, for example, in U.S. Pub.
  • assembly into contigs involves making a de Bruijn graph.
  • De Bruijn graphs reduce the computation effort by breaking reads into smaller sequences of DNA, called k-mers, where the parameter k denotes the length in bases of these sequences.
  • all reads are broken into k-mers (all subsequences of length k within the reads) and a path between the k-mers is calculated.
  • the reads are represented as a path through the k-mers.
  • the de Bruijn graph captures overlaps of length k-1 between these k-mers and not between the actual reads.
  • the de Bruijn graph reduces the high redundancy in short-read data sets.
  • Assembly of reads using de Bruijn graphs is described in U.S. Pub. 2011/0004413, U.S. Pub. 2011/0015863, and U.S. Pub. 2010/0063742, incorporated by reference in their entirety. Assembly of reads into contigs is further discussed in U.S. Pat. 6,223,128, U.S. Pub. 2009/0298064, U.S. Pub. 2010/0069263, and U.S. Pub. 2011/0257889, each of which is incorporated by reference herein in its entirety.
  • aspects of the invention relate to preparative steps in DNA sequencing-related technologies that reduce bias and increase the reliability and accuracy of downstream quantitative applications.
  • genomics assays that utilize next-generation (polony-based) sequencing to generate data, including genome resequencing, RNA-seq for gene expression, bisulphite sequencing for methylation, and Immune-seq, among others.
  • these methods utilize the counts of sequencing reads of a given genomic locus as a proxy for the representation of that sequence in the original sample of nucleic acids.
  • the majority of these techniques require a preparative step to construct a high-complexity library of DNA molecules that is representative of a sample of interest.
  • This may include chemical or biochemical treatment of the DNA (e.g., bisulphite treatment), capture of a specific subset of the genome (e.g., padlock probe capture, solution hybridization), and a variety of amplification techniques (e.g., polymerase chain reaction, whole genome amplification, rolling circle amplification).
  • chemical or biochemical treatment of the DNA e.g., bisulphite treatment
  • capture of a specific subset of the genome e.g., padlock probe capture, solution hybridization
  • amplification techniques e.g., polymerase chain reaction, whole genome amplification, rolling circle amplification.
  • genomic sequencing library may contain an over- or under-representation of particular sequences from a source genome as a result of errors (bias) in the library construction process.
  • bias can be particularly problematic when it results in target sequences from a genome being absent or undetectable in the sequencing libraries.
  • an under-representation of particular allelic sequences e.g., heterozygotic alleles
  • allelic sequences e.g., heterozygotic alleles
  • sequencing library quantification techniques depend on stochastic counting processes, these problems have typically been addressed by sampling enough (over- sampling) to obtain a minimum number of observations necessary to make statistically significant decisions.
  • aspects of the disclosure are based, in part, on the discovery of methods for overcoming problems associated with systematic and random errors (bias) in genome capture, amplification and sequencing methods, namely high variability in the capture and amplification of nucleic acids and disproportionate representation of heterozygous alleles in sequencing libraries. Accordingly, in some embodiments, the disclosure provides methods that reduce variability in the capture and amplification of nucleic acids. In other embodiments, the methods improve allelic representation in sequencing libraries and, thus, improve variant detection outcomes. In certain embodiments, the disclosure provides preparative methods for capturing target nucleic acids (e.g., genetic loci) that involve the use of differentiator tag sequences to uniquely tag individual nucleic acid molecules.
  • target nucleic acids e.g., genetic loci
  • the differentiator tag sequence permits the detection of bias based on the frequency with which pairs of differentiator tag and target sequences are observed in a sequencing reaction.
  • the methods reduce errors caused by bias, or the risk of bias, associated with the capture, amplification and sequencing of genetic loci, e.g., for diagnostic purposes.
  • aspects of the invention relate to associating unique sequence tags (referred to as differentiator tag sequences) with individual target molecules that are independently captured and/or analyzed (e.g., prior to amplification or other process that may introduce bias). These tags are useful to distinguish independent target molecules from each other thereby allowing an analysis to be based on a known number of individual target molecules. For example, if each of a plurality of target molecule sequences obtained in an assay is associated with a different differentiator tag, then the target sequences can be considered to be independent of each other and a genotype likelihood can be determined based on this information.
  • unique sequence tags referred to as differentiator tag sequences
  • each of the plurality of target molecule sequences obtained in the assay is associated with the same differentiator tag, then they probably all originated from the same target molecule due to over- representation (e.g., due to biased amplification) of this target molecule in the assay.
  • This provides less information than the situation where each nucleic acid was associated with a different differentiator tag.
  • a threshold number of independently isolated molecules e.g., unique combinations of differentiator tag and target sequences is analyzed to determine the genotype of a subject.
  • the invention relates to compositions comprising pools (libraries) of preparative nucleic acids that each comprise "differentiator tag sequences" for detecting and reducing the effects of bias, and for genotyping target nucleic acid sequences.
  • a "differentiator tag sequence” is a sequence of a nucleic acid (a preparative nucleic acid), which in the context of a plurality of different isolated nucleic acids, identifies a unique, independently isolated nucleic acid.
  • differentiator tag sequences are used to identify the origin of a target nucleic acid at one or more stages of a nucleic acid preparative method.
  • differentiator tag sequences provide a basis for differentiating between multiple independent, target nucleic acid capture events.
  • differentiator tag sequences provide a basis for differentiating between multiple independent, primary amplicons of a target nucleic acid, for example.
  • combinations of target nucleic acid and differentiator tag sequence (target:differentiator tag sequences) of an isolated nucleic acid of a preparative method provide a basis for identifying unique, independently isolated target nucleic acids.
  • FIG. 4A-C depict various non-limiting examples of methods for combining differentiator tag sequence and target sequences.
  • differentiator tags may be synthesized using any one of a number of different methods known in the art. For example, differentiator tags may be synthesized by random nucleotide addition. Differentiator tag sequences are typically of a predefined length, which is selected to control the likelihood of producing unique
  • target:differentiator tag sequences in a preparative reaction e.g., amplification-based reaction, a circularization selection-based reaction, e.g., a MIP reaction.
  • Differentiator tag sequences may be, up to 5, up to 6, up to 7 up to 8, up to 9, up to 10, up to 11, up to 12, up to 13, up to 14, up to 15, up to 16, up to 17, up to 18, up to 19, up to 20, up to 21, up to 22, up to 23, up to 24, up to 25, or more nucleotides in length.
  • isolated nucleic acids are identified as independently isolated if they comprise unique combinations of target nucleic acid and differentiator tag sequences, and observance of threshold numbers of unique combinations of target nucleic acid and differentiator tag sequences provide a certain statistical confidence in the genotype.
  • each nucleic acid molecule may be tagged with a unique differentiator tag sequence in a configuration that permits the differentiator tag sequence to be sequenced along with the target nucleic acid sequence of interest (the nucleic acid sequence for which the library is being prepared, e.g., a polymorphic sequence).
  • the target nucleic acid sequence of interest the nucleic acid sequence for which the library is being prepared, e.g., a polymorphic sequence.
  • a large library of unique differentiator tag sequences may be created by using
  • the differentiator tag sequences of the polynucleotides may be read at the final stage of the sequencing.
  • the observations of the differentiator tag sequences may be used to detect and correct biases in the final sequencing read-out of the library.
  • the total possible number of differentiator tag sequences, which may be produced, e.g., randomly, is 4 N , where N is the length of the differentiator tag sequence.
  • the length of the differentiator tag sequence may be adjusted such that the size of the population of MIPs having unique differentiator tag sequences is sufficient to produce a library of MIP capture products in which identical independent combinations of target nucleic acid and differentiator tag sequence are rare.
  • combinations of target nucleic acid and differentiator tag sequences may also be referred to as "target:differentiator tag sequences”.
  • each read may have an additional unique differentiator tag sequence.
  • all the unique differentiator tag sequences will be observed about an equal number of times. Accordingly, the number of occurrences of a differentiator tag sequence may follow a Poisson distribution.
  • overrepresentation of target:differentiator tag sequences in a pool of preparative nucleic acids is indicative of bias in the preparative process (e.g., bias in the amplification process).
  • target:differentiator tag sequence combinations that are statistically overrepresented are indicative of bias in the protocol at one or more steps between the incorporation of the differentiator tag sequences into MIPs and the actual sequencing of the MIP capture products.
  • the number of reads of a given target:differentiator tag sequence may be indicative (may serve as a proxy) of the amount of that target sequence present in the originating sample.
  • the numbers of occurrence of sequences in the originating sample is the quantity of interest.
  • the occurrence of differentiator tag sequences in a pool of MIPs may be predetermined (e.g., may be the same for all differentiator tag sequences). Accordingly, changes in the occurrence of differentiator tag sequences after amplification and sequencing may be indicative of bias in the protocol. Bias may be corrected to provide an accurate representation of the composition of the original MIP pool, e.g., for diagnostic purposes.
  • a library of preparative nucleic acid molecules may be constructed such that the number of nucleic acid molecules in the library is significantly larger than the number prospective target nucleic acid molecules to be captured using the library. This ensures that products of the preparative methods include only unique target:differentiator tag sequence; e.g., in a MIP reaction the capture step would undersample the total population of unique differentiator tag sequences in the MIP library. For example, an experiment utilizing 1 ⁇ g of genomic DNA will contain about " 150,000 copies of a diploid genome.
  • each MIP in the library comprising a randomly produced 12-mer differentiator tag sequence ( ⁇ 1.6 million possible unique differentiator tag sequences), there would be more than 100 unique differentiator tag sequences per genomic copy.
  • each MIP in the library comprising a randomly produced 15-mer differentiator tag sequence ( ⁇ 1 billion possible unique differentiator tag sequences), there would be more than 7000 unique differentiator tag sequences per genomic copy. Therefore, the probability of the same differentiator tag sequence being incorporated multiple times is incredibly small.
  • the length of the differentiator tag sequence is to be selected based on the amount of target sequence in a MIP capture reaction and the desired probability for having multiple, independent occurrences of target:differentiator tag sequence combinations.
  • FIG. 5 depicts a non-limiting method for genotyping based on target and differentiator tag sequences. Sequencing reads of target and differentiator tags sequences are collapsed to make diploid genotype calls.
  • FIG. 6 depicts non-limiting results of a simulation of a MIP capture reaction in which MIP probes, each having a differentiator tag sequence of 15 nucleotides, are combined with 10000 target sequence copies (e.g., genome equivalents). In this simulated reaction, the probability of capturing one or more copies of a target sequence having the same differentiator tag sequence is 0.05.
  • the Y axis reflects the number of observations.
  • the X axis reflects the number of independent occurrences of target:differentiator tag combinations.
  • FIG. 7 depicts a non-limiting graph of sequencing coverage, which can help ensure that alleles are sampled to sufficient depth (e.g., either lOx or 20x minimum sampling per allele, assuming 1000 targets).
  • the X axis is total per-target coverage required
  • the Y axis is the probability that a given total coverage will result in at least lOx or 20x coverage for each allele.
  • adapters may be ligated onto the ends of the molecules of interest. Adapters often contain PCR primer sites (for amplification or emulsion PCR) and/or sequencing primer sites.
  • barcodes may be included, for example, to uniquely identify individual samples (e.g., patient samples) that may be mixed together.
  • individual samples e.g., patient samples
  • barcodes may be included, for example, to uniquely identify individual samples (e.g., patient samples) that may be mixed together.
  • nucleic acids comprising differentiator tag sequences may be incorporated by ligation. This is a flexible method, because molecules having differentiator tag sequence can be ligated to any blunt-ended nucleic acids.
  • the sequencing primers must be incorporated subsequently such that they sequence both the differentiator tag sequence and the target sequence.
  • the sequencing adaptors can be synthesized with the random differentiator tag sequences at their 3' end (as degenerate bases), so that only one ligation must be performed.
  • Another method is to incorporate the differentiator tag sequence into a PCR primer, such that the primer structure is arranged with the common adaptor sequence followed by the random differentiator tag sequence followed by the PCR priming sequence (in 5' to 3' order).
  • a differentiator tag sequence and adaptor sequence (which may contain the sequencing primer site) are incorporated as tags.
  • Another method to incorporate the differentiator tag sequences is to synthesize them into a padlock probe prior to performing a gene capture reaction. The differentiator tag sequence is incorporated 3' to the targeting arm but 5' to the amplification primer that will be used downstream in the protocol.
  • Another method to incorporate the differentiator tag sequences is as a tag on a gene- specific or poly-dT reverse- transcription primer. This allows the differentiator tag sequence to be incorporated directly at the cDNA level.
  • the distribution of differentiator tag sequences can be assumed to be uniform. In this case, bias in any part of the protocol would change the uniformity of this distribution, which can be observed after sequencing. This allows the differentiator tag sequence to be used in any preparative process where the ultimate output is sequencing of many molecules in parallel.
  • Differentiator tag sequences may be incorporated into probes (e.g., MIPs) of a plurality when they are synthesized on-chip in parallel, such that degeneracy of the incorporated nucleotides is sufficient to ensure near-uniform distribution in the plurality of probes.
  • probes e.g., MIPs
  • amplification of a pool of unique differentiator tag sequences may itself introduce bias in the initial pool.
  • the scale of synthesis e.g., by column synthesis, chip based synthesis, etc.
  • potential bias may be minimized.
  • the differentiator tag sequences are used in genome re-sequencing. Considering that the raw accuracy of most next-generation sequencing instruments is relatively low, it is crucial to oversample the genomic loci of interest. Furthermore, since there are two alleles at every locus, it is important to sample enough to ensure that both alleles have been observed a sufficient number of times to determine with a sufficient degree of statistical confidence whether the sample is homozygous or heterozygous. Indeed, the sequencing is performed to sample the composition of molecules in the originating sample. However, after multiple reads have been collected for a given locus, it is possible that due to bias (e.g., caused by PCR amplification steps), a large fraction of the reads are derived from a single originating molecule.
  • bias e.g., caused by PCR amplification steps
  • sequences and corresponding distribution of differentiator tag sequences can be used as an additional input to the genotype-calling algorithm to significantly improve the accuracy and confidence of the genotype calls.
  • the disclosure provides methods for analyzing a plurality of to target sequences which are genetic loci or portions of genetic loci (e.g., a genetic locus of Table 1).
  • the genetic loci may be analyzed by sequencing to obtain a genotype at one or more polymorphisms (e.g., SNPs).
  • Exemplary polymorphisms are disclosed in Table 2.
  • the skilled artisan will appreciate that other polymorphisms are known in the art and may be identified, for example, by querying the Entrez Single Nucleotide Polymorphism database, for example, by searching with a
  • the mutations listed in Table 2 are documented polymorphisms in several disease-associated genes (CFTR is mutated in cystic fibrosis, GBA is mutated in Gaucher disease, ASPA is mutated in Canavan disease, HEXA is mutated in Tay Sachs disease).
  • the polymorphisms are of several types: insertion/deletion polymorphisms which will cause frameshifts (and thus generally interrupt protein function) unless the insertion/deletion length is a multiple of 3 bp, and substitutions which can alter the amino acid sequence of the protein and in some cases cause complete inactivation by introduction of a stop codon.
  • GBA 2629 rs3754485 GTTTC AG ACC AGCCTGGCC A AC AT AG [C/T] G A 83
  • GBA 2629 rs2990225 GCGAATCCCAACCCCGACGCTCGTCG[C/T]CG 87
  • TACTT TTTTTCC A A ATTG A AGGTTTTTGGC
  • ASPA 443 rs34680506 TTGAAGGTAAAATCATAGGGAGTTGG[-/G] 120
  • ASPA 443 rs 17850703 C AGGGCTGG AGGT A A A ACC ATTT ATT [A/G] CT 129
  • ASPA 443 rs 17175228 C AC A AG ATCTC ATTACTC AGG AGCTG [C/T] CC 131 AAGTGTCTAATGTACTTAGTTAA
  • ASPA 443 rs 16953074 TTCTGTGTA AC ATTTC ATTT A AGC A A [ A/G] GG 132
  • HEXA 3073 rs57733983 C ATACC A A AGGGC AGCTGG AGGG ATAC [C/T] A 152
  • HEXA 3073 rs34300017 ACACAGGTAATCCATGTTTATTATAG[-/A] 171 AAAATGCCACATTACTCTTTATTGA
  • HEXA 3073 rs34110830 AATGAACTTACAGGAAGGTAATATAT[-/G] 173
  • aspects of the invention relate to methods for detecting nucleic acid deletions or insertions in regions containing nucleic acid sequence repeats.
  • Genomic regions that contain nucleic acid sequence repeats are often the site of genetic instability due to the amplification or contraction of the number of sequence repeats (e.g., the insertion or deletion of one or more units of the repeated sequence). Instability in the length of genomic regions that contain high numbers of repeat sequences has been associated with a number of hereditary and non hereditary diseases and conditions.
  • "Fragile X syndrome, or Martin-Bell syndrome is a genetic syndrome which results in a spectrum of characteristic physical, intellectual, emotional and behavioral features which range from severe to mild in manifestation.
  • the syndrome is associated with the expansion of a single trinucleotide gene sequence (CGG) on the X chromosome, and results in a failure to express the FMR-1 protein which is required for normal neural development.
  • CGG trinucleotide gene sequence
  • cancer which has been associated with microsatellite instability
  • MSI genomic copy number of nucleic acid repeats at one or more microsatellite loci (e.g., BAT-25 and/or BAT-26).
  • BAT-25 and/or BAT-26 microsatellite loci
  • sequencing-based assays for determining the number of nucleic acid sequence repeats at a particular locus and identifying the presence of nucleic acid insertions or deletions.
  • such techniques are not useful in a high throughput multiplex analysis where the entire length of a region may not be sequenced.
  • aspects of the invention relate to detecting the presence of an insertion or deletion at a genomic locus without requiring the locus to be sequenced (or without requiring the entire locus to be sequenced). Aspects of the invention are particularly useful for detecting an insertion or deletion in a nucleic acid region that contains high levels of sequence repeats.
  • the presence of sequence repeats at a genetic locus is often associated with relatively high levels of polymorphism in a population due to insertions or deletions of one or more of the sequence repeats at the locus.
  • the polymorphisms can be associated with diseases or predisposition to diseases (e.g., certain polymorphic alleles are recessive alleles associated with a disease or condition).
  • the presence of sequence repeats often complicates the analysis of a genetic locus and increases the risk of errors when using sequencing techniques to determine the precise sequence and number of repeats at that locus.
  • aspects of the invention relate to determining the size of a genetic locus by evaluating the capture frequency of a portion of that locus suspected of containing an insertion or deletion (e.g., due to the presence of sequence repeats) using a nucleic acid capture technique (e.g., a nucleic acid sequence capture technique based on molecular inversion probe technology).
  • a nucleic acid capture technique e.g., a nucleic acid sequence capture technique based on molecular inversion probe technology.
  • a statistically significant difference in capture efficiency for a genetic locus of interest in different biological samples is indicative of different relative lengths in those samples. It should be appreciated that the length differences may be at one or both alleles of the genetic locus.
  • aspects of the invention may be used to identify polymorphisms regardless of whether biological samples being interrogated at heterozygous or homozygous for the polymorphisms.
  • subjects that contain one or more loci with an insertion or deletion can be identified by analyzing capture efficiencies for nucleic acids obtained from one or more biological samples using appropriate controls (e.g., capture efficiencies for known nucleic acid sizes, capture efficiencies for other regions that are not suspected of containing an insertion or deletion in the biological sample(s), or predetermined reference capture efficiencies, or any combination thereof.
  • appropriate controls e.g., capture efficiencies for known nucleic acid sizes, capture efficiencies for other regions that are not suspected of containing an insertion or deletion in the biological sample(s), or predetermined reference capture efficiencies, or any combination thereof.
  • aspects of the invention are not limited by the nature or presence of the control.
  • a subject may be identified as being at risk for a disease or condition associated with insertions or deletions at that genetic locus.
  • the subject may be analyzed in greater detail in order to determine the precise nature of the insertion or deletion and whether the subject is heterozygous or homozygous for one or more insertions or deletions.
  • gel electrophoresis of an amplification (e.g., PCR) product of the locus, or Southern blotting, or any combination thereof can be used as an orthogonal approach to verify the length of the locus.
  • a more exhaustive and detailed sequence analysis of the locus can be performed to identify the number and types of insertions and deletions.
  • other techniques may be used to further analyze a locus identified as having an abnormal length according to aspects of the invention.
  • aspects of the invention relate to detecting abnormal nucleic acid lengths in genomic regions of interest.
  • the invention aims to estimate the size of genomic regions that are hard to be accessed, such as repetitive elements.
  • methods of the invention do not require that the precise length be estimated.
  • fragile X can be used to illustrate aspects of the invention where the size of trinucleotide repeats (genotype) is linked to a symptom (phenotype).
  • fragile X is a non-limiting example and similar analyses may be performed for other genetic loci (e.g., independently or simultaneously in multiplex analyses).
  • MIPs molecular inversion probes
  • aspects of the invention are based on the recognition that the effect of length on probe capturing efficiency can be used in the context of an assay (e.g., a high throughput and/or multiplex assay) to allow the length of sequences to be determined without requiring sequencing of the entire region being evaluated. This is particularly useful for repeat regions that are prone to changes in size.
  • an assay e.g., a high throughput and/or multiplex assay
  • FIG. 8 which is reproduced from Deng et al., Nature Biotech. 27:353-60, (see Supplemental FIG. 1G of Deng et al.,) illustrates that shorter sequences are captured with higher efficiency that longer sequences using MIPs.
  • the statistical package R and its effects module were used for this analysis. A linear model was used, and each individual factor was assumed to be independent. The dashed lines represent a 95% confidence interval. Shorter target sequences were captured with higher efficiency than long target sequences (p ⁇ 2xl0 ⁇ 16 ). However, the use of this differential
  • polymerase fill-in and ligation reactions are performed to convert the hybridized probe to a covalently-closed, circular molecule containing the desired target.
  • PCR or rolling circle amplification plus exonuclease digestion of non-circularized material is performed to isolate and amplify the circular targets from the starting nucleic acid pool. Since one of the main benefits of the method is the potential for a high degree of multiplexing, generally thousands of targets are captured in a single reaction containing thousands of probes.
  • repetitive regions are surrounded by non-repetitive unique sequences, which can be used to amplify the repeat-containing regions using, for example, PCR or padlock (MlP)-based method.
  • a probe e.g., a MIP or padlock probe
  • the amplicon can be end- sequenced so that the unique sequence can be identified and served as the "representative" of the repetitive region as illustrated in FIG. 9.
  • FIG. 9 illustrates a non-limiting scheme of padlock
  • MIP multi-dimensional fingerprinting of a region that includes both repetitive regions (thick wavy line) and the adjacent unique sequence (thick strait line).
  • the regions of the probe are indicated with the targeting arms shown as regions "1" and "3.”
  • An intervening region that may be, or include, a sequencing primer binding site is shown as "2.”
  • the padlock After the padlock is circularized and amplified, it can be end-sequenced to obtain the sequence of the unique sequence, which represents the repetitive region of interest.
  • capturing efficiency is overall negatively correlated with target length, different probe sequences may have unique features.
  • probes could be designed and tested so that an optimal one is chosen to be sensitive enough to differentiate repetitive sizes of roughly 0-150 bp, 150-600 bp, and beyond, which represent normal, premutation and full mutation of fragile X syndrome, respectively.
  • probe sizes and sequences can be designed, and optionally optimized, to distinguish a range of repeat region size differences (e.g., length differences of about 3-30 bases, about 30-60 bases, about 60-90 bases, about 90-120 bases, about 120-150 bases, about 150-300 bases, about 300-600 bases, about 600-900 bases, or any intermediate or longer length difference).
  • a length difference may be an increase in size or a decrease in size.
  • an initial determination of an unexpected capture frequency is indicative of the presence of size difference.
  • an increase in capture frequency is indicative of a deletion.
  • a decrease in capture frequency is indicative of an insertion.
  • a change in capture frequency can be associated with either an increase or decrease in target region length. In some embodiments, the precise nature of the change can be determined using one or more additional techniques as described herein.
  • a MIP probe includes a linear nucleic acid strand that contains two hybridization sequences or targeting arms, one at each end of the linear probe, wherein each of the hybridization sequences is complementary to a separate sequence on a the same strand of a target nucleic acid, and wherein these sequences on the target nucleic acid flank the two ends of the target nucleic acid sequence of interest. It should be appreciated that upon hybridization, the two ends of the probe are inverted with respect to each other in the sense that both 5' and 3' ends of the probe hybridize to the same strand to separate regions flanking the target region (as illustrated in FIG. 9 for example).
  • the hybridization sequences are between about 10-100 nucleotides long, for example between about 10-30, about 30-60, about 60-90, or about 20, about 30, about 40, or about 50 nucleotides long. However, other lengths may be used depending on the application.
  • the hybridization Tms of both targeting arms of a probe are designed or selected to be similar.
  • the hybridization Tms of the targeting arms of a plurality of probes designed to capture different target regions are selected or designed to be similar so that they can be used together in a multiplex reaction. Accordingly, a typical size of a MIP probe prior to fill-in is about 60-80 nucleotides long.
  • MIP probes are designed to avoid sequence-dependent secondary structures.
  • MIP probes are designed such that the targeting arms do not overlap with known polymorphic regions.
  • targeting arms that can be used for capturing the repeat region of the Fragile X locus can have the following sequences or complementary to these sequences depending on the strand that is captured.
  • the typical captured size using these targeting arms is about 100 nucleotides in length (e.g., about 30 repeats of a tri-nucleotide repeat).
  • the number of reads obtained for the "representative" of the repetitive region is not informative to estimate the target length because it is dependent on the total number of reads obtained. To overcome this, it is useful to include one or more probes that target other "control" regions where no or minimal polymorphism exists among populations. Because of the systematic consistency of capturing efficiency (see, e.g., FIG. 9), the ratio of reads obtained for the repetitive "representative" to reads obtained for the control region(s) will be tuned using DNA with defined numbers of repeats. Ultimately, the ratio can serve as a measure of the repeat length as illustrated in FIG. 10. FIG.
  • the whole repetitive region can be sequenced by making a shotgun library (e.g., by making a shotgun library from a captured sequence, for example a sequence captured using a MIP probe).
  • a shotgun library e.g., by making a shotgun library from a captured sequence, for example a sequence captured using a MIP probe.
  • the expectation is that the number of reads from any given repeat will be a direct function of the number of repeats present.
  • a Poisson sampling-induced spread may need to be considered and in some embodiments may be sufficiently large to limit the resolution.
  • FIG. 11A-C shows the approach.
  • MIPs are synthesized to contain one of a large number differentiator tags in their backbone such that the probability of any two MIPs in a reaction having the same differentiator tag sequence is low. MIP capture is performed on the sample; the reaction will be biased for shorter target lengths, and therefore the reaction product will be comprised of more 'short' circles than 'long' circles. Each circle should bear a unique differentiator tag sequence. Then, linear RCA (IRCA) is performed on the circles.
  • IRCA linear RCA
  • a sequencing technique e.g., a next-generation sequencing technique
  • a sequencing technique is used to sequence part of one or more captured targets (e.g., or amplicons thereof) and the sequences are used to count the number of different barcodes that are present.
  • aspects of the invention relate to a highly-multiplexed qPCR reaction.
  • loci at which insertions or deletions or repeat sequences may be associated with a disease or condition are provided in Tables 3 and 4. It should be appreciated that the presence of an abnormal length at any one or more of these loci may be evaluated according to aspects of the invention. In some embodiments, two or more of these loci or other loci may be evaluated in a single multiplex reaction using different probes designed to hybridize under the same reaction conditions to different target nucleic acid in a biological sample.
  • FRAXA FMR1 on the X- CGG 6-53 230+ (Fragile X chromosome
  • FXTAS Frazier FMR1, on the X- CGG 6-53 55-200 X- associated chromosome
  • FRAXE AFF2 or FMR2, GCC 6-35 200+ Fragile XE on the X- mental chromosome retardation
  • aspects of the invention relate to methods for increasing the sensitivity of nucleic acid detection assays.
  • genomic assays that utilize next-generation (e.g., polony-based) sequencing to generate data, including genome resequencing, RNA-seq for gene expression, bisulphite sequencing for methylation, and Immune- seq, among others.
  • next-generation sequencing e.g., polony-based sequencing to generate data
  • genome resequencing RNA-seq for gene expression
  • bisulphite sequencing for methylation bisulphite sequencing for methylation
  • Immune- seq among others.
  • these methods utilize the counts of sequencing reads of a given genomic locus as a proxy for the representation of that sequence in the original sample of nucleic acids.
  • the majority of these techniques require a preparative step to construct a high-complexity library of DNA molecules that is representative of a sample of interest.
  • nucleic acid preparative techniques e.g., amplification, for example PCR-based amplification; sequence- specific capture, for example, using immobilized capture probes; or target capture into a circularized probe followed by a sequence analysis step.
  • amplification for example PCR-based amplification
  • sequence- specific capture for example, using immobilized capture probes
  • target capture into a circularized probe followed by a sequence analysis step.
  • current methods to involve oversampling a target nucleic acid preparation in order to increase the likelihood that all sequences that are present in the original nucleic acid sample will be represented in the final sequence data.
  • a genomic sequencing library may contain an over- or under- representation of particular sequences from a source nucleic acid sample (e.g., genome preparation) as a result of stochastic variations in the library construction process.
  • a source nucleic acid sample e.g., genome preparation
  • Such variations can be particularly problematic when they result in target sequences from a genome being absent or undetectable in a sequencing library.
  • an under-representation of particular allelic sequences e.g., heterozygotic alleles
  • an apparent homozygous representation in a sequencing library can result in an apparent homozygous representation in a sequencing library.
  • aspects of the invention relate to basing a nucleic acid sequence analysis on results from two or more different nucleic acid preparatory techniques that have different systematic biases in the types of nucleic acids that they sample rather than simply oversampling the target nucleic acid.
  • different techniques have different sequence biases that are systematic and not simply due to stochastic effects during nucleic acid capture or amplification.
  • the degree of oversampling required to overcome variations in nucleic acid preparation needs to be sufficient to overcome the biases.
  • the invention provides methods that reduce the need for oversampling by combining nucleic acid and/or sequence results obtained from two or more different nucleic acid preparative techniques that have different biases.
  • different techniques have different characteristic or systematic biases. For example, one technique may bias a sample analysis towards one particular allele at a genetic locus of interest, whereas a different technique would bias the sample analysis towards a different allele at the same locus. Accordingly, the same sample may be identified as being different depending on the type of technique that is used to prepare nucleic acid for sequence analysis. This effectively represents a sensitivity issue, because each technique has a different relative sensitivities for polymorphic sequences of interest.
  • the sensitivity of a nucleic acid analysis can be increased by combining the sequences from different nucleic acid preparative steps and using the combined sequence information for a diagnostic assay (e.g., for a making a call as to whether a subject is homozygous or heterozygous at a genetic locus of interest).
  • a diagnostic assay e.g., for a making a call as to whether a subject is homozygous or heterozygous at a genetic locus of interest.
  • heterozygote base-calls for a diploid genome e.g. a human sample presented for molecular diagnostic sequencing
  • Sample preparative methods may fall into three classes: 1) single- or several-target amplification (e.g., uniplex PCR, 'multiplex' PCR), 2) multi-target hybridization enrichment (e.g., Agilent SureSelect 'hybrid capture' [Gnirke et al 2009, Nature methods 27: 182-9], Roche/Nimblegen 'sequence capture' [Hodges et al 2007, Nature genetics 39: 1522-7], and 3) multi-target circularization selection (e.g.
  • a skewed ratio is a particular issue that decreases the sensitivity of detecting mutations present in a heterogeneous tumor tissue.
  • the methods disclosed herein are based, in part, on the discovery that certain classes of isolation methods have different modes of bias.
  • the disclosure provide methods for increasing the sensitivity of the downstream sequencing by using a combination of multiple isolation methods (e.g., one or more from at least two of the classes disclosed herein) for a sample. This is particularly important in molecular diagnostics where high sensitivity is required to minimize the chances of 'missing' a disease-associated mutation. For example, given a nominal false- negative error rate of 1x10 for sequencing following circularization selection, and a false- negative error rate of 1x10 for sequencing following hybridization enrichment, one can achieve a final false-negative rate of lxl0 ⁇ 6 by performing both techniques on the sample (assuming failures in each method are fully independent).
  • the number of missed carrier diagnoses would decrease from 1000 per million patients tested to 1 per million patients tested. Furthermore, if the testing was used in the context of prenatal carrier screening, the number of affected children born as a result of missing the carrier call in one parent would decrease from 25 per million to 25 per billion born.
  • the disclosure provides combinations of preparative methods to effectively increase sequencing coverage in regions containing disease-associated alleles. Since
  • heterozygote error rate is largely tied to both deviations from 50:50 allele representation, and in the case of next-generation DNA sequencing deviations from average abundance (such that less abundant isolated targets are more likely to be undersampled at one or both alleles), selectively increasing coverage in these regions will also selectively increase sensitivity.
  • MIPs that detect presence or absence of specific known disease-associated mutations can be used to increase sensitivity selectively.
  • these MIPs would have a targeting arm whose 3 '-most region is complementary to the expected mutation, and has a fill-in length of 0 or more bp. Thus, the MIP will form only if the mutation is present, and its presence will be detected by sequencing.
  • algorithms disclosed herein may be used to determine base identity with varying levels of stringency depending on whether the given position has any known disease-associated alleles. Stringency can be reduced in such positions by decreasing the minimum number of observed mutant reads necessary to make a consensus base-call. This will effectively increase sensitivity for mutant allele detection at the cost of decreased specificity.
  • An embodiment of the invention combines MIPs plus hybridization enrichment, plus optionally extra MIPs targeted to specific known, common disease-associated loci, e.g., to detect the presence of a polymorphism in a target nucleic acid.
  • FIG. 12 illustrates a schematic using MIPs plus hybridization enrichment, plus optionally extra MIPs targeted to specific known, common disease-associated loci, e.g., to detect the presence of a polymorphism in a target nucleic acid.
  • FIGS. 13 and 14 illustrate different capture efficiencies for MlP-based captures.
  • FIG. 13 shows a graph of per-target abundance with MIP capture.
  • bias largely drives the heterozygote error rate, since targets which are less abundant here are less likely to be covered in sufficient depth during sequencing to adequately sample both alleles. This is from Turner et al 2009, Nature methods 6:315-6.
  • Hybridization enrichment results in a qualitatively similar abundance distribution, but the abundance of a given target is likely not correlated between the two methods.
  • biases can be detected or overcome by
  • aspects of the invention involve preparing genomic nucleic acid and/or contacting them with one or more different probes (e.g., capture probes, hybridization probes, MIPs, others etc.).
  • the amount of genomic nucleic acid used per subject ranges from 1 ng to 10 micrograms (e.g., 500 ng to 5 micrograms).
  • the amount of probe used per assay may be optimized for a particular application.
  • the ratio (molar ratio, for example measured as a concentration ratio) of probe to genome equivalent e.g., haploid or diploid genome equivalent, for example for each allele or for both alleles of a nucleic acid target or locus of interest
  • the ratio ranges from 1/100, 1/10, 1/1, 10/1, 100/1, 1000/1.
  • lower, higher, or intermediate ratios may be used.
  • the amount of target nucleic acid and probe used for each reaction is normalized to avoid any observed differences being caused by differences in concentrations or ratios.
  • the genomic DNA concentration is read using a standard spectrophotometer or by fluorescence (e.g., using a fluorescent intercalating dye). The probe concentration may be determined experimentally or using information specified by the probe manufacturer.
  • a locus may be amplified and/or sequenced in a reaction involving one or more primers.
  • the amount of primer added for each reaction can range from 0.1 pmol to 1 nmol, 0.15 pmol to 1.5 nmol (for example around 1.5 pmol). However, other amounts (e.g., lower, higher, or intermediate amounts) may be used.
  • one or more intervening sequences e.g., sequence between the first and second targeting arms on a MIP capture probe
  • identifier or tag sequences e.g., a target sequence
  • other probe sequences e.g., other probe sequences that may be in a biological sample.
  • these sequences may be designed have a sufficient number of mismatches with any genomic sequence (e.g., at least 5, 10, 15, or more mismatches out of 30 bases) or as having a Tm (e.g., a mismatch Tm) that is lower (e.g., at least 5, 10, 15, 20, or more degrees C. lower) than the hybridization reaction temperature.
  • Tm e.g., a mismatch Tm
  • a targeting arm as used herein may be designed to hybridize (e.g., be complementary) to either strand of a genetic locus of interest if the nucleic acid being analyzed is DNA (e.g., genomic DNA).
  • DNA e.g., genomic DNA
  • a targeting arm should be designed to hybridize to the transcribed RNA.
  • MIP probes referred to herein as "capturing" a target sequence are actually capturing it by template-based synthesis rather than by capturing the actual target molecule (other than for example in the initial stage when the arms hybridize to it or in the sense that the target molecule can remain bound to the extended MIP product until it is denatured or otherwise removed).
  • a targeting arm may include a sequence that is complementary to one allele or mutation (e.g., a SNP or other polymorphism, a mutation, etc.) so that the probe will preferentially hybridize (and capture) target nucleic acids having that allele or mutation.
  • each targeting arm is designed to hybridize (e.g., be complementary) to a sequence that is not polymorphic in the subjects of a population that is being evaluated. This allows target sequences to be captured and/or sequenced for all all alleles and then the differences between subjects (e.g., calls of heterozygous or homozygous for one or more loci) can be based on the sequence information and/or the frequency as described herein.
  • sequence tags also referred to as barcodes
  • sequence tags may be designed to be unique in that they do not appear at other positions within a probe or a family of probes and they also do not appear within the sequences being targeted. Thus they can be used to uniquely identify (e.g., by sequencing or hybridization properties) particular probes having other characteristics (e.g., for particular subjects and/or for particular loci).
  • probes or regions of probes or other nucleic acids are described herein as comprising or including certain sequences or sequence characteristics (e.g., length, other properties, etc.). However, it should be appreciated that in some embodiments, any of the probes or regions of probes or other nucleic acids consist of those regions (e.g., arms, central regions, tags, primer sites, etc., or any combination thereof) of consist of those sequences or have sequences with characteristics that consist of one or more characteristics (e.g., length, or other properties, etc.) as described herein in the context of any of the embodiments (e.g., for tiled or staggered probes, tagged probes, length detection, sensitivity enhancing algorithms or any combination thereof).
  • nucleic acid refers to multiple linked nucleotides (i.e., molecules comprising a sugar (e.g., ribose or deoxyribose) linked to an exchangeable organic base, which is either a pyrimidine (e.g., cytosine (C), thymidine (T) or uracil (U)) or a purine (e.g., adenine (A) or guanine (G)).
  • a pyrimidine e.g., cytosine (C), thymidine (T) or uracil (U)
  • purine e.g., adenine (A) or guanine (G)
  • Nucleic acid and “nucleic acid molecule” may be used interchangeably and refer to oligoribonucleotides as well as oligodeoxyribonucleotides.
  • the terms shall also include polynucleosides (i.e., a polynucleotide minus a phosphate) and any other organic base containing nucleic acid.
  • the organic bases include adenine, uracil, guanine, thymine, cytosine and inosine.
  • nucleic acids may be single or double stranded.
  • the nucleic acid may be naturally or non-naturally occurring.
  • Nucleic acids can be obtained from natural sources, or can be synthesized using a nucleic acid synthesizer (i.e., synthetic). Harvest and isolation of nucleic acids are routinely performed in the art and suitable methods can be found in standard molecular biology textbooks. (See, for example, Maniatis' Handbook of Molecular Biology.)
  • the nucleic acid may be DNA or RNA, such as genomic DNA, mitochondrial DNA, mRNA, cDNA, rRNA, miRNA, or a combination thereof.
  • Non-naturally occurring nucleic acids such as bacterial artificial chromosomes (BACs) and yeast artificial chromosomes (YACs) can also be used.
  • the invention also contemplates the use of nucleic acid derivatives.
  • nucleic acid derivatives may increase the stability of the nucleic acids of the invention by preventing their digestion, particularly when they are exposed to biological samples that may contain nucleases.
  • a nucleic acid derivative is a non-naturally occurring nucleic acid or a unit thereof.
  • Nucleic acid derivatives may contain non-naturally occurring elements such as non-naturally occurring nucleotides and non-naturally occurring backbone linkages.
  • Nucleic acid derivatives may contain backbone modifications such as but not limited to phosphorothioate linkages, phosphodiester modified nucleic acids, phosphorothiolate modifications, combinations of phosphodiester and phosphorothioate nucleic acid,
  • the backbone composition of the nucleic acids may be homogeneous or heterogeneous.
  • Nucleic acid derivatives may contain substitutions or modifications in the sugars and/or bases. For example, they include nucleic acids having backbone sugars which are covalently attached to low molecular weight organic groups other than a hydroxyl group at the 3' position and other than a phosphate group at the 5' position (e.g., an 2'-0-alkylated ribose group). Nucleic acid derivatives may include non-ribose sugars such as arabinose.
  • Nucleic acid derivatives may contain substituted purines and pyrimidines such as C-5 propyne modified bases, 5- methylcytosine, 2-aminopurine, 2-amino-6-chloropurine, 2,6-diaminopurine, hypoxanthine, 2- thiouracil and pseudoisocytosine.
  • substitution(s) may include one or more substitutions/modifications in the sugars/bases, groups attached to the base, including biotin, fluorescent groups (fluorescein, cyanine, rhodamine, etc), chemically-reactive groups including carboxyl, NHS, thiol, etc., or any combination thereof.
  • a nucleic acid may be a peptide nucleic acid (PNA), locked nucleic acid (LNA), DNA, RNA, or co-nucleic acids of the same such as DNA-LNA co-nucleic acids.
  • PNA are DNA analogs having their phosphate backbone replaced with 2-aminoethyl glycine residues linked to nucleotide bases through glycine amino nitrogen and methylenecarbonyl linkers.
  • PNA can bind to both DNA and RNA targets by Watson-Crick base pairing, and in so doing form stronger hybrids than would be possible with DNA or RNA based oligonucleotides in some cases.
  • PNA are synthesized from monomers connected by a peptide bond (Nielsen, P. E. et al. Peptide Nucleic Acids, Protocols and Applications, Norfolk: Horizon Scientific Press, p. 1-19 (1999)). They can be built with standard solid phase peptide synthesis technology. PNA chemistry and synthesis allows for inclusion of amino acids and polypeptide sequences in the PNA design. For example, lysine residues can be used to introduce positive charges in the PNA backbone. All chemical approaches available for the modifications of amino acid side chains are directly applicable to PNA. Several types of PNA designs exist, and these include single strand PNA (ssPNA), bisPNA and pseudocomplementary PNA (pcPNA).
  • ssPNA single strand PNA
  • pcPNA pseudocomplementary PNA
  • ssPNA binds to single stranded DNA (ssDNA) preferably in antiparallel orientation (i.e., with the N- terminus of the ssPNA aligned with the 3' terminus of the ssDNA) and with a Watson-Crick pairing.
  • PNA also can bind to DNA with a Hoogsteen base pairing, and thereby forms triplexes with double stranded DNA (dsDNA) (Wittung, P. et al., Biochemistry 36:7973 (1997)).
  • LNA locked nucleic acid
  • An LNA form hybrids with DNA, which are at least as stable as PNA/DNA hybrids (Braasch, D. A. et al., Chem &Biol. 8(1): 1-7 (2001)). Therefore, LNA can be used just as PNA molecules would be. LNA binding efficiency can be increased in some embodiments by adding positive charges to it. LNAs have been reported to have increased binding affinity inherently.
  • All targets are captured as a set of partially-overlapping subtargets.
  • a 200 bp target exon might be captured as a set of 12 subtargets, each 60 bp in length (FIG. 1).
  • Each subtarget is chosen such that it partially overlaps two or three other targets.
  • all probes are composed of three regions: 1) a 20 bp 'targeting arm' comprised of sequence which hybridizes immediately upstream from the sub-target, 2) a 30 bp 'constant region' comprised of sequence used as a pair of amplification priming sites, and 3) a second 20 bp 'targeting arm' comprised of sequence which hybridizes immediately downstream from the sub-target.
  • Targeting arm sequences will be different for each capture probe in a set, while constant region sequence will be the same for all probes in the set, allowing all captured targets to be amplified with a single set of primers.
  • Targeting arm sequences should be designed such that any given pair of 20 bp sequences is unique in the target genome (to prevent spurious capture of undesired sites). Additionally, melting temperatures should be matched for all probes in the set such that hybridization efficiency is uniform for all probes at a constant temperature (e.g., 60 C). Targeting arm sequences should be computationally screened to ensure they do not form strong secondary structure that would impair their ability to basepair with the genomic target.
  • the first step in performing the detection/correction is to determine how many differentiator tag sequences are necessary for the given sample.
  • 1000 genomic targets corresponding to 1000 exons were captured. Since the differentiator tag sequence is part of the probe, it will measure/report biases that occur from the earliest protocol steps. Also, being located in the backbone, the differentiator tag sequence can easily be sequenced from a separate priming site, and therefore not impact the total achievable read-length for the target sequence.
  • MIP probes are synthesized using standard column-based oligonucleotide synthesis by any number of vendors (e.g. IDT), and differentiator tag sequences are introduced as 'degenerate' positions in the backbone. Each degenerate position increases the total number of differentiator tag sequences synthesized by a factor of 4, so a 10 nt degenerate region implies a differentiator tag sequence complexity of ⁇ le6 species.
  • FIG. 5 depicts a method for making diploid genotype calls in which repeat
  • the number of differentiator tag sequences necessary to be confident (within some statistical bounds) that a certain differentiator tag sequence will not be observed more than once by chance in combination with a certain target sequence was determined.
  • the total number of unique differentiator tag sequences for a certain differentiator tag sequence length is determined as
  • N is the total number of possible unique differentiator tag sequences
  • M is the number of target sequence copies in the capture reaction.
  • a MIP capture reaction in which MIP probes, each having a differentiator tag sequence of 15 nucleotides, are combined with 10000 target sequence copies (e.g., genome equivalents), the probability of capturing one or more copies of a target sequence having the same differentiator tag sequence is 0.05.
  • the MIP reaction will produce very few (usually 0, but occasionally 1 or more) targets where multiple copies are tagged with the same differentiator tag sequence.
  • FIG. 6 depicts results of a simulation for 100000 capture reactions having 15 nucleotide differentiator tag sequences and 10000 target sequences.
  • Monte Carlo simulations were performed to determine sequencing coverage requirements.
  • the simulations assume 10000 genomic copies of a given locus (target) half mom alleles and half dad alleles.
  • the simulations further assume 1% efficiency of capture for the MIP reaction.
  • the simulation samples from a capture mix 100 times without replacement to create a set of 100 capture products.
  • the simulation samples from the set of 100 capture products with replacement (assuming unbiased amplification) to generate 'reads' from either mom or dad.
  • the number of reads sampled depends on the coverage.
  • the number of independent reads from both mom and dad necessary to make a high-quality base-call (assumed to be 10 or 20 reads) were then determined.
  • At least three sets of control loci are captured in parallel that have a priori been shown to serve as proxies for various lengths of target locus. For example, if the target locus is expected to have a length between 50 and 1000 bp, then sets of control loci having lengths of 50, 250, and 1000 bp could be captured (e.g. 20 loci per set should provide adequate protection from outliers), and their abundance digitally measured by sequencing. These loci should be chosen such that minimal variation in efficiency between samples and on multiple runs of the same sample is observed (and are therefore 'efficiency invariant'). These will serve as 'reference' points that define the shape of the curve of abundance-vs-length. Determining the length of the target is then simply a matter of 'reading' the length from the appropriate point on the calibration curve.
  • the statistical confidence one has in the estimate of target length from this method is driven largely by three factors: 1) reproducibility/variation of the abundance data used to generate the calibration curve; 2) goodness of fit of the regression to the 'control' datapoints; 3) reproducibility of abundance data for the target locus being measured.
  • Statistical bounds on 1) and 2) will be known in advance, having been measured during development of the assay. Additionally, statistical bounds on 3) will be known in general in advance, since assay development should include adequate population sampling and measure of technical reproducibility. Standard statistical methods should be used to combine these three measures into a single P value for any given experimental measure of target abundance.
  • the regression can be used to predict the length value for n observations of the target locus whose length is unknown.
  • the predicted response value, computed when n observations is substituted into the equation for the regressed line, will have arbitrary precision.
  • the confidence interval for a predicted response is calculated as:
  • the confidence interval for the predicted value is given by y+t*sy, where ⁇ is the fitted value corresponding to x*.
  • the value t* is the upper (l-C)/2 critical value for the t(n-2) distribution.
  • a technique for analyzing a locus of interest can involve the following steps.
  • MIP probes are synthesized using standard column-based oligonucleotide synthesis by any number of vendors (e.g. IDT).
  • MIPs, hybridization, and mutation-detection MIPs are used to genotype a set of 1000 targets.
  • the protocol permits detection of any of 50 specific known point mutations
  • MIP capture reaction is performed essentially as described in Turner et al 2009, Nature methods 6:315-6.
  • a set of MIPs is designed such to that each probe in the set flanks one of the 1000 targets.
  • a hybridization enrichment reaction is performed using the Agilent SureSelect procedure.
  • the genomic DNA to be enriched is converted into a shotgun sequencing library using Illumina's 'Fragment Library' kit and protocol.
  • Agilent's web interface is used to design a set of probes which will hybridize to the target nucleic acids. Separately, a set of probes are designed
  • Mutation-detection MIPs which will form MIPs only if mutations (e.g., specific polymorphisms) are present.
  • Each mutation-detection MIP has a 3 '-most base identity that is specific for a single known mutation.
  • a reaction with this set of mutation-detection MIPs is performed to selectively detect the presence of any mutant alleles.
  • the two MIP reactions are combined (e.g., at potentially non-equimolar ratios to further increase sensitivity of mutation detection) into a single tube, and run as one sample on the next-generation DNA sequencing instrument.
  • the hybridization-enriched reaction is run as a separate sample on the next-generation DNA sequencing instrument.
  • Reads from each 'sample' are combined by a software algorithm which forms a consensus diploid genotype at each position in the target set by evaluating the total coverage at each position, the origin of each read in that total coverage, the quality score of each individual read, and the presence (or absence) of any reads derived from mutation- specific MIPs overlapping the region.
  • Carrier screening is performed either pre-conception or during pregnancy to determine a couple's risk of having a child with a recessive genetic disorder.
  • the number of individuals who could benefit from such screening is substantial, as roughly 2 million women give birth to their first child each year in the US.
  • the disorders for which testing is recommended vary based on a number of different patient-specific factors. For instance, the American Congress of
  • Obstetricians and Gynecologists recommends that screening for cystic fibrosis be offered to all women of reproductive age, and that testing be performed for additional disorders if indicated by family history, partner's carrier status, or ethnicity.
  • NGS next- generation DNA sequencing
  • NGS NGS to be used for carrier screening in a clinical setting, it must satisfy at least three requirements.
  • analytical accuracy must be both high and well characterized within the clinically relevant genes or regions.
  • the NGS workflow employed should yield data sufficient to cover the vast majority of targeted bases at a depth sufficient to make high-quality genotype calls.
  • the workflow combines automated, optimized molecular inversion probe target capture with molecular barcoding to maximize the sample throughput of a next-generation DNA sequencing machine, and employs a novel read assembly-based alignment method that enables accurate identification of both substitution and insertion/deletion lesions.
  • the workflow is applied to sequence the protein-coding regions of fifteen genes in which loss-of-function mutations cause recessive Mendelian disorders often included as part of routine carrier screening, and demonstrate through realistic simulation and comparison to Sanger sequencing data that our approach achieves high accuracies.
  • Probes were designed to capture the coding regions and certain well- characterized non-coding regions of 15 genes (See Table 5 below).
  • the 5' targeting arm (ligation arm) and 3' targeting arm (extension arm) comprised a total of 40 nucleotides, and were designed to flank 130 bp target regions. Probes were selected to maximize performance with respect to both capture efficiency and robustness to common polymorphisms.
  • probes targeting a genomic interval were designed and assigned score tuples consisting of: 1) presence of guanine or cytosine as the 5 '-most base of the ligation arm, 2) the number of dbSNP (version 130) entries intersecting targeting arm sites, and 3) the root mean squared deviation of the arms' predicted melting temperatures from optimal values derived from empirical studies of capture efficiency. Using these tuples, probes were ranked sequentially by 1, 2, and 3, and the probe with the highest rank was chosen. Probes were designed to 'tile' across targets with a period of 25 bp such that multiple probes with orthogonal targeting arm sequences captured every genomic position. The molecular inversion probes are provided in Appendix A.
  • Appendix A also includes the upstream and downstream regions corresponding to each molecular inversion probe, which is shown by the start position and end position coordinates of each targeting arm relative to the target sub-region's coordinates on the Human Genome 18 (HG 18).
  • Appendix B lists the genomic sub-regions targeted by the molecular inversion probes of Appendix A.
  • Table 5 shows diseases and genes the workflow is designed to interrogate, and the corresponding genes and nucleotides targeted.
  • Genomic DNA was purchased from the Coriell Cell Repositories (Camden, NJ) or isolated from whole blood by the Gentra Puregene method (Qiagen) modified to conclude with an overnight incubation at 65°C. Overnight incubation at an elevated temperature led to DNA shearing and an increased fraction of callable bases. All samples were considered "IRB Exempt" by Liberty IRB, our independent Institutional Review Board. On Tecan automation, 1.5 ug of genomic DNA was annealed with 1 ul of molecular inversion probe mix in IX Ampligase buffer (Epicentre Biotechnologies) for 5 min at 95°C followed by 24 hr at 54°C.
  • Exonuclease I and 50 U Exonuclease III were then added by Tecan automation and incubated for 1 hr at 37 °C followed by 10 min at 98°C.
  • the capture reaction product was amplified in two separate PCR reactions designed to attach a molecular barcode and Illumina cluster amplification sequences to the ends of each molecule so as to enable sequencing from each end of the captured region.
  • Tecan automation was used to set up the PCR, which was carried out with 3.75 ul of capture product, 15 pmol of each primer, 10 nmol dNTPs, and 1 U VeraSeq polymerase (Enzymatics, Inc) in lx Veraseq buffer. Cycling conditions were: 98°C 30 sec, 17-22X (98°C 10 sec, 54°C 30 sec, 72°C 15 sec), 4°C forever.
  • Raw .bcl files were converted to qseq files using bclConverter (Illumina).
  • Fastq files were generated by 'de-barcoding' genomic reads using the associated barcode reads; reads for which barcodes yielded no exact match to an expected barcode, or contained one or more low-quality basecalls, were discarded.
  • the remaining reads were aligned to hgl8 on a per-sample basis using BWA version 0.5.7 for short alignments and genotype calls were made using GATK version 1.0.4168 after base quality score re-calibration, realignment (with GATK version 1.0.5083) and targeting arm removal.
  • Clinical significance of variant calls was determined by matching against a VCF- formatted database of disease-causing mutations curated from the literature, with equivalent insertion/deletion regions calculated as previously described.
  • PCR was carried out with the genomic DNA described in Target capture, barcoding, and NGS using a modified version of the protocol from Zimmerman et al., using PCR primers from Jones et al., except Ml 3 tails were removed.
  • Zimmerman RS Cox S, Lakdawala NK, et al.
  • Ml 3 tails were removed.
  • MS Mutation Surveyor software
  • Softgenetics version 4.0.5 was used in batch-mode with default parameters to align abl files to target reference sequence and make genotype calls. Positions where MS base calls did not match in the forward and reverse directions were removed from consideration. All high-quality NGS genotype calls within 10 bp (inclusive) of target exons were subjected to cross-validation against VCF-converted MS variant calls. This process is described in more detail below.
  • NGS calls were classified true positive (TP), discordant (non-reference) variant genotype (DVG), or false positive (FP) if they matched MS calls by (i-iii), (iii) only, or none of the above criteria, respectively.
  • MS variant calls with no corresponding NGS variant call were classified false negative (FN).
  • Indel calls classified as DVG were re-classified as TP because GATK 1.0.4168 does not report zygosity for such calls.
  • NGS -detected variant allele is annotated for functional (clinical) significance by determining its relative position within the corresponding consensus coding sequence (CCDS).
  • CCDS consensus coding sequence
  • CCDS44531.1 ABCC8 (CCDS31437.1), HEXA (CCDS 10243.1), BLM (CCDS 10363.1), ASPA (CCDS 11028.1), G6PC (CCDS 11446.1), MCOLN1 (CCDS 12180.1), BCKDHA
  • Clinically significant (reportable) mutations include alterations to the conserved 2 basepairs flanking each exon (splice site), the native start codon, or the last codon (readthrough), as well as truncating (nonsense and frameshift) mutations. Additionally, GATK occasionally reports alternate insertion patterns with non-native bases (e.g. 'N') chosen from a minority of reads. These were classified 'indeterminate' and reportable to prompt follow-up confirmation.
  • Table 6 shows the set of 94 samples derived from immortalized cell lines and 59 samples derived from whole blood.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Selon l'invention, des aspects de l'invention concernent des procédés et compositions qui sont utiles pour réduire les biais et augmenter la reproductibilité de l'analyse multiplexe de loci génétiques. Dans certaines configurations, des étapes de préparation prédéterminées et/ou des techniques d'analyse de séquence d'acide nucléique sont utilisées dans des analyses multiplexes pour une pluralité de loci génétiques dans une pluralité d'échantillons.
EP14762322.7A 2013-03-15 2014-03-14 Procédés et compositions pour l'évaluation de marqueurs génétiques Withdrawn EP2971114A4 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201361789164P 2013-03-15 2013-03-15
US13/934,093 US20130337447A1 (en) 2009-04-30 2013-07-02 Methods and compositions for evaluating genetic markers
PCT/US2014/028212 WO2014143994A2 (fr) 2013-03-15 2014-03-14 Procédés et compositions pour l'évaluation de marqueurs génétiques

Publications (2)

Publication Number Publication Date
EP2971114A2 true EP2971114A2 (fr) 2016-01-20
EP2971114A4 EP2971114A4 (fr) 2016-11-23

Family

ID=51538293

Family Applications (1)

Application Number Title Priority Date Filing Date
EP14762322.7A Withdrawn EP2971114A4 (fr) 2013-03-15 2014-03-14 Procédés et compositions pour l'évaluation de marqueurs génétiques

Country Status (3)

Country Link
EP (1) EP2971114A4 (fr)
CA (1) CA2907177A1 (fr)
WO (1) WO2014143994A2 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3055338B1 (fr) * 2016-09-01 2020-05-29 Centre National De La Recherche Scientifique Methode de detection et d'identification in vitro d'un ou plusieurs micro-organismes impliques dans le processus de degradation, de modification ou de sequestration d'un ou plusieurs contaminants presents dans un echantillon biologique
US11639521B2 (en) * 2019-06-03 2023-05-02 The Chinese University Of Hong Kong Method for determining the copy number of a tandem repeat sequence

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1685410A1 (fr) * 2003-11-19 2006-08-02 EVOTEC Neurosciences GmbH Utilisation diagnostique et therapeutique du gene humain sgpl1 et d'une proteine contre les maladies neurodegeneratives
EP2425240A4 (fr) * 2009-04-30 2012-12-12 Good Start Genetics Inc Procédés et compositions d'évaluation de marqueurs génétiques
US20120165202A1 (en) * 2009-04-30 2012-06-28 Good Start Genetics, Inc. Methods and compositions for evaluating genetic markers
CA2888779A1 (fr) * 2012-11-07 2014-05-15 Good Start Genetics, Inc. Validation de tests genetiques

Also Published As

Publication number Publication date
WO2014143994A2 (fr) 2014-09-18
CA2907177A1 (fr) 2014-09-18
WO2014143994A3 (fr) 2015-02-19
EP2971114A4 (fr) 2016-11-23

Similar Documents

Publication Publication Date Title
US11840730B1 (en) Methods and compositions for evaluating genetic markers
US20120165202A1 (en) Methods and compositions for evaluating genetic markers
US20130337447A1 (en) Methods and compositions for evaluating genetic markers
AU2014278730B2 (en) Statistical analysis for non-invasive sex chromosome aneuploidy determination
EP2971182B1 (fr) Méthodes d'analyse génétique prénatale
US11041203B2 (en) Methods for assessing a genomic region of a subject
US20130173177A1 (en) Nucleic acid sequence analysis
AU2019283856B2 (en) Non-invasive fetal sex determination
WO2019008148A9 (fr) Enrichissement de régions génomiques ciblées pour analyse parallèle multiplexée
EP3649259A1 (fr) Analyse parallèle multiplexée enrichie en cible pour évaluation du risque pour des troubles génétiques
EP2971114A2 (fr) Procédés et compositions pour l'évaluation de marqueurs génétiques
Smylie et al. Analysis of sequence variations in several human genes using phosphoramidite bond DNA fragmentation and chip-based MALDI-TOF
Helsmoortel et al. Multiplexed high resolution melting assay for versatile sample tracking in a diagnostic and research setting
Craig DNA methods

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20151013

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20161026

RIC1 Information provided on ipc code assigned before grant

Ipc: C12Q 1/02 20060101ALI20161020BHEP

Ipc: C12Q 1/68 20060101AFI20161020BHEP

Ipc: G01N 33/50 20060101ALI20161020BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20170523