WO2010039991A2 - Method of generating informative dna templates for high-throughput sequencing applications - Google Patents

Method of generating informative dna templates for high-throughput sequencing applications Download PDF

Info

Publication number
WO2010039991A2
WO2010039991A2 PCT/US2009/059274 US2009059274W WO2010039991A2 WO 2010039991 A2 WO2010039991 A2 WO 2010039991A2 US 2009059274 W US2009059274 W US 2009059274W WO 2010039991 A2 WO2010039991 A2 WO 2010039991A2
Authority
WO
WIPO (PCT)
Prior art keywords
adapter
dna
informative
templates
dna templates
Prior art date
Application number
PCT/US2009/059274
Other languages
French (fr)
Other versions
WO2010039991A3 (en
Inventor
John Mullet
Daryl Morishige
Original Assignee
The Texas A&M University System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Texas A&M University System filed Critical The Texas A&M University System
Publication of WO2010039991A2 publication Critical patent/WO2010039991A2/en
Publication of WO2010039991A3 publication Critical patent/WO2010039991A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6811Selection methods for production or design of target specific oligonucleotides or binding molecules

Definitions

  • TAMC011WO_ST25.txt comprising nucleotide and/or amino acid sequences of the present invention submitted via EFS-Web.
  • the subject matter of the Sequence Listing is incorporated herein by reference in its entirety.
  • the present invention generally relates to the generation of DNA templates from specific sites in genomes for high throughput sequence-based analysis, for such applications as genotyping, marker-assisted breeding, genetic mapping, haplotyping, physical map construction, and gene mapping. More particularly, the invention relates to methods for isolating and enriching the population of informative DNA templates for such high throughput sequencing applications.
  • DNA polymorphisms A number of techniques have been described for discovering and detecting DNA polymorphisms. Most involve an initial DNA sequence polymorphism discovery step, which usually involves direct sequencing of cDNA or genes, or involves hybridization to oligonucleotide arrays, followed by the development of targeted single nucleotide polymorphism (SNP) or insertion/deletion (InDeI) detection assays. Digestion of genomic DNA by restriction enzymes, and/or shearing, followed by adapter ligation, is a standard procedure for preparing amplifiable templates from genomes for a variety of uses, such as polymerase chain reaction (PCR)-based genotyping assays.
  • PCR polymerase chain reaction
  • AFLP amplified fragment length polymorphism
  • PCR PCR to amplify restriction fragments obtained from a complex mixture of DNA fragments that are prepared by the action of restriction endonucleases on genomic DNA.
  • the primers used for amplification of DNA, or to initiate sequence analysis of DNA are not directed against a known genomic DNA sequence, but rather are designed such that they are complementary to sequences in adapters ligated to the ends of the restriction fragments.
  • This strategy yields restriction site localized (RSL) DNA sequences, some of which include polymorphisms.
  • RSL restriction site localized
  • secondary sources of DNA such as chloroplast, mitochondria, and contaminating bacteria or fungal genomes, may be present at relatively high copy number in some samples of a nuclear DNA selected for analysis. These non-target sources of DNA restriction fragments reduce the efficiency of bulk amplification and sequencing procedures.
  • Embodiments of the present methods comprise the sequential use of restriction enzymes and adapter ligation for template generation, followed by selection of informative templates by hybridization to oligonucleotides attached to a solid matrix, or by targeted PCR amplification, with subsequent direct sequence analysis.
  • Embodiments of this specific combination of procedures provide a highly flexible, very low cost, and highly accurate way to obtain genotyping information.
  • Compositions and kits for carrying out such methods are also provided in accordance with some embodiments of the invention.
  • a method of generating informative DNA templates for sequencing comprises: (a) obtaining a fragmented genomic DNA sequence from a first individual, to provide a mixture of DNA fragments, wherein the genomic DNA comprises a plurality of polymorphisms; (b) ligating at least a first adapter to the DNA fragments, to provide a mixture comprising adapter-modified informative DNA templates and adapter-modified non-informative DNA templates, wherein each said informative DNA template comprises a unique sequence in a location compatible with high throughput DNA sequencing of the template, wherein the unique sequence comprises a unique polymorphic site in the species genome sequence; and (c) selecting adapter-modified informative DNA templates (e.g., by either hybridization-based selection or targeted PCR amplification of the adapter-modified informative templates), to obtain an enriched mixture of adapter- modified informative DNA templates.
  • a hybridization-based selection such as in step (c) involves forming hybridized complexes comprising the adapter-modified informative DNA templates, oligonucleotides and a solid matrix, and excluding the non-informative DNA templates.
  • the method further comprises (d) separating non-hybridized non-informative DNA templates from the hybridized complexes; and (e) releasing the informative DNA templates from the hybridized complexes, to obtain an enriched mixture of adapter-modified informative DNA templates.
  • forming the hybridized complexes such as in step (c) comprises hybridizing oligonucleotides to complementary sequences in the informative DNA templates.
  • targeted PCR amplification of informative templates such as in step (c) involves a first primer complementary to the first adapter, and a set of second primer (or plurality of primers) complementary to one or more unique sequences in the informative DNA fragments, wherein each said second primer(s) is designed such that the resulting amplified DNA templates are of a predetermined sequence length (or range of lengths) and comprise informative DNA sequences.
  • ligating comprises ligating a second adapter to a terminus of each said DNA fragment opposite the first adapter, to provide the mixture of adapter-modified informative DNA templates and adapter-modified non- informative DNA templates.
  • some templates are generated that comprise adapter A or adapter B ligated to both termini of some DNA fragments making them incompatible with some high throughput DNA sequencing technologies. Therefore, some embodiments further comprise the use of suppression PCR to amplify and enrich DNA templates comprising adapters A and B that are compatible with high throughput sequencing relative to templates containing only adapter A or adapter B that are not suitable for sequencing.
  • a method according to the invention further comprises selecting adapter-modified informative DNA templates having a predetermined sequence length that is compatible DNA sequencing (e.g., bridge amplification in the case of ILLUMINA ® SGAIITM sequencing) and read length of a selected high throughput sequencing process.
  • a predetermined sequence length that is compatible DNA sequencing (e.g., bridge amplification in the case of ILLUMINA ® SGAIITM sequencing) and read length of a selected high throughput sequencing process.
  • the method further comprises a step (d) for subjecting the enriched mixture of adapter-modified informative DNA templates to a high- throughput DNA sequencing procedure, to obtain the sequences of the informative DNA templates, and a step (e) comparing the sequences of the informative DNA templates to at least one set of reference genomic DNA sequences to identify the specific (e.g., polymorphic) allele sequence obtained from each template or site in the genome.
  • at least one set of reference genomic DNA sequences is obtained from at least one reference individual.
  • at least the first adapter comprises an indexing sequence that can be correlated to the first individual.
  • a reagent for selecting informative DNA templates, comprising a solid matrix (e.g., a plurality of beads) and a plurality of different oligonucleotides attached to the solid matrix.
  • a solid matrix e.g., a plurality of beads
  • each of the oligonucleotides may be in the range of about 17-60 nucleotides in length and is complementary to a unique sequence present in a respective informative DNA template.
  • each such informative DNA template comprises at least one polymorphism located within the read length of a selected high throughput sequencing process, wherein the location is measured from either terminus of the informative DNA template.
  • a hybrid DNA complex which comprises a reagent for selection of informative DNA templates as described herein and a plurality of adapter-modified informative DNA templates hybridized to the matrix-attached oligonucleotides.
  • each such informative DNA template may comprise at least one polymorphism located within the read length of a selected high throughput sequencing process, wherein the location is measured from either terminus of the informative DNA template.
  • the adapter-modified informative DNA templates comprise at least a first adapter ligated to the informative templates.
  • the informative DNA templates are obtained from a single individual, or are derived from a plurality of individuals, in which case the first adapter comprises a unique indexing sequence ligated to each of the informative DNA templates for matching it to the individual from which it was derived.
  • the invention provides a method for marker- assisted selection. For instance, fragmented genomic DNA is obtained from a plurality individual plants or plant cells, to provide a plurality of genomic DNA fragments comprising a plurality of polymorphic sequences at least one of which is linked to a trait of interest. Fragmented genomic DNA is then ligated to at least a first adapter, to provide a plurality of adapter-modified informative DNA templates and adapter- modified non-informative DNA templates, wherein each of said informative DNA templates comprises a polymorphic sequence and wherein said first adapter comprises an index sequence that can be correlated to genomic DNA of an individual plant or plant cell.
  • Adapter-modified informative DNA templates are selected by either hybridization-based selection or targeted PCR amplification of the adapter-modified informative DNA templates, to obtain an enriched mixture of adapter-modified informative DNA templates that can be sequenced. Based on the sequence an individual plant or plant cell is selected based on the presence of at least one polymorphism linked to a trait of interest. For example, selecting an individual plant or plant cell may comprise selecting a plant cell for regeneration of a plant and/or a plant comprising a trait of interest can be selected for commercial production or breeding.
  • a plant or plant cell for selection is a wheat, maize, rye, rice, oat, barley, turfgrass, sorghum, millet, sugarcane, tobacco, tomato, potato, soybean, cotton, canola, sunflower or alfalfa plant or plant cell.
  • the invention provides a method for marker- assisted selection of a genomic region or gene that regulates expression of a trait such as trait of agronomic interest in a plant (e.g., a drought tolerance, enhanced yield, cold tolerance, pest resistance, insect resistance, salt tolerance or herbicide tolerance trait).
  • FIG. 1 is a schematic flow diagram of a method of generating DNA polymorphism-enriched DNA templates for sequencing, in accordance with certain embodiments of the invention.
  • FIG. 2 is a box flow diagram summarizing template preparation, and illustrating the structure of representative adapter- and primer-modified DNA templates and sequencing complexes, in accordance with certain embodiments of the invention.
  • FIG. 3 is a box flow diagram of a method of generating enriched informative DNA templates, commencing with restriction enzyme digestion of genomic DNA and/or shearing, in accordance with certain embodiments of the invention.
  • FIG. 4 is a box flow diagram of the steps following preparation and enrichment of informative templates that involve optional amplification, then sequencing and analysis of DNA sequences to identify polymorphisms and alleles.
  • FIG. 5 is a box flow diagram of a method of FIG. 1, including PCR amplification of informative DNA fragments, according to certain embodiments of the invention.
  • FIG. 6 illustrates an example of a specific 30 bp restriction site localized sequence from a sorghum genotype containing a unique polymorphism.
  • FIG. 7 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 8 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 9 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 10 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 11 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 12 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 13 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 14 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • FIG. 15 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
  • Restriction site localized (RSL)-sequences derived from nearly any genome will be a mixture of "unique" highly informative DNA sequences that occur once per genome, along with less informative DNA sequences that occur two or more times in the target genome.
  • Many existing bulk methods for isolating RSL-templates recover the entire collection of RSL-templates generated by a specific targeting restriction enzyme. However, not all of those recovered RSL-templates can be used to generate useful geno typing information. First, the sequences adjacent to many of these restriction sites will be repetitive and therefore not useful for genotyping.
  • RSL-templates will be too large or too small or contain DNA sequences with a propensity to form secondary structures that make them poor templates for bridge amplification, or for some other reason. Such size considerations are discussed in more detail below.
  • the only informative sequences, and therefore informative RSL-templates are those that correspond to DNA sequences spanning polymorphisms that distinguish parental genotypes. For example, if the parents used to create a population for genetic mapping purposes contain one sequence polymorphism per 1,000 bp, then only 1 specific template out of 40 RSL-templates sequenced at random will contain a polymorphic sequence, assuming 25 bp of sequence is obtained per template. In this instance, only the single DNA template containing the polymorphism is considered “informative," while the 39 other DNA templates in the mixture would be "uninformative,” for the purposes of genetic mapping in the target population.
  • direct hybridization e.g., to oligonucleotide-beads or microarrays
  • RSL- templates that contain polymorphisms (e.g., SNP/INDEL) for sequence-based genotyping.
  • polymorphisms e.g., SNP/INDEL
  • a primer-targeted PCR-based enrichment is described that allows smaller numbers of polymorphism-containing RSL-templates to be amplified for sequence-based genotyping.
  • Embodiments of the present methods that provide targeted isolation of RSL- templates for sequencing, which contain informative polymorphic sequences, will potentially increase efficiency by approximately 100-fold or more (by removing repeats and monomorphic templates. This will potentially reduce the cost of genotyping considerably and expand the number and complexity of applications for this technology.
  • Some applications require the collection of information from a large number of polymorphic sites per genome (e.g., haplotype-based analysis which typically requires analysis of about 5,000 to 20,000 or more polymorphic sites per genome depending on linkage disequilibrium, whereas other applications, such as DNA fingerprinting, early phase genetic mapping or pedigree analysis require analysis of a relatively small number of polymorphic sites (approximately 500-2,000) per genome.
  • Fine mapping genetic studies require higher density analysis (a large number of polymorphic sites analyzed per Mbp), but only in one or a few regions of the genome.
  • Various embodiments of the methods described herein for sequence-based genotyping substantially enhance the output of these forms of genomic analysis.
  • Embodiments of the methods disclosed herein offer a flexible method to generate and perform a targeted analysis of subsets of RSL-templates generated by any specific set of restriction enzymes, allowing variation in the number analyzed per genome and their distribution across the genome (or based on their utility in larger population studies).
  • FIG. 1 is a box flow diagram of a method of generating a mixture or pool of informative DNA templates, and then sequencing those templates to identify DNA polymorphisms by comparison to a reference sequence, to generate genotypes and haplotypes.
  • the method generally comprises the following stages or a subset thereof:
  • Stage I Fragmentation of the genomic DNA of an individual to generate templates of a predetermined sequence length.
  • the fragmentation stage is designed to produce DNA fragments having a specified complexity, sequence length and information content.
  • Template size is dependent on two factors: (1) DNA fragment length suitable for processing by the high throughput sequencer used to collect sequence information (e.g., an optimal size of about 50 to 250 bp, for uniform bridge-amplification on the ILLUMINA ® SGAIITM), and (2) DNA template length sufficient to utilize the sequence read length of the high throughput sequencer (e.g., optimally, at least 35-50bp for the SGAIITM and at least 200-400bp for the Roche 454 GS-FLXTM Genome Analysis System).
  • this predetermined sequence length of the DNA fragments is less than that of a protein- coding region of a gene, and in most instances is also smaller than an average exon length.
  • the predetermined template size may vary somewhat based on whether the fragments are generated by restriction enzymes or shearing, or by a combination of those techniques. For example, for restriction fragments, the distance of the DNA polymorphism away from the restriction enzyme cleavage site (where adapter A is added) is the primary determinant of the minimum fragment sequence length (as can be appreciated in the schematic illustration of Stage Ilia in FIG. 1).
  • the "sequencer read length" of a given sequencing system refers to the number of bases of sequence information that can be acquired from each template sequenced.
  • the minimum sequence length of an informative DNA template is that which spans a DNA polymorphism and that can be uniquely located in a genome.
  • a sequence as small as about 17 bp is sufficient to span a polymorphism and is typically large enough to often be a unique sequence in a genome, to permit identification of its location on a genetic map or within a genome sequence.
  • Somewhat larger sequences e.g., 25-35 are employed in some embodiments, in that a relatively high percent of the sequences in this size range can be uniquely located in most genomes.
  • a high- throughput sequencer such as the ILLUMINA ® SGAIITM or the ABI SOLIDTM sequencer is used (e.g., in Stage V)
  • the sequence read length is approximately 35-50bp.
  • the targeted DNA polymorphisms should be within about 35-50 bp of a restriction site where the adapter containing sequences used to initiate sequencing is located.
  • a high-throughput sequencer such as the Roche 454 sequencer is used as the sequencing system (e.g., in Stage V)
  • DNA polymorphisms within about 100-400 bp of restriction sites can be sequenced. If a DNA fragment is not within the read-length adjacent to the cleavage end obtained with one restriction enzyme, a second restriction enzyme with a different recognition sequence can be used in addition to the first enzyme, to generate template. Alternatively, genomic DNA can be sheared to generate template with a random set of termini.
  • Adapter A that is later used to initiate sequencing reactions in Stage V, is ligated to DNA termini generated by shearing, and the informative templates in the resulting mixture of sheared fragments are then enriched as described below (hybridization-based or PCR-based methods).
  • This approach to template preparation allows any region of any genome to be targeted for geno typing by direct sequence analysis.
  • the predetermined sequence length is in the range of about 35-50bp (e.g., for the ILLUMINA ® SGAIITM and ABI SOLIDTM sequencers) or about 400bp (e.g., for the Roche 454 sequencer, which has a comparatively longer sequence read length).
  • the informative DNA templates are those that can be used to generate sequences unique in the target genome, that span DNA polymorphisms, and that have adapters ligated sufficiently close to the DNA polymorphism so that sequences spanning the target polymorphic sequence are acquired.
  • templates generated specifically for use with the SGAIITM sequencer may be 50-200bp in length, even though the SGAIITM only sequences about 35bp from one end of the template.
  • templates generated specifically for use with the Roche 454 sequencer may be about 400-800bp in length, even though only 200-400bp of sequence are acquired from one end of the template.
  • larger templates may be generated for sequencing from both ends of the template (using the SGAIITM sequencer, for instance). It should be understood that the templates and hybridization complexes illustrated in FIG.
  • the genomic DNA is fragmented in Stage I to a desired range of sequence lengths and complexity by digesting the DNA with one or more restriction enzymes (as schematically shown in Step 1 of FIG. 3).
  • Complexity in this instance refers to the number of termini generated by digestion of a genome with a restriction enzyme and that correspond to informative sequences. This is accomplished, for example, with restriction enzymes having different recognition sequences (e.g., 4 bp, 6 bp, 7 bp, 8 bp) to vary the number, complexity, repetitiveness, and GC content of the DNA templates generated.
  • the DNA is digested by one or more restriction enzyme that is sensitive to DNA methylation to prepare DNA templates enriched in templates derived from the gene rich regions of a genome. Enrichment for unique gene rich templates occurs because, in many organisms, repeat sequences are differentially methylated and therefore no template will be generated from these sites.
  • DNA templates are provided for assaying differences in genome DNA methylation.
  • the DNA from two or more individuals is digested separately using each of the enzymes that recognizes and digests the same DNA sequence.
  • One of the enzymes is sensitive to DNA methylation (i.e., will not cut these sites), and the other enzyme is not sensitive to DNA methylation (i.e., will cut all recognition sites, regardless of whether it is methylated or not).
  • the individual DNAs are analyzed and tracked using different indexing sequences that can be used for identification. Differences in the complement of sequences derived from the two restriction enzymes correspond to differences in sites of methylation in the genome.
  • a subset of the polymorphisms that distinguish accessions or parental genotypes reside in the recognition sites of the restriction enzymes used for template preparation. Polymorphisms in a genotype that change the recognition sequence of the restriction enzyme used for template generation will prevent DNA digestion and as a consequence, no templates will be generated adjacent to that site. DNA polymorphisms of this type are identified by the presence of the two template sequences derived adjacent to a digestion site in one genotype, while the same two template sequences are absent in genotypes that contain polymorphisms in that specific restriction enzyme site.
  • the selected restriction enzymes will have the characteristic of cleaving organelle or bacterial DNA infrequently or not at all. This feature is potentially helpful during the initial phase of discovering the unique polymorphic templates useful for genotyping applications.
  • genotyping assays involving enrichment of informative templates by hybridization in Stage Ilia
  • PCR in Stage HIb of FIG. 1
  • Stage II Adapter ligation.
  • the restriction fragments (RFs) obtained in Stage I are of the desired sequence length(s) appropriate for serving as templates in the selected genotyping by sequencing system. Accordingly, a first adapter ("adapter A”) is ligated to the fragments. In most cases, a second adapter (“adapter B”) is ligated to the opposite end of each fragment (PCR-based enrichment does not require this).
  • Adapters A and B are different, unique double stranded oligonucleotide sequences about 17 bp or longer in length. There is a practical upper limit of the sequence length of about 60-90bp, based on present costs, however even longer lengths may be used in some instances.
  • the adapters need to be long enough to encode a sequence complementary to the primer used to initiate sequencing, the indexing sequence, and in some cases, sequences allowing binding to complementary termini generated by digestion of the target genome.
  • the relationship of the adapters, DNA fragment, index sequences and restriction termini, in some embodiments, is illustrated in FIG. 2.
  • One or both adapters may contain an indexing sequence of 2-6 base pairs.
  • the indexing sequence may be located in the adapter immediately adjacent to the nucleotide sequence (priming site) that will be used as the binding site for the sequencing primer that is used to initiate sequencing (in Stage V).
  • the indexing sequence serves as an identification tag that allows the informative DNA templates to be sorted according to their respective sources following sequence analysis (Stage VI).
  • Stage VI sequence analysis
  • adapter A is ligated to the restriction site end of the DNA fragment.
  • the "indexing sequence” comprised in adapter A is denoted "XXX”.
  • the “restriction site overhang” comprised in adapter A is denoted "YYY.”
  • the remaining portion of adapter A corresponds to a DNA sequencing primer ("primer A"), which is described below.
  • primer A a DNA sequencing primer
  • adapter B is ligated to the blunt end of the DNA fragment, after shearing and end repair.
  • a primer sequence (e.g., ILLUMINA ® primer A), ligated to the 5' end of adapter A, is a italicized sequence in boxes 3 and 4 of FIG. 2 (SEQ ID NOs: 5-6).
  • This primer B sequence may be added to the adapter-modified template through the process of PCR.
  • the sequence of primer A (e.g., ILLUMIN A ® primer B), is complementary to a "primer binding site" sequence within adapter A, exclusive of the indexing sequence and restriction overhang.
  • Primers A and B are useful for bridge amplification during cluster generation in certain sequencing systems, such as the ILLUMINA ® SGII sequencer.
  • the DNA sequencing primer as shown in box 4 of FIG.
  • adapter sequences are typically provided by the manufacturer of the selected sequencing platform for sequencing the informative DNA templates obtained in Stage IV.
  • the adapter sequences may also comprise specific sequences added to the ends to facilitate a primary amplification step and selection on magnetic beads, for example. It should be understood that any other suitable adapter and primer sequences capable of functioning in a similar manner may be used instead of the examples that are shown in FIG. 2.
  • an indexing sequence may be included at the end(s) of every DNA fragment derived from the individual's genome.
  • Including a unique indexing sequence in each adapter-ligated set of DNA templates allows the user to pool the resulting index-modified adapter-ligated DNA templates derived from a large number of different individuals following ligation of an index- modified adapter, as discussed in more detail below in the section titled "Multiplexing DNA Templates for Sequencing.” These samples can then be processed in bulk, sequenced, and the resulting sequences assigned to the individuals from which they were derived.
  • the fragmenting of genomic DNA is accomplished by digesting the DNA molecules with a first restriction enzyme treatment, followed by ligation of a first adapter ("adapter A") to the ends of the resulting fragments.
  • these fragments are further digested by a second restriction enzyme treatment, to generate a pool of adapter-linked DNA templates with a predetermined range of shorter sequence lengths, as illustrated schematically in Steps 1-4 of FIG. 3).
  • adapter A is a nucleotide sequence that includes an indexing sequence and an adjacent nucleotide sequence (i.e., primer binding site) that can be used in a Stage HIb PCR amplification step, shown in FIGs. 1 and 4) and/or to initiate sequencing in Stage V (Step 9, as shown in FIG. 4).
  • Adapter B is a different nucleotide sequence, with or without the indexing sequence, that includes a primer binding sequence that can be used to amplify templates containing adapters A and B (optional PCR amplification step indicated by dashed box in FIG.
  • Adapters A and B are unique and non-complementary to each other.
  • the adapter sequences may contain sequences that are complementary to oligonucleotides that are covalently attached to solid surfaces, to facilitate bridge amplification.
  • the genome is fragmented to a predetermined sequence length by a combination of digestion of DNA with restriction enzymes, and subsequent shearing of the restriction fragments.
  • the shearing step may be carried out by sonication or high pressure hydrodynamic shearing.
  • fragmenting of the genome DNA is accomplished by digesting the DNA molecules with a first restriction enzyme treatment, followed by ligation of a first adapter ("adapter A") to the ends of the resulting fragments, and then shearing the resulting restriction fragments to obtain a mixture or pool of informative and non-informative DNA templates.
  • adapter B is ligated to the blunt ends of the shear fragments, opposite adapter A.
  • the genomic DNA is fragmented with a single restriction enzyme that generates template of an optimal size for a high throughput sequencer.
  • CspCI recognizes a specific 7bp sequence and digests DNA flanking that sequence at a distance such that DNA fragments of approximately 32bp are generated.
  • adapters A and B would be ligated to the DNA fragments at the same time, to generate DNA fragments containing adapter A or adapter B ligated to both termini of a given fragment, and DNA fragments that have adapter A ligated at one end and adapter B ligated to the other end of the fragment.
  • PCR amplification of this mixture of adapter ligated fragments with primers complementary to sequences in adapter A and adapter B will preferentially amplify fragments containing both adapters due to suppression PCR.
  • the optional step of limited amplification by suppression PCR is shown in the flow diagram of FIG. 3.
  • Shearing Genomic DNA is fragmented to a predetermined sequence length by shearing instead of, or in addition to digestion with restriction enzyme(s), as shown in Step 1 of FIG. 3.
  • shearing treatments may be adjusted to generate the range of DNA fragment lengths of a predetermined range useful for downstream analysis on different DNA sequencers.
  • Adapters A and B are ligated to the resulting DNA fragments (Step 2' of FIG. 3).
  • Fragments containing Adapter A ligated to one end of the fragment and Adapter B ligated to the other end of the same fragment can be enriched relative to fragments having the same adapter ligated to both ends of the fragment, by suppression PCR amplification of the DNA templates using primers complementary to adapters A and B (as indicated in FIG. 3).
  • size selection i.e., enrichment of adapter-modified DNA fragments of specified sequence length, is performed before or after the limited amplification step.
  • shearing of the genomic DNA is less desirable than restriction enzyme digestion because (1) shearing produces templates with a distribution of different termini across the informative regions of the genome targeted for sequence analysis, (2) templates with high complexity are generated, (3) it is more labor intensive and less reproducible to shear and process a large number of different DNA samples compared to restriction digestion which can be done in 96-well or 384-well format, and (4) sheared DNA requires additional steps in template preparation compared to simple ligation of adapters to termini generated by restriction enzymes. Nevertheless, in some cases, a user may wish to employ shearing alone or in combination with a restriction enzyme digestion step, as outlined in FIG. 3, for ease of use or for specific applications.
  • the fragmentation protocol includes an initial shearing step followed by further reduction of fragment lengths by digestion with one or more restriction enzyme, as shown in Steps 1-4 of FIG. 3.
  • adapter A is biotinylated.
  • DNA templates are amplified using biotinylated primers that are complementary to sequences in adapters.
  • the biotinylated adapters are ligated to termini generated by a restriction enzyme, allowing purification of DNA templates on streptavidin beads.
  • the sstDNA may be further processed as described below. Purification using biotinylated adapters potentially eliminates or reduces the content of DNA fragments that are not linked to adapters.
  • template preparation also includes enriching the mixture of adapter-linked DNA templates for a specific range of sequence lengths. This may be done, for example, by electrophoresis of the adapter-linked DNA template on an agarose gel followed by extraction of the DNA templates of a predetermined size range, using standard techniques.
  • Selecting only the DNA templates of a predetermine range of sequence lengths potentially increases the efficiency of the overall method by reducing or eliminating DNA templates that are larger or smaller than a predetermined sequence length (e.g., longer than the sequence read length of a selected nucleotide sequencing system; e.g., 30-50 bp). Size selection may be carried out after adapter ligation, or after PCR of adapter-ligated templates as shown in FIG. 3, or after enrichment of informative templates depending on the application.
  • a predetermined sequence length e.g., longer than the sequence read length of a selected nucleotide sequencing system; e.g., 30-50 bp.
  • Size selection may be carried out after adapter ligation, or after PCR of adapter-ligated templates as shown in FIG. 3, or after enrichment of informative templates depending on the application.
  • an enrichment step may be utilized following ligation of adapters (Step II) and hybridization based template selection (Step Ilia).
  • the adapter-linked DNA templates are amplified prior to performing Step 5, as indicated by the dashed box describing limited amplification and enrichment that involves suppression PCR.
  • the adapter A- and adapter B-linked DNA fragments from Steps 2' or 2-4 of FIG. 3 may be subjected to PCR amplification using primers that are complementary to the non-indexing portion of adapters A and B. This step increases the relative abundance of templates containing adapters A and B relative to templates containing only adapter A or adapter B due to suppression PCR.
  • Optional PCR amplification steps are indicated in FIG. 1 by dashed boxes labeled "PCR.”
  • undesirable secondary sources of DNA such as chloroplast, mitochondria and bacterial DNA may be present at relatively high copy number in a sample of nuclear DNA targeted for analysis. In such cases, it may be desirable to enhance the efficiency of template preparation and sequencing of the individual's DNA.
  • BAC bacterial artificial chromosome
  • background "noise” arising from multiple copies of extraneous DNA sequences may be selectively reduced after fragmentation.
  • the representation of those sequences during preparation of the individual's informative DNA templates may be reduced by including certain non-amplifiable primers in a PCR template preparation step (see FIG. 1) to suppress amplification of these amplicons. These non-amplifiable primers may extend into the adapter sequence and are complementary to the undesirable repeat sequences located adjacent to the restriction digestion site.
  • a set of oligonucleotides is developed corresponding to the set of informative DNA templates in the target genome under investigation in a particular application.
  • the complementary oligonucleotides used for enrichment may correspond to any unique genomic sequence in an informative DNA template. Therefore, in most instances, the adapter sequences and repetitive sequences are avoided, and, in some cases, polymorphic sequences per se are also avoided, thereby minimizing variable hybridization due to sequence miss-matches.
  • the oligonucleotides are attached to the desired solid matrix ⁇ e.g., magnetic beads, planar or curved surfaces, tubes).
  • the chemistry and method of preparing the oligonucleotide-beads is carried out in accordance with the instructions of an oligonucleotide-bead manufacturer known to those in the art, or in accordance with other techniques for attaching oligonucleotides to solid substrates that are known and described in the literature.
  • the resulting set of complementary oligonucleotides (or substrate-bound oligonucleotides) may then be assembled in any of a variety of combinations depending on the parental genotypes being assayed or the region of the genome being analyzed. Thus, in many cases, the technical burden and costs of primer design may be distributed across multiple experiments.
  • the DNA templates are then enriched for informative DNA templates (i.e., the templates corresponding to a site that contains a polymorphism), as illustrated in Step Ilia of FIG. 1, and Step 5 of FIG. 3.
  • enrichment of the informative DNA templates includes hybridizing only the informative DNA templates to oligonucleotides having complementary sequences.
  • the mixed DNA templates are denatured to form single stranded DNAs, which are then hybridized to oligonucleotides that could range from about 17 to about 60 nucleotides in length.
  • the oligonucleotides contain sequences complementary to genomic DNA sequences present in the informative DNA templates and are designed to enhance a uniform specificity of binding under a standard hybridization condition.
  • the complementary oligonucleotides are attached to magnetic beads to facilitate the separation of hybridized informative DNA template from non-hybridized non- informative templates.
  • oligonucleotides may be attached to any other suitable solid surface or matrix.
  • the oligonucleotide-substrate complexes must be different than the oligonucleotide-substrate complexes that are used for bridge amplification or are used in the DNA sequencing procedure.
  • the hybridization-based selection procedure includes purifying the informative single stranded DNA templates by hybridization to a collection of oligonucleotides that are covalently attached to a solid matrix or surface (e.g., magnetic beads).
  • a solid matrix or surface e.g., magnetic beads
  • the mixture of adapter-linked DNA templates (from Stage II), optionally amplified and enriched for adapter-linked DNA templates of a specified size, is mixed with the oligonucleotide-beads under hybridization promoting conditions.
  • the hybridization-based enrichment procedure is improved by using oligonucleotides that hybridize to their complements at similar melting temperatures (Tm), however small differences in DNA template selection efficiency will usually not affect the results significantly because the purpose of this step is to enrich a subset of the mixture or pool of informative templates, not to discriminate between polymorphic sequences.
  • Tm melting temperatures
  • a low level of off target template selection will only introduce a low level of off target sequencing rather than creating a source of error.
  • a polymorphic marker is assayed within a given interval of the genetic map, however, any marker from the given region will provide similar information, and given a sufficient density of markers, missing data will not affect results.
  • the polymorphic site on the informative DNA template will be outside of the oligonucleotide:DNA template annealed region of the hybridized complex. In many embodiments this will be the preferred situation in order to avoid variation in selection of templates with perfect vs.
  • the oligos used for selection will overlap a sequence containing a polymorphism. For example, this may be the case when approximately 35bp templates are generated with the restriction enzyme CspCI, as further described below.
  • the resulting hybridized complexes are isolated by centrifugation, magnetic bead capture, filtering, or other suitable technique.
  • the hybridized complexes are washed with aqueous media containing buffer (pH 7-8) and salts and at a temperature and under conditions that will not disrupt interaction between oligonucleotides and templates, to remove the non- hybridized DNA templates (Stage III of FIG. 1 and Steps 5-6 of FIG. 3).
  • the exact hybridization and wash conditions and time will vary to some extent depending on the length and GC content of the oligonucleotides used for the selection step.
  • biotinylated oligonucleotides may be used to carry out selective enrichment of informative templates by hybridization (FIG. 1, Step Ilia).
  • the oligonucleotides complementary to informative templates are modified to contain biotin, allowing biotin/streptavidin bead-based capture to facilitate separating the hybridized informative DNA template from non-hybridized non-informative templates. More specifically, in some embodiments the oligonucleotide is modified allowing oligonucleotide ⁇ nformative template hybrids to be separated from non- hybridized template.
  • oligonucleotide: template hybrids may be purified by binding to a biotin: strepavidin bead or similarly modified surface.
  • a biotin: strepavidin bead or similarly modified surface In this procedure, single- stranded adapter A- and B-modified informative DNA templates are hybridized to complementary biotinylated oligonucleotides, followed by binding of the resulting complex to streptavidin beads, washing of beads to remove non target template and other materials, and release of single strand template DNA (sstDNA) from the streptavidin beads for downstream sequencing (FIG. 1, Step V).
  • Stage HIb Enrichment of Informative DNA Templates by Targeted PCR.
  • the enrichment of informative DNA templates is accomplished by targeted PCR amplification using a primer complementary to adapter A, and a primer complementary to a unique genomic sequence within each informative template targeted for analysis, designed so that the resulting amplified template is of the desired sequence length (compatible or optimal for the selected sequencing system) and the resulting amplified template spans informative sequences (i.e., DNA polymorphisms).
  • Embodiments of this alternative approach to selection of informative DNA templates are especially useful when the number of polymorphic sequences and informative templates targeted for analysis is less than about 2,000.
  • the PCR-based enrichment route is shown along the lower portion of FIG. 1 (Stages I, lib, HIb, IVb, and V).
  • Stage I DNA fragment generation by digestion with restriction enzymes and/or shearing is carried out in a similar manner for both PCR- based and hybridization-based enrichment pathways.
  • step lib involves ligation of a single adapter (A or B depending on embodiment) after which templates from different individuals may be pooled.
  • FIG. 1 shows the instance in which adapter A is ligated to one of the termini of a DNA fragment in Stage lib.
  • adapter A may be ligated to both ends of a DNA fragment.
  • PCR amplification techniques are well known in the art and have been described in the literature. Briefly, PCR amplification includes selectively amplifying the single- stranded adapter A-modified informative templates using primers that hybridize to adapter A and a set of second informative template specific primers, each of which is complementary to a unique genomic DNA sequence flanking adapter A in an adapter A-modified informative DNA-template.
  • the second primers that are unique to each informative template targeted for analysis are configured so that the amplified DNA contains a DNA sequence polymorphism. In this way, only the informative DNA templates are amplified, and the non-informative templates are not resulting in an enrichment of informative DNA templates.
  • primers that bind to adapter A and 10 or more different primers complementary to different informative templates are pooled (multiplexed) to streamline template preparation. In this way about 1,000-2,000 informative templates can be selectively amplified using a 96- well plate format for PCR, using pools of 10 or 20 different primers specific to informative templates per well.
  • the PCR amplification primers used to target informative templates in Stage HIb are designed to include two primary features: (1) unique "targeting" sequences are present at the 3' -end of each primer, usually 17-30bp in length, that are complementary to respective informative templates targeted for analysis, and (2) a "universal" amplification sequence, usually 17-30bp in length, that is not present in the target genome (as shown in FIG. 4).
  • all templates may then be amplified with primers complementary to adapter A and a primer complementary to the "universal" sequence present in the mixture of informative templates.
  • the amplified templates are of a predetermined sequence length and contain informative sequences.
  • the "universal" sequence (e.g., primer B in FIG. 4) is selected to be compatible with template preparation for high throughput sequencing on the ILLUMINA ® SGAIITM, ABI SOLIDTM, or the Roche 454.
  • one of the primers used to amplify the informative templates contains a biotin that allows purification of template prior to sequencing (Stage IVb).
  • Size selection may be performed either before or after the first PCR step, or, in some cases, later in the procedure. Size selection is desirable in many applications for increasing overall efficiency of a template generation and sequencing process. It is optional in some instances, however, such as when Stage HIb PCR based enrichment is employed.
  • Stage Ilia the DNA fragments have adapters A and B ligated to the fragments.
  • Stage HIb PCR-based enrichment route
  • only one adapter is ligated in Stage II (either A or B, depending on the application).
  • the second priming site is incorporated during targeted amplification, as shown.
  • the primers are indicated by arrows in Stage HIb, and the adapter B linked to the second primer is designated by an open box.
  • the PCR enrichment route (Stage HIb) is also illustrated in FIG. 4.
  • the asterisk (*) denotes the site of a DNA polymorphism in the genome (Stage I) and DNA templates (Stages II-IV) derived from the genome.
  • the adapter-linked informative single stranded DNA templates are separated from the bead- bound complementary oligonucleotides by heat denaturation or treatment with alkali to release the selected templates (Stage IV of FIG. 1 and Step 7 of FIG. 3).
  • the enriched set of DNA templates is then ready for an optional further amplification, and/or direct sequencing and analysis.
  • the resulting single- stranded DNA templates are then sequenced (Step 9 of FIG. 4).
  • the informative DNA templates recovered in Stage IVa are further enriched by PCR or any other suitable amplification technique.
  • a PCR amplification procedure includes denaturing the hybridized DNA templates (as in Step 7) and PCR amplification of the resulting mixture of informative DNA templates using a primer that is complementary to adapter A and a second primer that is complementary to adapter B.
  • PCR amplification technique is the well known solid-phase bridge amplification used to create DNA clusters for DNA sequencing on the ILLUMINA ® SGAIITM in which adapter-linked single stranded DNA templates attach to a surface containing a multiplicity of regularly spaced single-stranded primers having sequences that are complementary to the primer sequences contained in adapters A and B.
  • the polymerase enzyme incorporates nucleotides to build double- stranded "bridges" between the spaced-apart primers on the solid surface. After amplification, the resulting double-stranded DNA sequences may be represented as shown in box 3 of FIG. 2.
  • index a unique indexing sequence
  • the mixed DNA templates from Stage II, or the informative DNA templates from Stage IV of one individual may be pooled with similarly prepared, but differently indexed, adapter-linked DNA templates derived from other individuals.
  • the pool of differently indexed adapter-linked informative DNA templates may then be further processed and sequenced together.
  • An indexing sequence (XXX) is shown in box 2 of FIG. 2, flanking a polymorphism-containing DNA fragment having restriction termini denoted by "YYYyy,” at the 5' end of the fragment.
  • the recovered, and optionally amplified, adapter-linked informative DNA templates may then be directly sequenced using a high throughput DNA sequencing platform in which the optimal template sequence length for sequencing is in the range of about 50 bp to about 600 bp, in accordance with the instructions provided by the manufacturer of the chosen high throughput sequencing platform.
  • This high throughput sequencing step may also be referred to as "resequencing" of the informative DNA templates in instances and in situations in which a previously sequenced genome is the target genome under investigation.
  • a reference sequence will have been previously obtained for the same or a similar individual, and the polymorphisms in the reference sequence will have been previously identified.
  • a reference sequence may have been previously obtained as part of the process of preparing an above-mentioned set of complementary nucleotide sequences for use in the Stage III hybridization-based template selection process.
  • the informative DNA templates are sequenced using an ILLUMINA ® SGAIITM sequencing system, in which the optimal template sequence length is about 50-250bp.
  • Optimal template size in this instance is related to a template size that will uniformly be amplified in the sequencer by bridge amplification.
  • the informative DNA templates are sequenced using an ABI DNA SOLIDTM System, in which the optimal template sequence length is about 50 bp.
  • the informative DNA templates are sequenced using a Roche 454 GS-FLXTM Genome Analysis System, in which the optimal template sequence length is at least 400 bp in cases where the read length of the sequencer is 400 bp.
  • a number of “sequencing by synthesis” reactions are used to elucidate the identity of a plurality of bases at target positions within the target sequence. Sequencing by synthesis techniques are well known in the art, and have been described in the literature. All of these reactions rely on the use of a target sequence comprising at least two domains, including a first domain (i.e., an adapter) to which a sequencing primer will hybridize, and an adjacent second domain, for which sequence information is desired (i.e., an informative DNA template of initially undetermined sequence).
  • enzymes are used to add dNTPs to the primer, and each addition of dNTP is "read” to determine the identity of the added dNTP. This may proceed for many cycles.
  • Sequencing primers specific to the adapters ligated to termini created by the "targeting" restriction enzyme are used to generate sequences adjacent to these sites.
  • RS L- sequencing is a flexible technology that will allow the investigator to vary the number of sites sequenced and depth of sequencing depending on application.
  • Another feature of sequence-based genotyping from RSL-templates is that sequencing always starts from the same sites in the genome, immediately adjacent to the targeting restriction enzyme recognition sequence. As a result, the location of a specific sequence polymorphism is always a specified number of bases away from the start site for DNA sequencing in each template selected for analysis. For many applications, this is a potentially valuable feature because the accuracy of sequencing decreases in a predictable way as a function of position in the sequence. Conversely, sequence accuracy is highest close to the sequencing primer. Therefore, in some cases a quality score can be assigned to sequence-based genotypes based in part on this information.
  • the depth and redundancy of RSL-sequencing may be modulated in any of several ways, which are briefly described as follows: (1) by selecting restriction enzymes that cleave the target genome with difference frequencies depending on the recognition site, (2) by digesting the target genome with two or more restriction enzymes that recognize different 8-base, 6- or 4-base sequences, for example, to increase the number of different DNA segments targeted for sequencing; (3) by amplifying and pooling DNA from two or more different genotypes, each containing a specific sequence identification tag (indexing), to track the origin of the DNA sequences; and (4) by using restriction enzymes that are sensitive to the methylation state of DNA within their recognition sequences; all of these variations are further described elsewhere herein.
  • the flexible methods disclosed herein allow targeted analysis of subsets of RSL-templates generated by any specific set of restriction enzymes, by permitting variation in the number of informative DNA templates analyzed per genome and their distribution across the genome, or based on their utility in larger population studies.
  • a global analysis of the sequences of all RSL-templates generated by a particular restriction enzyme or combination of restriction enzymes from the parental genotypes of interest results in data that, in many embodiments, will identify; (1) the subset of informative RSL-templates that can be successfully sequenced at a reasonable frequency.
  • Differences in template utility are affected by size (i.e., sequence length), repetitiveness, presence of polymorphisms, and other factors; (2) the subset of RSL- templates that contain unique sequences in cases where a reference genome sequence is available, and (3) the subset of unique sequences that contain polymorphisms that distinguish parental genotypes.
  • size i.e., sequence length
  • repetitiveness i.e., repetitiveness
  • presence of polymorphisms i.e., presence of polymorphisms, and other factors
  • the subset of RSL- templates that contain unique sequences in cases where a reference genome sequence is available the subset of unique sequences that contain polymorphisms that distinguish parental genotypes.
  • the unique sequences may be mapped using bioinformatics.
  • the polymorphic sequences will need to be mapped through normal segregation analysis in a population.
  • Polymorphic RSL-sequences may be genetically mapped by analyzing mapping progeny followed by linkage analysis.
  • sequences of the recovered informative DNA templates may then be compared to those derived from one or more different individuals or to a reference set of sequences, to identify specific DNA polymorphisms (alleles) and to generate genotyping/haplotyping information.
  • Suitable software for analyzing the sequence data, and for aligning the sequences is available from well-known commercial sources.
  • an approximately 1,000 Mbp "genome" of random sequence is predicted to contain about 15,500 digestion sites for a restriction enzyme that recognizes a specific 8-base sequence.
  • the collection of about 100 bp sequences flanking this set of restriction enzyme digestion sites constitutes a specific approximately 3.1 Mbp sub-sample of the target "genome.”
  • RSL-sequencing allows the sequences flanking this set of restriction sites to be obtained from genomic DNA or a library of large insert bacterial artificial chromosome (BAC) clones prepared from that genome.
  • BAC bacterial artificial chromosome
  • the RSL-sequences derived from large insert clones may potentially be used to build physical maps, as the overlapping clones will contain common RSL-sequences.
  • RSL-sequences mapped onto a BAC -based physical map spanning a genome may potentially be used to locate gene sequences and whole genome sequence assemblies on the genome map.
  • RSL-sequences from different genotypes may be compared to identify DNA polymorphisms useful for the design of DNA marker assays and for diversity and haplotype analyses. Therefore, RSL-sequences obtained using any suitable high throughput- sequencing platform will potentially integrate genome map building, genome sequencing, and diversity analysis.
  • an informative DNA template refers to a DNA template that has three properties; (1) the template is compatible with high throughput sequencing, (2) the template contains one or more sequences that can be mapped to a unique location in a genome, and (3) the sequence or sequences derived from the template are polymorphic in the target species (i.e., parental lines used for genetic mapping, lines being fingerprinted, individuals being analyzed as part of a diversity or haplo typing study).
  • an informative DNA template contains a sequence in a location suitable for high throughput DNA sequencing, and contains a unique polymorphic site in the species genome sequence.
  • an informative DNA template must span a sequence that is polymorphic in the target species, it should be understood that it may or may not be polymorphic in the particular individual analyzed.
  • DNA templates generated using at least one restriction enzyme are referred to herein as “restriction site localized templates” or “RSL-templates".
  • RSL-templates DNA templates generated by shearing only, by digestion with restriction enzyme(s) only, or by a combination of shearing and restriction enzyme(s) digestion, are sometimes referred to herein as “RSL-templates.” Accordingly, the term “RSL- templates” should be interpreted to include DNA templates generated by either restriction enzyme(s) digestion or shearing, or by a combination of those.
  • a “nuclear genome” is all the DNA or genetic material in the chromosomes of a eukaryotic organism. Eukaryotic organisms such as plants will also contain organellar genomes in their mitochondria and chloroplasts.
  • allelic form is a distinct DNA sequence or "spelling" of a chromosomal region.
  • first identified allelic form is arbitrarily designated as the reference form and other allelic forms are designated as alternative or variant alleles.
  • allelic form occurring most frequently in a selected population is sometimes referred to as the wild type form. Diploid organisms may be homozygous or heterozygous for allelic forms.
  • a diallelic polymorphism has two forms.
  • a triallelic polymorphism has three forms. Most organisms have multiple alleles of gene sequences in their naturally occurring populations (germplasm in the case of plant species).
  • Polymorphism refers to the occurrence of two or more alternative sequences or alleles in a species or population.
  • a "polymorphic site” is the locus or specific sequence location in a genome at which sequence divergence occurs (i.e., the site of variation between allelic sequences).
  • a polymorphism may comprise one or more base changes, a nucleotide insertion or deletion (INDEL), a nucleotide inversion, or variation in the size of a simple sequence repeat (SSR), relative to a reference allele.
  • INDEL nucleotide insertion or deletion
  • SSR simple sequence repeat
  • a single base pair polymorphism termed a "single nucleotide polymorphism” (SNP), occurs at a polymorphic site occupied by a single nucleotide.
  • a single nucleotide polymorphism usually arises due to substitution of one nucleotide for another at the polymorphic site.
  • a transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine.
  • a transversion is the replacement of a purine by a pyrimidine or vice versa.
  • a “genotype” is a collection of all the polymorphisms or alleles of an individual's genome.
  • a "haplotype” is a combination of alleles or polymorphisms at multiple loci that tend to be transmitted together. For example, neighboring polymorphisms that are inherited together on the same chromosome.
  • a "monomorphic template” does not contain DNA polymorphisms.
  • An “amplicon” refers to a fragment of DNA that can be amplified using specific priming sites located at each terminus, sequences usually added by ligation of adapters.
  • oligonucleotide is a relatively short nucleic acid sequence, such as DNA or RNA, and may be single- or double-stranded. Oligonucleotides are typically prepared by synthetic means, however they may also be isolated from naturally occurring sources. For the purposes of this disclosure, oligonucleotides are usually in the range of about 17 to about 30 base pairs (bp) in length, and in some instances are about 30-90 bp long, for example.
  • hybridization refers to the non-covalent interaction or binding of two complementary single- stranded nucleic acid strands (i.e., DNA and/or RNA) into a single double- stranded molecule.
  • Two perfectly complementary strands will bind to each other readily (i.e., anneal or "hybridize") because the nucleotides of the complementary strands bind to their complements under normal hybridization conditions.
  • Hybridizations are usually performed under stringent conditions that are dependent on sequence length, GC content, temperature, salt, and other characteristics of the hybridizing media.
  • conditions of 5X SSPE 750 mM NaCl, 50 mM Na 2 PO 4 , 5 mM EDTA, pH 7.4
  • a temperature of 25-3O 0 C are suitable for SNP- specific oligonucleotide probe hybridizations.
  • stringent hybridization conditions see, for example, Sambrook, Fritsche and Maniatis. "Molecular Cloning A laboratory Manual” 2 nd Ed. Cold Spring Harbor Press (1989).
  • a “hybridization array” is an array comprising a solid support or matrix with attached oligonucleotide probes. Arrays typically comprise a plurality of different oligonucleotide probes that are coupled to a surface of a substrate in different locations. Substrates may be beads, planar or curved surfaces, fibers such as fiber optics, glass or any other suitable material or structure.
  • High throughput sequencing refers to instances where 400,000 to 70 million templates or more are sequenced in parallel in a single run generating up to 2 billion bases of sequence or more per run, in an automated nucleotide sequencing system.
  • references herein to "an individual” may apply to human beings, other mammals, plants, bacteria, or any other organism, as the context allows in this disclosure.
  • the term “individual” refers to a single member of any species. In most cases, the genotypes and templates derived from each individual of a species or of a different species will be different, but occasionally (e.g., twins or clones) the genotypes and corresponding templates will be the same. For example, different individuals may have different genotypes (e.g., Genbank accession numbers, ecotypes, germplasm accession numbers, etc.), with the exception of clones or genetically identical twins.
  • Example 1 DNA template preparation using one restriction enzyme.
  • Adapter A is a double stranded nucleotide sequence containing short terminal sequences that allow binding and ligation to DNA fragments generated by one or more restriction enzymes, selected 2-6 bp nucleotide sequence (index), a unique primer binding sequence useful for amplification or to initiate DNA sequencing ("primer A"), and adapter B contains a unique primer binding sequence.
  • index selected 2-6 bp nucleotide sequence
  • primer A a unique primer binding sequence useful for amplification or to initiate DNA sequencing
  • adapter B contains a unique primer binding sequence.
  • the sets of uniquely indexed adapters A and adapters B can be prepared by individual users of the technology but are typically provided by the manufacturer of the selected sequencing platform to be used for sequencing the informative DNA templates obtained in Step 1-5. If adapter A contains an indexing sequence, similarly prepared RFs from different individuals can be pooled to increase the efficiency of downstream processing
  • Enrich for DNA templates with sequence lengths optimal for the DNA sequencer selected for analysis i.e., in the range of about 50-200bp for the ILLUMINA® SGAIITM, 400-600 for the Roche 454 sequencer.
  • Ligate adapter A to the resulting RFs. If adapter A contains an indexing sequence, similarly prepared RFs from different individuals can be pooled to increase the efficiency of downstream processing.
  • Fragment the adapter A-linked RFs by shearing to generate smaller fragments in the range of about 50bp to about 200bp sequence length for sequencing on the ILLUMINA® SGAIITM or 400-600 for sequencing on the Roche 454 sequencer.
  • Ligate adapter B to the DNA fragments.
  • the technical approaches described above for targeted sequencing and polymorphism discovery adjacent to restriction sites in large genomes was validated on the sorghum genome (800 Mbp).
  • the number of sites across the sorghum genome analyzed for SNP discovery was varied in several ways, as follows: (1) using restriction enzymes with 8- or 6-base recognition sites (Fsel, Kasl respectively were tested), (2) using restriction enzymes sensitive to DNA methylation (Fsel) and a restriction enzyme that is not sensitive to DNA methylation (Sphl), and (3) using a 4 bp restriction enzyme or shearing to generate the second end of each amplicon. From this study it was concluded that the use of methylation sensitive restriction enzymes significantly reduced the number of repeat sequences obtained increasing data yield/sequence.
  • Fsel/Msel RSL-templates were prepared from BTx623 and IS3620C, and approximately 250,000 RSL-sequences were obtained from each genotype.
  • approximately 11,000 different sequences acquired in the experiment excluding error containing sequences
  • approximately 5,000 templates containing unique sequences were sequenced 5X or more times from each genotype.
  • Comparison of the sequences from the two genotypes identified 200-400 SNPs/InDels within 27 bp of the Fsel restriction site.
  • RSL-templates were prepared by digestion of BTx623 and IS3620C genomic DNA with Fsel followed by ligation of adapter A. The resulting RFs were sheared and adapter B was ligated to create DNA template. After PCR using primers complementary to sequences in adapter A and adapter B, and enrichment of template of an optimal size, the templates were sequenced using priming sites in adapter A on an ILLUMINA ® S G AllTM sequencer. In this experiment, approximately 13,000 different unique 27bp sequences were obtained from both genotypes revealing approximately 1,500 polymorphic sequences.
  • DNA templates generated using Kasl/shearing allowed sequence analysis of approximately 50,000 different unique RSL-tags at 5X or greater depth through acquisition of about 3,000,000 sequences. It is estimated that this collection of sequences from IS3620C and BTx623 will reveal more than 5,000 SNPs/INDELs adjacent to Fsel-sites when the data is fully analyzed.
  • the CspCl restriction enzyme When the CspCl restriction enzyme is used to digest a random 1,000 Mbp genome sequence, it is predicted that this size genome will contain about 62,500 sites for CspCl and generate 125,000 DNA templates.
  • the CspCl enzyme is not methylation sensitive, so nearly all sites would be available for digestion.
  • the resulting small (approximately 34— 38bp) DNA fragments may be purified by size selection on agarose gels, blunt ended, and ligated to adapters A and B (one of which containing the sequencing primer binding site plus an indexing sequence). PCR amplification will differentially amplify RSL-tags flanked by two different adapters (due to suppression PCR) and these may be loaded directly onto the sequencer or further purified prior to sequencing as necessary.
  • both strands of each DNA template will be sequenced, thereby eliminating or reducing the extent of the increased sequencing error rates that tend to occur towards the 3 '-end of each ILLUMIN A ® sequencing run. If CspCl digestion cuts a random sequence 1,000 Mbp genome at 62,500 sites creating 125,000 RSL-templates, and 50% of the 33 bp sequences derived from the resulting RSL-templates are unique, and if there is one polymorphism per 1,000 bp in a comparison of two genotypes, then analysis of this set of RSL-templates by sequencing would reveal approximately 2,062 polymorphic sequences. The subset of RSL-tags corresponding to unique sequences spanning polymorphic sites may then be mapped and used for genotyping, DNA fingerprinting, or haplotype analysis.
  • Example 5 Identification of RSL polymorphism and design of primers for amplification of the polymorphic region in sorghum.
  • SNPs single nucleotide polymorphisms
  • InDeIs insertion/deletions
  • FIG. 6 a 4 bp InDeI polymorphism between BTx623 and IS3620c is identified.
  • a reverse primer is designed downstream of the polymorphism. This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing.
  • A) denotes the sequence identifier: (Coordinates on Sorghum pseudomolecule); (Genotype_Sequence ID_Number of sequences in the sequence contig).
  • B) is the alignment and comparison of BTx623 and IS3620c sequence contigs.
  • the InDeI is bolded and underlined.
  • the next six bases correspond to the Fsel half-site.
  • the Index Sequences are contained within the RSL adapter ligated to the Fsel site.
  • C the results of a BLAST similarity search is shown, comparing the BTx623 sequence with the assembled Phytozome sorghum pseudomolecule database (available on the world wide web at phytozome.net).
  • a 227bp region, containing the 27 bp ILLUMINA ® sequence and 200 bp downstream is identified and downloaded.
  • an optimal reverse PCR primer is designed within the 227 bp sequence that will produce a PCR product of approximately 75-200 bp, when used with a forward primer specific for the Fsel adapter.
  • D) the adapter-modified DNA fragment, and representations of the forward and reverse Fsel adapter- specific primer are shown.
  • C) and D) the Fsel half site is underlined.
  • SNP or InDeI is bolded and underlined, and the reverse oligonucleotide primer-binding site italicized and underlined.
  • Example 6 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 14) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) and D the Fsel half site is underlined.
  • the InDeI is bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • Example 7 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 18) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) the Fsel half site is underlined.
  • the SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 8 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 22) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) the Fsel half site is underlined.
  • the InDeI is bolded and underlined, and the reverse oligonucleotide primer binding site italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 9 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 26) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) the Fsel half site is underlined.
  • the SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 10 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 30) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) the Fsel half site is underlined.
  • the SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 11 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum. As illustrated in FIG. 12, three SNP polymorphisms between BTx623 and
  • IS3620c are identified.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 34) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C the Fsel half site is underlined.
  • the SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • a three SNP polymorphism between BTx623 and IS3620c is identified.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 38) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C the Fsel half site is underlined.
  • the SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 13 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • a lbp InDeI and one SNP polymorphism between BTx623 and IS3620c are identified.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 42) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C the Fsel half site is underlined.
  • the SNP and InDeI are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • Example 14 Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
  • A) denotes the sequence identifier.
  • B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence.
  • a reverse primer (SEQ ID NO: 46) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5.
  • C) the Fsel half site is underlined.
  • the SNPs and InDeI are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
  • the adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of Fig. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
  • DNA template generation using combinations of restriction enzymes and adapter ligation, as described in Example 1, above, has been tested in silico and in the laboratory based on the rice genome sequence containing about 400,000,000 bp.
  • DNA templates generated by the described technique were sequenced by using the Roche 454 Genome Sequencer 20 System. The approximately 250,000 template sequences generated per run from several sequencing runs were analyzed and compared to results predicted in silico. The results confirmed the feasibility of using restriction enzymes/adapter ligation for the reproducible generation of DNA templates for high throughput targeted DNA sequencing and resequencing using the Roche 454 sequencer.
  • Example 16 High Throughput Multiplex Sequencing of Informative DNA Templates.
  • 10,000 informative RSL- templates are targeted for selective amplification, capture and genotyping analysis across 100 accessions of a species germplasm.
  • a 1OX depth of sequence analysis of the amplified, enriched informative DNA templates will require the acquisition of 10,000,000 sequences on the ILLUMINA® sequencer.
  • the ILLUMINA ® SGAIITM is capable of sequencing approximately 50 million templates per run or approximately 6.25 million per channel.
  • the required 10,000,000 sequences may be distributed across several channels of the ILLUMINA ® sequencer with sequencing done in parallel with other samples utilizing unique indexing sequences to assign the sequences to their accession of origin.
  • Genomic DNA from 100 different accessions may be digested with Kasl, CspCl, or any other suitable targeting restriction enzyme, followed by ligation of adapters as described above (and illustrated in box 2 of FIG. 2).
  • the resulting indexed DNA templates are then pooled prior to amplification and enrichment of the informative DNA templates.
  • a potential advantage of the proposed approach in addition to its lower cost, is the procedural flexibility and low barrier to entry. Many individual investigators will be able to obtain genotyping information at various depths depending on the requirements of the selected application.

Abstract

Methods and compositions are disclosed for generating informative DNA templates for sequencing. The methods generally include fragmenting an individual's genomic DNA to generate templates with predetermined range of sequence lengths, ligating adapters; selecting informative DNA templates, which contain a polymorphism; and recovering enriched informative DNA templates. The DNA templates may then be sequenced using a high throughput sequencing process, and the resulting sequences compared to a reference set of sequences obtained from the genomes of other individuals, to identify DNA polymorphisms and to generate genotypes.

Description

TITLE OF INVENTION
METHOD OF GENERATING INFORMATIVE DNA TEMPLATES FOR HIGH-THROUGHPUT SEQUENCING APPLICATIONS
REFERENCE TO RELATED APPLICATIONS
This application claims priority to U.S. Provisional Application No. 61/102,118, filed on October 2, 2008, which is incorporated herein by reference in its entirety.
INCORPORATION OF SEQUENCE LISTING
The Sequence Listing, which is a part of the present disclosure, includes a computer readable 11 KB file created on September 25, 2009 entitled
"TAMC011WO_ST25.txt" comprising nucleotide and/or amino acid sequences of the present invention submitted via EFS-Web. The subject matter of the Sequence Listing is incorporated herein by reference in its entirety.
TECHNICAL FIELD
The present invention generally relates to the generation of DNA templates from specific sites in genomes for high throughput sequence-based analysis, for such applications as genotyping, marker-assisted breeding, genetic mapping, haplotyping, physical map construction, and gene mapping. More particularly, the invention relates to methods for isolating and enriching the population of informative DNA templates for such high throughput sequencing applications.
BACKGROUND
A number of techniques have been described for discovering and detecting DNA polymorphisms. Most involve an initial DNA sequence polymorphism discovery step, which usually involves direct sequencing of cDNA or genes, or involves hybridization to oligonucleotide arrays, followed by the development of targeted single nucleotide polymorphism (SNP) or insertion/deletion (InDeI) detection assays. Digestion of genomic DNA by restriction enzymes, and/or shearing, followed by adapter ligation, is a standard procedure for preparing amplifiable templates from genomes for a variety of uses, such as polymerase chain reaction (PCR)-based genotyping assays. One recent methodology known as amplified fragment length polymorphism (AFLP) employs PCR to amplify restriction fragments obtained from a complex mixture of DNA fragments that are prepared by the action of restriction endonucleases on genomic DNA. In some methodologies, the primers used for amplification of DNA, or to initiate sequence analysis of DNA, are not directed against a known genomic DNA sequence, but rather are designed such that they are complementary to sequences in adapters ligated to the ends of the restriction fragments. This strategy yields restriction site localized (RSL) DNA sequences, some of which include polymorphisms.
Bulk methods for isolating restriction fragments that contain polymorphisms recover the entire collection of restriction fragments generated by the selected restriction enzyme. However, not all of those DNA fragments can be used to generate useful genotyping information. First, the sequences adjacent to many of the restriction sites will be repetitive and therefore not useful for genotyping. Secondly, some restriction site localized (RSL) fragments will be too large or too small or contain secondary structures that make them poor templates for bridge amplification and sequencing. Thirdly, in genetic mapping applications, the only informative restriction fragments are those that contain DNA sequences spanning polymorphisms that distinguish parental genotypes. In addition to those challenges, secondary sources of DNA such as chloroplast, mitochondria, and contaminating bacteria or fungal genomes, may be present at relatively high copy number in some samples of a nuclear DNA selected for analysis. These non-target sources of DNA restriction fragments reduce the efficiency of bulk amplification and sequencing procedures.
There is continuing interest in developing ways to rapidly identify a large number of polymorphisms and mutations distributed over the entire expanse of large genomes or targeted to specific sites, to facilitate their use in a wide range of applications, including DNA fingerprinting, marker-assisted plant and animal breeding, genetic/quantitative trait locus (QTL) mapping, haplotyping, and other applications.
SUMMARY Embodiments of the present methods comprise the sequential use of restriction enzymes and adapter ligation for template generation, followed by selection of informative templates by hybridization to oligonucleotides attached to a solid matrix, or by targeted PCR amplification, with subsequent direct sequence analysis. Embodiments of this specific combination of procedures provide a highly flexible, very low cost, and highly accurate way to obtain genotyping information. Compositions and kits for carrying out such methods are also provided in accordance with some embodiments of the invention.
In accordance with certain embodiments, a method of generating informative DNA templates for sequencing is provided that comprises: (a) obtaining a fragmented genomic DNA sequence from a first individual, to provide a mixture of DNA fragments, wherein the genomic DNA comprises a plurality of polymorphisms; (b) ligating at least a first adapter to the DNA fragments, to provide a mixture comprising adapter-modified informative DNA templates and adapter-modified non-informative DNA templates, wherein each said informative DNA template comprises a unique sequence in a location compatible with high throughput DNA sequencing of the template, wherein the unique sequence comprises a unique polymorphic site in the species genome sequence; and (c) selecting adapter-modified informative DNA templates (e.g., by either hybridization-based selection or targeted PCR amplification of the adapter-modified informative templates), to obtain an enriched mixture of adapter- modified informative DNA templates. In some embodiments, a hybridization-based selection such as in step (c) involves forming hybridized complexes comprising the adapter-modified informative DNA templates, oligonucleotides and a solid matrix, and excluding the non-informative DNA templates. In some embodiments, the method further comprises (d) separating non-hybridized non-informative DNA templates from the hybridized complexes; and (e) releasing the informative DNA templates from the hybridized complexes, to obtain an enriched mixture of adapter-modified informative DNA templates. In some embodiments, forming the hybridized complexes such as in step (c) comprises hybridizing oligonucleotides to complementary sequences in the informative DNA templates.
In some embodiments, targeted PCR amplification of informative templates such as in step (c) involves a first primer complementary to the first adapter, and a set of second primer (or plurality of primers) complementary to one or more unique sequences in the informative DNA fragments, wherein each said second primer(s) is designed such that the resulting amplified DNA templates are of a predetermined sequence length (or range of lengths) and comprise informative DNA sequences.
In some embodiments, in step (b), ligating comprises ligating a second adapter to a terminus of each said DNA fragment opposite the first adapter, to provide the mixture of adapter-modified informative DNA templates and adapter-modified non- informative DNA templates. In the process, some templates are generated that comprise adapter A or adapter B ligated to both termini of some DNA fragments making them incompatible with some high throughput DNA sequencing technologies. Therefore, some embodiments further comprise the use of suppression PCR to amplify and enrich DNA templates comprising adapters A and B that are compatible with high throughput sequencing relative to templates containing only adapter A or adapter B that are not suitable for sequencing.
In some embodiments, a method according to the invention further comprises selecting adapter-modified informative DNA templates having a predetermined sequence length that is compatible DNA sequencing (e.g., bridge amplification in the case of ILLUMINA® SGAII™ sequencing) and read length of a selected high throughput sequencing process.
In some embodiments, the method further comprises a step (d) for subjecting the enriched mixture of adapter-modified informative DNA templates to a high- throughput DNA sequencing procedure, to obtain the sequences of the informative DNA templates, and a step (e) comparing the sequences of the informative DNA templates to at least one set of reference genomic DNA sequences to identify the specific (e.g., polymorphic) allele sequence obtained from each template or site in the genome. In certain embodiments, at least one set of reference genomic DNA sequences is obtained from at least one reference individual. In further embodiments, at least the first adapter comprises an indexing sequence that can be correlated to the first individual.
In accordance with still another embodiment of the invention, a reagent is provided for selecting informative DNA templates, comprising a solid matrix (e.g., a plurality of beads) and a plurality of different oligonucleotides attached to the solid matrix. For example, each of the oligonucleotides may be in the range of about 17-60 nucleotides in length and is complementary to a unique sequence present in a respective informative DNA template. In certain aspects, each such informative DNA template comprises at least one polymorphism located within the read length of a selected high throughput sequencing process, wherein the location is measured from either terminus of the informative DNA template.
In accordance with another embodiment of the invention, a hybrid DNA complex is provided which comprises a reagent for selection of informative DNA templates as described herein and a plurality of adapter-modified informative DNA templates hybridized to the matrix-attached oligonucleotides. For instance, each such informative DNA template may comprise at least one polymorphism located within the read length of a selected high throughput sequencing process, wherein the location is measured from either terminus of the informative DNA template. In some aspects, the adapter-modified informative DNA templates comprise at least a first adapter ligated to the informative templates. In various embodiments, the informative DNA templates are obtained from a single individual, or are derived from a plurality of individuals, in which case the first adapter comprises a unique indexing sequence ligated to each of the informative DNA templates for matching it to the individual from which it was derived. These and other embodiments and potential advantages will be apparent with reference to the following description and drawings.
In still a further embodiment, the invention provides a method for marker- assisted selection. For instance, fragmented genomic DNA is obtained from a plurality individual plants or plant cells, to provide a plurality of genomic DNA fragments comprising a plurality of polymorphic sequences at least one of which is linked to a trait of interest. Fragmented genomic DNA is then ligated to at least a first adapter, to provide a plurality of adapter-modified informative DNA templates and adapter- modified non-informative DNA templates, wherein each of said informative DNA templates comprises a polymorphic sequence and wherein said first adapter comprises an index sequence that can be correlated to genomic DNA of an individual plant or plant cell. Adapter-modified informative DNA templates are selected by either hybridization-based selection or targeted PCR amplification of the adapter-modified informative DNA templates, to obtain an enriched mixture of adapter-modified informative DNA templates that can be sequenced. Based on the sequence an individual plant or plant cell is selected based on the presence of at least one polymorphism linked to a trait of interest. For example, selecting an individual plant or plant cell may comprise selecting a plant cell for regeneration of a plant and/or a plant comprising a trait of interest can be selected for commercial production or breeding. In certain aspects, a plant or plant cell for selection is a wheat, maize, rye, rice, oat, barley, turfgrass, sorghum, millet, sugarcane, tobacco, tomato, potato, soybean, cotton, canola, sunflower or alfalfa plant or plant cell. Thus, in certain aspects, the invention provides a method for marker- assisted selection of a genomic region or gene that regulates expression of a trait such as trait of agronomic interest in a plant (e.g., a drought tolerance, enhanced yield, cold tolerance, pest resistance, insect resistance, salt tolerance or herbicide tolerance trait).
As used herein the specification, "a" or "an" may mean one or more. As used herein in the claim(s), when used in conjunction with the word "comprising", the words "a" or "an" may mean one or more than one.
The use of the term "or" in the claims is used to mean "and/or" unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and "and/or." As used herein "another" may mean at least a second or more. Throughout this application, the term "about" is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.
Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein. FIG. 1 is a schematic flow diagram of a method of generating DNA polymorphism-enriched DNA templates for sequencing, in accordance with certain embodiments of the invention.
FIG. 2 is a box flow diagram summarizing template preparation, and illustrating the structure of representative adapter- and primer-modified DNA templates and sequencing complexes, in accordance with certain embodiments of the invention.
FIG. 3 is a box flow diagram of a method of generating enriched informative DNA templates, commencing with restriction enzyme digestion of genomic DNA and/or shearing, in accordance with certain embodiments of the invention. FIG. 4 is a box flow diagram of the steps following preparation and enrichment of informative templates that involve optional amplification, then sequencing and analysis of DNA sequences to identify polymorphisms and alleles.
FIG. 5 is a box flow diagram of a method of FIG. 1, including PCR amplification of informative DNA fragments, according to certain embodiments of the invention.
FIG. 6 illustrates an example of a specific 30 bp restriction site localized sequence from a sorghum genotype containing a unique polymorphism.
FIG. 7 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 8 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 9 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism. FIG. 10 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 11 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 12 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 13 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
FIG. 14 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism. FIG. 15 illustrates an example of another specific 30 bp restriction site localized sequence from a sorghum genotype containing another unique polymorphism.
DETAILED DESCRIPTION
Restriction site localized (RSL)-sequences derived from nearly any genome will be a mixture of "unique" highly informative DNA sequences that occur once per genome, along with less informative DNA sequences that occur two or more times in the target genome. Many existing bulk methods for isolating RSL-templates recover the entire collection of RSL-templates generated by a specific targeting restriction enzyme. However, not all of those recovered RSL-templates can be used to generate useful geno typing information. First, the sequences adjacent to many of these restriction sites will be repetitive and therefore not useful for genotyping. Secondly, some RSL-templates will be too large or too small or contain DNA sequences with a propensity to form secondary structures that make them poor templates for bridge amplification, or for some other reason. Such size considerations are discussed in more detail below. Thirdly, in genetic mapping applications, the only informative sequences, and therefore informative RSL-templates, are those that correspond to DNA sequences spanning polymorphisms that distinguish parental genotypes. For example, if the parents used to create a population for genetic mapping purposes contain one sequence polymorphism per 1,000 bp, then only 1 specific template out of 40 RSL-templates sequenced at random will contain a polymorphic sequence, assuming 25 bp of sequence is obtained per template. In this instance, only the single DNA template containing the polymorphism is considered "informative," while the 39 other DNA templates in the mixture would be "uninformative," for the purposes of genetic mapping in the target population.
In certain embodiments described herein, direct hybridization (e.g., to oligonucleotide-beads or microarrays) is utilized to select large numbers of RSL- templates that contain polymorphisms (e.g., SNP/INDEL) for sequence-based genotyping. In other embodiments, a primer-targeted PCR-based enrichment is described that allows smaller numbers of polymorphism-containing RSL-templates to be amplified for sequence-based genotyping.
Embodiments of the present methods that provide targeted isolation of RSL- templates for sequencing, which contain informative polymorphic sequences, will potentially increase efficiency by approximately 100-fold or more (by removing repeats and monomorphic templates. This will potentially reduce the cost of genotyping considerably and expand the number and complexity of applications for this technology. Some applications require the collection of information from a large number of polymorphic sites per genome (e.g., haplotype-based analysis which typically requires analysis of about 5,000 to 20,000 or more polymorphic sites per genome depending on linkage disequilibrium, whereas other applications, such as DNA fingerprinting, early phase genetic mapping or pedigree analysis require analysis of a relatively small number of polymorphic sites (approximately 500-2,000) per genome. Fine mapping genetic studies require higher density analysis (a large number of polymorphic sites analyzed per Mbp), but only in one or a few regions of the genome. Various embodiments of the methods described herein for sequence-based genotyping substantially enhance the output of these forms of genomic analysis.
Embodiments of the methods disclosed herein offer a flexible method to generate and perform a targeted analysis of subsets of RSL-templates generated by any specific set of restriction enzymes, allowing variation in the number analyzed per genome and their distribution across the genome (or based on their utility in larger population studies).
FIG. 1 is a box flow diagram of a method of generating a mixture or pool of informative DNA templates, and then sequencing those templates to identify DNA polymorphisms by comparison to a reference sequence, to generate genotypes and haplotypes. The method generally comprises the following stages or a subset thereof:
I. Fragment an individual genomic DNA to generate templates with predetermined range of sequence lengths.
II. Ligate adapters and select DNA template size. III. Informative DNA template selection.
IV. Recovery of enriched informative DNA templates.
V. High throughput sequencing of the recovered DNA templates.
VI. Compare the resulting sequences to a reference set of sequences obtained from the genomes of one or more other individuals, to identify DNA polymorphisms and to generate a genotype.
Stage I. Fragmentation of the genomic DNA of an individual to generate templates of a predetermined sequence length.
The fragmentation stage is designed to produce DNA fragments having a specified complexity, sequence length and information content. Template size is dependent on two factors: (1) DNA fragment length suitable for processing by the high throughput sequencer used to collect sequence information (e.g., an optimal size of about 50 to 250 bp, for uniform bridge-amplification on the ILLUMINA® SGAII™), and (2) DNA template length sufficient to utilize the sequence read length of the high throughput sequencer (e.g., optimally, at least 35-50bp for the SGAII™ and at least 200-400bp for the Roche 454 GS-FLX™ Genome Analysis System). In general, this predetermined sequence length of the DNA fragments is less than that of a protein- coding region of a gene, and in most instances is also smaller than an average exon length. The predetermined template size may vary somewhat based on whether the fragments are generated by restriction enzymes or shearing, or by a combination of those techniques. For example, for restriction fragments, the distance of the DNA polymorphism away from the restriction enzyme cleavage site (where adapter A is added) is the primary determinant of the minimum fragment sequence length (as can be appreciated in the schematic illustration of Stage Ilia in FIG. 1). The "sequencer read length" of a given sequencing system refers to the number of bases of sequence information that can be acquired from each template sequenced.
The minimum sequence length of an informative DNA template is that which spans a DNA polymorphism and that can be uniquely located in a genome. A sequence as small as about 17 bp is sufficient to span a polymorphism and is typically large enough to often be a unique sequence in a genome, to permit identification of its location on a genetic map or within a genome sequence. Somewhat larger sequences (e.g., 25-35) are employed in some embodiments, in that a relatively high percent of the sequences in this size range can be uniquely located in most genomes. When a high- throughput sequencer such as the ILLUMINA® SGAII™ or the ABI SOLID™ sequencer is used (e.g., in Stage V), the sequence read length is approximately 35-50bp. Accordingly, the targeted DNA polymorphisms should be within about 35-50 bp of a restriction site where the adapter containing sequences used to initiate sequencing is located. In instances in which a high-throughput sequencer such as the Roche 454 sequencer is used as the sequencing system (e.g., in Stage V), DNA polymorphisms within about 100-400 bp of restriction sites can be sequenced. If a DNA fragment is not within the read-length adjacent to the cleavage end obtained with one restriction enzyme, a second restriction enzyme with a different recognition sequence can be used in addition to the first enzyme, to generate template. Alternatively, genomic DNA can be sheared to generate template with a random set of termini. Adapter A, that is later used to initiate sequencing reactions in Stage V, is ligated to DNA termini generated by shearing, and the informative templates in the resulting mixture of sheared fragments are then enriched as described below (hybridization-based or PCR-based methods). This approach to template preparation allows any region of any genome to be targeted for geno typing by direct sequence analysis.
In the resulting mixture of genomic fragments, only those that contain an approximately 17-400 bp stretch of DNA sequence located adjacent to a restriction enzyme recognition site or shearing site (i.e., restriction site or cleavage site) must be unique in the target genome and span a polymorphism to be useful for genotyping. These restriction fragments or sheared fragments, once modified by the addition of suitable terminal sequences required for sequencing, are the "informative DNA templates." In some embodiments, in which the DNA fragments are prepared entirely by shearing, the predetermined sequence length is in the range of about 35-50bp (e.g., for the ILLUMINA® SGAII™ and ABI SOLID™ sequencers) or about 400bp (e.g., for the Roche 454 sequencer, which has a comparatively longer sequence read length). In the case of DNA fragments generated by random shearing, the informative DNA templates are those that can be used to generate sequences unique in the target genome, that span DNA polymorphisms, and that have adapters ligated sufficiently close to the DNA polymorphism so that sequences spanning the target polymorphic sequence are acquired.
Informative DNA templates will in most cases be larger than the sequences ultimately derived from the templates by sequencing. For example, templates generated specifically for use with the SGAII™ sequencer may be 50-200bp in length, even though the SGAII™ only sequences about 35bp from one end of the template. Likewise, templates generated specifically for use with the Roche 454 sequencer may be about 400-800bp in length, even though only 200-400bp of sequence are acquired from one end of the template. In other instances, larger templates may be generated for sequencing from both ends of the template (using the SGAII™ sequencer, for instance). It should be understood that the templates and hybridization complexes illustrated in FIG. 1 are not drawn to scale, and the relative lengths of the oligonucleotides and the templates may vary greatly. In some cases, the genomic DNA is fragmented in Stage I to a desired range of sequence lengths and complexity by digesting the DNA with one or more restriction enzymes (as schematically shown in Step 1 of FIG. 3). Complexity in this instance refers to the number of termini generated by digestion of a genome with a restriction enzyme and that correspond to informative sequences. This is accomplished, for example, with restriction enzymes having different recognition sequences (e.g., 4 bp, 6 bp, 7 bp, 8 bp) to vary the number, complexity, repetitiveness, and GC content of the DNA templates generated.
In some variations of the Step 1 fragmentation protocol, the DNA is digested by one or more restriction enzyme that is sensitive to DNA methylation to prepare DNA templates enriched in templates derived from the gene rich regions of a genome. Enrichment for unique gene rich templates occurs because, in many organisms, repeat sequences are differentially methylated and therefore no template will be generated from these sites. In addition, by digesting DNA with a pair of restriction enzymes that recognize and digest the same DNA sequence (isoschizomers), where one of the pair of enzymes will cut methylated DNA and the second enzyme of the pair will not cut methylated DNA, DNA templates are provided for assaying differences in genome DNA methylation. More specifically, the DNA from two or more individuals is digested separately using each of the enzymes that recognizes and digests the same DNA sequence. One of the enzymes is sensitive to DNA methylation (i.e., will not cut these sites), and the other enzyme is not sensitive to DNA methylation (i.e., will cut all recognition sites, regardless of whether it is methylated or not). The individual DNAs are analyzed and tracked using different indexing sequences that can be used for identification. Differences in the complement of sequences derived from the two restriction enzymes correspond to differences in sites of methylation in the genome.
A subset of the polymorphisms that distinguish accessions or parental genotypes reside in the recognition sites of the restriction enzymes used for template preparation. Polymorphisms in a genotype that change the recognition sequence of the restriction enzyme used for template generation will prevent DNA digestion and as a consequence, no templates will be generated adjacent to that site. DNA polymorphisms of this type are identified by the presence of the two template sequences derived adjacent to a digestion site in one genotype, while the same two template sequences are absent in genotypes that contain polymorphisms in that specific restriction enzyme site.
In some embodiments, in order to reduce sequencing background "noise" that might arise due to the undesirable inclusion of mitochondrial or bacterial DNA in the individual's genomic DNA sample, the selected restriction enzymes will have the characteristic of cleaving organelle or bacterial DNA infrequently or not at all. This feature is potentially helpful during the initial phase of discovering the unique polymorphic templates useful for genotyping applications. However, for genotyping assays involving enrichment of informative templates by hybridization (in Stage Ilia) or PCR (in Stage HIb of FIG. 1), the presence of organelle or non-target genome sequences is not a significant impediment and collection of this information may in some cases be informative. Stage II. Adapter ligation.
Following digestion of the DNA sample by one or more restriction enzymes, at least one adapter sequence is ligated to the fragments. A large excess of adapters is used relative to the number of RFs in the mixture, to ensure that most of the restriction fragments will receive an adapter, as illustrated schematically in FIG. 1. In some embodiments, the restriction fragments (RFs) obtained in Stage I are of the desired sequence length(s) appropriate for serving as templates in the selected genotyping by sequencing system. Accordingly, a first adapter ("adapter A") is ligated to the fragments. In most cases, a second adapter ("adapter B") is ligated to the opposite end of each fragment (PCR-based enrichment does not require this).
Adapters A and B are different, unique double stranded oligonucleotide sequences about 17 bp or longer in length. There is a practical upper limit of the sequence length of about 60-90bp, based on present costs, however even longer lengths may be used in some instances. The adapters need to be long enough to encode a sequence complementary to the primer used to initiate sequencing, the indexing sequence, and in some cases, sequences allowing binding to complementary termini generated by digestion of the target genome. The relationship of the adapters, DNA fragment, index sequences and restriction termini, in some embodiments, is illustrated in FIG. 2. One or both adapters may contain an indexing sequence of 2-6 base pairs. The indexing sequence may be located in the adapter immediately adjacent to the nucleotide sequence (priming site) that will be used as the binding site for the sequencing primer that is used to initiate sequencing (in Stage V). The indexing sequence serves as an identification tag that allows the informative DNA templates to be sorted according to their respective sources following sequence analysis (Stage VI). In an embodiment illustrated in boxes 2 and 3 of FIG. 2, a representative adapter
A sequence in underlined (SEQ ID NOs: 1-2), and a representative adapter B sequence is italicized and underlined (SEQ ID NOs: 3-4). In box 2, adapter A is ligated to the restriction site end of the DNA fragment. The "indexing sequence" comprised in adapter A is denoted "XXX". The "restriction site overhang" comprised in adapter A is denoted "YYY." The remaining portion of adapter A corresponds to a DNA sequencing primer ("primer A"), which is described below. As shown in boxes 2 and 3, in some embodiments, adapter B is ligated to the blunt end of the DNA fragment, after shearing and end repair. A primer sequence ("primer B") (e.g., ILLUMINA® primer A), ligated to the 5' end of adapter A, is a italicized sequence in boxes 3 and 4 of FIG. 2 (SEQ ID NOs: 5-6). This primer B sequence may be added to the adapter-modified template through the process of PCR. The sequence of primer A (e.g., ILLUMIN A® primer B), is complementary to a "primer binding site" sequence within adapter A, exclusive of the indexing sequence and restriction overhang. Primers A and B are useful for bridge amplification during cluster generation in certain sequencing systems, such as the ILLUMINA® SGII sequencer. In some embodiments, the DNA sequencing primer, as shown in box 4 of FIG. 2 by dashed lines, is complementary to the primer binding sequence in adapter A. Sets of uniquely indexed adapters A and/or adapters B, and the primer sequences (e.g., primer A and primer B) used to initiate sequencing are typically provided by the manufacturer of the selected sequencing platform for sequencing the informative DNA templates obtained in Stage IV. In some embodiments, the adapter sequences may also comprise specific sequences added to the ends to facilitate a primary amplification step and selection on magnetic beads, for example. It should be understood that any other suitable adapter and primer sequences capable of functioning in a similar manner may be used instead of the examples that are shown in FIG. 2.
By way of either or both adapter A and adapter B, an indexing sequence may be included at the end(s) of every DNA fragment derived from the individual's genome. Including a unique indexing sequence in each adapter-ligated set of DNA templates allows the user to pool the resulting index-modified adapter-ligated DNA templates derived from a large number of different individuals following ligation of an index- modified adapter, as discussed in more detail below in the section titled "Multiplexing DNA Templates for Sequencing." These samples can then be processed in bulk, sequenced, and the resulting sequences assigned to the individuals from which they were derived.
First and Second Restriction Enzyme Treatments of Genomic DNA. In some cases, the fragmenting of genomic DNA is accomplished by digesting the DNA molecules with a first restriction enzyme treatment, followed by ligation of a first adapter ("adapter A") to the ends of the resulting fragments. In some embodiments, these fragments are further digested by a second restriction enzyme treatment, to generate a pool of adapter-linked DNA templates with a predetermined range of shorter sequence lengths, as illustrated schematically in Steps 1-4 of FIG. 3). As mentioned above, in some embodiments, adapter A is a nucleotide sequence that includes an indexing sequence and an adjacent nucleotide sequence (i.e., primer binding site) that can be used in a Stage HIb PCR amplification step, shown in FIGs. 1 and 4) and/or to initiate sequencing in Stage V (Step 9, as shown in FIG. 4). Adapter B is a different nucleotide sequence, with or without the indexing sequence, that includes a primer binding sequence that can be used to amplify templates containing adapters A and B (optional PCR amplification step indicated by dashed box in FIG. 3), a step that enriches templates containing both adapter A and adapter B relative to templates containing only one type of adapter due to suppression PCR. The sequences of Adapters A and B are unique and non-complementary to each other. However, in some instances (when used in conjunction with the ILLUMINA® SGAII™, for instance), the adapter sequences may contain sequences that are complementary to oligonucleotides that are covalently attached to solid surfaces, to facilitate bridge amplification.
Restriction Enzyme and Shear Treatment of Genomic DNA. Referring to FIG. 3, in some embodiments, the genome is fragmented to a predetermined sequence length by a combination of digestion of DNA with restriction enzymes, and subsequent shearing of the restriction fragments. The shearing step may be carried out by sonication or high pressure hydrodynamic shearing. As illustrated by Steps 1-4, fragmenting of the genome DNA is accomplished by digesting the DNA molecules with a first restriction enzyme treatment, followed by ligation of a first adapter ("adapter A") to the ends of the resulting fragments, and then shearing the resulting restriction fragments to obtain a mixture or pool of informative and non-informative DNA templates. After end repair of the shear fragments using standard procedures, adapter B is ligated to the blunt ends of the shear fragments, opposite adapter A.
Digestion of genomic DNA with one restriction enzyme. In some embodiments, the genomic DNA is fragmented with a single restriction enzyme that generates template of an optimal size for a high throughput sequencer. For example, CspCI recognizes a specific 7bp sequence and digests DNA flanking that sequence at a distance such that DNA fragments of approximately 32bp are generated. In this instance, adapters A and B would be ligated to the DNA fragments at the same time, to generate DNA fragments containing adapter A or adapter B ligated to both termini of a given fragment, and DNA fragments that have adapter A ligated at one end and adapter B ligated to the other end of the fragment. PCR amplification of this mixture of adapter ligated fragments with primers complementary to sequences in adapter A and adapter B will preferentially amplify fragments containing both adapters due to suppression PCR. The optional step of limited amplification by suppression PCR is shown in the flow diagram of FIG. 3.
Shearing Genomic DNA. In some embodiments, the genomic DNA is fragmented to a predetermined sequence length by shearing instead of, or in addition to digestion with restriction enzyme(s), as shown in Step 1 of FIG. 3. In many cases, shearing treatments may be adjusted to generate the range of DNA fragment lengths of a predetermined range useful for downstream analysis on different DNA sequencers. Adapters A and B are ligated to the resulting DNA fragments (Step 2' of FIG. 3). Fragments containing Adapter A ligated to one end of the fragment and Adapter B ligated to the other end of the same fragment can be enriched relative to fragments having the same adapter ligated to both ends of the fragment, by suppression PCR amplification of the DNA templates using primers complementary to adapters A and B (as indicated in FIG. 3). In some embodiments, size selection (i.e., enrichment of adapter-modified DNA fragments of specified sequence length), is performed before or after the limited amplification step. In some instances, shearing of the genomic DNA is less desirable than restriction enzyme digestion because (1) shearing produces templates with a distribution of different termini across the informative regions of the genome targeted for sequence analysis, (2) templates with high complexity are generated, (3) it is more labor intensive and less reproducible to shear and process a large number of different DNA samples compared to restriction digestion which can be done in 96-well or 384-well format, and (4) sheared DNA requires additional steps in template preparation compared to simple ligation of adapters to termini generated by restriction enzymes. Nevertheless, in some cases, a user may wish to employ shearing alone or in combination with a restriction enzyme digestion step, as outlined in FIG. 3, for ease of use or for specific applications. In some embodiments, the fragmentation protocol includes an initial shearing step followed by further reduction of fragment lengths by digestion with one or more restriction enzyme, as shown in Steps 1-4 of FIG. 3.
Biotin/Streptavidin Purification of Adapter-linked Target DNAs. In some applications, adapter A is biotinylated. In other applications, DNA templates are amplified using biotinylated primers that are complementary to sequences in adapters. The biotinylated adapters are ligated to termini generated by a restriction enzyme, allowing purification of DNA templates on streptavidin beads. Following release of the single strand template DNA (sstDNA) from the avidin beads, the sstDNA may be further processed as described below. Purification using biotinylated adapters potentially eliminates or reduces the content of DNA fragments that are not linked to adapters.
Selecting DNA Templates of Specified Sequence Length. Enrichment of templates of a given size enhances the efficiency of the procedure with respect to the selected high throughput DNA sequencing system to be used for sequencing the final templates. In some embodiments, template preparation also includes enriching the mixture of adapter-linked DNA templates for a specific range of sequence lengths. This may be done, for example, by electrophoresis of the adapter-linked DNA template on an agarose gel followed by extraction of the DNA templates of a predetermined size range, using standard techniques. Selecting only the DNA templates of a predetermine range of sequence lengths potentially increases the efficiency of the overall method by reducing or eliminating DNA templates that are larger or smaller than a predetermined sequence length (e.g., longer than the sequence read length of a selected nucleotide sequencing system; e.g., 30-50 bp). Size selection may be carried out after adapter ligation, or after PCR of adapter-ligated templates as shown in FIG. 3, or after enrichment of informative templates depending on the application.
Enriching for DNA Templates containing Adapter A and Adapter B. In some instances, an enrichment step may be utilized following ligation of adapters (Step II) and hybridization based template selection (Step Ilia). Referring to FIG. 3, in some applications, the adapter-linked DNA templates are amplified prior to performing Step 5, as indicated by the dashed box describing limited amplification and enrichment that involves suppression PCR. The adapter A- and adapter B-linked DNA fragments from Steps 2' or 2-4 of FIG. 3 may be subjected to PCR amplification using primers that are complementary to the non-indexing portion of adapters A and B. This step increases the relative abundance of templates containing adapters A and B relative to templates containing only adapter A or adapter B due to suppression PCR. Optional PCR amplification steps are indicated in FIG. 1 by dashed boxes labeled "PCR."
Reduction of Contaminating DNA Sequences. In some applications, undesirable secondary sources of DNA such as chloroplast, mitochondria and bacterial DNA may be present at relatively high copy number in a sample of nuclear DNA targeted for analysis. In such cases, it may be desirable to enhance the efficiency of template preparation and sequencing of the individual's DNA. In addition to taking preliminary steps to purify the target individual's nuclei to reduce organelle DNA content, utilization of tissues with relatively low organelle DNA copy number, purification of bacterial artificial chromosome (BAC) DNA in situations where BACs are being profiled, and selection of restriction enzymes that cleave the organelle or bacterial DNA infrequently or not at all (as described above), background "noise" arising from multiple copies of extraneous DNA sequences may be selectively reduced after fragmentation. The representation of those sequences during preparation of the individual's informative DNA templates may be reduced by including certain non-amplifiable primers in a PCR template preparation step (see FIG. 1) to suppress amplification of these amplicons. These non-amplifiable primers may extend into the adapter sequence and are complementary to the undesirable repeat sequences located adjacent to the restriction digestion site.
Stage III. Informative DNA Template Selection.
Prior to selecting informative DNA templates, a set of oligonucleotides is developed corresponding to the set of informative DNA templates in the target genome under investigation in a particular application. The complementary oligonucleotides used for enrichment may correspond to any unique genomic sequence in an informative DNA template. Therefore, in most instances, the adapter sequences and repetitive sequences are avoided, and, in some cases, polymorphic sequences per se are also avoided, thereby minimizing variable hybridization due to sequence miss-matches. The oligonucleotides are attached to the desired solid matrix {e.g., magnetic beads, planar or curved surfaces, tubes). The chemistry and method of preparing the oligonucleotide-beads is carried out in accordance with the instructions of an oligonucleotide-bead manufacturer known to those in the art, or in accordance with other techniques for attaching oligonucleotides to solid substrates that are known and described in the literature. The resulting set of complementary oligonucleotides (or substrate-bound oligonucleotides) may then be assembled in any of a variety of combinations depending on the parental genotypes being assayed or the region of the genome being analyzed. Thus, in many cases, the technical burden and costs of primer design may be distributed across multiple experiments.
Stage Ilia: Hybridization-based Enrichment for Informative DNA Templates.
After the mixture of adapter-linked informative and non-informative DNA templates have been obtained (in Stage II), and, optionally, amplified and/or sized, as schematically shown in FIGs. 1 and 3, in some embodiments, the DNA templates are then enriched for informative DNA templates (i.e., the templates corresponding to a site that contains a polymorphism), as illustrated in Step Ilia of FIG. 1, and Step 5 of FIG. 3. In these embodiments, enrichment of the informative DNA templates includes hybridizing only the informative DNA templates to oligonucleotides having complementary sequences. The mixed DNA templates are denatured to form single stranded DNAs, which are then hybridized to oligonucleotides that could range from about 17 to about 60 nucleotides in length. The oligonucleotides contain sequences complementary to genomic DNA sequences present in the informative DNA templates and are designed to enhance a uniform specificity of binding under a standard hybridization condition. In some embodiments of this protocol, the complementary oligonucleotides are attached to magnetic beads to facilitate the separation of hybridized informative DNA template from non-hybridized non- informative templates. In various embodiments, oligonucleotides may be attached to any other suitable solid surface or matrix. The oligonucleotide-substrate complexes must be different than the oligonucleotide-substrate complexes that are used for bridge amplification or are used in the DNA sequencing procedure. One reason for this requirement is that, if the oligonucleotide sequences used for template selection were the same as the sequences in adapters A or B (i.e., sequences used for bridge amplification, etc.), then they would bind to all templates containing adapter A and/or B, which would prevent template enrichment. In certain embodiments, the hybridization-based selection procedure includes purifying the informative single stranded DNA templates by hybridization to a collection of oligonucleotides that are covalently attached to a solid matrix or surface (e.g., magnetic beads). One way in which a representative collection of oligonucleotides is obtained is described in Examples 5-14, below. The mixture of adapter-linked DNA templates (from Stage II), optionally amplified and enriched for adapter-linked DNA templates of a specified size, is mixed with the oligonucleotide-beads under hybridization promoting conditions. For some applications, the hybridization-based enrichment procedure is improved by using oligonucleotides that hybridize to their complements at similar melting temperatures (Tm), however small differences in DNA template selection efficiency will usually not affect the results significantly because the purpose of this step is to enrich a subset of the mixture or pool of informative templates, not to discriminate between polymorphic sequences. Moreover, a low level of off target template selection will only introduce a low level of off target sequencing rather than creating a source of error. In addition, it is the type of alleles at each site that is important (e.g., AA, AB or BB in a simple case), and redundant sequencing of DNA templates will be carried out. For genotyping studies, in certain embodiments, a polymorphic marker is assayed within a given interval of the genetic map, however, any marker from the given region will provide similar information, and given a sufficient density of markers, missing data will not affect results. As illustrated in Stage Ilia of FIG. 1, in some embodiments the polymorphic site on the informative DNA template will be outside of the oligonucleotide:DNA template annealed region of the hybridized complex. In many embodiments this will be the preferred situation in order to avoid variation in selection of templates with perfect vs. non-perfect sequence complementarity. However, in some embodiments, the oligos used for selection will overlap a sequence containing a polymorphism. For example, this may be the case when approximately 35bp templates are generated with the restriction enzyme CspCI, as further described below.
The resulting hybridized complexes (e.g., each containing adapter A, adapter B, informative DNA template, oligonucleotide, and magnetic bead or other substrate) are isolated by centrifugation, magnetic bead capture, filtering, or other suitable technique. The hybridized complexes are washed with aqueous media containing buffer (pH 7-8) and salts and at a temperature and under conditions that will not disrupt interaction between oligonucleotides and templates, to remove the non- hybridized DNA templates (Stage III of FIG. 1 and Steps 5-6 of FIG. 3). The exact hybridization and wash conditions and time (buffer, salts, temperature, etc.) will vary to some extent depending on the length and GC content of the oligonucleotides used for the selection step.
Enrichment of Informative DNA Templates Using Biotinylated Oligonucleotides and Avidin Beads. In some applications biotinylated oligonucleotides may be used to carry out selective enrichment of informative templates by hybridization (FIG. 1, Step Ilia). For some applications, the oligonucleotides complementary to informative templates are modified to contain biotin, allowing biotin/streptavidin bead-based capture to facilitate separating the hybridized informative DNA template from non-hybridized non-informative templates. More specifically, in some embodiments the oligonucleotide is modified allowing oligonucleotideάnformative template hybrids to be separated from non- hybridized template. For example, if the oligonucleotide is modified by incorporation of a terminal biotin, then oligonucleotide: template hybrids may be purified by binding to a biotin: strepavidin bead or similarly modified surface. In this procedure, single- stranded adapter A- and B-modified informative DNA templates are hybridized to complementary biotinylated oligonucleotides, followed by binding of the resulting complex to streptavidin beads, washing of beads to remove non target template and other materials, and release of single strand template DNA (sstDNA) from the streptavidin beads for downstream sequencing (FIG. 1, Step V).
Stage HIb: Enrichment of Informative DNA Templates by Targeted PCR.
An alternative to the above-described hybridization-based oligonucleotide: solid substrate enrichment procedure or biotinylated oligo-avidin bead enrichment procedure is shown schematically in FIGs. 1 and 4. In some embodiments, the enrichment of informative DNA templates is accomplished by targeted PCR amplification using a primer complementary to adapter A, and a primer complementary to a unique genomic sequence within each informative template targeted for analysis, designed so that the resulting amplified template is of the desired sequence length (compatible or optimal for the selected sequencing system) and the resulting amplified template spans informative sequences (i.e., DNA polymorphisms). Embodiments of this alternative approach to selection of informative DNA templates are especially useful when the number of polymorphic sequences and informative templates targeted for analysis is less than about 2,000. The PCR-based enrichment route is shown along the lower portion of FIG. 1 (Stages I, lib, HIb, IVb, and V). Stage I, DNA fragment generation by digestion with restriction enzymes and/or shearing is carried out in a similar manner for both PCR- based and hybridization-based enrichment pathways. For PCR-based enrichment, step lib involves ligation of a single adapter (A or B depending on embodiment) after which templates from different individuals may be pooled. FIG. 1 shows the instance in which adapter A is ligated to one of the termini of a DNA fragment in Stage lib. In some instances, adapter A may be ligated to both ends of a DNA fragment. For example, when DNA fragments are generated by digestion with one enzyme or when shearing is the only method used to create DNA fragments, ligation with a single adapter will result in DNA fragments with the same adapter ligated to both termini. PCR amplification techniques are well known in the art and have been described in the literature. Briefly, PCR amplification includes selectively amplifying the single- stranded adapter A-modified informative templates using primers that hybridize to adapter A and a set of second informative template specific primers, each of which is complementary to a unique genomic DNA sequence flanking adapter A in an adapter A-modified informative DNA-template. The second primers that are unique to each informative template targeted for analysis are configured so that the amplified DNA contains a DNA sequence polymorphism. In this way, only the informative DNA templates are amplified, and the non-informative templates are not resulting in an enrichment of informative DNA templates. In some instances, primers that bind to adapter A and 10 or more different primers complementary to different informative templates are pooled (multiplexed) to streamline template preparation. In this way about 1,000-2,000 informative templates can be selectively amplified using a 96- well plate format for PCR, using pools of 10 or 20 different primers specific to informative templates per well. In some embodiments, the PCR amplification primers used to target informative templates in Stage HIb are designed to include two primary features: (1) unique "targeting" sequences are present at the 3' -end of each primer, usually 17-30bp in length, that are complementary to respective informative templates targeted for analysis, and (2) a "universal" amplification sequence, usually 17-30bp in length, that is not present in the target genome (as shown in FIG. 4). After several rounds of PCR amplification of informative templates with primers having this design, all templates may then be amplified with primers complementary to adapter A and a primer complementary to the "universal" sequence present in the mixture of informative templates. As a result of this primer design, the amplified templates are of a predetermined sequence length and contain informative sequences. In some embodiments, the "universal" sequence (e.g., primer B in FIG. 4) is selected to be compatible with template preparation for high throughput sequencing on the ILLUMINA® SGAII™, ABI SOLID™, or the Roche 454. In some instances, one of the primers used to amplify the informative templates contains a biotin that allows purification of template prior to sequencing (Stage IVb).
Size Selection. As illustrated in FIGs. 1 and 3, and described above following Stage II, size selection may be performed either before or after the first PCR step, or, in some cases, later in the procedure. Size selection is desirable in many applications for increasing overall efficiency of a template generation and sequencing process. It is optional in some instances, however, such as when Stage HIb PCR based enrichment is employed. In Stage Ilia, the DNA fragments have adapters A and B ligated to the fragments. In Stage HIb (PCR-based enrichment route), however, only one adapter is ligated in Stage II (either A or B, depending on the application). In the PCR- enrichment route, the second priming site is incorporated during targeted amplification, as shown. The primers are indicated by arrows in Stage HIb, and the adapter B linked to the second primer is designated by an open box. The PCR enrichment route (Stage HIb) is also illustrated in FIG. 4. The asterisk (*) denotes the site of a DNA polymorphism in the genome (Stage I) and DNA templates (Stages II-IV) derived from the genome.
Stage IV. Recovery of Informative DNA Templates.
Following hybridization and separation from non-informative templates, the adapter-linked informative single stranded DNA templates are separated from the bead- bound complementary oligonucleotides by heat denaturation or treatment with alkali to release the selected templates (Stage IV of FIG. 1 and Step 7 of FIG. 3). The enriched set of DNA templates is then ready for an optional further amplification, and/or direct sequencing and analysis. The resulting single- stranded DNA templates are then sequenced (Step 9 of FIG. 4).
Enrichment for Informative DNA Templates. In some applications, the informative DNA templates recovered in Stage IVa (FIG. 1 and Step 7 of FIG. 3) are further enriched by PCR or any other suitable amplification technique. A PCR amplification procedure includes denaturing the hybridized DNA templates (as in Step 7) and PCR amplification of the resulting mixture of informative DNA templates using a primer that is complementary to adapter A and a second primer that is complementary to adapter B.
One suitable PCR amplification technique is the well known solid-phase bridge amplification used to create DNA clusters for DNA sequencing on the ILLUMINA® SGAII™ in which adapter-linked single stranded DNA templates attach to a surface containing a multiplicity of regularly spaced single-stranded primers having sequences that are complementary to the primer sequences contained in adapters A and B. The polymerase enzyme incorporates nucleotides to build double- stranded "bridges" between the spaced-apart primers on the solid surface. After amplification, the resulting double-stranded DNA sequences may be represented as shown in box 3 of FIG. 2.
Multiplexing DNA Templates for Sequencing. Due to the inclusion of a unique indexing sequence ("index") in one or both of adapter A and adapter B, the mixed DNA templates from Stage II, or the informative DNA templates from Stage IV of one individual may be pooled with similarly prepared, but differently indexed, adapter-linked DNA templates derived from other individuals. The pool of differently indexed adapter-linked informative DNA templates may then be further processed and sequenced together. An indexing sequence (XXX) is shown in box 2 of FIG. 2, flanking a polymorphism-containing DNA fragment having restriction termini denoted by "YYYyy," at the 5' end of the fragment.
Stage V. High Throughput Sequencing of Informative DNA Templates.
The recovered, and optionally amplified, adapter-linked informative DNA templates may then be directly sequenced using a high throughput DNA sequencing platform in which the optimal template sequence length for sequencing is in the range of about 50 bp to about 600 bp, in accordance with the instructions provided by the manufacturer of the chosen high throughput sequencing platform. This high throughput sequencing step may also be referred to as "resequencing" of the informative DNA templates in instances and in situations in which a previously sequenced genome is the target genome under investigation. In some instances, a reference sequence will have been previously obtained for the same or a similar individual, and the polymorphisms in the reference sequence will have been previously identified. For example, a reference sequence may have been previously obtained as part of the process of preparing an above-mentioned set of complementary nucleotide sequences for use in the Stage III hybridization-based template selection process.
In some embodiments, the informative DNA templates are sequenced using an ILLUMINA® SGAII™ sequencing system, in which the optimal template sequence length is about 50-250bp. Optimal template size in this instance is related to a template size that will uniformly be amplified in the sequencer by bridge amplification. In some embodiments, the informative DNA templates are sequenced using an ABI DNA SOLID™ System, in which the optimal template sequence length is about 50 bp. In some embodiments, the informative DNA templates are sequenced using a Roche 454 GS-FLX™ Genome Analysis System, in which the optimal template sequence length is at least 400 bp in cases where the read length of the sequencer is 400 bp.
In these and similar sequencing systems, a number of "sequencing by synthesis" reactions are used to elucidate the identity of a plurality of bases at target positions within the target sequence. Sequencing by synthesis techniques are well known in the art, and have been described in the literature. All of these reactions rely on the use of a target sequence comprising at least two domains, including a first domain (i.e., an adapter) to which a sequencing primer will hybridize, and an adjacent second domain, for which sequence information is desired (i.e., an informative DNA template of initially undetermined sequence). Upon formation of the assay complex, enzymes are used to add dNTPs to the primer, and each addition of dNTP is "read" to determine the identity of the added dNTP. This may proceed for many cycles.
Sequencing primers specific to the adapters ligated to termini created by the "targeting" restriction enzyme are used to generate sequences adjacent to these sites. RS L- sequencing is a flexible technology that will allow the investigator to vary the number of sites sequenced and depth of sequencing depending on application. Another feature of sequence-based genotyping from RSL-templates is that sequencing always starts from the same sites in the genome, immediately adjacent to the targeting restriction enzyme recognition sequence. As a result, the location of a specific sequence polymorphism is always a specified number of bases away from the start site for DNA sequencing in each template selected for analysis. For many applications, this is a potentially valuable feature because the accuracy of sequencing decreases in a predictable way as a function of position in the sequence. Conversely, sequence accuracy is highest close to the sequencing primer. Therefore, in some cases a quality score can be assigned to sequence-based genotypes based in part on this information.
Varying the number of sites sequenced per genome. Depending on the application, a user may want to sequence a large number of different DNA sequences or a small number of sequences in a target genome. Accordingly, the depth and redundancy of RSL-sequencing may be modulated in any of several ways, which are briefly described as follows: (1) by selecting restriction enzymes that cleave the target genome with difference frequencies depending on the recognition site, (2) by digesting the target genome with two or more restriction enzymes that recognize different 8-base, 6- or 4-base sequences, for example, to increase the number of different DNA segments targeted for sequencing; (3) by amplifying and pooling DNA from two or more different genotypes, each containing a specific sequence identification tag (indexing), to track the origin of the DNA sequences; and (4) by using restriction enzymes that are sensitive to the methylation state of DNA within their recognition sequences; all of these variations are further described elsewhere herein.
Stage VI. Identification of Polymorphisms and Genotyping.
The flexible methods disclosed herein allow targeted analysis of subsets of RSL-templates generated by any specific set of restriction enzymes, by permitting variation in the number of informative DNA templates analyzed per genome and their distribution across the genome, or based on their utility in larger population studies. A global analysis of the sequences of all RSL-templates generated by a particular restriction enzyme or combination of restriction enzymes from the parental genotypes of interest results in data that, in many embodiments, will identify; (1) the subset of informative RSL-templates that can be successfully sequenced at a reasonable frequency. Differences in template utility are affected by size (i.e., sequence length), repetitiveness, presence of polymorphisms, and other factors; (2) the subset of RSL- templates that contain unique sequences in cases where a reference genome sequence is available, and (3) the subset of unique sequences that contain polymorphisms that distinguish parental genotypes. In applications in which the user is working with a sequenced genome that is cross-referenced to a genetic map, the unique sequences may be mapped using bioinformatics. In applications in which the target genome has not been previously sequenced, then the polymorphic sequences will need to be mapped through normal segregation analysis in a population. Polymorphic RSL-sequences may be genetically mapped by analyzing mapping progeny followed by linkage analysis.
Once the sequences of the recovered informative DNA templates have been ascertained, they may then be compared to those derived from one or more different individuals or to a reference set of sequences, to identify specific DNA polymorphisms (alleles) and to generate genotyping/haplotyping information. Suitable software for analyzing the sequence data, and for aligning the sequences (BLAST, BLAT, etc.) to a reference in resequencing applications is available from well-known commercial sources.
For instance, an approximately 1,000 Mbp "genome" of random sequence is predicted to contain about 15,500 digestion sites for a restriction enzyme that recognizes a specific 8-base sequence. The collection of about 100 bp sequences flanking this set of restriction enzyme digestion sites constitutes a specific approximately 3.1 Mbp sub-sample of the target "genome." RSL-sequencing, as described above, allows the sequences flanking this set of restriction sites to be obtained from genomic DNA or a library of large insert bacterial artificial chromosome (BAC) clones prepared from that genome. The RSL-sequences derived from large insert clones may potentially be used to build physical maps, as the overlapping clones will contain common RSL-sequences. RSL-sequences mapped onto a BAC -based physical map spanning a genome may potentially be used to locate gene sequences and whole genome sequence assemblies on the genome map. In addition, RSL-sequences from different genotypes may be compared to identify DNA polymorphisms useful for the design of DNA marker assays and for diversity and haplotype analyses. Therefore, RSL-sequences obtained using any suitable high throughput- sequencing platform will potentially integrate genome map building, genome sequencing, and diversity analysis.
Definitions The term "informative DNA template" refers to a DNA template that has three properties; (1) the template is compatible with high throughput sequencing, (2) the template contains one or more sequences that can be mapped to a unique location in a genome, and (3) the sequence or sequences derived from the template are polymorphic in the target species (i.e., parental lines used for genetic mapping, lines being fingerprinted, individuals being analyzed as part of a diversity or haplo typing study). Thus, an informative DNA template contains a sequence in a location suitable for high throughput DNA sequencing, and contains a unique polymorphic site in the species genome sequence. Although an informative DNA template must span a sequence that is polymorphic in the target species, it should be understood that it may or may not be polymorphic in the particular individual analyzed.
DNA templates generated using at least one restriction enzyme are referred to herein as "restriction site localized templates" or "RSL-templates". For ease of reference, the DNA templates generated by shearing only, by digestion with restriction enzyme(s) only, or by a combination of shearing and restriction enzyme(s) digestion, are sometimes referred to herein as "RSL-templates." Accordingly, the term "RSL- templates" should be interpreted to include DNA templates generated by either restriction enzyme(s) digestion or shearing, or by a combination of those. A "nuclear genome" is all the DNA or genetic material in the chromosomes of a eukaryotic organism. Eukaryotic organisms such as plants will also contain organellar genomes in their mitochondria and chloroplasts.
An "allele" is a distinct DNA sequence or "spelling" of a chromosomal region. Customarily, the first identified allelic form is arbitrarily designated as the reference form and other allelic forms are designated as alternative or variant alleles. The allelic form occurring most frequently in a selected population is sometimes referred to as the wild type form. Diploid organisms may be homozygous or heterozygous for allelic forms. A diallelic polymorphism has two forms. A triallelic polymorphism has three forms. Most organisms have multiple alleles of gene sequences in their naturally occurring populations (germplasm in the case of plant species).
"Polymorphism" refers to the occurrence of two or more alternative sequences or alleles in a species or population. A "polymorphic site" is the locus or specific sequence location in a genome at which sequence divergence occurs (i.e., the site of variation between allelic sequences). A polymorphism may comprise one or more base changes, a nucleotide insertion or deletion (INDEL), a nucleotide inversion, or variation in the size of a simple sequence repeat (SSR), relative to a reference allele. A single base pair polymorphism, termed a "single nucleotide polymorphism" (SNP), occurs at a polymorphic site occupied by a single nucleotide. A single nucleotide polymorphism usually arises due to substitution of one nucleotide for another at the polymorphic site. A transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine. A transversion is the replacement of a purine by a pyrimidine or vice versa.
A "genotype" is a collection of all the polymorphisms or alleles of an individual's genome.
A "haplotype" is a combination of alleles or polymorphisms at multiple loci that tend to be transmitted together. For example, neighboring polymorphisms that are inherited together on the same chromosome.
A "monomorphic template" does not contain DNA polymorphisms. An "amplicon" refers to a fragment of DNA that can be amplified using specific priming sites located at each terminus, sequences usually added by ligation of adapters.
An "oligonucleotide" is a relatively short nucleic acid sequence, such as DNA or RNA, and may be single- or double-stranded. Oligonucleotides are typically prepared by synthetic means, however they may also be isolated from naturally occurring sources. For the purposes of this disclosure, oligonucleotides are usually in the range of about 17 to about 30 base pairs (bp) in length, and in some instances are about 30-90 bp long, for example.
The term "hybridization" refers to the non-covalent interaction or binding of two complementary single- stranded nucleic acid strands (i.e., DNA and/or RNA) into a single double- stranded molecule. Two perfectly complementary strands will bind to each other readily (i.e., anneal or "hybridize") because the nucleotides of the complementary strands bind to their complements under normal hybridization conditions. Hybridizations are usually performed under stringent conditions that are dependent on sequence length, GC content, temperature, salt, and other characteristics of the hybridizing media. For example, conditions of 5X SSPE (750 mM NaCl, 50 mM Na2PO4, 5 mM EDTA, pH 7.4) and a temperature of 25-3O0C are suitable for SNP- specific oligonucleotide probe hybridizations. For stringent hybridization conditions, see, for example, Sambrook, Fritsche and Maniatis. "Molecular Cloning A laboratory Manual" 2nd Ed. Cold Spring Harbor Press (1989).
A "hybridization array" is an array comprising a solid support or matrix with attached oligonucleotide probes. Arrays typically comprise a plurality of different oligonucleotide probes that are coupled to a surface of a substrate in different locations. Substrates may be beads, planar or curved surfaces, fibers such as fiber optics, glass or any other suitable material or structure.
"High throughput sequencing" refers to instances where 400,000 to 70 million templates or more are sequenced in parallel in a single run generating up to 2 billion bases of sequence or more per run, in an automated nucleotide sequencing system.
References herein to "an individual" may apply to human beings, other mammals, plants, bacteria, or any other organism, as the context allows in this disclosure. As used herein, the term "individual" refers to a single member of any species. In most cases, the genotypes and templates derived from each individual of a species or of a different species will be different, but occasionally (e.g., twins or clones) the genotypes and corresponding templates will be the same. For example, different individuals may have different genotypes (e.g., Genbank accession numbers, ecotypes, germplasm accession numbers, etc.), with the exception of clones or genetically identical twins. EXAMPLES
Example 1. DNA template preparation using one restriction enzyme.
Digest genomic DNA of a first individual with the CspCI, a type II restriction enzyme, to generate DNA restriction fragments (RFs) approximately 34-36 bp in size {i.e., sequence length), which is optimal for sequencing on the ILLUMIN A® SGAII™ sequencer. CspCI recognizes and cuts the sequence:
5'....10-ii (N)CAA(N)5GTGG(N) 12-i3....3' (SEQ ID NO: 39) 3'...42-13(N)GTT(N)5CACC(N)1O-Ii....5' (SEQ ID NO: 40) This enzyme typically digests a random genome about every 8,192 bp and thereby generates a large number of approximately 35 bp fragments
Ligate adapters A and B to the resulting RFs, using standard ligation techniques. Adapter A is a double stranded nucleotide sequence containing short terminal sequences that allow binding and ligation to DNA fragments generated by one or more restriction enzymes, selected 2-6 bp nucleotide sequence (index), a unique primer binding sequence useful for amplification or to initiate DNA sequencing ("primer A"), and adapter B contains a unique primer binding sequence. The sets of uniquely indexed adapters A and adapters B can be prepared by individual users of the technology but are typically provided by the manufacturer of the selected sequencing platform to be used for sequencing the informative DNA templates obtained in Step 1-5. If adapter A contains an indexing sequence, similarly prepared RFs from different individuals can be pooled to increase the efficiency of downstream processing
Amplify the adapter-linked RFs by PCR using primers specific to sequences in adapter A and adapter B to enrich for DNA templates containing both adapters
Enrich for DNA templates with sequence lengths optimal for the DNA sequencer selected for analysis (i.e., in the range of about 50-200bp for the ILLUMINA® SGAII™, 400-600 for the Roche 454 sequencer).
Enrich for adapter-linked informative DNA templates (i.e., those containing sequences corresponding to a polymorphism).
Sequence the resulting informative DNA templates with an ILLUMINA® SGAII™, ABI SOLID™, or Roche 454 sequencer.
Compare the sequences obtained to those of a set of reference sequences, to identify DNA sequence polymorphisms and genotypes. Example 2. DNA template preparation using two restriction enzymes.
Digest genomic DNA with the restriction enzyme, Fsel, to generate RFs ranging from less than 100 bp to more than 100,000 bp in length with an average fragment size of 64,000 bp for a genome of random sequence. Ligate adapter A to the resulting RFs. If adapter A contains an indexing sequence, similarly prepared RFs from different individuals can be pooled to increase the efficiency of downstream processing.
Digest the adapter A-linked RFs with a second restriction enzyme that has a 4- base recognition sequence to reduce their average sequence lengths to about 256 bp. Ligate adapter B to the resulting Adapter A-linked DNA templates, opposite adapter A.
Enrich the adapter-linked DNA templates in the range of about 50-400bp for sequencing on the ILLUMINA® S G All™ and to about 400-600bp for sequencing on the Roche 454 sequencer. PCR amplify DNA templates that contain both adapter A and adapter B.
Enrich for informative DNA templates.
Sequence the informative DNA templates.
Compare the sequences obtained from each individual to each other and/or to a reference set of sequences to identify DNA polymorphisms and genotypes.
Example 3. DNA template preparation using restriction enzymes and shearing.
Digest genomic DNA with the restriction enzyme, Fsel, to generate RFs ranging from less than lOObp to more than 100,000 bp in length with an average length of 64,000 bp for a genome of random sequence.
Ligate adapter A to the resulting RFs. If adapter A contains an indexing sequence, similarly prepared RFs from different individuals can be pooled to increase the efficiency of downstream processing.
Fragment the adapter A-linked RFs by shearing to generate smaller fragments in the range of about 50bp to about 200bp sequence length for sequencing on the ILLUMINA® SGAII™ or 400-600 for sequencing on the Roche 454 sequencer. Ligate adapter B to the DNA fragments.
PCR amplify templates containing both adapter A and adapter B using primers specific to sequences contained in adapter A and adapter B. Enrich DNA templates of a predetermined size. Enrich for informative DNA templates. Sequence DNA templates.
Compare sequences to a reference set to identify DNA sequence polymorphisms.
Example 4. Sorghum Polymorphism Discovery
The technical approaches described above for targeted sequencing and polymorphism discovery adjacent to restriction sites in large genomes was validated on the sorghum genome (800 Mbp). The number of sites across the sorghum genome analyzed for SNP discovery was varied in several ways, as follows: (1) using restriction enzymes with 8- or 6-base recognition sites (Fsel, Kasl respectively were tested), (2) using restriction enzymes sensitive to DNA methylation (Fsel) and a restriction enzyme that is not sensitive to DNA methylation (Sphl), and (3) using a 4 bp restriction enzyme or shearing to generate the second end of each amplicon. From this study it was concluded that the use of methylation sensitive restriction enzymes significantly reduced the number of repeat sequences obtained increasing data yield/sequence. It was also found that the use of one targeting enzyme plus DNA shearing (vs. two restriction enzymes) generated a larger number of more uniform size templates increasing data yield/sequence. Fsel/Msel RSL-templates were prepared from BTx623 and IS3620C, and approximately 250,000 RSL-sequences were obtained from each genotype. Among the approximately 11,000 different sequences acquired in the experiment (excluding error containing sequences), approximately 5,000 templates containing unique sequences (found only once per genome) were sequenced 5X or more times from each genotype. Comparison of the sequences from the two genotypes identified 200-400 SNPs/InDels within 27 bp of the Fsel restriction site.
In a follow-up experiment, RSL-templates were prepared by digestion of BTx623 and IS3620C genomic DNA with Fsel followed by ligation of adapter A. The resulting RFs were sheared and adapter B was ligated to create DNA template. After PCR using primers complementary to sequences in adapter A and adapter B, and enrichment of template of an optimal size, the templates were sequenced using priming sites in adapter A on an ILLUMINA® S G All™ sequencer. In this experiment, approximately 13,000 different unique 27bp sequences were obtained from both genotypes revealing approximately 1,500 polymorphic sequences. DNA templates generated using Kasl/shearing allowed sequence analysis of approximately 50,000 different unique RSL-tags at 5X or greater depth through acquisition of about 3,000,000 sequences. It is estimated that this collection of sequences from IS3620C and BTx623 will reveal more than 5,000 SNPs/INDELs adjacent to Fsel-sites when the data is fully analyzed.
When the CspCl restriction enzyme is used to digest a random 1,000 Mbp genome sequence, it is predicted that this size genome will contain about 62,500 sites for CspCl and generate 125,000 DNA templates. The CspCl enzyme is not methylation sensitive, so nearly all sites would be available for digestion. The resulting small (approximately 34— 38bp) DNA fragments may be purified by size selection on agarose gels, blunt ended, and ligated to adapters A and B (one of which containing the sequencing primer binding site plus an indexing sequence). PCR amplification will differentially amplify RSL-tags flanked by two different adapters (due to suppression PCR) and these may be loaded directly onto the sequencer or further purified prior to sequencing as necessary. Because primers are ligated to the termini at random, both strands of each DNA template will be sequenced, thereby eliminating or reducing the extent of the increased sequencing error rates that tend to occur towards the 3 '-end of each ILLUMIN A® sequencing run. If CspCl digestion cuts a random sequence 1,000 Mbp genome at 62,500 sites creating 125,000 RSL-templates, and 50% of the 33 bp sequences derived from the resulting RSL-templates are unique, and if there is one polymorphism per 1,000 bp in a comparison of two genotypes, then analysis of this set of RSL-templates by sequencing would reveal approximately 2,062 polymorphic sequences. The subset of RSL-tags corresponding to unique sequences spanning polymorphic sites may then be mapped and used for genotyping, DNA fingerprinting, or haplotype analysis.
Example 5. Identification of RSL polymorphism and design of primers for amplification of the polymorphic region in sorghum.
Specific 30 bp RSL sequences from different genotypes are compared to identify single nucleotide polymorphisms (SNPs) or insertion/deletions (InDeIs). In this example, illustrated in FIG. 6, a 4 bp InDeI polymorphism between BTx623 and IS3620c is identified. A reverse primer is designed downstream of the polymorphism. This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing. A) denotes the sequence identifier: (Coordinates on Sorghum pseudomolecule); (Genotype_Sequence ID_Number of sequences in the sequence contig). B) is the alignment and comparison of BTx623 and IS3620c sequence contigs. The InDeI is bolded and underlined. The first three bases in a sequence correspond to the Index Sequence for the particular genotype (BTx623=ATC; IS3620c=GTC). The next six bases (CCGGCC) correspond to the Fsel half-site. The Index Sequences are contained within the RSL adapter ligated to the Fsel site. In C), the results of a BLAST similarity search is shown, comparing the BTx623 sequence with the assembled Phytozome sorghum pseudomolecule database (available on the world wide web at phytozome.net). A 227bp region, containing the 27 bp ILLUMINA® sequence and 200 bp downstream is identified and downloaded. Using PrimerQuest software (available on the world wide web at idtdna.com), an optimal reverse PCR primer is designed within the 227 bp sequence that will produce a PCR product of approximately 75-200 bp, when used with a forward primer specific for the Fsel adapter. In D) the adapter-modified DNA fragment, and representations of the forward and reverse Fsel adapter- specific primer are shown. In C) and D), the Fsel half site is underlined. The SNP or InDeI is bolded and underlined, and the reverse oligonucleotide primer-binding site italicized and underlined.
Example 6. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 7, another 3 bp InDeI polymorphism between BTx623 and IS3620c is identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 14) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C) and D), the Fsel half site is underlined. The InDeI is bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined.
Example 7. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 8, two SNP polymorphisms between BTx623 and IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 18) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 8. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 9, another 4 bp InDeI polymorphism between BTx623 and IS3620c is identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 22) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The InDeI is bolded and underlined, and the reverse oligonucleotide primer binding site italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 9. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 10, two SNP polymorphisms between BTx623 and IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 26) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 10. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 11, two SNP polymorphisms between BTx623 and IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 30) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 11. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum. As illustrated in FIG. 12, three SNP polymorphisms between BTx623 and
IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 34) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example. Example 12. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 13, a three SNP polymorphism between BTx623 and IS3620c is identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 38) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 13. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 14, a lbp InDeI and one SNP polymorphism between BTx623 and IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 42) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNP and InDeI are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of FIG. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 14. Identification of RSL polymorphism and design of primer for amplification of the polymorphic region in sorghum.
As illustrated in FIG. 15, three SNPs and a lbp InDeI polymorphism between BTx623 and IS3620c are identified. A) denotes the sequence identifier. B) shows the alignment and comparison of the BTx623 and IS3620c contigs for this sequence. A reverse primer (SEQ ID NO: 46) is designed downstream of the polymorphism, using the genomic sequence shown in C). This primer will be used with an Fsel adapter- specific primer to produce a PCR product for sequencing, as described above in Example 5. In C), the Fsel half site is underlined. The SNPs and InDeI are bolded and underlined, and the reverse oligonucleotide primer binding site is italicized and underlined. The adapter-modified DNA fragment, and the forward and reverse Fsel adapter- specific primers are similar to those shown in D) of Fig. 6, except that the 30 bp informative DNA sequence is derived from C) of the present example.
Example 15. Rice Polymorphism Discovery
DNA template generation using combinations of restriction enzymes and adapter ligation, as described in Example 1, above, has been tested in silico and in the laboratory based on the rice genome sequence containing about 400,000,000 bp. DNA templates generated by the described technique were sequenced by using the Roche 454 Genome Sequencer 20 System. The approximately 250,000 template sequences generated per run from several sequencing runs were analyzed and compared to results predicted in silico. The results confirmed the feasibility of using restriction enzymes/adapter ligation for the reproducible generation of DNA templates for high throughput targeted DNA sequencing and resequencing using the Roche 454 sequencer.
Example 16. High Throughput Multiplex Sequencing of Informative DNA Templates. In one prospective haplotype scale application, 10,000 informative RSL- templates are targeted for selective amplification, capture and genotyping analysis across 100 accessions of a species germplasm. A 1OX depth of sequence analysis of the amplified, enriched informative DNA templates will require the acquisition of 10,000,000 sequences on the ILLUMINA® sequencer. The ILLUMINA® SGAII™ is capable of sequencing approximately 50 million templates per run or approximately 6.25 million per channel. The required 10,000,000 sequences may be distributed across several channels of the ILLUMINA® sequencer with sequencing done in parallel with other samples utilizing unique indexing sequences to assign the sequences to their accession of origin. Genomic DNA from 100 different accessions may be digested with Kasl, CspCl, or any other suitable targeting restriction enzyme, followed by ligation of adapters as described above (and illustrated in box 2 of FIG. 2). The resulting indexed DNA templates are then pooled prior to amplification and enrichment of the informative DNA templates. A potential advantage of the proposed approach, in addition to its lower cost, is the procedural flexibility and low barrier to entry. Many individual investigators will be able to obtain genotyping information at various depths depending on the requirements of the selected application. This approach to DNA template production employing oligonucleotide-bead- or microarray-based capture and enrichment of informative templates, or targeted PCR amplification of informative templates, and high throughput sequencing is potentially applicable to any genome of any size, providing great flexibility in design. It is especially suitable for applications in which higher genotyping resolution is required (e.g., for haplotyping). Without further elaboration, it is believed that one skilled in the art can, using the description herein, utilize the present invention to its fullest extent. The embodiments described herein are to be construed as illustrative and not as constraining the remainder of the disclosure in any way whatsoever. While the preferred embodiments of the invention have been shown and described, many variations and modifications thereof can be made by one skilled in the art without departing from the spirit and teachings of the invention. For example, the order of execution of the steps of a disclosed process may be varied, in some embodiments, from the order given. Accordingly, the scope of protection is not limited by the description set out above, but is only limited by the claims, including all equivalents of the subject matter of the claims. The disclosures of all patents, patent applications and publications cited herein are hereby incorporated herein by reference, to the extent that they provide procedural or other details consistent with and supplementary to those set forth herein.
All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.

Claims

Claim 1. A method of generating informative DNA templates comprising:
(a) obtaining fragmented genomic DNA sequences from a first individual comprising a plurality of polymorphic sequences;
(b) ligating at least a first adapter to said fragmented genomic DNA sequences, to provide a plurality of adapter-modified informative DNA templates and adapter-modified non-informative DNA templates, wherein a subset of said informative DNA templates comprises a polymorphic sequence; and
(c) selecting adapter-modified informative DNA templates to obtain an enriched mixture of adapter-modified informative DNA templates.
Claim 2. The method of claim 1, wherein selecting adapter-modified informative DNA templates of step (c) comprises hybridization-based selection of said adapter- modified informative DNA templates.
Claim 3. The method of claim 1, wherein selecting adapter-modified informative DNA templates of step (c) comprises targeted PCR amplification of said adapter- modified informative DNA templates.
Claim 4. The method of claim 2, wherein said hybridization-based selection comprises hybridizing said adapter-modified informative DNA templates to solid matrix- attached oligonucleotides to form hybridized complexes, wherein said hybridized complexes exclude said non-informative DNA templates.
Claim 5. The method of claim 4, wherein said solid matrix comprises a plurality of beads.
Claim 6. The method of claim 4, further comprising:
(d) removing at least a portion of the non-hybridized non-informative DNA templates from said hybridized complexes; and
(e) releasing the informative DNA templates from the hybridized complexes, to obtain an enriched mixture of adapter-modified informative DNA templates.
Claim 7. The method of claim 4, wherein forming said hybridized complexes in step (c) comprises hybridizing oligonucleotides to complementary sequences in the informative DNA templates.
Claim 8. The method of claim 7, wherein each said oligonucleotide is about 17-60 nucleotides in length.
Claim 9. The method of claim 7, wherein each said oligonucleotide is complementary to a known informative DNA template sequence corresponding to a polymorphic site in the genome.
Claim 10. The method of claim 6, wherein said oligonucleotides comprise a modification to allow oligonucleotidednformative template hybrids to be separated from non- hybridized template.
Claim 11. The method of claim 10, wherein said modification comprises incorporation of a terminal biotin.
Claim 12. The method of claim 3, wherein said targeted PCR amplification of informative templates comprises using a first primer complementary to said first adapter, and a plurality of second primers complementary to unique sequences in said informative DNA templates, wherein each of said plurality of second primers is designed such that the resulting amplified DNA templates comprise informative DNA sequences.
Claim 13. The method of claim 1, wherein said ligating of step (b) comprises ligating a second adapter to a terminus of each said fragmented genomic DNA sequences opposite said first adapter, to provide said plurality of adapter-modified informative DNA templates and adapter-modified non-informative DNA templates.
Claim 14. The method of claim 13, further comprising selective amplification of said DNA templates containing said first adapter and second adapter.
Claim 15. The method of claim 1, further comprising selecting adapter-modified informative DNA templates based on the length of the DNA templates.
Claim 16. The method of claim 15, wherein adapter-modified informative DNA templates of about 50 to 200 base pairs are selected.
Claim 17. The method of claim 15, wherein adapter-modified informative DNA templates of about 400 to 600 base pairs are selected.
Claim 18. The method of claim 1, further comprising:
(d) subjecting the enriched mixture of adapter-modified informative DNA templates to DNA sequencing to obtain the sequences of said informative DNA templates.
Claim 19. The method of claim 18, wherein said DNA sequencing is high- throughput sequencing on an ILLUMINA® SGAII -TM, ABI DNA SOLID NT 1MM System or
Roche 454 GS-FLX rT 1MM Genome Analysis System.
Claim 20. The method of claim 18, further comprising:
(e) comparing the sequences of the informative DNA templates to at least one set of reference genomic DNA sequences.
Claim 21. The method of claim 20, wherein at least one set of reference genomic DNA sequences is obtained from at least one other individual.
Claim 22. The method of claim 1, wherein at least said first adapter comprises an indexing sequence that can be correlated to said first individual.
Claim 23. The method of claim 1, further comprising, prior to (c):
(b>0 preparing adapter-modified DNA templates from a plurality individuals wherein each said adapter-modified DNA template from each respective individual contains a unique indexing sequence; and (b2) combining said adapter-modified DNA templates obtained from all said individuals, to obtain a pool of adapter-modified DNA templates.
Claim 24. The method of claim 1, wherein obtaining fragmented genomic DNA sequences of step (a) comprises subjecting a genomic DNA sequence to digestion by at least a first restriction enzyme or to mechanical shearing.
Claim 25. The method of claim 24, wherein obtaining fragmented genomic DNA sequences of step (a) comprises digesting a genomic DNA sequence with at least a first restriction enzyme, and said ligating comprises ligating said first adapter and a second adapter to said fragmented genomic DNA sequences, to provide said adapter-modified DNA templates, wherein said first and second adapters are attached to the restriction termini of each said adapter-modified DNA fragments.
Claim 26. The method of claim 24, wherein, said first restriction enzyme is sensitive to DNA methylation.
Claim 27. The method of claim 26, further comprising separately digesting a sample of said genomic DNA from said first individual with a second restriction enzyme that is not sensitive to DNA methylation.
Claim 28. The method of claim 1, wherein each said plurality of polymorphic sequences comprise a polymorphism selected from the group consisting of a single nucleotide polymorphism, a simple sequence repeat polymorphism, a sequence inversion polymorphism, a nucleotide insertion polymorphism and a nucleotide deletion polymorphism.
Claim 29. A reagent for selecting informative DNA templates, comprising a plurality of oligonucleotides attached to a solid matrix, wherein each of said oligonucleotides is in the range of about 17-60 nucleotides in length and is complementary to a genomic sequence located 600 nucleotides or less from a polymorphic sequence.
Claim 30. The reagent of claim 29, wherein each of said oligonucleotides is complementary to a genomic sequence located 50 to 200 or 400 to 600 nucleotides from a polymorphic sequence.
Claim 31. The reagent of claim 29, wherein said solid matrix comprises a plurality of beads.
Claim 32. A hybrid DNA complex comprising: the reagent of claim 29; and a plurality of adapter-modified informative DNA templates hybridized to said matrix-attached oligonucleotides, wherein each said informative DNA template comprises at least one polymorphic sequence.
Claim 33. The hybrid DNA complex of claim 32, wherein said adapter-modified informative DNA templates are from a single individual.
Claim 34. The hybrid DNA complex of claim 32, wherein said adapter-modified informative DNA templates are from a plurality of individuals, and said adapter includes a unique indexing sequence for matching each said adapter-modified informative DNA templates to a specific individual.
Claim 35. A method of marker-assisted selection comprising:
(a) obtaining fragmented genomic DNA sequences from a plurality individual plants or plant cells, to provide a plurality of fragmented genomic DNA sequences comprising a plurality of polymorphic sequences at least one of which is linked to a trait of interest;
(b) ligating at least a first adapter to said fragmented DNA sequences, to provide a plurality of adapter-modified informative DNA templates and adapter- modified non-informative DNA templates, wherein each of said informative DNA templates comprises a polymorphic sequence and wherein said first adapter comprises an index sequence that can be correlated to genomic DNA of an individual;
(c) selecting adapter-modified informative DNA templates to obtain an enriched mixture of adapter-modified informative DNA templates; (d) subjecting the enriched mixture of adapter-modified informative DNA templates to DNA sequencing to obtain the sequences of said informative DNA templates; and
(e) selecting an individual plant or plant cell based on the presence of at least one polymorphism in an informative DNA template wherein the polymorphism is linked to a trait of interest.
Claim 36. The method of claim 35, wherein selecting adapter-modified informative DNA templates of step (c) comprises hybridization-based selection of said adapter- modified informative DNA templates.
Claim 37. The method of claim 35, wherein selecting adapter-modified informative DNA templates of step (c) comprises targeted PCR amplification of said adapter- modified informative DNA templates.
Claim 38. The method of claim 35, wherein selecting an individual plant or plant cell comprises selecting a plant cell for regeneration of a plant.
Claim 39. The method of claim 35, wherein the plant or plant cell is a wheat, maize, rye, rice, oat, barley, turfgrass, sorghum, millet, sugarcane, tobacco, tomato, potato, soybean, cotton, canola, sunflower and alfalfa plant or plant cell.
Claim 40. The method of claim 35, wherein the trait is a trait of agronomic interest.
Claim 41. The method of claim 40, wherein the trait of agronomic interest is selected from the group consisting of drought tolerance, enhanced yield, cold tolerance, pest resistance, insect resistance, salt tolerance and herbicide tolerance.
PCT/US2009/059274 2008-10-02 2009-10-01 Method of generating informative dna templates for high-throughput sequencing applications WO2010039991A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10211808P 2008-10-02 2008-10-02
US61/102,118 2008-10-02

Publications (2)

Publication Number Publication Date
WO2010039991A2 true WO2010039991A2 (en) 2010-04-08
WO2010039991A3 WO2010039991A3 (en) 2011-03-03

Family

ID=42074216

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/059274 WO2010039991A2 (en) 2008-10-02 2009-10-01 Method of generating informative dna templates for high-throughput sequencing applications

Country Status (1)

Country Link
WO (1) WO2010039991A2 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102373288A (en) * 2011-11-30 2012-03-14 盛司潼 Method and kit for sequencing target areas
CN104232627A (en) * 2013-06-13 2014-12-24 深圳华大基因科技有限公司 2b-RAD pooling technology
CN104232626A (en) * 2013-06-13 2014-12-24 深圳华大基因科技有限公司 Barcode object in reduced-representation genome sequencing library and design method thereof
CN105483267A (en) * 2016-01-15 2016-04-13 古博 Plasma cfDNA (cell-free deoxyribonucleic acid) bi-molecular marker, method for marking and detecting plasma cfDNA and application of plasma cfDNA bi-molecular marker
GB2533882A (en) * 2012-01-26 2016-07-06 Nugen Tech Inc Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
US9745614B2 (en) 2014-02-28 2017-08-29 Nugen Technologies, Inc. Reduced representation bisulfite sequencing with diversity adaptors
US9822408B2 (en) 2013-03-15 2017-11-21 Nugen Technologies, Inc. Sequential sequencing
US9957549B2 (en) 2012-06-18 2018-05-01 Nugen Technologies, Inc. Compositions and methods for negative selection of non-desired nucleic acid sequences
US10102337B2 (en) 2014-08-06 2018-10-16 Nugen Technologies, Inc. Digital measurements from targeted sequencing
WO2018212318A1 (en) * 2017-05-19 2018-11-22 Toyota Jidosha Kabushiki Kaisha Set of random primers and method for preparing dna library using the same
US10570448B2 (en) 2013-11-13 2020-02-25 Tecan Genomics Compositions and methods for identification of a duplicate sequencing read
CN111635958A (en) * 2020-07-22 2020-09-08 中国农业科学院作物科学研究所 Molecular marker linked with rice cold-resistant gene qSF12 and application thereof
CN112458199A (en) * 2020-12-24 2021-03-09 华智生物技术有限公司 SNP molecular marker of rice salt-tolerant gene SKC1 and application thereof
US11028430B2 (en) 2012-07-09 2021-06-08 Nugen Technologies, Inc. Methods for creating directional bisulfite-converted nucleic acid libraries for next generation sequencing
US11099202B2 (en) 2017-10-20 2021-08-24 Tecan Genomics, Inc. Reagent delivery system
US11795451B2 (en) 2017-12-25 2023-10-24 Toyota Jidosha Kabushiki Kaisha Primer for next generation sequencer and a method for producing the same, a DNA library obtained through the use of a primer for next generation sequencer and a method for producing the same, and a DNA analyzing method using a DNA library

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000078945A1 (en) * 1999-06-23 2000-12-28 Stratagene Methods of enriching for and identifying polymorphisms
US20040132056A1 (en) * 2001-07-20 2004-07-08 Affymetrix, Inc. Method of target enrichment and amplification
WO2007106509A2 (en) * 2006-03-14 2007-09-20 Genizon Biosciences, Inc. Methods and means for nucleic acid sequencing
WO2008015975A1 (en) * 2006-07-31 2008-02-07 Kinki University Method for amplification of dna fragment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000078945A1 (en) * 1999-06-23 2000-12-28 Stratagene Methods of enriching for and identifying polymorphisms
US20040132056A1 (en) * 2001-07-20 2004-07-08 Affymetrix, Inc. Method of target enrichment and amplification
WO2007106509A2 (en) * 2006-03-14 2007-09-20 Genizon Biosciences, Inc. Methods and means for nucleic acid sequencing
WO2008015975A1 (en) * 2006-07-31 2008-02-07 Kinki University Method for amplification of dna fragment

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102373288A (en) * 2011-11-30 2012-03-14 盛司潼 Method and kit for sequencing target areas
US10876108B2 (en) 2012-01-26 2020-12-29 Nugen Technologies, Inc. Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
GB2533882A (en) * 2012-01-26 2016-07-06 Nugen Tech Inc Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
GB2533882B (en) * 2012-01-26 2016-10-12 Nugen Tech Inc Method of enriching and sequencing nucleic acids of interest using massively parallel sequencing
GB2513793B (en) * 2012-01-26 2016-11-02 Nugen Tech Inc Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
US9650628B2 (en) 2012-01-26 2017-05-16 Nugen Technologies, Inc. Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library regeneration
US10036012B2 (en) 2012-01-26 2018-07-31 Nugen Technologies, Inc. Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
US9957549B2 (en) 2012-06-18 2018-05-01 Nugen Technologies, Inc. Compositions and methods for negative selection of non-desired nucleic acid sequences
US11697843B2 (en) 2012-07-09 2023-07-11 Tecan Genomics, Inc. Methods for creating directional bisulfite-converted nucleic acid libraries for next generation sequencing
US11028430B2 (en) 2012-07-09 2021-06-08 Nugen Technologies, Inc. Methods for creating directional bisulfite-converted nucleic acid libraries for next generation sequencing
US9822408B2 (en) 2013-03-15 2017-11-21 Nugen Technologies, Inc. Sequential sequencing
US10619206B2 (en) 2013-03-15 2020-04-14 Tecan Genomics Sequential sequencing
US10760123B2 (en) 2013-03-15 2020-09-01 Nugen Technologies, Inc. Sequential sequencing
CN104232627A (en) * 2013-06-13 2014-12-24 深圳华大基因科技有限公司 2b-RAD pooling technology
CN104232626A (en) * 2013-06-13 2014-12-24 深圳华大基因科技有限公司 Barcode object in reduced-representation genome sequencing library and design method thereof
US11725241B2 (en) 2013-11-13 2023-08-15 Tecan Genomics, Inc. Compositions and methods for identification of a duplicate sequencing read
US10570448B2 (en) 2013-11-13 2020-02-25 Tecan Genomics Compositions and methods for identification of a duplicate sequencing read
US11098357B2 (en) 2013-11-13 2021-08-24 Tecan Genomics, Inc. Compositions and methods for identification of a duplicate sequencing read
US9745614B2 (en) 2014-02-28 2017-08-29 Nugen Technologies, Inc. Reduced representation bisulfite sequencing with diversity adaptors
US10102337B2 (en) 2014-08-06 2018-10-16 Nugen Technologies, Inc. Digital measurements from targeted sequencing
CN105483267A (en) * 2016-01-15 2016-04-13 古博 Plasma cfDNA (cell-free deoxyribonucleic acid) bi-molecular marker, method for marking and detecting plasma cfDNA and application of plasma cfDNA bi-molecular marker
CN105483267B (en) * 2016-01-15 2018-12-04 古博 Plasma DNA bimolecular label, label and method of detection blood plasma cfDNA and application thereof
US20200071776A1 (en) * 2017-05-19 2020-03-05 Toyota Jidosha Kabushiki Kaisha Set of random primers and method for preparing dna library using the same
CN110651052B (en) * 2017-05-19 2022-10-28 丰田自动车株式会社 Random primer set and method for preparing DNA library using the same
WO2018212318A1 (en) * 2017-05-19 2018-11-22 Toyota Jidosha Kabushiki Kaisha Set of random primers and method for preparing dna library using the same
CN110651052A (en) * 2017-05-19 2020-01-03 丰田自动车株式会社 Random primer set and method for preparing DNA library using the same
US11099202B2 (en) 2017-10-20 2021-08-24 Tecan Genomics, Inc. Reagent delivery system
US11795451B2 (en) 2017-12-25 2023-10-24 Toyota Jidosha Kabushiki Kaisha Primer for next generation sequencer and a method for producing the same, a DNA library obtained through the use of a primer for next generation sequencer and a method for producing the same, and a DNA analyzing method using a DNA library
CN111635958A (en) * 2020-07-22 2020-09-08 中国农业科学院作物科学研究所 Molecular marker linked with rice cold-resistant gene qSF12 and application thereof
CN112458199A (en) * 2020-12-24 2021-03-09 华智生物技术有限公司 SNP molecular marker of rice salt-tolerant gene SKC1 and application thereof
CN112458199B (en) * 2020-12-24 2021-11-16 华智生物技术有限公司 SNP molecular marker of rice salt-tolerant gene SKC1 and application thereof

Also Published As

Publication number Publication date
WO2010039991A3 (en) 2011-03-03

Similar Documents

Publication Publication Date Title
US11649494B2 (en) High throughput screening of populations carrying naturally occurring mutations
WO2010039991A2 (en) Method of generating informative dna templates for high-throughput sequencing applications
DK2002017T3 (en) High-capacity detection of molecular markers based on restriction fragments
US8980551B2 (en) Use of class IIB restriction endonucleases in 2nd generation sequencing applications
JP2007509629A (en) Complex nucleic acid analysis by cleavage of double-stranded DNA
US20090208943A1 (en) Method for the High Throughput Screening of Transposon Tagging Populations and Massive Parallel Sequence Identification of Insertion Sites
JP2007530026A (en) Nucleic acid sequencing
US20200102612A1 (en) Method for identifying the source of an amplicon
Du et al. Comprehensive evaluation of SNP identification with the Restriction Enzyme-based Reduced Representation Library (RRL) method
EP2180065A1 (en) Method of reducing the molecular weight of at least one PCR product for its detection while maintaining its identity
WO2001032929A1 (en) Methods and compositions for analysis of snps and strs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09818526

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09818526

Country of ref document: EP

Kind code of ref document: A2