WO2013103759A2 - Haplotype based pipeline for snp discovery and/or classification - Google Patents

Haplotype based pipeline for snp discovery and/or classification Download PDF

Info

Publication number
WO2013103759A2
WO2013103759A2 PCT/US2013/020211 US2013020211W WO2013103759A2 WO 2013103759 A2 WO2013103759 A2 WO 2013103759A2 US 2013020211 W US2013020211 W US 2013020211W WO 2013103759 A2 WO2013103759 A2 WO 2013103759A2
Authority
WO
WIPO (PCT)
Prior art keywords
snp
module
snps
markers
computerized system
Prior art date
Application number
PCT/US2013/020211
Other languages
French (fr)
Other versions
WO2013103759A3 (en
Inventor
Ramesh BUYYARAPU
Shunxue TANG
Kanika ARORA
Navin ELANGO
Siva P. Kumpatla
Pradeep MARRI
Jennifer Changhong TANG
Robert MCEWAN
Clive EVANS
Kelly PARLIAMENT
Original Assignee
Dow Agrosciences Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dow Agrosciences Llc filed Critical Dow Agrosciences Llc
Publication of WO2013103759A2 publication Critical patent/WO2013103759A2/en
Publication of WO2013103759A3 publication Critical patent/WO2013103759A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • This invention is generally related to the field of bioinformatics, and more specifically the field of discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism.
  • SNP single nucleotide polymorphism
  • SNP markers have become markers of choice for marker assisted selection (MAS) in crop improvement programs because of their higher abundance, amenability for automation and availability of high throughput genotyping platforms.
  • MAS marker assisted selection
  • current methodology for identifying SNPs in plants has many limitations, including a very high rate of false positives. This problem is especially challenging for plants with complex genomes. Thus, there remains a need for methodology which can identify and/or classify SNPs efficiently and accurately.
  • This invention is related to systems and methods for discovery and/or
  • SNP single-nucleotide polymorphism
  • the candidate SNP markers identified and/or classified using the systems and methods disclosed can be useful for phenotype or trait association studies.
  • HAPSNP haplotype based pipeline for SNP discovery and/or classification
  • the disclosed systems and methods can be especially useful for polyploid and complex plant genomes.
  • a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism comprises:
  • the system further comprises at least one of alignment module, assembly/mapping module, haplotype calling module, and SNP sequence formatting module. In a further or alternative embodiment, the system comprises all of assembly/mapping module(s), haplotype calling module, and SNP sequence formatting module. In another embodiment, the system further comprises a SNP marker classification module. In one embodiment, the system comprises an automatic sequencer or DNA sequencing machine. In another embodiment, the input device is selected from the group consisting of automated sequencer, sequencing data input device, and sequencing data storage device. In another embodiment, the output interface comprises a list of candidate SNP markers.
  • the database described herein contains information selected from the group consisting of SNPs, contigs with at least one SNP, and haplotypes with at least two SNPs.
  • the filtration engine for removing unreliable SNPs comprises at least four SNP filters to remove unreliable SNPs and generate reliable SNPs.
  • the filtration engine for removing unreliable SNPs comprises at least five SNP filters to remove unreliable SNPs and generate reliable SNPs.
  • the filtration engine for removing unreliable SNPs comprises at least six SNP filters to remove unreliable SNPs and generate reliable SNPs.
  • the filtration engine for removing unreliable SNPs comprises at least five SNP filters to remove unreliable SNPs and generate reliable SNPs.
  • the assembly/mapping module converts raw sequence data into contig FASTA files and/or ACE files.
  • the haplotype calling module generates haplotype data by examining patterns of SNP loci across contigs.
  • the SNP sequence formatting module generates candidate SNP markers with flanking sequences.
  • the candidate SNP markers have a validation success rate of at least or greater than 60%.
  • the candidate SNP markers have a validation success rate of from 60% to 80%.
  • the candidate SNP markers have a validation success rate of about 75%.
  • the computerized system provided comprises a controller module for SNP calling, and further comprises an assembly/mapping module, a haplotype calling module, and a SNP sequence formatting module.
  • the computerized system provided comprises a controller module for Loci identification, and further comprising an alignment module and a SNP sequence formatting module.
  • a method for use in a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism comprises:
  • the method further comprises determining haplotype using at least one haplotype calling module. In a further or alternative embodiment, the method further comprises formatting candidate SNP markers using at least one SNP sequence formatting module.
  • the computerized system of the method comprises a system described herein.
  • the method provides candidate SNP markers having a validation success rate of at least or greater than 60%.
  • the candidate SNP markers have a validation success rate of from 60% to 80%.
  • the candidate SNP markers have a validation success rate of about 75%.
  • the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program.
  • the method provides candidate SNP markers having a validation success rate of at least one and half folds (i.e., at least 50% increase) as compared to a publicly available program.
  • the publicly available program is QualitySNP.
  • the QualitySNP program is disclosed in Tang et al., BMC Bioinformatics 7:438 (2006), the content of which is incorporated in its entirety.
  • the system or method disclosed provides that the candidate SNP markers are classified into at least two types using at least one SNP marker
  • the candidate SNP markers are classified into at least three types.
  • the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than 60% validation success rate.
  • the candidate SNP markers have a validation success rate of from 60% to 80%.
  • the candidate SNP markers have a validation success rate of about 75%.
  • the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than two folds validation success rate as compared to a publicly available program.
  • the publicly available program is QualitySNP.
  • the organism for the systems or methods described herein comprises a polyploid genome.
  • the organism is a plant.
  • the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat.
  • the method provided further comprises classifying candidate SNP markers using a SNP marker classification module.
  • at least one type of the candidate SNP markers has a validation success rate of at least 60%.
  • at least two types of the candidate SNP markers have a validation success rate of at least 60%.
  • the method comprises:
  • the computerized system of the method comprises a system described herein.
  • the method provided further comprises identifying SNP loci using a Loci identification module.
  • the method provides candidate SNP markers having a validation success rate of at least or greater than 60%.
  • the candidate SNP markers have a validation success rate of from 60% to 80%.
  • the candidate SNP markers have a validation success rate of about 75%.
  • the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program.
  • the method provides candidate SNP markers having a validation success rate of at least one and half folds (i.e., at least 50% increase) as compared to a publicly available program.
  • the publicly available program is QualitySNP.
  • the QualitySNP program can be obtained from the world wide website bioinofmatics.nl/tools/snpweb as disclosed in Tang et al., BMC Bioinformatics 7:438 (2006), the content of which is incorporated in its entirety.
  • the system or method disclosed provides that the candidate SNP markers are classified into at least two types using at least one SNP marker
  • the candidate SNP markers are classified into at least three types.
  • the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than 60% validation success rate.
  • the candidate SNP markers have a validation success rate of from 60% to 80%.
  • the candidate SNP markers have a validation success rate of about 75%.
  • the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than two folds validation success rate as compared to a publicly available program.
  • the publicly available program is STACKs.
  • the organism for the systems or methods described herein comprises a polyploid genome.
  • the organism is a plant.
  • the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat.
  • the plant is G. hirsutum, G. barbadense, or G. mustelinum.
  • Figure 1 shows an exemplary embodiment of the HAPSNP pipeline provided herein. Five modules of the system are illustrated: (1) Assembly/mapping; (2) SNP calling; (3) SNP filtration; (4) Haplotype calling; and (5) SNP sequence formatting.
  • Figure 2 shows an exemplary system provided herein.
  • Figure 3 shows exemplary input sequences from raw sequencing data.
  • Figure 4 shows an exemplary output screen shot after the assembly/mapping module.
  • Figure 5 shows an exemplary output screen for possible SNPs after the SNP calling module.
  • Figure 6 shows an exemplary output screen for homopolymer region SNPs after the SNP filtration module.
  • Figure 7 shows an exemplary output screen for filtered SNP after the SNP filtration module.
  • Figure 8 shows an exemplary output screen after the haplotype calling module.
  • Figure 9 shows an exemplary output screen after the SNP sequence formatting module.
  • Figure 10 shows an exemplary output screen after both the haplotype calling and the SNP sequence formatting module.
  • Figure 11 shows an example of Type I, II, and III SNPs in cotton identified using an exemplary system and method provided herein.
  • Figure 11 A shows a typical distribution of Type I SNPs;
  • Figure 1 IB shows a typical distribution of Type II SNPs;
  • Figure 11C shows a typical distribution of Type III SNPs.
  • Figure 12 shows an exemplary embodiment of the HAPSNP pipeline provided to be combined with genotyping-by-sequencing (GBS).
  • GGS genotyping-by-sequencing
  • SNP marker development in polyploid crop species is very challenging due to the existence of multiple sub-genomes in the nucleus. Due to the presence of duplicated loci in the sub-genomes, it is very difficult to distinguish true SNPs from allelic variations in homologs and false SNPs from non-allelic variations in paralogs.
  • the phrase “candidate SNP markers” refers to SNP sequences identified to be validated using biological and/or other assays as associated with traits or phenotypes of an organism, for example plants.
  • the phrase “plant” includes dicotyledons plants and monocotyledons plants.
  • Examples of dicotyledons plants include tobacco, Arabidopsis, soybean, tomato, papaya, canola, sunflower, cotton, alfalfa, potato, grapevine, pigeon pea, pea, Brassica, chickpea, sugar beet, rapeseed, watermelon, melon, pepper, peanut, pumpkin, radish, spinach, squash, broccoli, cabbage, carrot, cauliflower, celery, Chinese cabbage, cucumber, eggplant, and lettuce.
  • Examples of monocotyledons plants include corn, rice, wheat, sugarcane, barley, rye, sorghum, orchids, bamboo, banana, cattails, lilies, oat, onion, millet, and triticale.
  • linkage analysis refers to a method used to identify SNPs close or adjacent to one another in the same contig, chromosome, or a stretch of sequence defined otherwise. Methods for construction of contigs are well known in the art, for example, see the CAP3 program disclosed in Huang, X. and A. Madan "CAP3 : A DNA Sequence Assembly Program.” Genome Research 9(9): 868-877 (1999), the content of which is incorporated by reference in its entirety.
  • polymorphism refers to a difference of DNA bases in genomes/chromosomes of organisms.
  • the polymorphism may reside within coding sequence of an open reading frame. Alternatively, it may reside within non- coding sequences.
  • all bases that have variations from genomes/chromosomes of organisms can be considered as polymorphism, which will be distinguished from errors introduced by human manipulation such as sequencing error or mutation introduced during amplification.
  • haplotype refers to a group of SNPs that are generally inherited together. Haplotypes can have stronger correlations with traits or phenotypic effects compared with individual SNPs, and therefore may provide increased diagnostic accuracy in some cases (see e.g., Stephens et al. (2001) Science 293: 489-493).
  • FASTA format was introduced by Bill Pearson and David Lipman in 1988 for representing either nucleotide or amino acid sequences (see Pearson and Lipman, "Improved tolls for biological sequence comparison” (1988) Proc. Natl. Acad. Sci. USA 85:2444-2448).
  • a sequence in FASTA format is a text-based format beginning with a single-line description containing a greater-than symbol (>) in first column, followed by lines of sequence data.
  • an ACE file is a generally used data to represent sequence assembly.
  • the systems and methods disclosed herein provides a SNP detection pipeline which utilizes the haplotype information to distinguish homologous loci from paralogous loci.
  • this Haplotype Based Pipeline for SNP Discovery and/or Classification uses high throughput sequence data assembly tools along with multiple custom scripts to decipher the contig assembly sequence by (i) identifying putative SNPs initially; (ii) generating haplotype information and allelic frequency of loci in respective genotypes; and (iii) enhancing the ability to identify high quality SNPs using the haplotype information and allelic frequency.
  • This exemplary pipeline functions well for SNP marker discovery using the sequence information from biparental resources, for example, in both cotton and canola. SNPs identified from this pipeline can be converted into genotyping assays and can be validated with a success rate of up to 60-80% polymorphism rate across various genotypes.
  • the efficiency of the HAPSNP provided herein is relatively high in (i) high assay validation rate (60-80%) compared to other SNP mining programs ( ⁇ 25%) for polyploid species; and (ii) more robust in handling huge datasets for allele mining (>10 Million sequences) compared to other SNP mining programs ( ⁇ 1 million sequences).
  • the utility of the exemplary HAPSNP provided herein can be extended to other complex diploids, polyploid crop species and targeted de-novo or re-sequencing projects to identify true SNPs and also to analyze the multiple types of sequence data (for example, from 454 Life Science Corporation, Applied Biosystems (ABI), and/or Illumina Inc.) from more than multiple parental (>2) sources.
  • the HAPSNP pipeline provided herein can be implemented for single nucleotide variation detection in any organism including plants and it can also be used for formatting of the SNP sequence information to suit assay designing for multiple genotyping chemistries, for example, Illumina Inc.'s GoldenGate assay, Infinium® iSelect® beadchip, or
  • FIG. 1 An exemplary embodiment of the HAPSNP pipeline is shown in Figure 1, where five modules of the system are illustrated: (1) Assembly/mapping; (2) SNP calling; (3) SNP filtration; (4) Haplotype calling; and (5) SNP sequence formatting.
  • the input can be raw sequencing data.
  • the raw sequencing data can be generated either for de novo or re-sequencing purposes through the next generation sequencing (NGS) instruments which can be initially quality filtered according to the standard criteria set by NGS instrument manufacturers.
  • NGS next generation sequencing
  • sequences from two or more sources for example, genotypes and/or parental lines
  • de novo assembly programs for example, Celera Assembler
  • mapped to a reference genome or sequence using other programs for example, Mosaik program.
  • the assembled or mapped data is converted in .ace file format for further processing.
  • module (1) can be either contig FASTA files or .ace files.
  • the input typically includes .ace files as illustrated in Figure 1.
  • all possible loci with single nucleotide variations are identified by a custom designed script in the contig regions.
  • the systems and methods provided herein allow user to set the sequencing depth at SNP position for each allele and allelic frequency per genotype required for SNP allele calling.
  • module (2) The major function of module (2) is to remove most of the false SNPs from sequencing errors, and this function is critical for distinguishing the allelic variants (homologous, true SNPs) from the non-allelic (homeologous or paralogous, false SNPs) variants and for haplotype calling (see description for module (4) below).
  • the output of module (2) may include all possible SNPs and contigs generated from SNPs.
  • module (3) SNP filtration, the input typically includes all identified SNPs from module (2).
  • the major function of module (3) is to remove SNPs found in
  • the SNP filtration module provided is different than existing programs because filtration used here does not depend on (i.e., independent from) numbers of SNPs, frequency of duplication of SNPs, or size of the population as in existing programs. Further, the HAPSNP pipeline provided herein allows users to choose and/or create customized SNP filtration unit within the SNP filtration module for a specific purpose, for example, for particular crops including cotton, canola, corn, wheat, sunflower, or soybean.
  • the input typically includes all possible SNPs or contigs generated from SNPs from module (3).
  • the information from module (3) is used to generate the haplotype information for each contig.
  • each haplotype is defined as a unique combination of alleles in contiguous series of SNP locations found in a contig. Haplotypes can be generated for each contig by examining the patterns of SNP loci across contigs. SNPs with more than two haplotypes in any of the genotypes (most common in polyploids) or with the same two haplotypes in all the genotypes are considered false SNPs as they are potentially non-allelic variations between paralogs and eliminated for further validation.
  • module (4) the major function of module (4) is to greatly enhance the percentage of true SNPs for validation after haplotype generation.
  • the output of module (4) may include haplotypes generated from contigs/SNPs.
  • Module (4) can optionally include a haplotype filtration unit to filter out false or undesired haplotype.
  • the input typically includes filtered SNPs, Contigs (FASTA files), and haplotypes generated from contigs/SNPs.
  • the contig sequences are used to get flanking sequence for each filtered SNP.
  • each of the filtered SNP loci is converted to [Allele 1/Allele2] format and the flanking sequences are formatted to fit for assay design and validation with, for example, KASPar®, Infinium®, or GoldenGate genotyping technology.
  • the SNP other than the selected position (10 bases upstream and downstream) can be converted into ambiguous bases or wobbles (R, Y, M, K, M, S, W, H, B, V, D, N) to avoid assay design in the flanking SNP region.
  • the SNPs that are away from the selected SNP position can be converted to major allele. This process reduces the risk of failure during assay validation.
  • the output of module (5) typically includes selected SNPs with flanking sequences, for example, listed in an Excel spreadsheet.
  • the systems or methods disclosed herein further comprises at least one SNP marker classification module.
  • candidate SNP markers are classified into at least two types using the SNP marker classification module.
  • candidate SNP markers are classified into at least three types using the SNP marker classification module. The classification can be based on association with genotype or other criteria as demonstrated in examples herein.
  • Major advantage of the systems and methods provided herein include at least one of the following: (1) the HAPSNP pipeline disclosed can handle large sequencing data generated from NGS instruments; (2) the HAPSNP pipeline disclosed can use sequencing depth at SNP position and allele frequency to assure the quality of allele calling and distinguish the allelic variations from non-allelic variations between paralogs, (3) the HAPSNP pipeline disclosed can implement haplotype information to further enhance the percentage of true SNPs, and (4) the HAPSNP pipeline disclosed can format the SNP sequence information to suit assay design with multiple genotyping platforms.
  • the HAPSNP pipeline provided includes a data storage/database and retrieval system for SNPs/haplotypes integrated with operation system and analysis system as shown in Figure 2.
  • the input device may include raw sequencing data from genomic DNA, expression sequence tag (ESTs), genome sequence tags (GSTs), and/or nucleic acid information from other sources such as FASTA files.
  • the HAPSNP pipeline provided herein allows users to input specific sequence data as desired.
  • the output device may include unit for generating Excel spreadsheet to be displayed in a computer screen, database for SNP/haplotype of contig/alignments (before and after filtration), and/or user- friendly interface, for example, a web-based interface or e-mail notification system.
  • the HAPSNP pipeline disclosed can be particularly useful for SNP discovery in polyploid species including cotton, canola, and soybean. Further, the HAPSNP pipeline disclosed is powerful enough to identify SNPs from a set of two parents and also to generate haplotype information for "Genotyping by Sequencing" projects used in either quantitative trait locus (QTL) mapping or trait introgression programs, or even for hybrid crops.
  • QTL quantitative trait locus
  • the utility of the systems and methods disclosed can be extended to analyze the data from multiple sequencing technologies and also multiple parental sources to identify candidate SNP loci for assay validation in, for example, cotton and canola.
  • the systems and methods disclosed can be used to analyze the NGS data from targeted re-sequencing projects in, for example, soybean, corn, sunflower, and cotton.
  • Sequencing data from cotton can be imported directly into the assembly/mapping module of the HAPSNP pipeline provided as shown in Figure 3.
  • the output .ace (or ACE) file can be input into the SNP calling module (see Module 2 of Figure 1 and Figure 4).
  • the SNP calling module determines all possible SNPs based on sequence comparison among all input sequences and optionally a reference sequence is considered for sequence comparison.
  • Contig sequences and identifiers can be included in all SNPs as output after SNP calling as shown in Figure 5. These SNPs/contigs are then subject to SNP filtration (for example, see Module 3 of Figure 1).
  • the SNP filtration module can also determine whether SNPs are in a homopolymer region. If yes, the homopolymer region SNPs can be displayed as shown in Figure 6. After SNP filtration, false positive SNPs are removed and input into a SNP sequence formatting module as shown in Figure 1.
  • haplotype calling module for example, see Module 4 of Figure 1.
  • the haplotype calling module can optionally include a haplotype filtration unit which is independent from the SNP filtration module.
  • the haplotype information can be input into the SNP sequence formatting module to be considered for association with genotypes after combination with filtered SNPs (see Figure 9 for an example of a haplotype output).
  • the SNP sequence formatting module (for example, see Module 5 of Figure 1) complies filtered SNP with flanking sequences together with haplotype information (optionally filtered) to determine contigs containing "candidate SNP markers" (see Figure 10 as an example of the output of candidate SNP markers with contig identifiers).
  • the output of the HAPSNP pipeline provided herein can include (1) contig identifier information, (2) contig sequence information, (3) SNP sequence information, and (4) haplotype designation.
  • the HAPSNP pipeline of this example is compared to publicly unmodified program including QualitySNP and Consortium in either cotton or canola. As shown in Table 1, the HAPSNP pipeline provided can increase validation success of candidate SNP markers more than two folds from about 27-33% to about 60-69%.
  • Type I SNPs are variations where alleles are in homologous condition in each genotype.
  • Type II SNPs have heterologous alleles in one genotype and homologous allele in other genotype.
  • Type III SNPs are typically derived from paralogous or homeologous sequences in the genome, and have heterologous alleles within each genotype. These SNPs can be further filtered and formatted with flanking sequence information to fit for multiple SNP genotyping assay formats including
  • Figures 11 A-C show typical distributions of Type I, II, and III SNPs in cotton using the systems and methods provided herein.
  • SNP markers Single nucleotide polymorphism (SNP) markers have become markers of choice for marker assisted selection (MAS) in crop improvement programs because of their higher abundance, amenability for automation and availability of high throughput genotyping platforms.
  • Complexity reduction approaches combined with high throughput sequencing technologies have enabled rapid development of informative SNP markers.
  • Genotyping-by- Sequencing (GBS) methods offer high throughput approaches for SNP discovery and genotyping.
  • GGS Genotyping-by- Sequencing
  • the HAPSNP pipeline disclosed herein is used to combine GBS data/system to distinguish homologous loci from paralogous loci.
  • This particular embodiment of HAPSNP pipeline can extract exact homology matches from high throughput sequencing data using STACKs program and is designed with multiple custom scripts to decipher the homologous sequence tags to provide at least one of the following advantages: (i) identifying putative SNPs; (ii) generating haplotype information and allelic frequency of loci across multiple genotypes; (iii) enhancing the ability to identify high quality SNPs using the haplotype information and allelic frequency; (iv) facilitating redundancy check within the SNP dataset; and (v) providing SNP sequence in assay convertible format. SNPs identified from this pipeline are converted into genotyping assays and are validated with a success rate of up to 75% polymorphism rate across various genotypes.
  • the efficiency of this pipeline is relatively high due to (i) high assay validation rate (-75%) compared to other SNP mining programs ( ⁇ 25% for polyploid species); and (ii) its robustness in handling huge datasets for allele mining (>10 Million sequences) compared to other SNP mining programs ( ⁇ 1 million sequences).
  • the utility of this pipeline can be extended to other complex diploids, polyploid crop species and targeted de-novo or re-sequencing projects to identify true SNPs.
  • HAPSNP pipeline provided in this example can be implemented for single nucleotide variation detection in any crop and it can also be used for formatting of the SNP sequence information to suit assay designing for multiple genotyping chemistries such as Illumina GoldenGate, Infinium, iSelect, TaqMan or KASPar assays.
  • Figure 12 represents the flowchart of this GBS-HAPSNP pipeline. The utility of this pipeline can also be extended for routine genotyping from GBS experiments in complex polyploids, including G. hirsutum, G. barbadense, or G. mustelinum.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This invention is related to systems and methods for discovery and/or classification of single-nucleotide polymorphism (SNP) markers. The SNP sequences identified and/or classified using the systems and methods disclosed can be useful for phenotype or trait association studies. In particular, a haplotype based pipeline for SNP discovery and/or classification (HAPSNP) is provided and the disclosed systems and methods can be especially useful for polyploid and complex plant genomes.

Description

HAPLOTYPE BASED PIPELINE FOR SNP DISCOVERY AND/OR
CLASSIFICATION
FIELD OF THE INVENTION
[0001] This invention is generally related to the field of bioinformatics, and more specifically the field of discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism.
BACKGROUND OF THE INVENTION
[0002] Single nucleotide polymorphism (SNP) markers have become markers of choice for marker assisted selection (MAS) in crop improvement programs because of their higher abundance, amenability for automation and availability of high throughput genotyping platforms. However, current methodology for identifying SNPs in plants has many limitations, including a very high rate of false positives. This problem is especially challenging for plants with complex genomes. Thus, there remains a need for methodology which can identify and/or classify SNPs efficiently and accurately.
SUMMARY OF THE INVENTION
[0003] This invention is related to systems and methods for discovery and/or
classification of single-nucleotide polymorphism (SNP) markers. The candidate SNP markers identified and/or classified using the systems and methods disclosed can be useful for phenotype or trait association studies. In particular, a haplotype based pipeline for SNP discovery and/or classification (HAPSNP) is provided and the disclosed systems and methods can be especially useful for polyploid and complex plant genomes.
[0004] In one aspect, provided is a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism. The system comprises:
(a) an input device and an output device/interface;
(b) an analysis system interface coupled to memory of a computer;
(c) an operating system optionally comprising a database;
(d) a controller module for SNP calling or Loci identification; and
(e) a filtration engine for removing unreliable SNPs.
[0005] In one embodiment, the system further comprises at least one of alignment module, assembly/mapping module, haplotype calling module, and SNP sequence formatting module. In a further or alternative embodiment, the system comprises all of assembly/mapping module(s), haplotype calling module, and SNP sequence formatting module. In another embodiment, the system further comprises a SNP marker classification module. In one embodiment, the system comprises an automatic sequencer or DNA sequencing machine. In another embodiment, the input device is selected from the group consisting of automated sequencer, sequencing data input device, and sequencing data storage device. In another embodiment, the output interface comprises a list of candidate SNP markers.
[0006] In one embodiment, the database described herein contains information selected from the group consisting of SNPs, contigs with at least one SNP, and haplotypes with at least two SNPs. In another embodiment, the filtration engine for removing unreliable SNPs comprises at least four SNP filters to remove unreliable SNPs and generate reliable SNPs. In another embodiment, the filtration engine for removing unreliable SNPs comprises at least five SNP filters to remove unreliable SNPs and generate reliable SNPs. In another embodiment, the filtration engine for removing unreliable SNPs comprises at least six SNP filters to remove unreliable SNPs and generate reliable SNPs. In another embodiment, the filtration engine for removing unreliable SNPs comprises at least five SNP filters to remove unreliable SNPs and generate reliable SNPs.
[0007] In one embodiment, the assembly/mapping module converts raw sequence data into contig FASTA files and/or ACE files. In another embodiment, the haplotype calling module generates haplotype data by examining patterns of SNP loci across contigs. In another embodiment, the SNP sequence formatting module generates candidate SNP markers with flanking sequences. In a further or alternative embodiment, the candidate SNP markers have a validation success rate of at least or greater than 60%. In a further embodiment, the candidate SNP markers have a validation success rate of from 60% to 80%. In a further embodiment, the candidate SNP markers have a validation success rate of about 75%.
[0008] In another embodiment, the computerized system provided comprises a controller module for SNP calling, and further comprises an assembly/mapping module, a haplotype calling module, and a SNP sequence formatting module. In another embodiment, the computerized system provided comprises a controller module for Loci identification, and further comprising an alignment module and a SNP sequence formatting module.
[0009] In another aspect, provided is a method for use in a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism. The method comprises:
(a) assembling/mapping sequence data using an assembly/mapping module;
(b) identifying all possible SNPs using a SNP calling module; and
(c) generating reliable SNPs using a SNP filtration module.
[0010] In one embodiment, the method further comprises determining haplotype using at least one haplotype calling module. In a further or alternative embodiment, the method further comprises formatting candidate SNP markers using at least one SNP sequence formatting module.
[0011] In one embodiment, the computerized system of the method comprises a system described herein. In another embodiment, the method provides candidate SNP markers having a validation success rate of at least or greater than 60%. In a further embodiment, the candidate SNP markers have a validation success rate of from 60% to 80%. In a further embodiment, the candidate SNP markers have a validation success rate of about 75%. In a further or alternative embodiment, the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program. In another embodiment, the method provides candidate SNP markers having a validation success rate of at least one and half folds (i.e., at least 50% increase) as compared to a publicly available program. In a further embodiment, the publicly available program is QualitySNP. The QualitySNP program is disclosed in Tang et al., BMC Bioinformatics 7:438 (2006), the content of which is incorporated in its entirety.
[0012] In one embodiment, the system or method disclosed provides that the candidate SNP markers are classified into at least two types using at least one SNP marker
classification module. In a further embodiment, the candidate SNP markers are classified into at least three types. In another embodiment, the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than 60% validation success rate. In a further embodiment, the candidate SNP markers have a validation success rate of from 60% to 80%. In a further embodiment, the candidate SNP markers have a validation success rate of about 75%. In another embodiment, the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than two folds validation success rate as compared to a publicly available program. In a further embodiment, the publicly available program is QualitySNP.
[0013] In some embodiments, the organism for the systems or methods described herein comprises a polyploid genome. In some embodiments, the organism is a plant. In some embodiments, the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat. In another embodiment, the method provided further comprises classifying candidate SNP markers using a SNP marker classification module. In a further embodiment, at least one type of the candidate SNP markers has a validation success rate of at least 60%. In a further embodiment, at least two types of the candidate SNP markers have a validation success rate of at least 60%.
[0014] In another aspect, provided is a method for use in a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism in combination with a system for genotyping-by- sequencing (GBS). The method comprises:
(a) aligning genotyping-by-sequencing (GBS) data using an alignment module;
(b) generating reliable SNPs using a SNP filtration module; and
(c) formatting candidate SNP markers using a SNP sequence formatting module.
[0015] In one embodiment, the computerized system of the method comprises a system described herein. In another embodiment, the method provided further comprises identifying SNP loci using a Loci identification module. In another embodiment, the method provides candidate SNP markers having a validation success rate of at least or greater than 60%. In a further embodiment, the candidate SNP markers have a validation success rate of from 60% to 80%. In a further embodiment, the candidate SNP markers have a validation success rate of about 75%. In a further or alternative embodiment, the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program. In another embodiment, the method provides candidate SNP markers having a validation success rate of at least one and half folds (i.e., at least 50% increase) as compared to a publicly available program. In a further embodiment, the publicly available program is QualitySNP. The QualitySNP program can be obtained from the world wide website bioinofmatics.nl/tools/snpweb as disclosed in Tang et al., BMC Bioinformatics 7:438 (2006), the content of which is incorporated in its entirety.
[0016] In one embodiment, the system or method disclosed provides that the candidate SNP markers are classified into at least two types using at least one SNP marker
classification module. In a further embodiment, the candidate SNP markers are classified into at least three types. In another embodiment, the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than 60% validation success rate. In a further embodiment, the candidate SNP markers have a validation success rate of from 60% to 80%. In a further embodiment, the candidate SNP markers have a validation success rate of about 75%. In another embodiment, the system or method disclosed provides that at least one type of the candidate SNP markers with at least or greater than two folds validation success rate as compared to a publicly available program. In a further embodiment, the publicly available program is STACKs.
[0017] In some embodiments, the organism for the systems or methods described herein comprises a polyploid genome. In some embodiments, the organism is a plant. In some embodiments, the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat. In some embodiments, the plant is G. hirsutum, G. barbadense, or G. mustelinum.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] Figure 1 shows an exemplary embodiment of the HAPSNP pipeline provided herein. Five modules of the system are illustrated: (1) Assembly/mapping; (2) SNP calling; (3) SNP filtration; (4) Haplotype calling; and (5) SNP sequence formatting.
[0019] Figure 2 shows an exemplary system provided herein.
[0020] Figure 3 shows exemplary input sequences from raw sequencing data.
[0021] Figure 4 shows an exemplary output screen shot after the assembly/mapping module.
[0022] Figure 5 shows an exemplary output screen for possible SNPs after the SNP calling module.
[0023] Figure 6 shows an exemplary output screen for homopolymer region SNPs after the SNP filtration module.
[0024] Figure 7 shows an exemplary output screen for filtered SNP after the SNP filtration module.
[0025] Figure 8 shows an exemplary output screen after the haplotype calling module.
[0026] Figure 9 shows an exemplary output screen after the SNP sequence formatting module.
[0027] Figure 10 shows an exemplary output screen after both the haplotype calling and the SNP sequence formatting module.
[0028] Figure 11 shows an example of Type I, II, and III SNPs in cotton identified using an exemplary system and method provided herein. Figure 11 A shows a typical distribution of Type I SNPs; Figure 1 IB shows a typical distribution of Type II SNPs; Figure 11C shows a typical distribution of Type III SNPs.
[0029] Figure 12 shows an exemplary embodiment of the HAPSNP pipeline provided to be combined with genotyping-by-sequencing (GBS). Four modules of the system are illustrated: (1) Alignment; (2) Loci identification; (3) SNP filtration; and (4) SNP sequence formatting.
DETAILED DESCRIPTION OF THE INVENTION
[0030] Unlike diploids, SNP marker development in polyploid crop species is very challenging due to the existence of multiple sub-genomes in the nucleus. Due to the presence of duplicated loci in the sub-genomes, it is very difficult to distinguish true SNPs from allelic variations in homologs and false SNPs from non-allelic variations in paralogs.
[0031] Previously, transcriptome and genome complexity reduction techniques combined with high throughput sequencing technologies have been used to enable rapid development of informative SNP markers. SNP mining programs (for example, AutoSNP) have been developed to use allelic frequency as a measure of SNP confidence, but allelic frequency alone is not a good measure of SNP quality especially in polyploid crops. High genomic complexity, narrow genetic base, polyploid nature, and lack of reference genome are major factors to hinder development of candidate SNP markers in cultivated cotton, canola and other species.
[0032] As used herein, the phrase "candidate SNP markers" refers to SNP sequences identified to be validated using biological and/or other assays as associated with traits or phenotypes of an organism, for example plants. As used herein, the phrase "plant" includes dicotyledons plants and monocotyledons plants. Examples of dicotyledons plants include tobacco, Arabidopsis, soybean, tomato, papaya, canola, sunflower, cotton, alfalfa, potato, grapevine, pigeon pea, pea, Brassica, chickpea, sugar beet, rapeseed, watermelon, melon, pepper, peanut, pumpkin, radish, spinach, squash, broccoli, cabbage, carrot, cauliflower, celery, Chinese cabbage, cucumber, eggplant, and lettuce. Examples of monocotyledons plants include corn, rice, wheat, sugarcane, barley, rye, sorghum, orchids, bamboo, banana, cattails, lilies, oat, onion, millet, and triticale.
[0033] As used herein, the phrase "linkage analysis" refers to a method used to identify SNPs close or adjacent to one another in the same contig, chromosome, or a stretch of sequence defined otherwise. Methods for construction of contigs are well known in the art, for example, see the CAP3 program disclosed in Huang, X. and A. Madan "CAP3 : A DNA Sequence Assembly Program." Genome Research 9(9): 868-877 (1999), the content of which is incorporated by reference in its entirety.
[0034] As used herein, the phrase "polymorphism" refers to a difference of DNA bases in genomes/chromosomes of organisms. In some embodiments, the polymorphism may reside within coding sequence of an open reading frame. Alternatively, it may reside within non- coding sequences. As used herein, all bases that have variations from genomes/chromosomes of organisms can be considered as polymorphism, which will be distinguished from errors introduced by human manipulation such as sequencing error or mutation introduced during amplification.
[0035] As used herein, the phrase "haplotype" refers to a group of SNPs that are generally inherited together. Haplotypes can have stronger correlations with traits or phenotypic effects compared with individual SNPs, and therefore may provide increased diagnostic accuracy in some cases (see e.g., Stephens et al. (2001) Science 293: 489-493).
[0036] In the field of bioinformatics, FASTA format was introduced by Bill Pearson and David Lipman in 1988 for representing either nucleotide or amino acid sequences (see Pearson and Lipman, "Improved tolls for biological sequence comparison" (1988) Proc. Natl. Acad. Sci. USA 85:2444-2448). Basically, a sequence in FASTA format is a text-based format beginning with a single-line description containing a greater-than symbol (>) in first column, followed by lines of sequence data. In addition, an ACE file is a generally used data to represent sequence assembly.
[0037] To increase the efficiency of SNP detection from homologous sequences, and reduce the risk of high false positive rate due to the presence of homeologous sub-genomes in polyploid genomes like cotton and canola, the systems and methods disclosed herein provides a SNP detection pipeline which utilizes the haplotype information to distinguish homologous loci from paralogous loci.
[0038] In one embodiment, this Haplotype Based Pipeline for SNP Discovery and/or Classification (HAPSNP) provided herein uses high throughput sequence data assembly tools along with multiple custom scripts to decipher the contig assembly sequence by (i) identifying putative SNPs initially; (ii) generating haplotype information and allelic frequency of loci in respective genotypes; and (iii) enhancing the ability to identify high quality SNPs using the haplotype information and allelic frequency. This exemplary pipeline functions well for SNP marker discovery using the sequence information from biparental resources, for example, in both cotton and canola. SNPs identified from this pipeline can be converted into genotyping assays and can be validated with a success rate of up to 60-80% polymorphism rate across various genotypes.
[0039] The efficiency of the HAPSNP provided herein is relatively high in (i) high assay validation rate (60-80%) compared to other SNP mining programs (<25%) for polyploid species; and (ii) more robust in handling huge datasets for allele mining (>10 Million sequences) compared to other SNP mining programs (<1 million sequences). The utility of the exemplary HAPSNP provided herein can be extended to other complex diploids, polyploid crop species and targeted de-novo or re-sequencing projects to identify true SNPs and also to analyze the multiple types of sequence data (for example, from 454 Life Science Corporation, Applied Biosystems (ABI), and/or Illumina Inc.) from more than multiple parental (>2) sources.
[0040] The HAPSNP pipeline provided herein can be implemented for single nucleotide variation detection in any organism including plants and it can also be used for formatting of the SNP sequence information to suit assay designing for multiple genotyping chemistries, for example, Illumina Inc.'s GoldenGate assay, Infinium® iSelect® beadchip, or
KBioscience's KASPar® assay.
[0041] An exemplary embodiment of the HAPSNP pipeline is shown in Figure 1, where five modules of the system are illustrated: (1) Assembly/mapping; (2) SNP calling; (3) SNP filtration; (4) Haplotype calling; and (5) SNP sequence formatting.
[0042] For module (1) Assembly/mapping, the input can be raw sequencing data. The raw sequencing data can be generated either for de novo or re-sequencing purposes through the next generation sequencing (NGS) instruments which can be initially quality filtered according to the standard criteria set by NGS instrument manufacturers. In some embodiments, when a reference genome or sequence exists, sequences from two or more sources (for example, genotypes and/or parental lines) can be assembled into contigs using de novo assembly programs (for example, Celera Assembler), or mapped to a reference genome or sequence using other programs (for example, Mosaik program). In one embodiment, the assembled or mapped data is converted in .ace file format for further processing. Thus, the output of module (1) can be either contig FASTA files or .ace files. [0043] For module (2) SNP calling, the input typically includes .ace files as illustrated in Figure 1. In one embodiment, all possible loci with single nucleotide variations are identified by a custom designed script in the contig regions. In a further or separate embodiment, the systems and methods provided herein allow user to set the sequencing depth at SNP position for each allele and allelic frequency per genotype required for SNP allele calling. The major function of module (2) is to remove most of the false SNPs from sequencing errors, and this function is critical for distinguishing the allelic variants (homologous, true SNPs) from the non-allelic (homeologous or paralogous, false SNPs) variants and for haplotype calling (see description for module (4) below). The output of module (2) may include all possible SNPs and contigs generated from SNPs.
[0044] For module (3) SNP filtration, the input typically includes all identified SNPs from module (2). The major function of module (3) is to remove SNPs found in
homopolymer stretches where sequencing technology is prone to errors (for example, especially from 454 Life Science Corporation). Other technology/project specific filters are contemplated to be applied in module (3) to further reduce false positives. The SNP filtration module provided is different than existing programs because filtration used here does not depend on (i.e., independent from) numbers of SNPs, frequency of duplication of SNPs, or size of the population as in existing programs. Further, the HAPSNP pipeline provided herein allows users to choose and/or create customized SNP filtration unit within the SNP filtration module for a specific purpose, for example, for particular crops including cotton, canola, corn, wheat, sunflower, or soybean.
[0045] For module (4) Haplotype calling, the input typically includes all possible SNPs or contigs generated from SNPs from module (3). The information from module (3) is used to generate the haplotype information for each contig. As provided herein, each haplotype is defined as a unique combination of alleles in contiguous series of SNP locations found in a contig. Haplotypes can be generated for each contig by examining the patterns of SNP loci across contigs. SNPs with more than two haplotypes in any of the genotypes (most common in polyploids) or with the same two haplotypes in all the genotypes are considered false SNPs as they are potentially non-allelic variations between paralogs and eliminated for further validation. Thus, the major function of module (4) is to greatly enhance the percentage of true SNPs for validation after haplotype generation. The output of module (4) may include haplotypes generated from contigs/SNPs. Module (4) can optionally include a haplotype filtration unit to filter out false or undesired haplotype.
[0046] For module (5) SNP sequence formatting, the input typically includes filtered SNPs, Contigs (FASTA files), and haplotypes generated from contigs/SNPs. In some embodiment, the contig sequences are used to get flanking sequence for each filtered SNP. In some embodiments, each of the filtered SNP loci is converted to [Allele 1/Allele2] format and the flanking sequences are formatted to fit for assay design and validation with, for example, KASPar®, Infinium®, or GoldenGate genotyping technology. In some embodiments, if there are multiple SNP loci in the same contig, the SNP other than the selected position (10 bases upstream and downstream) can be converted into ambiguous bases or wobbles (R, Y, M, K, M, S, W, H, B, V, D, N) to avoid assay design in the flanking SNP region. The SNPs that are away from the selected SNP position can be converted to major allele. This process reduces the risk of failure during assay validation. The output of module (5) typically includes selected SNPs with flanking sequences, for example, listed in an Excel spreadsheet.
[0047] In some embodiments, the systems or methods disclosed herein further comprises at least one SNP marker classification module. In some embodiments, candidate SNP markers are classified into at least two types using the SNP marker classification module. In other embodiments, candidate SNP markers are classified into at least three types using the SNP marker classification module. The classification can be based on association with genotype or other criteria as demonstrated in examples herein.
[0048] Major advantage of the systems and methods provided herein include at least one of the following: (1) the HAPSNP pipeline disclosed can handle large sequencing data generated from NGS instruments; (2) the HAPSNP pipeline disclosed can use sequencing depth at SNP position and allele frequency to assure the quality of allele calling and distinguish the allelic variations from non-allelic variations between paralogs, (3) the HAPSNP pipeline disclosed can implement haplotype information to further enhance the percentage of true SNPs, and (4) the HAPSNP pipeline disclosed can format the SNP sequence information to suit assay design with multiple genotyping platforms.
[0049] The HAPSNP pipeline provided includes a data storage/database and retrieval system for SNPs/haplotypes integrated with operation system and analysis system as shown in Figure 2. The input device may include raw sequencing data from genomic DNA, expression sequence tag (ESTs), genome sequence tags (GSTs), and/or nucleic acid information from other sources such as FASTA files. In some embodiments, the HAPSNP pipeline provided herein allows users to input specific sequence data as desired. The output device may include unit for generating Excel spreadsheet to be displayed in a computer screen, database for SNP/haplotype of contig/alignments (before and after filtration), and/or user- friendly interface, for example, a web-based interface or e-mail notification system.
[0050] The HAPSNP pipeline disclosed can be particularly useful for SNP discovery in polyploid species including cotton, canola, and soybean. Further, the HAPSNP pipeline disclosed is powerful enough to identify SNPs from a set of two parents and also to generate haplotype information for "Genotyping by Sequencing" projects used in either quantitative trait locus (QTL) mapping or trait introgression programs, or even for hybrid crops. The utility of the systems and methods disclosed can be extended to analyze the data from multiple sequencing technologies and also multiple parental sources to identify candidate SNP loci for assay validation in, for example, cotton and canola. In addition, the systems and methods disclosed can be used to analyze the NGS data from targeted re-sequencing projects in, for example, soybean, corn, sunflower, and cotton.
EXAMPLES
Example 1
Comparison of the Improved HAPSNP pipeline to Existing Programs
[0051] Sequencing data from cotton can be imported directly into the assembly/mapping module of the HAPSNP pipeline provided as shown in Figure 3. After assembly/mapping (for example, see Module 1 of Figure 1), the output .ace (or ACE) file can be input into the SNP calling module (see Module 2 of Figure 1 and Figure 4). The SNP calling module determines all possible SNPs based on sequence comparison among all input sequences and optionally a reference sequence is considered for sequence comparison. Contig sequences and identifiers can be included in all SNPs as output after SNP calling as shown in Figure 5. These SNPs/contigs are then subject to SNP filtration (for example, see Module 3 of Figure 1). The SNP filtration module can also determine whether SNPs are in a homopolymer region. If yes, the homopolymer region SNPs can be displayed as shown in Figure 6. After SNP filtration, false positive SNPs are removed and input into a SNP sequence formatting module as shown in Figure 1.
[0052] Separately, all possible SNPs are subject to the haplotype calling module (for example, see Module 4 of Figure 1). The haplotype calling module can optionally include a haplotype filtration unit which is independent from the SNP filtration module. The haplotype information can be input into the SNP sequence formatting module to be considered for association with genotypes after combination with filtered SNPs (see Figure 9 for an example of a haplotype output).
[0053] Finally, the SNP sequence formatting module (for example, see Module 5 of Figure 1) complies filtered SNP with flanking sequences together with haplotype information (optionally filtered) to determine contigs containing "candidate SNP markers" (see Figure 10 as an example of the output of candidate SNP markers with contig identifiers).
[0054] As shown in Figure 10, the output of the HAPSNP pipeline provided herein can include (1) contig identifier information, (2) contig sequence information, (3) SNP sequence information, and (4) haplotype designation.
Figure imgf000014_0001
[0055] The HAPSNP pipeline of this example is compared to publicly unmodified program including QualitySNP and Consortium in either cotton or canola. As shown in Table 1, the HAPSNP pipeline provided can increase validation success of candidate SNP markers more than two folds from about 27-33% to about 60-69%.
[0056] For the canola project, 1499 SNP markers are validated out of 4,568 candidates from the Canada SNP Discovery Consortium, thus a 32.8% validation success. Using the HAPSNP pipeline of this example, the validation success of 60.5% is calculated based on a combination of two studies [(1374+5285)/(2171+8830) = 60.5%]: first set (for example, Type I) - 1374 validated SNP markers out of 2171 candidates SNP markers (resulting a 63% validation success by itself), and second set (for example, Type II) - validated SNP markers 5285 out of 8830 candidates SNP markers (resulting a 60% validation success by itself). Thus, the HAPSNP pipeline provided is superior to existing programs for SNP discovery and/or validation and is able to achieve more than 60% validation success rate.
[0057] Potential SNPs with high confidence score can be classified into Type I, II and III SNPs using the allelic information for that locus. Type I SNPs are variations where alleles are in homologous condition in each genotype. Type II SNPs have heterologous alleles in one genotype and homologous allele in other genotype. Type III SNPs are typically derived from paralogous or homeologous sequences in the genome, and have heterologous alleles within each genotype. These SNPs can be further filtered and formatted with flanking sequence information to fit for multiple SNP genotyping assay formats including
GoldenGate®, KASPar®, Infinium® etc. Figures 11 A-C show typical distributions of Type I, II, and III SNPs in cotton using the systems and methods provided herein.
Example 2
HAPSNP pipeline in combination with GBS
[0058] Single nucleotide polymorphism (SNP) markers have become markers of choice for marker assisted selection (MAS) in crop improvement programs because of their higher abundance, amenability for automation and availability of high throughput genotyping platforms. Complexity reduction approaches combined with high throughput sequencing technologies have enabled rapid development of informative SNP markers. Genotyping-by- Sequencing (GBS) methods offer high throughput approaches for SNP discovery and genotyping. However, high genomic complexity, narrow genetic base, polyploid nature and lack of reference genome hinder development of candidate SNP markers in cultivated cotton and other non-model plant species. To increase the efficiency of SNP detection from homologous sequences, and reduce the risk of high false positive rate due to the presence of homeologous sub-genomes in polyploid genomes like cotton and canola, the HAPSNP pipeline disclosed herein is used to combine GBS data/system to distinguish homologous loci from paralogous loci. This particular embodiment of HAPSNP pipeline can extract exact homology matches from high throughput sequencing data using STACKs program and is designed with multiple custom scripts to decipher the homologous sequence tags to provide at least one of the following advantages: (i) identifying putative SNPs; (ii) generating haplotype information and allelic frequency of loci across multiple genotypes; (iii) enhancing the ability to identify high quality SNPs using the haplotype information and allelic frequency; (iv) facilitating redundancy check within the SNP dataset; and (v) providing SNP sequence in assay convertible format. SNPs identified from this pipeline are converted into genotyping assays and are validated with a success rate of up to 75% polymorphism rate across various genotypes. The efficiency of this pipeline is relatively high due to (i) high assay validation rate (-75%) compared to other SNP mining programs (<25% for polyploid species); and (ii) its robustness in handling huge datasets for allele mining (>10 Million sequences) compared to other SNP mining programs (<1 million sequences). The utility of this pipeline can be extended to other complex diploids, polyploid crop species and targeted de-novo or re-sequencing projects to identify true SNPs.
[0059] Unlike diploids, SNP marker development in polyploid crop species is very challenging due to the existence of multiple sub-genomes in the nucleus. Due to the presence of duplicated loci in the sub-genomes, it is very difficult to distinguish true SNPs from allelic variations as in homologs and false SNPs from non-allelic variations as in paralogs. This HAPSNP pipeline provided in this example can be implemented for single nucleotide variation detection in any crop and it can also be used for formatting of the SNP sequence information to suit assay designing for multiple genotyping chemistries such as Illumina GoldenGate, Infinium, iSelect, TaqMan or KASPar assays. Figure 12 represents the flowchart of this GBS-HAPSNP pipeline. The utility of this pipeline can also be extended for routine genotyping from GBS experiments in complex polyploids, including G. hirsutum, G. barbadense, or G. mustelinum.

Claims

A computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism, comprising,
(a) an input device and an output device/interface;
(b) an analysis system interface coupled to memory of a computer;
(c) an operating system comprising a database;
(d) a controller module for SNP calling or Loci identification; and
(e) a filtration engine for removing unreliable SNPs.
The computerized system of claim 1, further comprising at least one of alignment module, assembly/mapping module, haplotype calling module, and SNP sequence formatting module.
The computerized system of claim 1, wherein the input device is selected from the group consisting of automated sequencer, sequencing data input device, and sequencing data storage device.
The computerized system of claim 1, wherein the output interface comprises a list of candidate SNP markers.
The computerized system of claim 1, wherein the database contains information selected from the group consisting of SNPs, contigs with at least one SNP, and haplotypes with at least two SNPs.
The computerized system of claim 1, wherein the filtration engine for removing unreliable SNPs comprises at least four SNP filters to remove unreliable SNPs and generate reliable SNPs.
The computerized system of claim 2, wherein the SNP sequence formatting module generates candidate SNP markers with flanking sequences.
8. The computerized system of claim 4, further comprises a SNP marker classification module.
9. The computerized system of claim 8, wherein the candidate SNP markers are
classified into at least three types using the SNP marker classification module.
10. The computerized system of claim 9, wherein at least one type of the candidate SNP markers has a validation success rate of at least 60%.
11. The computerized system of claim 1, comprising a controller module for SNP calling, and further comprising an assembly/mapping module, a haplotype calling module, and a SNP sequence formatting module.
12. The computerized system of claim 1, comprising a controller module for Loci
identification, and further comprising an alignment module and a SNP sequence formatting module.
13. A method for use in a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism, comprising,
(a) assembling/mapping sequence data using an assembly/mapping module;
(b) identifying all possible SNPs using a SNP calling module; and
(c) generating reliable SNPs using a SNP filtration module.
14. The method of claim 13, wherein the computerized system comprises a system of claim 1.
15. The method of claim 13, further comprising determining haplotype using a haplotype calling module.
16. The method of claim 13, further comprising formatting candidate SNP markers using a SNP sequence formatting module.
17. The method of claim 13, wherein the computerized system comprises a system of claim 11.
18. The method of claim 13, wherein the method provides candidate SNP markers having a validation success rate of at least 60%.
19. The method of claim 13, wherein the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program.
20. The method of claim 19, wherein the publicly available program is QualitySNP.
21. The method of claim 13, wherein the organism comprises a polyploidy genome.
22. The method of claim 21, wherein the organism is a plant.
23. The method of claim 22, wherein the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat.
24. The method of claim 13, further comprising classifying candidate SNP markers using a SNP marker classification module.
25. The method of claim 25, wherein at least one type of the candidate SNP markers has a validation success rate of at least 60%.
26. The method of claim 25, wherein at least two types of the candidate SNP markers have a validation success rate of at least 60%.
27. A method for use in a computerized system for discovery and/or classification of single nucleotide polymorphism (SNP) markers in an organism in combination with a system for genotyping-by-sequencing (GBS), comprising,
(a) aligning genotyping-by-sequencing (GBS) data using an alignment module; (b) generating reliable SNPs using a SNP filtration module; and
(c) formatting candidate SNP markers using a SNP sequence formatting module.
28. The method of claim 27, wherein the computerized system comprises a system of claim 1.
29. The method of claim 27, further comprising identifying SNP loci using a Loci
identification module.
30. The method of claim 27, wherein the computerized system comprises a system of claim 12.
31. The method of claim 27, wherein the method provides candidate SNP markers having a validation success rate of at least 60%.
32. The method of claim 27, wherein the method provides candidate SNP markers having a validation success rate of at least two folds as compared to a publicly available program.
33. The method of claim 32, wherein the publicly available program is STACKs.
34. The method of claim 27, wherein the organism comprises a polyploidy genome.
35. The method of claim 34, wherein the organism is a plant.
36. The method of claim 35, wherein the plant is selected from the group consisting of cotton, canola, corn, soybean, sunflower, and wheat.
37. The method of claim 36, wherein the plant is G. hirsutum, G. barbadense, or G. mustelinum.
PCT/US2013/020211 2012-01-04 2013-01-04 Haplotype based pipeline for snp discovery and/or classification WO2013103759A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261582861P 2012-01-04 2012-01-04
US61/582,861 2012-01-04

Publications (2)

Publication Number Publication Date
WO2013103759A2 true WO2013103759A2 (en) 2013-07-11
WO2013103759A3 WO2013103759A3 (en) 2013-09-06

Family

ID=47522976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/020211 WO2013103759A2 (en) 2012-01-04 2013-01-04 Haplotype based pipeline for snp discovery and/or classification

Country Status (1)

Country Link
WO (1) WO2013103759A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016055971A3 (en) * 2014-10-10 2016-06-02 Invitae Corporation Methods, systems and processes of de novo assembly of sequencing reads
CN106868131A (en) * 2017-02-22 2017-06-20 中国农业科学院棉花研究所 No. 6 chromosomes of upland cotton SNP marker related to fibre strength
EP3117008A4 (en) * 2014-03-14 2017-09-27 Dow AgroSciences LLC Markers linked to reniform nematode resistance
US10176296B2 (en) 2017-05-17 2019-01-08 International Business Machines Corporation Algebraic phasing of polyploids
CN110106272A (en) * 2019-04-29 2019-08-09 四川农业大学 A kind of Tetraploid Elytrigia 3E chromosome molecular labeling and its application
CN116779035A (en) * 2023-05-26 2023-09-19 成都基因汇科技有限公司 Polyploid transcriptome subgenomic typing method and computer readable storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
J. M. CATCHEN ET AL: "Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences", G3: GENES|GENOMES|GENETICS, vol. 1, no. 3, 1 August 2011 (2011-08-01), pages 171-182, XP055071071, DOI: 10.1534/g3.111.000240 *
Jan Van Oeveren ET AL: "Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis", Single Nucleotide Polymorphisms, Methods in Molecular Biology 578, 1 January 2009 (2009-01-01), pages 73-91, XP055071076, DOI: 10.1007/978-1-60327-411-1_4,a Retrieved from the Internet: URL:http://download.bioon.com.cn/view/upload/month_1004/20100419_ee17b59a19517c3eb17cIBjUuh9eoYMF.attach.pdf [retrieved on 2013-07-12] *
TANG JIFENG ET AL: "HaploSNPer: a web-based allele and SNP detection tool", BMC GENETICS, BIOMED CENTRAL, GB, vol. 9, no. 1, 28 February 2008 (2008-02-28), page 23, XP021032629, ISSN: 1471-2156 *
TANG JIFENG ET AL: "QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species", BMC BIOINFORMATICS, BIOMED CENTRAL, LONDON, GB, vol. 7, no. 1, 9 October 2006 (2006-10-09) , page 438, XP021021578, ISSN: 1471-2105, DOI: 10.1186/1471-2105-7-438 cited in the application *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3117008A4 (en) * 2014-03-14 2017-09-27 Dow AgroSciences LLC Markers linked to reniform nematode resistance
WO2016055971A3 (en) * 2014-10-10 2016-06-02 Invitae Corporation Methods, systems and processes of de novo assembly of sequencing reads
CN106795568A (en) * 2014-10-10 2017-05-31 因维蒂公司 Method, system and the process of the DE NOVO assemblings of read is sequenced
CN106868131A (en) * 2017-02-22 2017-06-20 中国农业科学院棉花研究所 No. 6 chromosomes of upland cotton SNP marker related to fibre strength
CN106868131B (en) * 2017-02-22 2020-12-29 中国农业科学院棉花研究所 SNP molecular marker of upland cotton No. 6 chromosome related to fiber strength
US10176296B2 (en) 2017-05-17 2019-01-08 International Business Machines Corporation Algebraic phasing of polyploids
US10607718B2 (en) 2017-05-17 2020-03-31 International Business Machines Cororation Algebraic phasing of polyploids
CN110106272A (en) * 2019-04-29 2019-08-09 四川农业大学 A kind of Tetraploid Elytrigia 3E chromosome molecular labeling and its application
CN110106272B (en) * 2019-04-29 2022-08-02 四川农业大学 Tetraploid elytrigia elongata 3E chromosome molecular marker and application thereof
CN116779035A (en) * 2023-05-26 2023-09-19 成都基因汇科技有限公司 Polyploid transcriptome subgenomic typing method and computer readable storage medium
CN116779035B (en) * 2023-05-26 2024-03-15 成都基因汇科技有限公司 Polyploid transcriptome subgenomic typing method and computer readable storage medium

Also Published As

Publication number Publication date
WO2013103759A3 (en) 2013-09-06

Similar Documents

Publication Publication Date Title
Le Nguyen et al. Next-generation sequencing accelerates crop gene discovery
Wang et al. Applications of genotyping-by-sequencing (GBS) in maize genetics and breeding
Gimode et al. Identification of SNP and SSR markers in finger millet using next generation sequencing technologies
Lobaton et al. Resequencing of common bean identifies regions of inter–gene pool introgression and provides comprehensive resources for molecular breeding
Yan et al. High-throughput SNP genotyping with the GoldenGate assay in maize
Ossowski et al. Sequencing of natural strains of Arabidopsis thaliana with short reads
Yang et al. Target SSR-Seq: a novel SSR genotyping technology associate with perfect SSRs in genetic analysis of cucumber varieties
McCormick et al. RIG: Recalibration and interrelation of genomic sequence data with the GATK
AU2004215928B2 (en) Genetic diagnosis using multiple sequence variant analysis
WO2013103759A2 (en) Haplotype based pipeline for snp discovery and/or classification
Jagtap et al. Genome-wide development and validation of cost-effective KASP marker assays for genetic dissection of heat stress tolerance in maize
Shen et al. Development of GBTS and KASP panels for genetic diversity, population structure, and fingerprinting of a large collection of broccoli (Brassica oleracea L. var. italica) in China
He et al. ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis
Manching et al. Phased genotyping-by-sequencing enhances analysis of genetic diversity and reveals divergent copy number variants in maize
Howard et al. Integration of Infinium and Axiom SNP array data in the outcrossing species Malus× domestica and causes for seemingly incompatible calls
Aflitos et al. Introgression browser: high‐throughput whole‐genome SNP visualization
Oetjens et al. Y-chromosome structural diversity in the bonobo and chimpanzee lineages
Zan et al. Genetic regulation of transcriptional variation in natural Arabidopsis thaliana accessions
Silva et al. A 3K Axiom SNP array from a transcriptome-wide SNP resource sheds new light on the genetic diversity and structure of the iconic subtropical conifer tree Araucaria angustifolia (Bert.) Kuntze
Fresnedo-Ramírez et al. An integrative AmpSeq platform for highly multiplexed marker-assisted pyramiding of grapevine powdery mildew resistance loci
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
Li et al. CandiHap: a haplotype analysis toolkit for natural variation study
Mabire et al. High throughput genotyping of structural variations in a complex plant genome using an original Affymetrix® axiom® array
Hoang et al. De novo assembly and characterizing of the culm-derived meta-transcriptome from the polyploid sugarcane genome based on coding transcripts
Schikora-Tamarit et al. Recent gene selection and drug resistance underscore clinical adaptation across Candida species

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13700121

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 13700121

Country of ref document: EP

Kind code of ref document: A2