WO2019138244A1 - Method for identifying genes associated with a particular phenotype - Google Patents

Method for identifying genes associated with a particular phenotype Download PDF

Info

Publication number
WO2019138244A1
WO2019138244A1 PCT/GB2019/050077 GB2019050077W WO2019138244A1 WO 2019138244 A1 WO2019138244 A1 WO 2019138244A1 GB 2019050077 W GB2019050077 W GB 2019050077W WO 2019138244 A1 WO2019138244 A1 WO 2019138244A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
enriched
nucleic acids
mers
phenotype
Prior art date
Application number
PCT/GB2019/050077
Other languages
French (fr)
Inventor
Brande Bruce Hertel WULFF
Sanu ARORA
Burkhard STEUERNAGEL
Kumar Gaurav
Original Assignee
John Innes Centre
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by John Innes Centre filed Critical John Innes Centre
Publication of WO2019138244A1 publication Critical patent/WO2019138244A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the invention relates to the field of genetic association and gene identification, and in particular to methods for identifying genes associated with a particular phenotype.
  • Plant disease resistance is often mediated by dominant or semi-dominant resistance (R) genes which encode receptors that perceive pathogen infection either by direct recognition of pathogen effector molecules, or indirectly by recognition of effector modified host targets.
  • R genes encode intracellular nucleotide binding/leucine-rich repeat (NLR) immune receptor proteins.
  • NLR genes are amongst the most evolutionarily dynamic gene families in plants with hundreds of these genes usually present in plant genomes. Crop plants and their wild progenitors typically possess extensive copy number and sequence variation at these NLR loci. Domestication and intensive breeding have reduced R gene diversity in crops rendering them more vulnerable to disease epidemics.
  • New R genes derived from closely related species have been introduced into crops to improve disease resistance by either relatively laborious intensive breeding or more recently as transgenes.
  • R gene cloning is complicated by extensive pan-genome variation at NLR loci making it difficult to establish orthogonal relationships between a reference genome sequence and other accessions.
  • the structural variation at NLR loci of different accessions can also reduce recombination at the locus further complicating positional cloning approaches.
  • the requirement for recombination in gene cloning can be overcome by mutational genomics. The generation of multiple independent mutants by mutagenesis followed by genome complexity reduction and sequencing (e.g.
  • NLR exome capture (“MutRenSeq”, described in Steuernagel et at., 2016, Nat Biotechnol 34: 652-655; Steuernagel et at., 2017, Methods Mol Biol 1659: 215-229; and WO2015/127185] or chromosome flow sorting and sequencing [“MutChromSeq”, described in Sanchez-Martin et at., 2016, Genome Biol 17: 221 and Steuernagel et al., 2017, Methods Mol Biol 1659: 231-243]), enables the unambiguous identification of mutant alleles and the functional NLR gene (see Table 1).
  • positional cloning and mutational genomics share certain limitations. Notably, they require the R gene to exist as a single gene in an otherwise susceptible genetic background to the pathogen isolate of interest which can take numerous generations to achieve. They also require that the plant target species is amenable for the generation and screening of large recombinant or mutant populations numbering 1000s of individuals (see Table 1). However, many wild relatives of crop plants contain multiple functional resistance genes and are experimentally intractable due to pre-domestication traits such as long generation times, seed shattering, seed dormancy, as well as having seed and plant morphologies not amenable to existing mechanization technologies.
  • GWAS genome-wide association studies
  • GWAS enables trait correlation in a genetically diverse population structure by exploiting pre-existing recombination events accumulated in natural populations.
  • Members within a population are screened with a genome-wide set of markers and linkage disequilibrium sought between markers and traits of interest.
  • Markers linked to traits of interest are then readily assigned to a genomic region in a reference genome.
  • the reliance on a reference genome excludes the identification of sequences that have significantly diverged from the reference, as is often the case for R gene loci.
  • a method for identifying one or more genes associated with a selected phenotype in an organism comprising:
  • the one or more genes may for example be R genes.
  • the database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.
  • Fig. 1 is a flow diagram showing an embodiment of the invention.
  • Fig. 2 is a diagram depicting combining association genetics and resistance gene enrichment sequencing (“AgRenSeq”) for rapid R gene discovery and cloning (a, Crop wild relative diversity panel)
  • a genetically diverse panel of accessions (Ac) is (b, Phenotype) phenotyped with different pathogen races, and (c, RenSeq) subjected to resistance gene enrichment sequencing (“RenSeq”) followed by assembly of the NLR repertoire and extraction of NLR k- mers for each accession, i) Library, ii) Baits, iii) Hybridize, iv) NLR capture, v) Sequence, vi) k- mers, vii) Assembly (d, K-mer association matrix) Each k-mer in each accession was given a phenotype score between +2 and -2 depending on the level of resistance or susceptibility.
  • Fig. 3 shows distribution of 317 KASP markers across the seven Aegilops tauschii chromosome arms. Centromeres are indicated by the letter“C”. Other horizontal bars indicate the chromosomal position of the markers. These markers were used to study the genetic diversity of the Ae. tauschii accessions (Fig. 4B).
  • Fig. 4 depicts genetic architecture of stem rust resistance in Ae. Wilmingtonii.
  • Each integer (dot-column) on the x-axis represents an NLR contig from the RenSeq assembly of a single accession containing the respective Sr gene.
  • Each dot on the y-axis represents one or more RenSeq k-mers associated with resistance across the diversity panel to the respective PGT race.
  • Dot size is proportional to the number of k-mers associated with resistance: (c) RKQQC vs. BW_01 189, Sr33 (d) TPMKC vs. BVV 01191 , Sr46 (e) QTHJC vs. BW_01072, SrTA 1662 and (f) TTKSK vs. BW_01083, Sr45.
  • Fig. 5 is an unweighted pair group method with arithmetic mean (UPGMA) tree of 195 Ae. tauschii accessions based on 317 SNP markers.
  • the tree is divided into two branches representing ssp. strangulata (darker branches) and ssp. tauschii (lighter branches) with an intermediate accession between the two subgroups.
  • the final set of 151 non-redundant Ae. tauschii ssp. strangulata accessions used in AgRenSeq analysis are marked with filled black circles.
  • Fig. 6 is a graph showing principal component analysis of the 195 Ae. tauschii lines.
  • Using 317 SNP markers identified two different subgroups, namely ssp. strangulata (lower dots; Lineage 2) and ssp. tauschii (upper dots; Lineage 1) with an intermediate accession lying between the two subspecies.
  • the PCA distribution of accessions into subgroups corresponded closely with the subgroups depicted in the UPGMA tree.
  • L1 Lineage 1 ;
  • Fig. 7 shows photographs of a phenotypic diversity in the Ae. tauschii panel on the second leaf in response to various PGT races.
  • the infection types were scored using the Stakman infection type scale and the scores ranged from: resistant (;0, ;1-, 1), moderately resistant (1+, 2), moderately susceptible (2+, 3), to susceptible (3+, 4).
  • the accession number and PGT race of the infected leaves from left to right are: BW_01020 (TTKSK), BW_01114 (TPMKC); BW_01020 (TTTTF); BW_01084 (TRTTF); BW_01081 (TTKSK); BW_01039 (TRTTF); BW_01065 (TTKSK); BW_01184 (TPMKC); BW_01125 (TTKSK); BW_01012 (TPMKC).
  • Fig. 8 is a series of histograms showing distribution of seedling stem rust reactions across the 151 non-redundant Ae. tauschii spp. strangulata accessions to PGT races (a) QTHJC (b) RKQQC (c) TTKSK (d) TRTTF (e) TPMKC (f) TTTTF and (g) UK_01.
  • the scale at the bottom of each histogram indicates the variation in reaction types ranging from resistant, moderately resistant, moderately susceptible to susceptible.
  • Fig. 9 shows genetic mapping of SrTA 1662 in an Ae. Wilmingtonii TA1662 x T. aestivum KS05HW14 BC2F1 population (a) PCR amplification of a 750 bp fragment from SrTA1662 using primers R1.F3 and R1.R3. From left to right are genotypes of a resistant BC2F1 individual (U6714-B-047 ( SrTA 1662/srTA1662 )), a susceptible BC2F1 individual (U6714- B-048 (srTA 1662/srTA 1662)) , Ae.
  • a resistant BC2F1 individual U6714-B-047 ( SrTA 1662/srTA1662 )
  • a susceptible BC2F1 individual U6714- B-048 (srTA 1662/srTA 1662)
  • Fig. 10 shows mapping of Sr46 in diploid and hexaploid families resulting in the identification of a sequence tagged site marker psr649 (a, Genetic map, Chr 2DS) located 0.1 cM from the resistance locus.
  • Sr46 candidate NLR gene was validated by transgenic studies (g) where the wild type and sib lines without the transgene showed high infection type of 3+4 (g1), while seven independent transgenic events showed an intermediate infection type (g2) and nine independent transgenic events showed a low infection type (g3, g4) characteristic of a resistant phenotype.
  • I Xgwm1099
  • II Xpsr649
  • III Xbarc297
  • Mut Mutant.
  • Fig. 11 is a series of photographs showing transgenic Sr45 expression provides stem rust resistance in wheat cultivar Fielder.
  • Primary transgenic seedlings show a typical Sr45 infection phenotype when challenged with the Australian stem rust race 98-1 ,2,3,5 and 6.
  • PC1 10-1 to 12 carry an Sr45 construct with the native promoter and terminator sequences, while
  • PC147-1 carries Sr45 regulated by the Sr33 promoter and terminator sequences.
  • Nine out of ten independent PC147 lines conferred seedling resistance typical of Sr45.
  • Fig. 12 Identification of stem rust resistance genes in Ae. tauschii by k-mer— based regression analysis for isolate RKQQC.
  • (a, BW_01005 contigs) Linear regression analysis identifies Sr33 (dot column highlighted in black) and an additional candidate gene (labelled“false positive”)
  • (b) Grouping all candidate contigs identified across the panel followed by multivariate lasso regression for RKQQC indicates that the additional candidate identified by linear regression in (a) is a false positive (FP), while a new candidate gene is identified, which has a marginally greater effect size than Sr33.
  • a method for identifying one or more genes associated with a selected phenotype in an organism comprising:
  • Step (f) may comprise recording the presence or absence of k- mers in each enriched sample sequence and ascribing a score to each k-mer according to the association of each k-mer with the selected phenotype.
  • Step (g) may comprise mapping the identified k- mers onto each of the enriched sample sequences to identify high scoring regions in the enriched sample sequences.
  • Step (d) may comprise enriching the samples for nucleic acids associated with the selected phenotype by hybrid-based capture enrichment.
  • the method may comprise designing a target sequence capture library for enriching the nucleic acids.
  • Step (d) may comprise depleting from the samples nucleic acids not associated with the selected phenotype.
  • the samples may be enriched for nucleic acids for plant resistance (R) proteins.
  • the samples may be enriched for nucleic acids for nucleotide-binding and leucine-rich repeat (NLR) proteins.
  • the samples may be enriched for nucleic acids for membrane-bound receptor proteins.
  • the nucleic acids of the database may comprise non-gene-related sequences, for example structural nucleic acids that may be associated with the selected phenotype.
  • Step (f) may comprise identifying k- mers of two or more fixed lengths k.
  • the organism may for example be a plant.
  • the organism may be a mammal such as a human, a non-human primate, a rat or a mouse.
  • Step (e) may comprise alignment-free assembly of the enriched sample sequences.
  • the sequencing step (e) may comprise long mate-pair scaffolding of the enriched samples (see“CapRenSeq” procedure described below and in Example 2).
  • the sequence information may be obtained using high-throughput sequencing.
  • the genetic samples may comprise multiple nucleic acids associated with similar phenotypes.
  • the multiple nucleic acids within each genetic sample may be associated with the same phenotype.
  • Step (e) may comprise obtaining sequence information from a pool of enriched samples having defined levels of association with the selected phenotype or an opposite phenotype.
  • the database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.
  • the database may comprise genetic samples from a plantation crop plant.
  • the database of the method may comprise or consist of genetic samples from the following non-limiting list of plants: wheat (for example Triticum urartu, T. aestivum, Aegilops sharonensis or Ae. speltoides), rye (for example Secale cereale), millet, onion, palm, tomato, pepper (for example Capsicum annuum), tobacco, canola, cotton (for example Gossypium ssp.), peanut, alfalfa, sunflower, safflower, Hordeum vulgare, oat (for example Avena sativa), Oryza spp., maize (for example Zea mays), sorghum (for example Sorghum bicolor ), sugarcane (for example Saccharum spp.), banana (for example Musa spp.), potato (for example Solarium spp.), Ipomoea batatas, soya bean (for example Glycine max or G.
  • wheat for example Tri
  • Cicer arietinum Cisum sativum, cassava (for example Manihot esculenta), Dioscorea spp., sugar beet (for example Beta vulgaris), legumes (for example Favaceae spp.), Cannabis sativa, Arabidopsis thaliana or Theobroma cacao.
  • Fig. 1 illustrates steps involved in a method according to an aspect of the invention.
  • Reduced representation sequencing for example but not limited to resistance gene enrichment sequencing (“RenSeq”), is conducted on a database such as a diversity panel of samples or a pool of samples.
  • the k- mers in each sample of the diversity panel or the pool of samples are counted and a k- er matrix is generated for all k- mers.
  • the complexity of the data set may be reduced by removing k- mers below or above a certain percentage threshold (minimal and maximal k-mer frequency).
  • the diversity panel samples or pool of samples are phenotyped and the phenotype scores may be converted.
  • the phenotype scores for each k-mer are then summed up and projected onto the assembly of an accession with the desired phenotype.
  • the invention may also be described as comprising the following steps:
  • Step (i) may comprise plotting the k-mers onto a sequenced and assembled reference accession, in which case k-mer scores will be plotted per nucleotide or per gene, or in bins of a defined nucleotide or genetic length.
  • the“database of genetic samples” may be derived from a“diversity panel” or“diversity collection”.
  • the samples or accessions in the database may have been phenotyped for one or more selected traits of interest prior to performing steps of the invention, or the method of the invention may include a step of phenotyping the samples accessions in the database.
  • A“diversity panel” or“diversity collection” is a collection of genetic samples from a diverse population, typically species-specific, and may include nucleic acid fragments prepared from organismal nucleic acids (e.g. genomic DNA) by a method that can reveal polymorphisms or mutations (e.g. sequence differences) between samples.
  • Diversity panels may be published and publically available and/or custom designed according to the species and trait of interest. Individual uniquely identifiable samples within a diversity panel are termed accessions. The accessions within a diversity panel may be genetically unique or redundant.
  • Enrichment is a method by which particular genomic regions of interest are preferentially amplified or selected. Target enrichment can be performed before sequencing, to reduce the amount of sequencing and data processing required. Target enrichment methods include uniplex and multiplexed polymerase chain reaction (PCR), molecular inversion probe (MlP)-based approaches or amplicon based target enrichment and hybrid capture-based approaches.
  • PCR polymerase chain reaction
  • MlP molecular inversion probe
  • hybridized target-specific probes are hybridized to cDNA libraries.
  • the probes may be attached to a microarray or in solution.
  • Hybridized probes may be removed from solution by bead capture to isolate the target DNA from the background DNA.
  • Probes for use in hybrid capture-based target enrichment may be provided as a generic or custom designed base capture library.
  • the target enrichment of the present invention may be performed for example according to the methods as described by Jupe et al. (2013, Plant J. 76:530-544).
  • Next generation sequencing produces large amounts of short reads from fragmented DNA.
  • the reads In order for the sequence to be analyzed the reads must be assembled to produce a reconstruction of the complete original DNA sequence, or assembly. This is done by aligning the reads, for example by finding overlaps between them. Assembly may be performed by mapping to a reference genome or de novo. De novo assembly has the advantage of enabling of a DNA sequence without reference to external data that either may be unavailable (for example when a reference genome is not available) or bias the results, for example in the case of diverse species or diverse genomic regions.
  • RonSeq resistance gene enrichment sequencing
  • Capturing high molecular weight DNA for long-read or linked read RenSeq e.g. PacBio, Nanopore and 10x is technically challenging due to the inherent instability of long single-stranded DNA molecules.
  • RenSeq may therefore also be performed on LMP or 10x libraries and the captured reads used to improve a standard RenSeq assembly.
  • the elements of this“combined assembly procedure” (Cap) for the gene content in a whole genome can be applied in the present invention to the fraction of genes captured by RenSeq, for example to obtain NLR repertoires with a higher proportion of full-length genes (a procedure referred to herein as“CapRenSeq”).
  • Alignment free assembly of sequencing data is a process by which de novo genome assembly is performed. Advantages of alignment free assembly over alignment and reference sequence-based methods of sequence analysis include the increased speed of alignment free assembly. Alignment free methods for sequence analysis include methods based on k-mer analysis and information theory.
  • k-mer frequency may also be referred to as k-mer counting.
  • Methods based on k-mer frequency consider for a fixed or variable integer k, the relative frequencies of all possible k-mers for each of the input sequences and define a distance between the reads based on these frequency vectors.
  • each sequence position may be considered in terms of overlapping k- mers. Contiguous or spaced, e.g. pattern deterministic, k- mers may be used.
  • the method of the present invention comprises k- mer-based association genetics.
  • the method may comprise performing trait associations on subsequences or k- mers generated from raw sequence reads.
  • the method of the present invention may comprise extracting the k- mers from sequence and assembled accessions from a diversity panel.
  • the method of the present invention may comprise computing the genetic identity of the accessions following sequencing and assembly of the sequence reads, based on known SNP markers. Redundant accessions can then be removed before further analysis. This step has the advantage of decreasing the data load of the subsequent analysis steps.
  • the sequencing used in the method of the present invention may be lllumina short-read sequencing by synthesis technology.
  • the term “gene” refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA.
  • the term gene may include one or more exons or part thereof, one or more introns or part thereof, all or a portion of the 5’ untranslated region, and all or a portion of the 3’ untranslated region.
  • the method of invention may also be applied to identifying non-gene nucleic acids associated with a selected phenotype, for example non-coding nucleic acids such as nucleic acids involved in chromosome structure.
  • a comprehensive search across all the accessions may also be performed by grouping the contigs of all the accessions based on sequence similarity for significant candidate contigs, and then jointly testing to retain the most important ones.
  • this allows false positives to be filtered out and aids identification of rare resistance genes.
  • the method of the invention may also be performed by including an additional step of grouping the enriched sample sequences based on the presence or absence of a contig, and testing the association of the contig with the presence of the k- mer.
  • phenotype refers to a detectable appearance or characteristic of an organism. Phenotype is understood to result from the interaction of the organism’s genotype with the environment. Phenotype includes individual phenotypic traits that are detectable and describable after visual inspection, measurement, and/or analysis by chemical, biochemical and/or molecular techniques.
  • R genes Disease Resistance (R) genes from wild relatives are a valuable resource for breeding resistant crops.
  • introgression of R genes into crops is often associated with co-integration of deleterious linked genes and pathogens can rapidly evolve to overcome R genes when deployed singly.
  • Introducing multiple cloned R genes into crops as a stack would avoid linkage drag and delay emergence of resistance breaking pathogen races.
  • current R gene cloning methods require segregating or mutant progenies, which are difficult to generate for many wild relatives due to poor agronomic traits.
  • Ae. tauschii is the wild progenitor species of the Thticum aestivum (bread wheat) D genome and a number of stem rust resistance (Sr) genes that have been introgressed into bread wheat from this species.
  • Sr stem rust resistance
  • strangulata because of the availability of genetically diverse accessions (see Jones et ai., 2013, Theor Appl Genet 126: 1793-1808) and two cloned Sr genes from this species ( Sr33 and Sr45) to serve as positive controls.
  • Sr33 and Sr45 two cloned Sr genes from this species
  • strangulata namely Sr46 (Yu et ai, 2015, Theor Appl Genet 128: 431-443) and SrTA 1662 (Olson et al., 2013, Theor Appl Genet 126: 1179-1188).
  • the SrTA 1662 candidate gene was mapped in a recombinant inbred line population to a 3.8 cM interval shown to encode SrTA 166-resistance, thus supporting our AgRenSeq data (Fig. 9).
  • the SrTA 1662 candidate encodes a coiled-coil NLR protein with 85% amino acid identity to Sr33.
  • Sr46 candidate gene Independently of this AgRenSeq approach, we also identified an Sr46 candidate gene by conventional fine mapping in segregating diploid progenitor and wheat populations coupled with the sequencing of candidate genes in this region in three ethyl methanesulphonate-derived mutants that had lost Sr46 resistance. In two mutants, the same candidate gene contained non-synonymous substitutions, while the third mutant had a deletion of the chromosomal segment encoding this gene (Fig. 10a-f). Comparison of the Sr46 candidate gene identified by AgRenSeq and map-based/mutagenesis cloning showed that they were 100% identical. This gene, hereafter referred to as Sr46, encodes a coiled coil, NLR protein that conferred stem rust resistance when expressed as a transgene in the susceptible wheat cv. Fielder (Fig. 10g; Table 4).
  • the AgRenSeq procedure as described in this example is reference genome-independent and directly identifies the NLR that confers resistance rather than identifying a genomic region encoding multiple paralogs for subsequent candidate gene confirmation.
  • the k-mer strategy could also be adapted for associative transcriptomics which allows trait correlation based on variation in transcript sequence and abundance (Harper etal., 2012, Nat Biotechnol 30: 798-802).
  • the low basal level of NLR gene expression may necessitate costly and computationally challenging ultra-deep transcript sequencing unless cDNAs were enriched by RenSeq (Andolfo et al., 2014, BMC Plant Biol 14: 120).
  • AgRenSeq is also not restricted to the limited genetic variation and recombination present in bi-parental populations, but rather interrogates pan genome sequence variation in diverse germplasm to isolate uncharacterised R genes (or other genes). Thus, importantly, no additional crossing or mutagenesis is required for R gene cloning using this approach. This makes it possible to clone R genes from accessions of a wild species lacking any advanced agronomic traits and requires only the phenotyping of an enrichment-sequenced diversity panel.
  • a set of 195 Ae. tauschii accessions were assembled from the Wheat Genetics Resource Center (Kansas State University, USA), The Leibniz Institute of Plant Genetics and Crop Plant Research (Gatersleben, Germany), National Small Grains Collection (Beltsville, Maryland, USA), International Center for Agricultural Research in the Dry Areas (Aleppo, Iran), The Commonwealth Scientific and Industrial Research Organisation (Canberra, Australia), The N. I. Vavilov Institute of Plant Industry (St. Russia), University of California-Davis (Davis, California, USA), and Agriculture and Agri-Food Canada (Winnipeg, Canada). The majority of these accessions were originally collected from diverse collection sites around the Caspian Sea (Fig. 4a).
  • NLRs We used several available sequence resources from Ae. tauschii and related genomes to generate a set of baits with high potential to capture the NLR repertoire across the Ae. tauschii species complex.
  • Transcriptome raw data from publicly available datasets from Ae. tauschii (ERR420230, ERR420231), Triticum urartu (PRJNA191053; Krasileva et al., 2013, Genome Biol 14: R66), T. turgidum (PRJNA191054; Krasileva et al., 2013, supra), Ae.
  • KASP markers A total of 2,1 10 SNP markers were used in the design of the bait library. These markers were composed of: 331 KASP markers selected from an existing screen of six Ae. Wilmingtonii accessions (Ent-230, Ent-088, Ent-323, Ent-336, Ent-392, Ent-414) (available from www.cerealsdb.net; Wilkinson et al. 2012, BMC Bioinformatics 13: 219); 682 KASP markers designed using the probe sequences from the 820K Axiom array (Affymetrix Inc.; Winfield et al. 2016, Plant Biotechnol J 14: 1195-1206) based on polymorphism among fourteen Ae.
  • leyii accessions data available from www.cerealsdb.net and in Winfield et al., 2016 [supra]
  • 363 KASP markers selected from screening of the six Ae. tauschii accessions (as above) on the 90K iSelect array (Wang et al., 2014, Plant Biotechnol J 12: 787-796); 662 probes from the 90K D-genome markers were shortlisted based on D-genome map position and conversion into KASP markers on the NIAB MAGIC population (Gardner et al., 2016, Plant Biotechnol J 14: 1406-1417; map available from www.niab.com/magic); 38 KASP markers from a previous characterization of Ae. tauschii diversity (Jones et al., 2013, Theor Appl Genet 126: 1793-1808); and 34 KASP markers used for the development of standardized marker-assisted selection assays
  • lllumina libraries with an average insert size of 700 bp were enriched in bulks of four by MYcroarray, Michigan, USA, as previously described (Steuernagel etal., 2017, Methods Mol Biol 1659: 215-229), and sequenced on an lllumina HiSeq with 250 bp PE reads at Novogene, China.
  • Bait sequences were aligned to AL8/78 genome sequence using BLASTn (Zhang et al., 2000, J Comput Biol 7: 203-214). Local alignments were filtered for matches with at least length of 100 and 80% identity. Regions from the genome sequence with matches to at least two baits were extracted including 1000 bp flanking sequence. Reads from RenSeq of AL8/78 accession were aligned to extracted regions using BWA (Li & Durbin, 2009, Bioinformatics 25: 1754-1760).
  • the output of the CLC Assembler was a FASTA file containing all the contig sequences. Construction of phylogenetic tree and selection of non-redundant accessions
  • the KASP marker sequences were extracted from the RenSeq assemblies of each accession using BLASTn and the sequences were multiple-aligned to score the SNP variation across the accessions.
  • the marker sequences with minor allele frequency (MAF) less than 5% in the panel and missing data greater than 60% were excluded from the analysis.
  • the resulting genotyping matrix consisting of 316 KASP markers was used to build a UPGMA (unweighted pair group method with arithmetic mean) based tree using the majority consensus rule for drawing the 100 bootstraps using the Bio.Phylo module from the Biopython [http://biopython.org] package (Talevich et at., 2012, BMC Bioinformatics 13: 209).
  • a python script was used to generate an iTOL [https://itol.embl.de/] compatible tree for rendering and annotation.
  • an identity- by-state matrix was computed for all possible pairs of accessions to group those accessions with 399% genetic identity.
  • the set of 151 non-redundant accessions were selected for association mapping based on genetic identity, variation in phenotype data of accessions within identity group, significance of any accession reported in the literature and the seed availability in our stock.
  • Phenotype scores were converted from Stakman infection types (Stakman et al., 1962, United States Department of Agriculture, Agricultural Research Service E-617) to numeric values between -2 and 2 associating (0, ;, 1-, 1 , 1 +, 2-, 2, 2+, 3-, 3, 3+, 4) with (2, 1.67, 1.33, 1 , 0.67, 0.33, 0, -0.33, -0.67, -1 , -1.33, -2). Values from replicates were combined into a mean. For a resistant accession, i.e.
  • AgRenSeq plots were generated using R where each integer on the x-axis corresponds to one contig. Dots on plot are according to scores of k- mers within each contig. Dotsize was plotted according to number of k- mers with the specific score.
  • Fig. 4b The phylogenetic tree with PGT phenotypes and Sr gene genotypes displayed in concentric circles (Fig. 4b) was constructed with iTOL v3 (Letunic & Bork, 2016, Nucleic Acids Res 44: W242-245).
  • sequences were identified with (i) a truncated NLR that carried only the nucleotide binding domain, and (ii) an inosine-5'-monophosphate dehydrogenase (previously annotated as ‘Putative brown planthopper-induced resistance protein 1' and which possesses a CBS domain, a small domain originally identified in cystathionine beta-synthase), that co segregated with Sr46 (Fig. 10b).
  • Amplified fragments generated from consensus regions of these wheat NLRs were used to screen an AUS18913 BAC library (Moullet et at., 1999, Theor Appl Genet 99: 305-313).
  • a specific amplicon, primer S46ConsFA (5’- GCT G AGATT GTGCTT CTT CT AG-3’ , SEQ ID NO: 1) + primer S46PREVR (5’- CATTGTACGCTCTGTCCAATA-3’, SEQ ID NO: 2) derived from re-sequencing AUS18913 with the consensus NLR cosegregated with Sr46 and the corresponding BAC clone (2A/B_60P19) was sequenced (Fig. 10e).
  • BAC2A/B_60P19 A single full-length gene with NB-ARC and LRR domains was identified in BAC2A/B_60P19 and further investigated as a candidate gene for Sr46 (Fig. 10f).
  • Ethyl methane sulphonate (EMS) mutagenesis was performed on line R9.3 according to procedures previously described (Mago et ai, 2017, Methods Mol Biol 1659: 199-205).
  • M 1 BS, M2S, M3B Three M 2 mutants were identified (M 1 BS, M2S, M3B) and progeny tested at M 3 and M 4 generations.
  • Mutant M1 BS carried a large deletion of chromosome 2DS, whereas point mutations occurred in M2S and M3B (Fig. 10f).
  • NLR gene was previously identified as the candidate gene for Sr45 based on map position and screening of six independent loss of function mutants derived from EMS treatment (Steuernagel et at., 2016, supra). However, no transgenic wheats with the candidate gene sequence were generated and tested with rust to confirm its Sr45- mediated stem rust resistance function.
  • a transformation construct containing a native Sr45 expression cassette (PC1 10; Table 4) was assembled by amplification of a 6,481 bp fragment of genomic DNA including the Sr45 coding sequence, 885 bp of 5' regulatory sequence and 1508 bp of 3' regulatory sequence with primers S45F1 (5’- AGT ACT GT AAT AATT G ATTCCGTCG-3’ , SEQ ID NO: 3) and S45R5 (5’- GAAATTCCTGCTGCATTGC-3’, SEQ ID NO: 4).
  • the amplified fragment was cloned into the Notl site of the binary vector pVecBARII, a derivative of pWBvec8 in which the selectable marker cassette conferring resistance to hygromycin was replaced with sequences that confer resistance to bialaphos (Wang et ai, 1998, Acta Horticulturae 461 : 401-408) to generate construct PC1 10.
  • a transformation construct containing an Sr45 expression cassette was constructed using the Golden Gate Modular Cloning Toolbox for Plants (Engler et al., 2014, ACS Synth Biol 3: 839-843).
  • the Sr33 promoter and Sr33 terminator were used to regulate expression of the Sr45 coding sequence. Synthetic DNA fragments of these three sequences were obtained from a commercial provider each flanked by a pair of divergent Bsa ⁇ recognition sites for Golden Gate Cloning and four base pair standard fusion sites for TypellS assembly defined in the Plant Common Genetic Syntax (Patron et ai, 2015, New Phytol 208: 13-19).
  • Nine out of the 10 tested lines showed the typical Sr45 resistance with an infection type of ; 1 (Fig. 11 b).
  • SrTA 1662 primer design Sequence generated from 12-plex genotyping-by-sequencing (GBS) using Psfl and Msp ⁇ enzymes according to Tru etai (2012; PLoS One 7: e32253) were available for the Ae. tauschii accession TA1662 and the hexaploid wheat genotype KS05HW14.
  • a custom Python script was used to parse KS05HW14 and TA1662 sequences by barcode and trim barcode sequences.
  • a BLAST database was created from the KS05HW14 and TA1662 sequences. The SrTA 1662 sequence was queried onto each database.
  • AGT AAACT CCCGGCT GAGAT A-3’ (SEQ ID NO: 5) and R1.R3 5’- GCCG AG ACT GAACT CAACT C-3’ (SEQ ID NO: 6), respectively, amplify a 750 bp fragment in genotypes carrying SrTA 1662 (SrTA 1662/SrTA 1662; SrT A 1662/srT A 1662) , while no fragment is amplified in individuals lacking a SrTA 1662 allele
  • reaction conditions for amplification of the SrTA 1662 fragment were as follows: 13.6 mL ddH20, 2.4 mL 10X reaction buffer, 1.9 mL 2.5 mM dNTPs, 2.0 mL 10 pM of forward primer, 2.0 mL 10 pM of reverse primer, 0.1 mL (0.5 U) Taq polymerase and 2 mL of template DNA (30 ng/mL). Cycling conditions include an initial denaturation of 95 °C (5 min) followed by 35 cycles of 95 °C (45 s), 63 °C (45s), and 72 °C (90s), and a final extension at 72 °C (10 min).
  • the R1 F3 + R1 R3 PCR fragment was mapped to a 3.8 cM interval in 83 individuals from the U6714 BC2F1 mapping population (Olson et at., 2013, Theor Appl Genet 126: 1179-1 188) using six SSR markers and one EST-derived marker (Fig. 9b). Marker order and distances were determined using R/qtl (Broman et ai, 2003, Bioinformatics 19: 889-890).
  • the 750 bp R1.F3+R1.R3 PCR fragment co segregated with resistance to PGT race QTHJC conferred by SrTA 1662 on 1 DS (i.e. zero recombinants were obtained among 83 BC2F1 plants).
  • CapRenSeq as described above is applied to an Ae. tauschii line in which we had identified candidate genes for stem rust, but where we require additional sequence for functional analysis.
  • Ae. tauschii accession BW011 11 also known as TOWWC1 12, available from the Germplasm Resources Centre; www.seedstor.ac.uk
  • Triticeae NLR library V3 contains -200- thousand baits from NLR exons and repeat-masked introns from Triticum aestivum, T. diccocoides, T. diccocoides, T. urartu, Ae. tauschii, Ae. speltoides, Ae. sharonensis, Ae. peregrina, Ae.
  • the contig_1689_1 increased in length from 6,260 to 8,731 bp, and was also joined to two other contigs of lengths 1 ,205 bp and 1 ,332 with gaps of 1 ,630 and 1 ,611 Ns, respectively.
  • RenSeq assemblies can be improved by scaffolding with LMP libraries. Further improvements could be obtained by scaffolding with an additional LMP library with a 10 kb insert size, and/or by scaffolding with linked read sequences (e.g. obtained from a 10x Genomics library).
  • the gaps (Ns) in the scaffolds could be obtained by scaffolding with an additional LMP library with a 10 kb insert size, and/or by scaffolding with linked read sequences (e.g. obtained from a 10x Genomics library).
  • the regression model for a particular k-mer can be described by: where pheno represents the vector of the phenotype scores, K represents the presence/absence vector of the k-mer, and P 1 , P 2 , - , P n represent the n most significant PCA dimensions. Fitting the above regression model means finding the coefficients such that the Euclidean distance between the vectors on the left and the right hand sides of the above expression is minimized. A likelihood ratio test for nested models (Wilks 1938, Ann Math Stat 9: 60-62) was then used to obtain a p-value for each k-mer.
  • This procedure reduces background (false positives) and allows the identification of rare candidate gene variants.
  • BBSRC Biological Sciences Research Council
  • Table 1 Comparison of gene (such as R gene) cloning strategies.
  • Table 3 Stem rust resistance genes identified by AgRenSeq in Aegilops tauschii.
  • Table 4 List of binary constructs carrying Sr genes.
  • Table 5 Diseases of bread wheat for which genetic variation in Aegilops Wilmingtonii has been observed.
  • Table 6 Examples of crop wild relative germplasm panels.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention is directed to a method for identifying one or more genes associated with a selected phenotype in an organism. The method includes the steps of screening a database for genetic samples or nucleic acids associated with a selected phenotype, producing enriched sample sequences, identifying sub-sequences of a fixed length k (k- mers) associated with the selected phenotype within the enriched sample sequences, and mapping the identified k-mers. The method may be used for example to identify genes encoding plant resistance proteins, such as nucleotide-binding and leucine-rich repeat (NLR) proteins.

Description

METHOD FOR IDENTIFYING GENES ASSOCIATED WITH A PARTICULAR PHENOTYPE
FIELD OF THE INVENTION
The invention relates to the field of genetic association and gene identification, and in particular to methods for identifying genes associated with a particular phenotype.
BACKGROUND OF THE INVENTION
Plant disease resistance is often mediated by dominant or semi-dominant resistance (R) genes which encode receptors that perceive pathogen infection either by direct recognition of pathogen effector molecules, or indirectly by recognition of effector modified host targets. Most R genes encode intracellular nucleotide binding/leucine-rich repeat (NLR) immune receptor proteins. NLR genes are amongst the most evolutionarily dynamic gene families in plants with hundreds of these genes usually present in plant genomes. Crop plants and their wild progenitors typically possess extensive copy number and sequence variation at these NLR loci. Domestication and intensive breeding have reduced R gene diversity in crops rendering them more vulnerable to disease epidemics. New R genes derived from closely related species have been introduced into crops to improve disease resistance by either relatively laborious intensive breeding or more recently as transgenes. R gene cloning, however, is complicated by extensive pan-genome variation at NLR loci making it difficult to establish orthogonal relationships between a reference genome sequence and other accessions. The structural variation at NLR loci of different accessions can also reduce recombination at the locus further complicating positional cloning approaches. The requirement for recombination in gene cloning can be overcome by mutational genomics. The generation of multiple independent mutants by mutagenesis followed by genome complexity reduction and sequencing (e.g. NLR exome capture [“MutRenSeq”, described in Steuernagel et at., 2016, Nat Biotechnol 34: 652-655; Steuernagel et at., 2017, Methods Mol Biol 1659: 215-229; and WO2015/127185] or chromosome flow sorting and sequencing [“MutChromSeq”, described in Sanchez-Martin et at., 2016, Genome Biol 17: 221 and Steuernagel et al., 2017, Methods Mol Biol 1659: 231-243]), enables the unambiguous identification of mutant alleles and the functional NLR gene (see Table 1).
However, positional cloning and mutational genomics share certain limitations. Notably, they require the R gene to exist as a single gene in an otherwise susceptible genetic background to the pathogen isolate of interest which can take numerous generations to achieve. They also require that the plant target species is amenable for the generation and screening of large recombinant or mutant populations numbering 1000s of individuals (see Table 1). However, many wild relatives of crop plants contain multiple functional resistance genes and are experimentally intractable due to pre-domestication traits such as long generation times, seed shattering, seed dormancy, as well as having seed and plant morphologies not amenable to existing mechanization technologies.
In contrast, genome-wide association studies (GWAS) do not require monogenic resistant lines or the development of mapping or mutagenesis populations. GWAS enables trait correlation in a genetically diverse population structure by exploiting pre-existing recombination events accumulated in natural populations. Members within a population are screened with a genome-wide set of markers and linkage disequilibrium sought between markers and traits of interest. Markers linked to traits of interest are then readily assigned to a genomic region in a reference genome. However, the reliance on a reference genome excludes the identification of sequences that have significantly diverged from the reference, as is often the case for R gene loci.
There remains a need to provide improved methods suitable for identification of genes associated with a particular phenotype, such as R genes, particularly in organisms where whole genome sequencing is currently not feasible.
SUMMARY OF THE INVENTION
According to an aspect of the present invention, there is provided a method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:
(a) providing a database of genetic samples from a population of the organism, wherein the genetic samples comprise nucleic acids;
(b) optionally screening the database for genetic samples or nucleic acids associated with the selected phenotype;
(c) obtaining nucleic acid samples from the database;
(d) producing enriched samples by enriching the nucleic acid samples for nucleic acids associated with the selected phenotype;
(e) obtaining sequence information from the enriched samples to provide enriched sample sequences; (f) identifying sub-sequences of a fixed length k (k-mers) associated with the selected phenotype within the enriched sample sequences; and
(g) mapping the identified k- mers onto each of the enriched sample sequences to identify one or more genes associated with the selected phenotype. The one or more genes may for example be R genes.
The database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 is a flow diagram showing an embodiment of the invention.
Fig. 2 is a diagram depicting combining association genetics and resistance gene enrichment sequencing (“AgRenSeq”) for rapid R gene discovery and cloning (a, Crop wild relative diversity panel) A genetically diverse panel of accessions (Ac) is (b, Phenotype) phenotyped with different pathogen races, and (c, RenSeq) subjected to resistance gene enrichment sequencing (“RenSeq”) followed by assembly of the NLR repertoire and extraction of NLR k- mers for each accession, i) Library, ii) Baits, iii) Hybridize, iv) NLR capture, v) Sequence, vi) k- mers, vii) Assembly (d, K-mer association matrix) Each k-mer in each accession was given a phenotype score between +2 and -2 depending on the level of resistance or susceptibility. The sum of scores for each k-mer is calculated for each pathogen race. -mers are then plotted in an association matrix according to their sequence identity to NLRs from a given accession (x-axis) and their phenotype score (y-axis) (see Fig. 4). Ac = Accession, Res. = Resistant, Sus. = Susceptible, SR = Score race, a S(phenotype * k- er value).
Fig. 3 shows distribution of 317 KASP markers across the seven Aegilops tauschii chromosome arms. Centromeres are indicated by the letter“C”. Other horizontal bars indicate the chromosomal position of the markers. These markers were used to study the genetic diversity of the Ae. tauschii accessions (Fig. 4B). Fig. 4 depicts genetic architecture of stem rust resistance in Ae. tauschii. (a) Geographic distribution of Ae. tauschii ssp. strangulata (lighter dots) and ssp. tauschii (darker dots) used in this study. Two accessions from China and two from Pakistan, which fall outside of the map, are not shown. ARM, Armenia; AZE, Azerbaijan; GEO, Georgia; IRN, Iran; KAZ, Kazakhstan; SYR, Syria; TJK, Tajikistan; TUR, Turkey; TRK, Turkmenistan; UZB, Uzbekistan (b) Phylogenetic tree displaying Ae. tauschii ssp. strangulata (151 accessions) and ssp. tauschii (21 accessions) with an intermediate accession. Stem rust phenotypes and Sr gene genotypes are displayed by concentric circles around the tree. A = Sr45, B = Sr46, C = Sr33, D = SrTA 1662, E = TTKSK, F = TPMKC, G = RKQQC, H = QTHJC. (c-f) Identification of (c) Sr33, (d) Sr46, (e) SrTA 1662 and (f) Sr45 by AgRenSeq. Each integer (dot-column) on the x-axis represents an NLR contig from the RenSeq assembly of a single accession containing the respective Sr gene. Each dot on the y-axis represents one or more RenSeq k-mers associated with resistance across the diversity panel to the respective PGT race. Dot size is proportional to the number of k-mers associated with resistance: (c) RKQQC vs. BW_01 189, Sr33 (d) TPMKC vs. BVV 01191 , Sr46 (e) QTHJC vs. BW_01072, SrTA 1662 and (f) TTKSK vs. BW_01083, Sr45.
Fig. 5 is an unweighted pair group method with arithmetic mean (UPGMA) tree of 195 Ae. tauschii accessions based on 317 SNP markers. The tree is divided into two branches representing ssp. strangulata (darker branches) and ssp. tauschii (lighter branches) with an intermediate accession between the two subgroups. The final set of 151 non-redundant Ae. tauschii ssp. strangulata accessions used in AgRenSeq analysis are marked with filled black circles.
Fig. 6 is a graph showing principal component analysis of the 195 Ae. tauschii lines. Using 317 SNP markers identified two different subgroups, namely ssp. strangulata (lower dots; Lineage 2) and ssp. tauschii (upper dots; Lineage 1) with an intermediate accession lying between the two subspecies. The PCA distribution of accessions into subgroups corresponded closely with the subgroups depicted in the UPGMA tree. L1 = Lineage 1 ; L2 = Lineage 2; I = Intermediate. Fig. 7 shows photographs of a phenotypic diversity in the Ae. tauschii panel on the second leaf in response to various PGT races. The infection types were scored using the Stakman infection type scale and the scores ranged from: resistant (;0, ;1-, 1), moderately resistant (1+, 2), moderately susceptible (2+, 3), to susceptible (3+, 4). The accession number and PGT race of the infected leaves from left to right are: BW_01020 (TTKSK), BW_01114 (TPMKC); BW_01020 (TTTTF); BW_01084 (TRTTF); BW_01081 (TTKSK); BW_01039 (TRTTF); BW_01065 (TTKSK); BW_01184 (TPMKC); BW_01125 (TTKSK); BW_01012 (TPMKC).
Fig. 8 is a series of histograms showing distribution of seedling stem rust reactions across the 151 non-redundant Ae. tauschii spp. strangulata accessions to PGT races (a) QTHJC (b) RKQQC (c) TTKSK (d) TRTTF (e) TPMKC (f) TTTTF and (g) UK_01. The scale at the bottom of each histogram indicates the variation in reaction types ranging from resistant, moderately resistant, moderately susceptible to susceptible. R = Resistant, S = Susceptible, Freq. = Frequency.
Fig. 9 shows genetic mapping of SrTA 1662 in an Ae. tauschii TA1662 x T. aestivum KS05HW14 BC2F1 population (a) PCR amplification of a 750 bp fragment from SrTA1662 using primers R1.F3 and R1.R3. From left to right are genotypes of a resistant BC2F1 individual (U6714-B-047 ( SrTA 1662/srTA1662 )), a susceptible BC2F1 individual (U6714- B-048 (srTA 1662/srTA 1662)) , Ae. tauschii TA1662 (SrTA1662/SrTA 1662) and the susceptible recurrent parent KS05HW14 ( srTA1662/srTA 1662 ). (b) The 750 bp PCR fragment co-segregated with SrTA 1662-mediated resistance in 83 BC2F1 individuals. CentiMorgan distances are shown on left and markers are shown on right of chromosome 1 D. I = Xgdm33, II = Xwmc147, III = Xcfd15, IV = Xwmc432, V = SrTA1662, VI = XF3R3, VII = Xwmc222, VIII = Xcfa2158, IX = XBE489313.
Fig. 10 shows mapping of Sr46 in diploid and hexaploid families resulting in the identification of a sequence tagged site marker psr649 (a, Genetic map, Chr 2DS) located 0.1 cM from the resistance locus. Physical mapping using BAC contigs of Ae. tauschii AL8/78 of the locus (b, Ae. tauschii AL8/78 contig 5195) combined with comparative genomics of the corresponding region from Brachypodium distachum led to the identification of candidate genes (c, Brachypodium 5g01180, 01167, 01160; d, Chinese spring 2DS) in the Sr46 donor BAC library (e, Ae. tauschii AUS18913 BAC 2A/B_60P19 Sr46 NLR candidate gene; f) of which unique mutations were found in the NLR gene. The Sr46 candidate NLR gene was validated by transgenic studies (g) where the wild type and sib lines without the transgene showed high infection type of 3+4 (g1), while seven independent transgenic events showed an intermediate infection type (g2) and nine independent transgenic events showed a low infection type (g3, g4) characteristic of a resistant phenotype. I = Xgwm1099, II = Xpsr649, III = Xbarc297, Mut = Mutant.
Fig. 11 is a series of photographs showing transgenic Sr45 expression provides stem rust resistance in wheat cultivar Fielder. Primary transgenic seedlings show a typical Sr45 infection phenotype when challenged with the Australian stem rust race 98-1 ,2,3,5 and 6. (a) PC1 10-1 to 12 carry an Sr45 construct with the native promoter and terminator sequences, while (b) PC147-1 carries Sr45 regulated by the Sr33 promoter and terminator sequences. Nine out of ten independent PC147 lines conferred seedling resistance typical of Sr45. i = Fielder, ii = PC1 10-1 , iii = PC110-4, iv = PC1 10-4, v = PC1 10-5, vi = PC110- 7, vii = PC1 10-10, viii = PC1 10-12, ix = PC110-3, x = PC1 10-6, xi = PC110-9, xii = PC1 10- 11 , xiii = PC100-8, xiv = PC147-1 , xv = Fielder.
Fig. 12. Identification of stem rust resistance genes in Ae. tauschii by k-mer— based regression analysis for isolate RKQQC. (a, BW_01005 contigs) Linear regression analysis identifies Sr33 (dot column highlighted in black) and an additional candidate gene (labelled“false positive”) (b) Grouping all candidate contigs identified across the panel followed by multivariate lasso regression for RKQQC indicates that the additional candidate identified by linear regression in (a) is a false positive (FP), while a new candidate gene is identified, which has a marginally greater effect size than Sr33. 1 = Sr33, 2 = New Candidate, 3 = False Positive. DETAILED DESCRIPTION OF THE INVENTION
According to one aspect of the present invention, there is provided a method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:
(a) providing a database of genetic samples from a population of the organism, wherein the genetic samples comprise nucleic acids;
(b) optionally screening the database for genetic samples or nucleic acids associated with the selected phenotype;
(c) obtaining nucleic acid samples from the database;
(d) producing enriched samples by enriching the nucleic acid samples for nucleic acids associated with the selected phenotype;
(e) obtaining sequence information from the enriched samples to provide enriched sample sequences;
(f) identifying sub-sequences of a fixed length k (k- mers) associated with the selected phenotype within the enriched sample sequences; and
(g) mapping the identified k- mers onto each of the enriched sample sequences to identify one or more genes associated with the selected phenotype.
Certain steps may be carried out in a different order to that depicted above.
The present invention, in contrast to most published association studies, allows for reference genome-independent gene discovery and cloning. On provision of a database of genetic samples from a population (for example, a database such as a“diversity panel” or“diversity collection”), no additional crossing or mutagenesis is required for this gene identification. Thus, for example, cloning of genes from accessions of a wild plant species lacking advanced agronomic traits is possible, allowing for plant breeders to discover genes in their breeding germplasm and develop gene-specific markers for marker assisted selection. Step (f) may comprise recording the presence or absence of k- mers in each enriched sample sequence and ascribing a score to each k-mer according to the association of each k-mer with the selected phenotype.
Step (g) may comprise mapping the identified k- mers onto each of the enriched sample sequences to identify high scoring regions in the enriched sample sequences.
Step (d) may comprise enriching the samples for nucleic acids associated with the selected phenotype by hybrid-based capture enrichment. Here, the method may comprise designing a target sequence capture library for enriching the nucleic acids.
Step (d) may comprise depleting from the samples nucleic acids not associated with the selected phenotype.
According to the method, the samples may be enriched for nucleic acids for plant resistance (R) proteins. For example, the samples may be enriched for nucleic acids for nucleotide-binding and leucine-rich repeat (NLR) proteins. The samples may be enriched for nucleic acids for membrane-bound receptor proteins.
The nucleic acids of the database may comprise non-gene-related sequences, for example structural nucleic acids that may be associated with the selected phenotype.
Step (f) may comprise identifying k- mers of two or more fixed lengths k.
The organism may for example be a plant. Alternatively, the organism may be a mammal such as a human, a non-human primate, a rat or a mouse.
Step (e) may comprise alignment-free assembly of the enriched sample sequences.
The sequencing step (e) may comprise long mate-pair scaffolding of the enriched samples (see“CapRenSeq” procedure described below and in Example 2).
The sequence information may be obtained using high-throughput sequencing.
The genetic samples may comprise multiple nucleic acids associated with similar phenotypes. The multiple nucleic acids within each genetic sample may be associated with the same phenotype. Step (e) may comprise obtaining sequence information from a pool of enriched samples having defined levels of association with the selected phenotype or an opposite phenotype.
The database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant. For example, the database may comprise genetic samples from a plantation crop plant.
The database of the method may comprise or consist of genetic samples from the following non-limiting list of plants: wheat (for example Triticum urartu, T. aestivum, Aegilops sharonensis or Ae. speltoides), rye (for example Secale cereale), millet, onion, palm, tomato, pepper (for example Capsicum annuum), tobacco, canola, cotton (for example Gossypium ssp.), peanut, alfalfa, sunflower, safflower, Hordeum vulgare, oat (for example Avena sativa), Oryza spp., maize (for example Zea mays), sorghum (for example Sorghum bicolor ), sugarcane (for example Saccharum spp.), banana (for example Musa spp.), potato (for example Solarium spp.), Ipomoea batatas, soya bean (for example Glycine max or G. soja), Cicer arietinum, Pisum sativum, cassava (for example Manihot esculenta), Dioscorea spp., sugar beet (for example Beta vulgaris), legumes (for example Favaceae spp.), Cannabis sativa, Arabidopsis thaliana or Theobroma cacao.
Fig. 1 illustrates steps involved in a method according to an aspect of the invention.
Reduced representation sequencing, for example but not limited to resistance gene enrichment sequencing (“RenSeq”), is conducted on a database such as a diversity panel of samples or a pool of samples. The k- mers in each sample of the diversity panel or the pool of samples (individual data sets) are counted and a k- er matrix is generated for all k- mers. The complexity of the data set may be reduced by removing k- mers below or above a certain percentage threshold (minimal and maximal k-mer frequency). The diversity panel samples or pool of samples are phenotyped and the phenotype scores may be converted. The phenotype scores for each k-mer are then summed up and projected onto the assembly of an accession with the desired phenotype.
The invention may also be described as comprising the following steps:
(a) Select or provide a diversity panel;
(b) Design a targeted sequence capture library; (c) Enrich genetic samples (or“accessions”) in the diversity panel using the sequence capture library in (b), then sequence individual accessions in the diversity panel. (It is also possible to bulk samples with similar traits before sequencing. This would significantly reduce the cost of discovering and cloning individual resistance genes form a diversity panel. It also allows performing a cost-effective assessment of a diversity panel to see if the population structure in the panel would likely facilitate gene discovery and cloning by the method of the invention, before performing the method of the invention on each individual accession. The advantage of performing the method of the present invention on each individual accession is that subsequently one can manipulate the population structure by removing individual accessions. This may reveal new genes that were previously masked, as exemplified for example in Example 1 below with Sr45);
(d) Record presence or absence of k-mers for each accession and create a matrix: Accessions by k-mers. For example, entries may be 1 for“present” and 0 for“absent”;
(e) Phenotype the diversity panel for any phenotypic trait of interest that is likely to be associated with genes captured in step (b). Accessions are scored positive (e.g. +2) for presence of a phenotypic trait of interest and negative (e.g. -2) for an undesired phenotypic trait. Intermediate phenotypic traits may be scored at (or close to) zero;
(f) Associate phenotypes (e) with matrix (d) to derive a score for each k-mer;
(g) Project scores onto a de novo assembly of an accession having the desired phenotype. Projection means that each contig of the assembly accumulates all scores of k-mers that are present in that contig; and
(h) Plot scores per contig. Data points representing several k-mers with the same score may be highlighted.
(i) Step (g) may comprise plotting the k-mers onto a sequenced and assembled reference accession, in which case k-mer scores will be plotted per nucleotide or per gene, or in bins of a defined nucleotide or genetic length.
As used herein, the“database of genetic samples” may be derived from a“diversity panel” or“diversity collection”. The samples or accessions in the database may have been phenotyped for one or more selected traits of interest prior to performing steps of the invention, or the method of the invention may include a step of phenotyping the samples accessions in the database. A“diversity panel” or“diversity collection” is a collection of genetic samples from a diverse population, typically species-specific, and may include nucleic acid fragments prepared from organismal nucleic acids (e.g. genomic DNA) by a method that can reveal polymorphisms or mutations (e.g. sequence differences) between samples.
Diversity panels may be published and publically available and/or custom designed according to the species and trait of interest. Individual uniquely identifiable samples within a diversity panel are termed accessions. The accessions within a diversity panel may be genetically unique or redundant.
Enrichment (also termed “target enrichment” herein) is a method by which particular genomic regions of interest are preferentially amplified or selected. Target enrichment can be performed before sequencing, to reduce the amount of sequencing and data processing required. Target enrichment methods include uniplex and multiplexed polymerase chain reaction (PCR), molecular inversion probe (MlP)-based approaches or amplicon based target enrichment and hybrid capture-based approaches.
In hybrid capture-based target enrichment, hybridized target-specific probes are hybridized to cDNA libraries. The probes may be attached to a microarray or in solution. Hybridized probes may be removed from solution by bead capture to isolate the target DNA from the background DNA. Probes for use in hybrid capture-based target enrichment may be provided as a generic or custom designed base capture library.
The target enrichment of the present invention may be performed for example according to the methods as described by Jupe et al. (2013, Plant J. 76:530-544).
Next generation sequencing produces large amounts of short reads from fragmented DNA. In order for the sequence to be analyzed the reads must be assembled to produce a reconstruction of the complete original DNA sequence, or assembly. This is done by aligning the reads, for example by finding overlaps between them. Assembly may be performed by mapping to a reference genome or de novo. De novo assembly has the advantage of enabling of a DNA sequence without reference to external data that either may be unavailable (for example when a reference genome is not available) or bias the results, for example in the case of diverse species or diverse genomic regions. A rate limiting step for some resistance gene enrichment sequencing (“RenSeq”) assemblies arises due to their fragmentation. For example, from the -900 and -3400 complete NLR loci in the Ae. tauschii (diploid) and bread wheat (hexaploid) genomes, respectively, only -25% assemble into full-length NLRs. Capturing high molecular weight DNA for long-read or linked read RenSeq (e.g. PacBio, Nanopore and 10x) is technically challenging due to the inherent instability of long single-stranded DNA molecules.
Moreover, homopolymer-rich introns in the NLRs appear to impede the PacBio chemistry contributing to uneven coverage and fragmented NLR assemblies. As a result of fragmented RenSeq assemblies, some NLRs may be missed in correlation studies due to a decreased signal-to-noise ratio.
Recent-established protocols now allow for the generation of high-quality assemblies of complex plant genomes such as polyploid wheat (see, for example, Avni et al. [2017] Science 357: 93-97). Requirements for the generation of those high-quality assemblies may include: (i) a 450 bp insert library sequenced to high coverage with 250 bp paired end reads to generate overlapping fragments for an initial assembly, followed by (ii) scaffolding with a series of long mate pair (“LMP”) libraries, and/or (iii) scaffolding with “Linked-Reads” (e.g. from 10x Genomics Inc.), which provides long-range information from short-read sequencing data. RenSeq may therefore also be performed on LMP or 10x libraries and the captured reads used to improve a standard RenSeq assembly. The elements of this“combined assembly procedure” (Cap) for the gene content in a whole genome can be applied in the present invention to the fraction of genes captured by RenSeq, for example to obtain NLR repertoires with a higher proportion of full-length genes (a procedure referred to herein as“CapRenSeq”).
Alignment free assembly of sequencing data is a process by which de novo genome assembly is performed. Advantages of alignment free assembly over alignment and reference sequence-based methods of sequence analysis include the increased speed of alignment free assembly. Alignment free methods for sequence analysis include methods based on k-mer analysis and information theory.
As used herein, k-mer frequency may also be referred to as k-mer counting.
Methods based on k-mer frequency consider for a fixed or variable integer k, the relative frequencies of all possible k-mers for each of the input sequences and define a distance between the reads based on these frequency vectors. In other k- mer based methods each sequence position may be considered in terms of overlapping k- mers. Contiguous or spaced, e.g. pattern deterministic, k- mers may be used.
The method of the present invention comprises k- mer-based association genetics. The method may comprise performing trait associations on subsequences or k- mers generated from raw sequence reads. The method of the present invention may comprise extracting the k- mers from sequence and assembled accessions from a diversity panel.
The method of the present invention may comprise computing the genetic identity of the accessions following sequencing and assembly of the sequence reads, based on known SNP markers. Redundant accessions can then be removed before further analysis. This step has the advantage of decreasing the data load of the subsequent analysis steps.
The sequencing used in the method of the present invention may be lllumina short-read sequencing by synthesis technology.
As used herein, the term “gene” refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA. The term gene may include one or more exons or part thereof, one or more introns or part thereof, all or a portion of the 5’ untranslated region, and all or a portion of the 3’ untranslated region.
The method of invention may also be applied to identifying non-gene nucleic acids associated with a selected phenotype, for example non-coding nucleic acids such as nucleic acids involved in chromosome structure.
As exemplified by way of example in Example 3 below, a comprehensive search across all the accessions may also be performed by grouping the contigs of all the accessions based on sequence similarity for significant candidate contigs, and then jointly testing to retain the most important ones. Advantageously, this allows false positives to be filtered out and aids identification of rare resistance genes. In particular, the method of the invention may also be performed by including an additional step of grouping the enriched sample sequences based on the presence or absence of a contig, and testing the association of the contig with the presence of the k- mer.
As used herein, the term“phenotype” refers to a detectable appearance or characteristic of an organism. Phenotype is understood to result from the interaction of the organism’s genotype with the environment. Phenotype includes individual phenotypic traits that are detectable and describable after visual inspection, measurement, and/or analysis by chemical, biochemical and/or molecular techniques.
Particular non-limiting embodiments of the present invention will now be described in detail.
EXAMPLES
Example 1 :“AqRenSeq”
Summary
Genetic resistance is the most economic and environmentally sustainable approach for crop disease protection. Disease Resistance (R) genes from wild relatives are a valuable resource for breeding resistant crops. However, introgression of R genes into crops is often associated with co-integration of deleterious linked genes and pathogens can rapidly evolve to overcome R genes when deployed singly. Introducing multiple cloned R genes into crops as a stack would avoid linkage drag and delay emergence of resistance breaking pathogen races. However, current R gene cloning methods require segregating or mutant progenies, which are difficult to generate for many wild relatives due to poor agronomic traits. We have exploited natural pan-genome variation in a wild diploid wheat by combining association genetics with R gene enrichment sequencing (which we call “AgRenSeq”) to clone four stem rust resistance genes in <6 months. AgRenSeq applied to diversity panels is therefore a major advance in isolating R genes for engineering broad- spectrum resistance in crops.
Introduction
In the present example, we show that -mer-based association genetics combined with resistance gene enrichment sequencing (an embodiment of the invention also referred to herein as“AgRenSeq”) enables the discovery and cloning of R genes from a plant diversity panel. A summary of the process used in this example is shown in Fig. 2.
Results and Discussion
Here, we describe the AgRenSeq analysis of a panel of Aegilops tauschii accessions that were phenotyped with races of the wheat stem rust pathogen Puccinia graminis f. sp. tritici (PGT). Ae. tauschii is the wild progenitor species of the Thticum aestivum (bread wheat) D genome and a number of stem rust resistance (Sr) genes that have been introgressed into bread wheat from this species. To test AgRenSeq, we chose Ae. tauschii ssp. strangulata because of the availability of genetically diverse accessions (see Jones et ai., 2013, Theor Appl Genet 126: 1793-1808) and two cloned Sr genes from this species ( Sr33 and Sr45) to serve as positive controls. We obtained 174 Ae. tauschii ssp. strangulata accessions that span its habitat range and also included 21 Ae. tauschii ssp. tauschii accessions as an outgroup.
We designed and tested a sequence capture bait library optimised for Ae. tauschii NLRs plus genomic regions encoding 317 single nucleotide polymorphism (SNP) markers distributed across all seven chromosomes (Fig. 3). RenSeq analysis was undertaken on the diversity panel using this capture library and lllumina short-read sequencing, with 1.66 ± 0.59 Gb of 250 bp paired-end sequence obtained per accession. De novo sequence assemblies were analysed with NLR-parser, a programme that identifies NLRs based on number and order of short conserved NLR sequence motifs24. We obtained 1 ,437 ± 81 NLR contigs per accession, 299 ± 14 of which encoded full length NLRs and 1 ,137 ± 87 encoding partial NLRs. Captured SNP marker sequences were compared among accessions and a set of 151 genetically distinct Ae. tauschii ssp. strangulata accessions were selected (Figs 4a and 5-6). These accessions were phenotyped for their reaction to seven PGT races with different virulence profiles and geographic origins (Table 2; Fig. 7). From 10% to 67% of the accessions showed resistance depending on the PGT race used (Figs 4b and Fig. 8; Table 3).
To identify stem rust resistance (Sr) genes in the phenotyped panel, we performed a k- mer-based AgRenSeq analysis. The k-mer sequences were given a numerical value based on the level of resistance or susceptibility observed for the accession in which they occurred, and these values were then summed for each k-mer. In previous k-mer based GWAS, associated k-mers were used to build a local assembly (Rahman et al., 2017, “Association mapping from sequencing reads using K-mers”, BioRxiv) or reconstruct diverged haplotypes through a direct base-by-base k-mer frequency-guided extension process (Audano et ai., 2017,“Mapping-free variant calling using haplotype reconstruction from k-mer frequencies”, BioRxiv). However, the high degree of sequence identity between paralogous NLR genes, particularly in the more conserved NB-ARC (nucleotide-binding adaptor shared by APAF-1 , R proteins, and CED-4) domain, makes correct phasing and assembly difficult for these genes. We therefore mapped k-mers directly onto NLR assemblies of resistant accessions.
A proof-of-concept experiment was undertaken to identify the Sr33 gene amongst these accessions using the North American PGT race RKQQC which is avirulent on Sr33, but virulent on most other known Sr genes (Rouse et al. , 201 1 , Crop Science 51 : 2074-2078). We mapped k-mers associated with RKQQC resistance onto the RenSeq assembly of accession BW_01189 which is the original source of the Sr33 gene (Periyannan et al., 2013, Science 341 : 786-788). A single discrete association peak was identified in a 5.6 kb RenSeq contig that had 100% identity to Sr33 (Fig. 4c). The same method was then used to map k-mers associated with resistance to all six remaining PGT races in the 151 accessions. Two non-redundant, high-confidence candidate Sr genes other than Sr33 were identified (Fig. 4d-e, Table 3) and these candidate Sr genes were aligned to the reference genome assembly of wheat cv. Chinese Spring RefSeql .O (www.wheatgenome.org). The genomic locations of these candidate Sr genes coincided with the genetic positions of two Sr genes introgressed into wheat from Ae. tauschii ssp. strangulata, namely Sr46 (Yu et ai, 2015, Theor Appl Genet 128: 431-443) and SrTA 1662 (Olson et al., 2013, Theor Appl Genet 126: 1179-1188). The SrTA 1662 candidate gene was mapped in a recombinant inbred line population to a 3.8 cM interval shown to encode SrTA 166-resistance, thus supporting our AgRenSeq data (Fig. 9). The SrTA 1662 candidate encodes a coiled-coil NLR protein with 85% amino acid identity to Sr33.
Independently of this AgRenSeq approach, we also identified an Sr46 candidate gene by conventional fine mapping in segregating diploid progenitor and wheat populations coupled with the sequencing of candidate genes in this region in three ethyl methanesulphonate-derived mutants that had lost Sr46 resistance. In two mutants, the same candidate gene contained non-synonymous substitutions, while the third mutant had a deletion of the chromosomal segment encoding this gene (Fig. 10a-f). Comparison of the Sr46 candidate gene identified by AgRenSeq and map-based/mutagenesis cloning showed that they were 100% identical. This gene, hereafter referred to as Sr46, encodes a coiled coil, NLR protein that conferred stem rust resistance when expressed as a transgene in the susceptible wheat cv. Fielder (Fig. 10g; Table 4).
While identification of Sr33, Sr46 and SrTA 1662 was successful we did not identify Sr45. We hypothesized this was due to the prevalence of Sr46 amongst the panel accessions masking Sr45 resistance as none of the seven PGT races utilized were virulent on Sr46 and avirulent on Sr45. To overcome this problem, we removed all 64 accessions containing Sr46 from the panel. The -mers associated with resistance to PGT race TTKSK in the remaining 87 accessions were then mapped onto the TTKSK-resistant Ae. tauschii accession BW_01083, the original donor of Sr45. Previously we had identified a candidate Sr45 gene by MutRenSeq from a wheat- Ae. tauschii BW_01083 introgression line. Using AgRenSeq, a single discrete association peak was identified that corresponded to this previously identified Sr45 candidate gene (Fig. 4f). This gene conferred resistance when transformed into cv Fielder confirming both our previous MutRenSeq identification (see Steuernagel etai., 2016, supra) and current AgRenSeq data (Fig. 11, Table 4). From AgRenSeq analysis and sequence alignments among these four cloned Sr genes and the diversity panel, Sr46 and SrTA 1662 were the most prevalent being found in 42% of the accessions, while Sr33 and Sr45 were found in 5% and 7% of accessions, respectively (Fig. 4b).
Our results demonstrate that AgRenSeq is a robust protocol for discovering and cloning functional R genes from a diversity panel. This approach successfully identified four Sr genes that have been introgressed into wheat from Ae. tauschii ssp. strangulata over the last 40 years. In addition to stem rust resistance, Ae. tauschii is a rich source of genetic variation for resistance to other diseases and pests relevant to bread wheat cultivation including leaf rust, stripe rust, wheat blast, powdery mildew, Hessian fly, and others (Table 5). Our RenSeq-configured diversity panel is essentially an R gene genotyped resource which can now be screened against other pathogens/pests to clone additional functional R genes.
In contrast to most published association studies, the AgRenSeq procedure as described in this example is reference genome-independent and directly identifies the NLR that confers resistance rather than identifying a genomic region encoding multiple paralogs for subsequent candidate gene confirmation. The k-mer strategy could also be adapted for associative transcriptomics which allows trait correlation based on variation in transcript sequence and abundance (Harper etal., 2012, Nat Biotechnol 30: 798-802). The low basal level of NLR gene expression may necessitate costly and computationally challenging ultra-deep transcript sequencing unless cDNAs were enriched by RenSeq (Andolfo et al., 2014, BMC Plant Biol 14: 120). AgRenSeq is also not restricted to the limited genetic variation and recombination present in bi-parental populations, but rather interrogates pan genome sequence variation in diverse germplasm to isolate uncharacterised R genes (or other genes). Thus, importantly, no additional crossing or mutagenesis is required for R gene cloning using this approach. This makes it possible to clone R genes from accessions of a wild species lacking any advanced agronomic traits and requires only the phenotyping of an enrichment-sequenced diversity panel.
Diversity collections are available for many crops and their wild relatives including soybean, pea, cotton, maize, potato, wheat, barley, rice, banana and cocoa (Table 6). The ability to rapidly clone agriculturally valuable R genes or other genes by AgRenSeq further increases the value of these germplasm collections. AgRenSeq may also allow breeders to discover genes in their breeding germplasm and develop gene-specific markers for marker assisted selection. Thus, this method has enormous utility for genetic improvement of most crops and will also provide new fundamental insights into the structure and evolution of species-wide functional R gene or other gene architectures.
Materials and Methods
Ae. tauschii diversity panel and DNA extraction
A set of 195 Ae. tauschii accessions were assembled from the Wheat Genetics Resource Center (Kansas State University, USA), The Leibniz Institute of Plant Genetics and Crop Plant Research (Gatersleben, Germany), National Small Grains Collection (Beltsville, Maryland, USA), International Center for Agricultural Research in the Dry Areas (Aleppo, Syria), The Commonwealth Scientific and Industrial Research Organisation (Canberra, Australia), The N. I. Vavilov Institute of Plant Industry (St. Petersburg, Russia), University of California-Davis (Davis, California, USA), and Agriculture and Agri-Food Canada (Winnipeg, Canada). The majority of these accessions were originally collected from diverse collection sites around the Caspian Sea (Fig. 4a). For each accession, we used seeds obtained by self-pollination of a single plant in a glasshouse at the John Innes Centre in the year 2016. High molecular weight genomic DNA was extracted from leaf tissue using a modified CTAB method (see Yu et al., 2017, Theor Appl Genet 130: 1207- 1222). DNA quantification was performed using a NanoDrop spectrophotometer (Thermo Scientific) and the Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies). Ae. tauschii accessions with the GRU accession numbers are available from the Germplasm Resources Unit, John Innes Centre, Norwich, UK (https://www.jic.ac.uk/germplasm/).
Design of Ae. tauschii bait library
NLRs: We used several available sequence resources from Ae. tauschii and related genomes to generate a set of baits with high potential to capture the NLR repertoire across the Ae. tauschii species complex. Transcriptome raw data from publicly available datasets from Ae. tauschii (ERR420230, ERR420231), Triticum urartu (PRJNA191053; Krasileva et al., 2013, Genome Biol 14: R66), T. turgidum (PRJNA191054; Krasileva et al., 2013, supra), Ae. sharonensis (PRJEB5340; Yu et al., 2017, supra), and hexaploid wheat (ERX1053979 [Steuernagel et al., 2016, supra], PRJEB23474) were assembled using Trinity (Grabherr et al., 201 1 , Nat Biotechnol 29: 644-652) with default parameters. In addition, we added NLR gene models from T. aestivum (Grabherr et al., 2011 , supra ; Choulet et al., 2014, Science 345: 1249721), Hordeum vuigare (Consortium et al., 2012, Nature 491 : 71 1-716) and Brachypodium distachyon (Initiative, 2012, Nature 463: 763- 768). All resources were screened for NLR association using NLR-Parser (Steuernagel et al., 2015, Bioinformatics 31 : 1665-1667).
KASP markers: A total of 2,1 10 SNP markers were used in the design of the bait library. These markers were composed of: 331 KASP markers selected from an existing screen of six Ae. tauschii accessions (Ent-230, Ent-088, Ent-323, Ent-336, Ent-392, Ent-414) (available from www.cerealsdb.net; Wilkinson et al. 2012, BMC Bioinformatics 13: 219); 682 KASP markers designed using the probe sequences from the 820K Axiom array (Affymetrix Inc.; Winfield et al. 2016, Plant Biotechnol J 14: 1195-1206) based on polymorphism among fourteen Ae. tauschii accessions (data available from www.cerealsdb.net and in Winfield et al., 2016 [supra]) 363 KASP markers selected from screening of the six Ae. tauschii accessions (as above) on the 90K iSelect array (Wang et al., 2014, Plant Biotechnol J 12: 787-796); 662 probes from the 90K D-genome markers were shortlisted based on D-genome map position and conversion into KASP markers on the NIAB MAGIC population (Gardner et al., 2016, Plant Biotechnol J 14: 1406-1417; map available from www.niab.com/magic); 38 KASP markers from a previous characterization of Ae. tauschii diversity (Jones et al., 2013, Theor Appl Genet 126: 1793-1808); and 34 KASP markers used for the development of standardized marker-assisted selection assays
(http://www.cerealsdb.uk.net/cerealgenomics/CerealsDBNEW/download_mas.php?URL =/cerealgenomics/CerealsDBNEW/FORM_SNPs_vs_Brachy.php). Only 800 marker sequences could be aligned to the AL8/78 reference sequence (NCBI GenBank assembly accession: GCA_000347335.1). The best hit for these markers was elongated to 240 bp. The IUPAC base for the SNP was replaced by the allele found in AL8/78.
The resulting resource of NLR genes and Ae. tauschii SNP markers were masked for repeats using RepeatMasker [http://repeatmasker.org] and the Triticeae Repeat Database (Wicker & Keller, 2002, Trends in Plant Sciences 7: 561-562). A set of 58,943 baits were generated from the unmasked sequences.
Library preparation and sequencing
lllumina libraries with an average insert size of 700 bp were enriched in bulks of four by MYcroarray, Michigan, USA, as previously described (Steuernagel etal., 2017, Methods Mol Biol 1659: 215-229), and sequenced on an lllumina HiSeq with 250 bp PE reads at Novogene, China.
Testing bait library
Bait sequences were aligned to AL8/78 genome sequence using BLASTn (Zhang et al., 2000, J Comput Biol 7: 203-214). Local alignments were filtered for matches with at least length of 100 and 80% identity. Regions from the genome sequence with matches to at least two baits were extracted including 1000 bp flanking sequence. Reads from RenSeq of AL8/78 accession were aligned to extracted regions using BWA (Li & Durbin, 2009, Bioinformatics 25: 1754-1760). Mapping was converted to mpileup format using samtools (Li et al., 2009, Bioinformatics 25: 2078-2079) and percentage of bases in those regions that were covered by RenSeq reads was extracted from mpileup file using a custom script. Of a total of 1 ,950,922 positions in regions, 1 ,952,834 (99.9%) were covered.
NLR assembly parameters
The primary sequence data was trimmed using Trimmomatic v0.2 (Bolger et al., 2014, Bioinformatics 30: 21 14-2120) and de novo assembled with the CLC Assembly Cell (http://www.clcbio.com/products/clc-assembly-cell/) using word size (-w = 64) with standard parameters. The output of the CLC Assembler was a FASTA file containing all the contig sequences. Construction of phylogenetic tree and selection of non-redundant accessions
The KASP marker sequences were extracted from the RenSeq assemblies of each accession using BLASTn and the sequences were multiple-aligned to score the SNP variation across the accessions. The marker sequences with minor allele frequency (MAF) less than 5% in the panel and missing data greater than 60% were excluded from the analysis. The resulting genotyping matrix consisting of 316 KASP markers was used to build a UPGMA (unweighted pair group method with arithmetic mean) based tree using the majority consensus rule for drawing the 100 bootstraps using the Bio.Phylo module from the Biopython [http://biopython.org] package (Talevich et at., 2012, BMC Bioinformatics 13: 209). Further, a python script was used to generate an iTOL [https://itol.embl.de/] compatible tree for rendering and annotation. In addition, an identity- by-state matrix was computed for all possible pairs of accessions to group those accessions with ³99% genetic identity. The set of 151 non-redundant accessions were selected for association mapping based on genetic identity, variation in phenotype data of accessions within identity group, significance of any accession reported in the literature and the seed availability in our stock.
Ae. tauschii stem rust phenotyping for AgRenSeq
Two-week old seedlings were inoculated with Puccinia graminis f. sp. tritici (PGT) spores and infection types were scored two weeks later as previously described (Yu et al., 2017, supra ; see Fig. 7).
K-mer presence/absence matrix
-mers (k = 71) were counted in trimmed raw data per accession using Jellyfish (Marcais & Kingsford, 201 1 , Bioinformatics 27: 764-770). -mers with a count of less than 10 in an accession were discarded immediately. -mer counts from all accessions were integrated to create a presence/absence matrix with one row per k-mer and one column per accession. The entries were reduced to 1 (presence) and 0 (absence). Subsequently, the matrix was cleaned for -mers that occur in less than five accessions or occur in all but four or less accessions. Programs to process the data were implemented in Java.
K-mer-phenotype associations
Phenotype scores were converted from Stakman infection types (Stakman et al., 1962, United States Department of Agriculture, Agricultural Research Service E-617) to numeric values between -2 and 2 associating (0, ;, 1-, 1 , 1 +, 2-, 2, 2+, 3-, 3, 3+, 4) with (2, 1.67, 1.33, 1 , 0.67, 0.33, 0, -0.33, -0.67, -1 , -1.33, -2). Values from replicates were combined into a mean. For a resistant accession, i.e. with the expectation of carrying a resistance gene, contigs associated with NLR genes were annotated using NLR-Parser v2 [see github/steuernb/MutRenSeq] Only k- mers occurring in one of the NLR-associated contigs were considered for association analysis. For each of those k- ers, if also present in the pre-calculated presence/absence matrix, an association score was calculated by the sum of phenotype scores from accessions carrying that k- mer. Programs for association were implemented in Java.
Generating AgRenSeq plots
AgRenSeq plots were generated using R where each integer on the x-axis corresponds to one contig. Dots on plot are according to scores of k- mers within each contig. Dotsize was plotted according to number of k- mers with the specific score.
Phylogenetic tree graphics
The phylogenetic tree with PGT phenotypes and Sr gene genotypes displayed in concentric circles (Fig. 4b) was constructed with iTOL v3 (Letunic & Bork, 2016, Nucleic Acids Res 44: W242-245).
Positional cloning of Sr46
Mapping families were developed at both the diploid (Ae. tauschii) and hexaploid (T. aestivum) levels. An F3 family from AUS18913 (diploid donor of Sr46) x CP I110856 (stem rust susceptible diploid) was phenotyped with the Australian PGT race 34-1 ,2,3,4,5,6 and 7 (culture 74-L1) at the Plant Breeding Institute, University of Sydney, Australia. At the hexaploid level, selections of resistant lines (R9.3, R1 1.4), derived from a cross between the synthetic hexaploid, Langdon/AUS1913 (Lagudah et at., 1987, Journal of Cereal Science 5: 129-138) and cv. Meering (stem rust susceptible), that carried only Sr46, were backcrossed to Meering or crossed directly with cv. Westonia. The simple sequence repeat markers gwm1099 and barc297, located on the distal end of wheat chromosome 2DS, that flank the Sr46 locus, were used to select 30 recombinant lines out of 960 F2 seed from R9.3 x Meering and R11.4 x Westonia (Fig. 10a). These recombinants were also phenotyped with PGT culture 74-L1. We used a marker, psr649, closely linked (0.1 cM) to Sr46 from the diploid family to identify a BAC contig, ctg5195, from the wheat D genome Assembly 1.1 (http://aegilops.wheat.ucdavis.edu/ATGSP/). Within this contig, sequences were identified with (i) a truncated NLR that carried only the nucleotide binding domain, and (ii) an inosine-5'-monophosphate dehydrogenase (previously annotated as ‘Putative brown planthopper-induced resistance protein 1' and which possesses a CBS domain, a small domain originally identified in cystathionine beta-synthase), that co segregated with Sr46 (Fig. 10b). A comparative genomics approach was adopted whereby BLAST searches identified an orthologous inosine-5'-monophosphate dehydrogenase gene in Brachypodium chromosome 5 (Bradi_5g01180) with adjacent NB-ARC domain LRR (Bradi_5g01 167) and a protein annotated with a LIM domain (Bradi_5g01 160) (Fig. 10c). Sequence similarity searches with the Bradi_5g01167 NLR against the wheat chromosome 2DS survey sequence identified a small family of related NLRs on three separate contigs, one of which contained adjacent sequences with a LIM domain (IWGSC 2DS survey sequences) (Fig. 10d). Amplified fragments generated from consensus regions of these wheat NLRs were used to screen an AUS18913 BAC library (Moullet et at., 1999, Theor Appl Genet 99: 305-313). A specific amplicon, primer S46ConsFA (5’- GCT G AGATT GTGCTT CTT CT AG-3’ , SEQ ID NO: 1) + primer S46PREVR (5’- CATTGTACGCTCTGTCCAATA-3’, SEQ ID NO: 2) derived from re-sequencing AUS18913 with the consensus NLR cosegregated with Sr46 and the corresponding BAC clone (2A/B_60P19) was sequenced (Fig. 10e). A single full-length gene with NB-ARC and LRR domains was identified in BAC2A/B_60P19 and further investigated as a candidate gene for Sr46 (Fig. 10f). Ethyl methane sulphonate (EMS) mutagenesis was performed on line R9.3 according to procedures previously described (Mago et ai, 2017, Methods Mol Biol 1659: 199-205). Three M2 mutants were identified (M 1 BS, M2S, M3B) and progeny tested at M3 and M4 generations. Mutant M1 BS carried a large deletion of chromosome 2DS, whereas point mutations occurred in M2S and M3B (Fig. 10f).
Validation of Sr46 and Sr45 candidate genes by transformation
Validation of the candidate Sr46 gene, S46NLR- BAC2A/B_60P19 in wheat transformation (cv. Fielder) was conducted according to previous procedures (Periyannan et at., 2013, Science 341 : 786-788) using a gene construct with the genomic clone which included 2 kb of the native promoter region and 1.8 kb of downstream sequences (PC92; Table 4). T ransgenic plants and sib lines were tested with the Australian PGT race 98-1 , 2, 3, 5, and 6 (virulent on Fielder) on To, Ti and T2 progenies (Fig. 10g).
An NLR gene was previously identified as the candidate gene for Sr45 based on map position and screening of six independent loss of function mutants derived from EMS treatment (Steuernagel et at., 2016, supra). However, no transgenic wheats with the candidate gene sequence were generated and tested with rust to confirm its Sr45- mediated stem rust resistance function. A transformation construct containing a native Sr45 expression cassette (PC1 10; Table 4) was assembled by amplification of a 6,481 bp fragment of genomic DNA including the Sr45 coding sequence, 885 bp of 5' regulatory sequence and 1508 bp of 3' regulatory sequence with primers S45F1 (5’- AGT ACT GT AAT AATT G ATTCCGTCG-3’ , SEQ ID NO: 3) and S45R5 (5’- GAAATTCCTGCTGCATTGC-3’, SEQ ID NO: 4). The amplified fragment was cloned into the Notl site of the binary vector pVecBARII, a derivative of pWBvec8 in which the selectable marker cassette conferring resistance to hygromycin was replaced with sequences that confer resistance to bialaphos (Wang et ai, 1998, Acta Horticulturae 461 : 401-408) to generate construct PC1 10. In addition, a transformation construct containing an Sr45 expression cassette was constructed using the Golden Gate Modular Cloning Toolbox for Plants (Engler et al., 2014, ACS Synth Biol 3: 839-843). The Sr33 promoter and Sr33 terminator (Periyannan et ai, 2013, supra) were used to regulate expression of the Sr45 coding sequence. Synthetic DNA fragments of these three sequences were obtained from a commercial provider each flanked by a pair of divergent Bsa\ recognition sites for Golden Gate Cloning and four base pair standard fusion sites for TypellS assembly defined in the Plant Common Genetic Syntax (Patron et ai, 2015, New Phytol 208: 13-19). These parts were assembled into the plCH47732 Level one acceptor (Weber et ai, 2011 , PLoS One 6: e16765) and then the gene cassette was cloned into the A/ofl site of the pVec8 binary vector to produce PC147 (Table 4).
The two Sr45 constructs, with native and non-native regulatory elements, were transformed into the rust susceptible wheat cultivar Fielder. A total of 12 primary transgenic plants were recovered containing construct PC110 and inoculated with PGT race 98-1 , 2, 3, 5, and 6. Seven lines (PC110-1 , -2, -4, -5, -7, -10, -12) had an infection type ;1-, four lines (PC1 10-3, -6, -9, -1 1) had an infection type 1 while the transgenic line PC110-8 showed a susceptible infection type 3+ (similar to non-transgenic Fielder) as it carried a truncated Sr45 gene sequence (Fig. 11 a). In addition, a total of 10 primary transgenic plants carrying PC147 were also scored against PGT race 98-1 , 2, 3, 5, and 6. Nine out of the 10 tested lines showed the typical Sr45 resistance with an infection type of ; 1 (Fig. 11 b).
Mapping of the SrTA1662 candidate NLR gene
SrTA 1662 primer design: Sequence generated from 12-plex genotyping-by-sequencing (GBS) using Psfl and Msp\ enzymes according to Poland etai (2012; PLoS One 7: e32253) were available for the Ae. tauschii accession TA1662 and the hexaploid wheat genotype KS05HW14. A custom Python script was used to parse KS05HW14 and TA1662 sequences by barcode and trim barcode sequences. A BLAST database was created from the KS05HW14 and TA1662 sequences. The SrTA 1662 sequence was queried onto each database. Three GBS tags from KS05HW14 and one tag from TA1662 with Psfl cut sites were aligned to the SrTA 1662 sequence using MUSCLE (https://www.ebi.ac.uk/Tools/msa/muscle/) and targeted for primer design.
PCR conditions: The forward and reverse primer pairs, R1.F3 5’-
AGT AAACT CCCGGCT GAGAT A-3’ (SEQ ID NO: 5) and R1.R3 5’- GCCG AG ACT GAACT CAACT C-3’ (SEQ ID NO: 6), respectively, amplify a 750 bp fragment in genotypes carrying SrTA 1662 (SrTA 1662/SrTA 1662; SrT A 1662/srT A 1662) , while no fragment is amplified in individuals lacking a SrTA 1662 allele
(srTA 1662/srT A 1662) (Fig. 9a). Reaction conditions for amplification of the SrTA 1662 fragment were as follows: 13.6 mL ddH20, 2.4 mL 10X reaction buffer, 1.9 mL 2.5 mM dNTPs, 2.0 mL 10 pM of forward primer, 2.0 mL 10 pM of reverse primer, 0.1 mL (0.5 U) Taq polymerase and 2 mL of template DNA (30 ng/mL). Cycling conditions include an initial denaturation of 95 °C (5 min) followed by 35 cycles of 95 °C (45 s), 63 °C (45s), and 72 °C (90s), and a final extension at 72 °C (10 min).
Mapping of the SrTA 1662 fragment: The R1 F3 + R1 R3 PCR fragment was mapped to a 3.8 cM interval in 83 individuals from the U6714 BC2F1 mapping population (Olson et at., 2013, Theor Appl Genet 126: 1179-1 188) using six SSR markers and one EST-derived marker (Fig. 9b). Marker order and distances were determined using R/qtl (Broman et ai, 2003, Bioinformatics 19: 889-890). The 750 bp R1.F3+R1.R3 PCR fragment co segregated with resistance to PGT race QTHJC conferred by SrTA 1662 on 1 DS (i.e. zero recombinants were obtained among 83 BC2F1 plants).
Example 2:“CapRenSeq”
Introduction
CapRenSeq as described above is applied to an Ae. tauschii line in which we had identified candidate genes for stem rust, but where we require additional sequence for functional analysis. Results and Discussion
Ae. tauschii accession BW011 11 (also known as TOWWC1 12, available from the Germplasm Resources Centre; www.seedstor.ac.uk) was previously enriched with an NLR library optimised for Ae. tauschii to generate 2.43 Gb of raw resistance gene enrichment sequence (RenSeq) data (Arora et ai., 2018, Resistance gene discovery and cloning by sequence capture and association genetics. www.biorxiv.org/content/earlv/2018/01/15/248146). The sequences were trimmed using Trimmomatic v0.2 (Bolger et ai., 2014, supra) and de novo assembled with the CLC Assembly Cell (http://www.clcbio.com/products/clc-assemblv-cell/) using word size (-w = 64) with standard parameters. This resulted in a RenSeq assembly containing 298 complete and 1 ,171 partial NLRs (Table 7).
Subsequently, two lllumina Nextera long mate-pair (LMP) libraries with insert sizes of 3 kb and 6 kb (LMP3 and LMP6, respectively) were converted into lllumina TruSeq libraries and enriched with Triticeae NLR library V3. This generic Triticeae NLR library contains -200- thousand baits from NLR exons and repeat-masked introns from Triticum aestivum, T. diccocoides, T. diccocoides, T. urartu, Ae. tauschii, Ae. speltoides, Ae. sharonensis, Ae. peregrina, Ae. umbellulata, Secale cereale and Hordeum vuigare. The original BW01 11 1 RenSeq assembly was scaffolded with SOAPdenovo2 (Luo et al., 2012, GigaScience 18, doi: 10.1 186/2047-217X-1-18) using LMP3 and LMP6 and filtered by NLR-Parser (Steuernagel et ai., 2015, supra). This resulted in 399 complete and 744 partial NLRs (Table 7). Moreover, the total size of the assembly increased 1.75-fold from 4.84 Mb to 8.47 Mb and the average scaffold size increased 2.25-fold from 3,296 bp to 7,413 bp (Table 7).
To investigate the reason for the increased scaffold size in RenSeq+LMP3+LMP6, we looked at the sequences of three known stem rust (Sr) resistance genes, namely Sr45 (Steuernagel et al., 2016, supra), Sr46 (NCBI accession number MG851023) and SrTA 1662 (NCBI accession number MG76391 1). These three genes were previously found to be present in accession BW_01 11 1 (Arora et ai., 2018, supra). We also looked at contig_1689_1 (6,364 bp).
For SrTA 1662 we found that the original contig had increased from 7,557 to 13,093 bp without any gap. For Sr45, the original contig (5,172 bp) had been joined to another contig (862 bp) with a gap spanned by 4, 103 N’s. Similarly, the Sr46 contig (-4,400 bp) had been joined to another contig (-400 bp) with a gap of -5,000 Ns. There was no increase in the length of the original Sr45 and Sr46 contigs. The contig_1689_1 increased in length from 6,260 to 8,731 bp, and was also joined to two other contigs of lengths 1 ,205 bp and 1 ,332 with gaps of 1 ,630 and 1 ,611 Ns, respectively.
These results show that RenSeq assemblies can be improved by scaffolding with LMP libraries. Further improvements could be obtained by scaffolding with an additional LMP library with a 10 kb insert size, and/or by scaffolding with linked read sequences (e.g. obtained from a 10x Genomics library). The gaps (Ns) in the scaffolds could
subsequently be filled in by (i) aligning the LMP reads to the scaffolds and performing local assemblies, and/or (ii) PCR and Sanger sequencing.
Example 3
For association analysis, we prefiltered the k-mers by calculating the correlation of NLR- associated k-mers with the phenotype, and only retained the positively correlated k- mers. For each filtered k-mer, a linear regression model (Lees, et ai 2016, Nat Corns 7: 12797) was fitted to predict the phenotype of an accession based on whether it contains the k-mer, while using a number of significant PCA dimensions obtained from the SNP- marker matrix as covariates to control for the population structure. The regression model for a particular k-mer can be described by:
Figure imgf000029_0001
where pheno represents the vector of the phenotype scores, K represents the presence/absence vector of the k-mer, and P1, P2, - , Pn represent the n most significant PCA dimensions. Fitting the above regression model means finding the coefficients
Figure imgf000029_0002
such that the Euclidean distance between the vectors on the left and the right hand sides of the above expression is minimized. A likelihood ratio test for nested models (Wilks 1938, Ann Math Stat 9: 60-62) was then used to obtain a p-value for each k-mer.
The presence/absence of each k-mer in the Ae. tauschii panel was checked for association with stem rust race RKQQC. using a method based on univariate linear regression, and the regression scores of k-mers were then mapped to the assembly of the resistant accession, BW_01005 (see Arora et ai, 2018, supra). This resulted in the identification of the positive control, Sr33, along with another candidate, as shown in Fig. 12a.
Mapping onto other resistant accessions resulted in more candidates. To determine which of these candidates merit functional testing, the presence/absence of all the candidate contigs (having a regression score above a threshold of 20) in the Ae. tauschii panel was inferred using BLAST+ 2.2.30 (Camacho et al., 2009, BMC Bioinformatics 10: 421), while grouping them into sets containing elements with a high sequence similarity. This grouping resulted in five sets.
A representative from each set, namely contig_1208_1 of BW_01084, contig_1208_1 of BW_01086, conti g_622_1 of BW_01084, contig_51_1 of BW_01010, contig_2059_1 of BW_01115, was taken and their presence/absence was jointly tested for association with RKQQC using multivariate lasso regression (Friedman et al., 2008, J. Statistical Software 33: 1-22). The contig_1208_1 of BW_01084 corresponds to Sr33, while the contig_622_1 of BW_01084 shares high sequence similarity with the second candidate in Fig. 12b. The first two candidates, including the positive control Sr33, were found to have a much higher effect size relative to others, the fourth had a moderate effect size, while the rest two had a negligible effect size (Fig. 12b).
This procedure reduces background (false positives) and allows the identification of rare candidate gene variants.
ACKNOWLEDGEMENTS
This research was supported by the NBI Computing Infrastructure for Science (CiS) group and financed by the Two Blades Foundation, USA, the Biotechnology and
Biological Sciences Research Council (BBSRC), the BBSRC Designing Future Wheat Cross-Institute Strategic Programme, the Norwich Research Park Translational Fund, the Lieberman-Okinow Endowment at the University of Minnesota; the Grains Research and Development Corporation; and in-kind support by MYcroarray. 3 o
Table 1 : Comparison of gene (such as R gene) cloning strategies.
n
H
O
w
Figure imgf000031_0001
Figure imgf000032_0001
Figure imgf000033_0001
Figure imgf000034_0001
**This Example.
Table 3: Stem rust resistance genes identified by AgRenSeq in Aegilops tauschii.
Figure imgf000035_0001
Table 4: List of binary constructs carrying Sr genes.
Figure imgf000035_0002
Table 5: Diseases of bread wheat for which genetic variation in Aegilops tauschii has been observed.
Figure imgf000036_0001
Figure imgf000037_0001
Table 6: Examples of crop wild relative germplasm panels.
Figure imgf000038_0001
Figure imgf000039_0001
Figure imgf000040_0001
Figure imgf000041_0001
Figure imgf000042_0001
Table 7. CapRenSeq of Ae. tauschii accession BW01111 with an LMP3 and LMP6 library.
Figure imgf000043_0001
Although the present invention has been described with reference to preferred or exemplary embodiments, those skilled in the art will recognize that various modifications and variations to the same can be accomplished without departing from the spirit and scope of the present invention and that such modifications are clearly contemplated herein. No limitation with respect to the specific embodiments disclosed herein and set forth in the appended claims is intended nor should any be inferred. All documents cited herein are incorporated by reference in their entirety.

Claims

1. A method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:
(a) providing a database of genetic samples from a population of the organism, wherein the genetic samples comprise nucleic acids;
(b) optionally screening the database for genetic samples or nucleic acids associated with the selected phenotype;
(c) obtaining nucleic acid samples from the database;
(d) producing enriched samples by enriching the nucleic acid samples for nucleic acids associated with the selected phenotype;
(e) obtaining sequence information from the enriched samples to provide enriched sample sequences;
(f) identifying sub-sequences of a fixed length k (k-mers) associated with the selected phenotype within the enriched sample sequences; and
(g) mapping the identified k- mers onto each of the enriched sample sequences to identify one or more genes associated with the selected phenotype.
2. The method according to claim 1 , wherein step (f) comprises recording the presence or absence of k- mers in each enriched sample sequence and ascribing a score to each k-mer according to the association of each k-mer with the selected phenotype.
3. The method according to either of claim 1 or claim 2, wherein step (g) comprises mapping the identified k- mers onto each of the enriched sample sequences to identify high scoring regions in the enriched sample sequences.
4. The method according to any preceding claim, wherein step (d) comprises enriching the samples for nucleic acids associated with the selected phenotype by hybrid- based capture enrichment.
5. The method according to claim 4, comprising designing a target sequence capture library for enriching the nucleic acids.
6. The method according to any preceding claim, wherein step (d) comprises depleting from the nucleic acid samples nucleic acids not associated with the selected phenotype.
7. The method according to any preceding claim, wherein the nucleic acid samples are enriched for nucleic acids for plant resistance (R) proteins.
8. The method according to claim 7, wherein the nucleic acid samples are enriched for nucleic acids for nucleotide-binding and leucine-rich repeat (NLR) proteins.
9. The method according to either of claim 7 or claim 8, wherein the nucleic acid samples are enriched for nucleic acids for membrane-bound receptor proteins.
10. The method according to any preceding claim, wherein step (f) comprises identifying k- mers of two or more fixed lengths k.
11. The method according to any preceding claim, wherein the organism is a plant.
12. The method according to any preceding claim, wherein step (e) comprises long mate-pair scaffolding of the enriched samples.
13. The method according to any preceding claim, wherein step (e) comprises alignment-free assembly of the enriched sample sequences.
14. The method according to any preceding claim, wherein the sequence information is obtained using high-throughput sequencing.
15. The method according to any preceding claim, wherein the sequence information is obtained using Linked-Reads sequencing.
16. The method according to any preceding claim, wherein the genetic samples comprise multiple nucleic acids associated with similar phenotypes.
17. The method according to claim 16, wherein the multiple nucleic acids within each genetic sample are associated with the same phenotype.
18. The method according to any preceding claim, wherein step (e) comprises obtaining sequence information from a pool of enriched samples having defined levels of association with the selected phenotype or an opposite phenotype.
19. The method according to any preceding claim, wherein the database comprises genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.
20. The method according to claim 19, wherein the database comprises genetic samples from a plantation crop plant.
21. The method of any preceding claim, wherein step (g) comprises an additional step of grouping the enriched sample sequences based on the presence or absence of a contig, and testing association of the contig with the presence of the k- mer.
22. The method according to any preceding claim, wherein the database comprises or consists of genetic samples from wheat (for example Triticum urartu, Triticum aestivum, Aegilops sharonensis or Ae. speltoides), rye (for example Secale cereale), millet, onion, palm, tomato, pepper (for example Capsicum annuum), tobacco, canola, cotton (for example Gossypium ssp.), peanut, alfalfa, sunflower, safflower, Hordeum vulgare, oat (for example Avena sativa ), Oryza spp., maize (for example Zea mays), sorghum (for example Sorghum bico/oή, sugarcane (for example Saccharum spp.), banana (for example Musa spp.), potato (for example Solarium spp.), Ipomoea batatas, soya bean (for example Glycine max or G. soja), Cicer arietinum, Pisum sativum, cassava (for example Manihot escuienta), Dioscorea spp., sugar beet (for example Beta vulgaris), legumes (for example Favaceae spp.), Cannabis sativa, Arabidopsis thaliana or Theobroma cacao.
PCT/GB2019/050077 2018-01-12 2019-01-11 Method for identifying genes associated with a particular phenotype WO2019138244A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1800558.7A GB201800558D0 (en) 2018-01-12 2018-01-12 Method
GB1800558.7 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019138244A1 true WO2019138244A1 (en) 2019-07-18

Family

ID=61256265

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2019/050077 WO2019138244A1 (en) 2018-01-12 2019-01-11 Method for identifying genes associated with a particular phenotype

Country Status (2)

Country Link
GB (1) GB201800558D0 (en)
WO (1) WO2019138244A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289384A (en) * 2020-10-15 2021-01-29 华中农业大学 Construction method and application of whole citrus genome KASP marker library
CN113628685A (en) * 2021-07-27 2021-11-09 广东省农业科学院水稻研究所 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data
WO2022008496A1 (en) 2020-07-07 2022-01-13 John Innes Centre Association mapping method
CN117746979A (en) * 2024-02-21 2024-03-22 中国科学院遗传与发育生物学研究所 Animal variety identification method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Alignment-free sequence analysis - Wikipedia", 26 November 2017 (2017-11-26), XP055572619, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Alignment-free_sequence_analysis&oldid=812187681#Methods_based_on_k-mer/word_frequency> [retrieved on 20190321] *
ATIF RAHMAN ET AL: "Association Mapping From Sequencing Reads Using K-mers", BIORXIV, 23 May 2017 (2017-05-23), XP055572986, Retrieved from the Internet <URL:https://www.biorxiv.org/content/biorxiv/early/2017/07/19/141267.full.pdf> [retrieved on 20190322], DOI: 10.1101/141267 *
BURKHARD STEUERNAGEL ET AL: "Rapid cloning of disease-resistance genes in plants using mutagenesis and sequence capture", NATURE BIOTECHNOLOGY, vol. 34, no. 6, 25 April 2016 (2016-04-25), New York, pages 652 - 655, XP055572159, ISSN: 1087-0156, DOI: 10.1038/nbt.3543 *
GUOTAI YU ET AL: "Discovery and characterization of two new stem rust resistance genes in Aegilops sharonensis", THEORETICAL AND APPLIED GENETICS ; INTERNATIONAL JOURNAL OF PLANT BREEDING RESEARCH, vol. 130, no. 6, 8 March 2017 (2017-03-08), Berlin, DE, pages 1207 - 1222, XP055572666, ISSN: 0040-5752, DOI: 10.1007/s00122-017-2882-8 *
KARL J V NORDSTRÖM ET AL: "Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers", NATURE BIOTECHNOLOGY, vol. 31, no. 4, 10 March 2013 (2013-03-10), New York, pages 325 - 330, XP055572994, ISSN: 1087-0156, DOI: 10.1038/nbt.2515 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022008496A1 (en) 2020-07-07 2022-01-13 John Innes Centre Association mapping method
CN112289384A (en) * 2020-10-15 2021-01-29 华中农业大学 Construction method and application of whole citrus genome KASP marker library
CN112289384B (en) * 2020-10-15 2024-02-20 华中农业大学 Construction method and application of citrus whole genome KASP marker library
CN113628685A (en) * 2021-07-27 2021-11-09 广东省农业科学院水稻研究所 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data
CN113628685B (en) * 2021-07-27 2022-03-15 广东省农业科学院水稻研究所 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data
CN117746979A (en) * 2024-02-21 2024-03-22 中国科学院遗传与发育生物学研究所 Animal variety identification method

Also Published As

Publication number Publication date
GB201800558D0 (en) 2018-02-28

Similar Documents

Publication Publication Date Title
Arora et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics
Le Nguyen et al. Next-generation sequencing accelerates crop gene discovery
Nadeem et al. DNA molecular markers in plant breeding: current status and recent advancements in genomic selection and genome editing
Deng et al. A telomere-to-telomere gap-free reference genome of watermelon and its mutation library provide important resources for gene discovery and breeding
Singh et al. Next‐generation sequencing for identification of candidate genes for Fusarium wilt and sterility mosaic disease in pigeonpea (C ajanus cajan)
Jiang Molecular markers and marker-assisted breeding in plants
Yu et al. Aegilops sharonensis genome-assisted identification of stem rust resistance gene Sr62
Delseny et al. High throughput DNA sequencing: the new sequencing revolution
Stanton-Geddes et al. Candidate genes and genetic architecture of symbiotic and agronomic traits revealed by whole-genome, sequence-based association genetics in Medicago truncatula
Mascher et al. Mapping-by-sequencing accelerates forward genetics in barley
Yu et al. Identification of genome-wide variants and discovery of variants associated with Brassica rapa clubroot resistance gene Rcr1 through bulked segregant RNA sequencing
Chagné et al. Development of a set of SNP markers present in expressed genes of the apple
Ben-Ari et al. Marker-assisted selection in plant breeding
Jiang Molecular marker-assisted breeding: a plant breeder’s review
WO2019138244A1 (en) Method for identifying genes associated with a particular phenotype
Yang et al. Rapid development of molecular markers by next-generation sequencing linked to a gene conferring phomopsis stem blight disease resistance for marker-assisted selection in lupin (Lupinus angustifolius L.) breeding
Wang et al. From genome to gene: a new epoch for wheat research?
Barbey et al. Disease resistance genetics and genomics in octoploid strawberry
Fletcher et al. Development of a next-generation NIL library in Arabidopsis thaliana for dissecting complex traits
Nguyen et al. Tools for Chrysanthemum genetic research and breeding: Is genotyping-by-sequencing (GBS) the best approach?
BR112017026015B1 (en) METHODS FOR SELECTING A CORN PLANT WITH RESISTANCE TO STALK ROT CAUSED BY ANTHRACNOSIS AND METHOD FOR INTROGRESSING A QTL ALLELE ASSOCIATED WITH RESISTANCE TO STALK ROT CAUSED BY ANTHRACNOSIS INTO A CORN PLANT
Shimizu et al. Markers, maps, and marker-assisted selection
Rajendran et al. Genotyping by sequencing advancements in barley
US20070048768A1 (en) Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping and marker development
US20070192909A1 (en) Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping ane marker development

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19701008

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19701008

Country of ref document: EP

Kind code of ref document: A1