WO2022008496A1

WO2022008496A1 - Association mapping method

Info

Publication number: WO2022008496A1
Application number: PCT/EP2021/068625
Authority: WO
Inventors: Brande WULFF; Kumar GAURUV
Original assignee: John Innes Centre
Priority date: 2020-07-07
Filing date: 2021-07-06
Publication date: 2022-01-13
Also published as: GB202010388D0

Abstract

The invention is directed to a method for identifying one or more genes associated with a selected phenotype in an organism. The method includes the steps of providing sequence information derived from a database of genetic samples from a population of the organism, wherein the genetic samples comprise nucleic acids, and wherein the sequence information is comprised of a plurality of contigs, identifying sub-sequences of a fixed length k (k-mers) associated with the selected phenotype within the sequence information, ordering one or more k-mers using a reference genome assembly, and mapping the k-mers onto one or more of the ordered contigs to identify one or more genes associated with the selected phenotype.

Description

Association Mappinq Method

FIELD OF THE INVENTION

The invention relates to the field of genetic association and gene identification, and in particular to methods for identifying genes associated with a particular phenotype.

BACKGROUND OF THE INVENTION

Genome-Wide Association Studies (GW AS) enable trait correlation in a genetically diverse population structure by exploiting pre-existing recombination events that have accumulated in natural populations. Members within a population are screened with a genome-wide set of markers and linkage disequilibrium sought between markers and traits of interest. Markers linked to traits of interest are then readily assigned to a genomic region in a reference genome. However, GW AS relies on a reference genome and this makes identification of sequences that have significantly diverged from the reference much more challenging. Indeed, a single reference genome typically only represents a fraction of the genetic variation present in the whole species complex (i.e. the ‘pan-genome’). This is particularly relevant for genes that are subject to strong diversifying selection, which may result in haplotypes with pronounced copy number and sequence variation. An example of such a gene class are the plant disease resistance {R) genes, most of which encode intracellular nucleotide binding/leucine-rich repeat (NLR) immune receptor proteins. NLR genes are amongst the most evolutionarily dynamic gene families in plants with hundreds of these genes usually present in plant genomes. In GWAS, if the target haplotype is sufficiently diverged or altogether absent from the reference genome, this can lead to no significant signal being detected.

Identifying specific resistance genes in Aegilops tauschii(Ae. tauschii) by combining /c-mer based reference-free association genetics with resistance gene enrichment sequencing (AgRenSeq) has been previously described in WO2019138244A1 , which is incorporated by reference.

These previous techniques make use of enrichment sequencing to reduce the cost of sequencing and to improve the representation of NLR genes in the library. This allows NLR genes to be more readily identified from the genetic background but has the limitations of being restricted to discovering only R genes encoded by NLRs. If a given R gene is not encoded by an NLR or not included within the bait library, it will be excluded from later consideration. With the continuing decrease in the cost of sequencing, techniques like Whole Genome Shotgun (WGS) become increasingly cost-effective while avoiding the disadvantages of enrichment sequencing such as the need for bait library design and desirable sequence information falling outside the enriched sequences.

However, WGS techniques can produce sequence information that is highly disordered and highly repetitive, and is therefore difficult to interpret. This makes it very difficult to localise the presence of genetic markers and to extract useful information from GW AS. There therefore remains a need to improve GW AS techniques for large data sets such as those produced by WGS. The present invention addresses this need.

SUMMARY OF THE INVENTION

According to an aspect of the present invention, there is provided a method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:

(a) providing sequence information derived from a database of genetic samples from a population of the organism, wherein the genetic samples comprise nucleic acids, and wherein the sequence information is comprised of a plurality of contigs;

(b) identifying sub-sequences of a fixed length k f/c-mers) associated with the selected phenotype within the sequence information

(c) ordering one or more contigs using a reference genome assembly; and

(d) mapping the /c-mers onto one or more of the ordered contigs to identify one or more genes associated with the selected phenotype.

The one or more genes may for example be R genes.

The database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.

BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is a flow diagram showing an embodiment of the invention. Fig. 2 is a diagram depicting combining association genetics and whole genome shotgun (WGS) sequencing for rapid gene discovery and cloning (a) A genetically diverse panel of accessions (Ac) is (b) phenotyped, e.g. with different pathogen races or another trait of interest, and (c) (i, ii) subjected to whole genome shotgun sequencing followed by (iii) extraction of k- mers for each accession, (iv) assembly of an accession containing the target gene/trait, (v) aligning the assembled contigs to a reference genome, and (vi) determination of k- mer — trait associations and projection of the associated k- mers to the aligned contigs.

Fig. 3 shows photographs of a phenotypic diversity in the Ae. tauschii panel on the second leaf in response to various PGT races. The infection types were scored using the Stakman infection type scale and the scores ranged from: resistant (;0, ;1 -, 1 ), moderately resistant (1 +, 2), moderately susceptible (2+, 3), to susceptible (3+, 4). The accession number and PGT race of the infected leaves from left to right are: BW 01020 (TTKSK), BW 01114 (TPMKC); BW 01020 (TTTTF); BW_01084 (TRTTF); BW_01081 (TTKSK); BW_01039 (TRTTF); BW 01065 (TTKSK); BW_01184 (TPMKC); BW_01125 (TTKSK); BW_01012 (TPMKC).

Fig. 4. Re-identification of Sr45 and Sr46, providing resistance to Puccinia graminis t. sp. tritici race TTKSK, using association mapping with WGS sequencing. Black/grey dot on the y-axis represents one or more k- mers significantly associated with resistance/susceptibility, respectively, to TTKSK across the diversity panel. Dot size is proportional to the number of k- mers having the specific value on the y-axis. (a) Significantly associated k- mers mapped to the Ae. tauschii reference accession, AL8/78, which is susceptible to TTKSK. Each dot-column on the x-axis corresponds to a 10 kb genomic block starting from that position. The peaks marked Sr45 and Sr46 contain the non-functional (not providing resistance to TTKSK) alleles of Sr45 and Sr46. The prominent peaks in the unanchored region are in linkage disequilibrium with Sr46. (b) Significantly associated /c-mers mapped to the de novo assembly of TOWWC-113, an Ae. tauschii accession resistant to TTKSK. Each dot-column on the x-axis represents an unordered contig from the de novo assembly. The random ordering on the x-axis and the small contig size makes it impossible to identify Sr45 and Sr46 from the background noise arising mostly due to linkage disequilibrium (c) Significantly associated /c-mers mapped to the de novo assembly of TOWWC-113, where each contig has been ordered by anchoring to the reference genome of AL8/78. Most of the contigs significantly associated with resistance have concentrated around the genomic loci of Sr45 and Sr46. (d) Significantly associated /c-mers mapped to the ordered contigs of TOWWC-112, another resistant accession with a much improved de novo assembly, leading to further improvement in signal-to-noise ratio.

DETAILED DESCRIPTION OF THE INVENTION According to one aspect of the present invention, there is provided a method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:

(b) identifying sub-sequences of a fixed length k (k- mers) associated with the selected phenotype within the sequence information;

(c) ordering one or more contigs using a reference genome assembly; and

(d) mapping the k- mers onto one or more of the ordered contigs to identify one or more genes associated with the selected phenotype.

Ordering the contigs using a reference genome assembly allows the k- mers to be mapped to a particular location within the reference assembly. This prevents k- mers from becoming distributed across the measured sequence information, which generates a lot of noise. Instead ordering the contigs to which the k- mers are mapped, localizes the k- mers to the same linkage disequilibrium block. This strategy also allows positively associated /c-mers to be mapped to regions of the genome that are significantly diverged between the reference genome and the target genome which in turn allows identification of genes that may be absent in the reference genome.

Certain steps may be carried out in a different order to that depicted above. The present invention, in contrast to most published association studies, allows for reference genome-independent gene discovery and cloning. On provision of a database of genetic samples from a population (for example, a database such as a “diversity panel” or “diversity collection”), significantly reducing the amount of additional crossing or mutagenesis required for gene identification. Thus, for example, cloning of genes from accessions of a wild plant species lacking advanced agronomic traits is possible, allowing plant breeders to discover genes in their breeding or pre-breeding germplasm and develop gene-specific markers for marker assisted selection.

Step (a) may comprise obtaining sequence information

Step (b) may comprise recording the presence or absence of one or more /c-mers associated with the selected phenotype and ascribing a score to the one or more /c-mers according to the association of the one or more /c-mers with the selected phenotype.

Step (b) may further comprise selecting /c-mers with a score with a value greater than a first threshold value and/or lower than a second threshold value, for mapping in step (d).

Step (d) may comprise mapping the identified /c-mers onto each of the ordered contigs to identify high scoring or low scoring regions in the sequence information.

The reference genome assembly may be a chromosome-level reference genome assembly. Having a reference assembly which is ordered to a level where sequence information is ascribed to individual chromosomes allows for a much greater level of relative ordering when ordering contigs with respect to the reference assembly. This ordering brings contigs that are in the same linkage disequilibrium block together, allowing /c-mers to be mapped to a specific linkage disequilibrium block. This reduces the signal- to-noise ratio as the noise caused by /c-mers not being specifically mapped is reduced.

Preferably, the reference genome assembly is derived from a genetic sample of a related organism, preferably an organism with a highly similar genome to the tested organism. Wherein, a genome is considered highly similar to another genome if they have 95%, 96%, 97%, 98%, 99% or greater sequence identity when comparing similarity over 1 , 2, 3, 4, 5, 6, or more chromosomes or genomic regions. Preferably, the reference genome assembly would be derived from a different individual in the same species or the same genome group as the tested organism. For example, Ae. tauschii is a D genome progenitor of hexaploid ABD bread wheat and so Ae. tauschii contigs can be readily aligned to the D genome of hexaploid wheat.

Step (a) may further comprise producing enriched samples by enriching the nucleic acid samples associated with selected phenotype. However, this is not necessarily preferred due to the potential bias introduced by enriching the nucleic acid samples.

Optionally, the enrichment may comprise enriching the samples for nucleic acids associated with the selected phenotype by hybrid-based capture enrichment. Here, the method may comprise designing a target sequence capture library for enriching the nucleic acids.

Optionally, the enrichment may comprise depleting from the samples nucleic acids not associated with the selected phenotype.

According to the method, the samples may be enriched for nucleic acids for plant resistance ( R ) proteins. For example, the samples may be enriched for nucleic acids for nucleotide-binding and leucine-rich repeat (NLR) proteins. The samples may be enriched for nucleic acids for membrane-bound receptor proteins.

The nucleic acids of the database may comprise non-gene-related sequences, for example structural nucleic acids that may be associated with the selected phenotype.

Step (b) may comprise identifying k- mers of one or more fixed lengths k.

The organism may for example be a plant. Alternatively, the organism may be a mammal such as a human, a non-human primate, a rat or a mouse.

Step (a) may comprise alignment-free assembly of the enriched sample sequences.

The sequencing step (a) may comprise long mate-pair scaffolding of the enriched samples (see “CapRenSeq” procedure described in Reference Example 2).

The sequence information may be obtained using high-throughput sequencing or Linked- Reads sequencing.

The sequence information may be whole genome shotgun (WGS) sequence data. WGS data can be highly disordered, particularly if it has low coverage or if the genome is highly repetitive such as in plant genomes. Using the technique of the invention allows better mapping of identified /c-mers to within the same linkage disequilibrium block within an organism’s genome and allows genes associated with a particular phenotype (e.g. resistance genes) to be more readily identified.

The ordering step in step (c) may comprise identifying the longest sequence alignment for each contig to a corresponding sequence in the reference genome assembly. It is preferred to order the contigs such that all contigs from a given linkage disequilibrium block are brought together. A significant reduction in noise from disordered /c-mers is achieved when contigs are matched to their longest alignment in the reference assembly. The genetic samples may comprise multiple nucleic acids associated with similar phenotypes. The multiple nucleic acids within each genetic sample may be associated with the same phenotype.

The database may comprise genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant. For example, the database may comprise genetic samples from a plantation crop plant.

The database of the method may comprise or consist of genetic samples from the following non-limiting list of plants: wheat (for example Triticum urartu, T. aestivum, Aegilops sharonensis or Ae. speltoides), rye (for example Secale cereale), millet, onion, palm, tomato, pepper (for example Capsicum annuum), tobacco, canola, cotton (for example Gossypium ssp.), peanut, alfalfa, sunflower, safflower, Hordeum vulgare, oat (for example Avena sativa ), Oryza spp., maize (for example Zea mays), sorghum (for example Sorghum ίooΐoή, sugarcane (for example Saccharum spp.), banana (for example Musa spp.), potato (for example Solarium spp.), Ipomoea batatas, soya bean (for example Glycine max or G. soja), Cicer arietinum, Pisum sativum, cassava (for example Manihot escuienta), Dioscorea spp., sugar beet (for example Beta vulgaris), legumes (for example Favaceae spp.), brassicas (for example Brassica napus), Cannabis sativa, Arabidopsis thaliana or Theobroma cacao.

Fig. 1 illustrates steps involved in a method according to an aspect of the invention.

Whole genome sequencing, for example Whole Genome Shotgun sequencing, is conducted on a database such as a diversity panel of samples or a pool of samples. The k- mers in each sample of the diversity panel or the pool of samples (individual data sets) are counted and a k- mer matrix is generated for all k- mers. The complexity of the data set may be reduced by removing k- mers below or above a certain percentage threshold (minimal and maximal k- mer frequency). The diversity panel samples or pool of samples are phenotyped and the phenotype scores may be converted. The phenotype scores for each k- mer are then summed up and projected onto the assembly of an accession with the desired phenotype.

The invention may also be described as comprising the following steps:

(a) Select or provide a diversity panel;

(b) Perform whole genome shotgun sequencing of the diversity panel. (c) Record presence or absence of k- mers for each accession and create a matrix listing which accessions include which k- mers. For example, entries in the matrix may be 1 for “present” and 0 for “absent”.

(d) Phenotype the diversity panel for any phenotypic trait of interest that is likely to be associated with genes captured in step (b). Accessions are scored positive (e.g. +2) for presence of a phenotypic trait of interest and negative (e.g. -2) for an undesired phenotypic trait. Intermediate phenotypic traits may be scored at (or close to) zero;

(e) Associate phenotypes (d) with matrix (c) to derive a score for each -mer;

(f) Order contigs using a well-ordered chromosome-level reference genome assembly. Contigs are aligned to the reference assembly (preferring longest alignments);

(g) Project scores onto each contig of the newly ordered assembly. This accumulates all scores of -mers that are present in each contig; and

(h) Plot scores per contig. Data points representing several -mers with the same score may be highlighted. (i) Step (g) may comprise plotting the -mers onto a sequenced and assembled reference accession, in which case k-mer scores will be plotted per nucleotide or per gene, or in bins of a defined nucleotide or genetic length.

Step (e) may further comprise pre-filtering the /c-mers that have scores above a first desired threshold value and/or have scores below a second desired threshold value, as described in Reference Example 3 and in Example 4.

As used herein, the “database of genetic samples” may be derived from a “diversity panel” or “diversity collection”. The samples or accessions in the database may have been phenotyped for one or more selected traits of interest prior to performing steps of the invention, or the method of the invention may include a step of phenotyping the samples accessions in the database.

A “diversity panel” or “diversity collection” is a collection of genetic samples from a diverse population, typically species-specific, and may include nucleic acid fragments prepared from organismal nucleic acids (e.g. genomic DNA) by a method that can reveal polymorphisms or mutations (e.g. sequence differences) between samples. Diversity panels may be published and publicly available and/or custom designed according to the species and trait of interest. Individual uniquely identifiable samples within a diversity panel are termed accessions. The accessions within a diversity panel may be genetically unique or redundant.

Enrichment (also termed “target enrichment” herein) is a method by which particular genomic regions of interest are preferentially amplified or selected. Target enrichment can be performed before sequencing, to reduce the amount of sequencing and data processing required. Target enrichment methods include uniplex and multiplexed polymerase chain reaction (PCR), molecular inversion probe (MlP)-based approaches or amplicon based target enrichment and hybrid capture-based approaches.

In hybrid capture-based target enrichment, hybridized target-specific probes are hybridized to cDNA libraries. The probes may be attached to a microarray or in solution. Hybridized probes may be removed from solution by bead capture to isolate the target DNA from the background DNA. Probes for use in hybrid capture-based target enrichment may be provided as a generic or custom designed base capture library.

The target enrichment of the present invention may be performed for example according to the methods as described by Jupe et al. (2013, Plant J. 76:530-544).

Next generation sequencing produces large amounts of short reads from fragmented DNA. In order for the sequence to be analyzed the reads must be assembled to produce a reconstruction of the complete original DNA sequence, or assembly. This is done by aligning the reads, for example by finding overlaps between them. Assembly may be genuinely de novo without recourse to a reference, or it may be aided by consulting a reference genome. De novo assembly has the advantage of enabling of a DNA sequence without reference to external data that either may be unavailable (for example when a reference genome is not available) or bias the results, for example in the case of diverse species or diverse genomic regions.

Alignment free assembly of sequencing data is a process by which de novo genome assembly is performed. Advantages of alignment free assembly over alignment and reference sequence-based methods of sequence analysis include the increased speed of alignment free assembly. Alignment free methods for sequence analysis include methods based on k- mer analysis and information theory. As used herein, k- mer frequency may also be referred to as k- mer counting.

Methods based on k- mer frequency consider for a fixed or variable integer k, the relative frequencies of all possible k- mers for each of the input sequences and define a distance between the reads based on these frequency vectors. In other k- mer based methods each sequence position may be considered in terms of overlapping k- mers. Contiguous or spaced, e.g. pattern deterministic, k- mers may be used.

The method of the present invention comprises k- mer-based association genetics. The method may comprise performing trait associations on subsequences or k- mers generated from raw sequence reads. The method of the present invention may comprise extracting the k- mers from sequenced and assembled accessions from a diversity panel.

The method of the present invention may comprise computing the genetic identity of the accessions following sequencing and assembly of the sequence reads, based on known SNP markers or /c-mers. Redundant accessions can then be removed before further analysis. This step has the advantage of decreasing the data load of the subsequent analysis steps.

The sequencing used in the method of the present invention may be lllumina short-read sequencing by synthesis technology.

As used herein, the term “gene” refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA. The term gene may include one or more exons or part thereof, one or more introns or part thereof, all or a portion of the 5’ untranslated region, and all or a portion of the 3’ untranslated region.

The method of the invention may also be applied to identifying non-gene nucleic acids associated with a selected phenotype, for example non-coding nucleic acids such as nucleic acids involved in chromosome structure.

As exemplified by way of example in Reference Example 3 below, a comprehensive search across all the accessions may also be performed by grouping the contigs of all the accessions based on sequence similarity for significant candidate contigs, and then jointly testing to retain the most important ones. Advantageously, this allows false positives to be filtered out and aids identification of rare resistance genes. As used herein, the term “phenotype” refers to a detectable appearance or characteristic of an organism. Phenotype is understood to result from the interaction of the organism’s genotype with the environment. Phenotype includes individual phenotypic traits that are detectable and describable after visual inspection, measurement, and/or analysis by chemical, biochemical and/or molecular techniques.

Particular non-limiting embodiments of the present invention will now be described in detail.

EXAMPLES

Reference Examples 1 -3 are described in full in_WO2019138244A1 and Arora et al. 2019 (Nature Biotechnology, 37:139-143), which are incorporated by reference. The material in the reference examples below are to given concepts and discussion that are of use in understanding the present invention.

Reference Example 1 : “AgRenSeq”

Results and Discussion

Here, we describe the AgRenSeq analysis of a panel of Aegilops tauschii accessions that were phenotyped with races of the wheat stem rust pathogen Puccinia graminist sp. tritici (PGT). Ae. tauschii is the wild progenitor species of the Triticum aestivum (bread wheat) D genome and a number of stem rust resistance (Sr) genes that have been introgressed into bread wheat from this species. To test AgRenSeq, we chose Ae. tauschii ssp. strangulata because of the availability of genetically diverse accessions (see Jones et al., 2013, Theor Appl Genet 126: 1793-1808) and two cloned Srgenes from this species ( Sr33 and Sr45) to serve as positive controls. We obtained 174 Ae. tauschii ssp. strangulata accessions that span its habitat range and also included 21 Ae. tauschii ssp. tauschii accessions as an outgroup.

We designed and tested a sequence capture bait library optimised for Ae. tauschii NLRs plus genomic regions encoding 317 single nucleotide polymorphism (SNP) markers distributed across all seven chromosomes RenSeq analysis was undertaken on the diversity panel using this capture library and lllumina short-read sequencing, with 1 .66 ± 0.59 Gb of 250 bp paired-end sequence obtained per accession. De novo sequence assemblies were analysed with NLR-parser, a programme that identifies NLRs based on number and order of short conserved NLR sequence motifs. We obtained 1 ,437 ± 81 NLR contigs per accession, 299 ± 14 of which encoded full length NLRs and 1 ,137 ± 87 encoding partial NLRs. Captured SNP marker sequences were compared among accessions and a set of 151 genetically distinct Ae. tauschii ssp. strangulata accessions were selected. These accessions were phenotyped for their reaction to seven PGT races with different virulence profiles and geographic origins (Fig. 3). From 10% to 67% of the accessions showed resistance depending on the PGT race used.

To identify stem rust resistance (Sr) genes in the phenotyped panel, we performed a k- mer-based AgRenSeq analysis. The /c-mer sequences were given a numerical value based on the level of resistance or susceptibility observed for the accession in which they occurred, and these values were then summed for each k- mer. In previous k- mer based GW AS, associated k- mers were used to build a local assembly (Rahman et al., 2017, “Association mapping from sequencing reads using K-mers”, BioRxiv) or reconstruct diverged haplotypes through a direct base-by-base k- mer frequency-guided extension process (Audano etai, 2017, “Mapping-free variant calling using haplotype reconstruction from k- mer frequencies”, BioRxiv). However, the high degree of sequence identity between paralogous NLR genes, particularly in the more conserved NB-ARC (nucleotide-binding adaptor shared by APAF-1 , R proteins, and CED-4) domain, makes correct phasing and assembly difficult for these genes. We therefore mapped k-mers directly onto NLR assemblies of resistant accessions.

A proof-of-concept experiment was undertaken to identify the Sr33 gene amongst these accessions using the North American PGT race RKQQC which is avirulent on Sr33, but virulent on most other known Sr genes (Rouse et al., 2011 , Crop Science 51 : 2074-2078). We mapped k-mers associated with RKQQC resistance onto the RenSeq assembly of accession BW 01189 which is the original source of the Sr33 gene (Periyannan et al., 2013, Science 341 : 786-788). A single discrete association peak was identified in a 5.6 kb RenSeq contig that had 100% identity to Sr33. The same method was then used to map k-mers associated with resistance to all six remaining PGT races in the 151 accessions. Two non-redundant, high-confidence candidate Sr genes other than Sr33 were identified and these candidate Sr genes were aligned to the reference genome assembly of wheat cv. Chinese Spring RefSeql .O (www.wheatgenome.org). The genomic locations of these candidate Sr genes coincided with the genetic positions of two Sr genes introgressed into wheat from Ae. tauschii ssp. strangulata, namely Sr46 (Yu etal., 2015, Theor Appl Genet 128: 431 -443) and SrTA 1662 (Olson etal., 2013, Theor Appl Genet 126: 1179-1188). The SrTA 1662 candidate gene was mapped in a recombinant inbred line population to a 3.8 cM interval shown to encode SrTA 766-resistance, thus supporting our AgRenSeq data. The SrTA 1662 candidate encodes a coiled-coil NLR protein with 85% amino acid identity to Sr33.

Independently of this AgRenSeq approach, we also identified an Sr46 candidate gene by conventional fine mapping in segregating diploid progenitor and wheat populations coupled with the sequencing of candidate genes in this region in three ethyl methanesulphonate-derived mutants that had lost Sr46 resistance. In two mutants, the same candidate gene contained non-synonymous substitutions, while the third mutant had a deletion of the chromosomal segment encoding this gene. Comparison of the Sr46 candidate gene identified by AgRenSeq and map-based/mutagenesis cloning showed that they were 100% identical. This gene, hereafter referred to as Sr46, encodes a coiled coil, NLR protein that conferred stem rust resistance when expressed as a transgene in the susceptible wheat cv.

While identification of Sr33, Sr46 and SrTA1662 was successful we did not identify Sr45. We hypothesized this was due to the prevalence of Sr46 amongst the panel accessions masking Sr45 resistance as none of the seven PGT races utilized were virulent on Sr46 and avirulent on Sr45. To overcome this problem, we removed all 64 accessions containing Sr46 from the panel. The /c-mers associated with resistance to PGT race TTKSK in the remaining 87 accessions were then mapped onto the TTKSK-resistant Ae. tauschii accession BW 01083, the original donor of Sr45. Previously we had identified a candidate Sr45 gene by MutRenSeq from a wheat-Ae. tauschii BW 01083 introgression line. Using AgRenSeq, a single discrete association peak was identified that corresponded to this previously identified Sr45 candidate gene. This gene conferred resistance when transformed into cv Fielder confirming both our previous MutRenSeq identification (see Steuernagel et al., 2016, supra) and current AgRenSeq data. From AgRenSeq analysis and sequence alignments among these four cloned Sr genes and the diversity panel, Sr46 and SrTA1662 were the most prevalent being found in 42% of the accessions, while Sr33 and Sr45 were found in 5% and 7% of accessions, respectively.

Our results demonstrate that AgRenSeq is a robust protocol for discovering and cloning functional R genes from a diversity panel. This approach successfully identified four Sr genes that have been introgressed into wheat from Ae. tauschii ssp. strangulata over the last 40 years. In addition to stem rust resistance, Ae. tauschii is a rich source of genetic variation for resistance to other diseases and pests relevant to bread wheat cultivation including leaf rust, stripe rust, wheat blast, powdery mildew, Hessian fly, and others. Our RenSeq-configured diversity panel is essentially an flgene genotyped resource which can now be screened against other pathogens/pests to clone additional functional R genes.

In contrast to most published association studies, the AgRenSeq procedure as described in this example is reference genome-independent and directly identifies the NLR that confers resistance rather than identifying a genomic region encoding multiple paralogs for subsequent candidate gene confirmation. The /c-mer strategy could also be adapted for associative transcriptomics which allows trait correlation based on variation in transcript sequence and abundance (Harper etal., 2012, Nat Biotechnol 30: 798-802). The low basal level of NLR gene expression may necessitate costly and computationally challenging ultra-deep transcript sequencing unless cDNAs were enriched by RenSeq (Andolfo etal., 2014, BMC Plant Biol 14: 120). AgRenSeq is also not restricted to the limited genetic variation and recombination present in bi-parental populations, but rather interrogates pan genome sequence variation in diverse germplasm to isolate uncharacterised R genes (or other genes). Thus, importantly, no additional crossing or mutagenesis is required for R gene cloning using this approach. This makes it possible to clone R genes from accessions of a wild species lacking any advanced agronomic traits and requires only the phenotyping of an enrichment-sequenced diversity panel.

Diversity collections are available for many crops and their wild relatives including soybean, pea, cotton, maize, potato, wheat, barley, rice, banana and cocoa. The ability to rapidly clone agriculturally valuable R genes or other genes by AgRenSeq further increases the value of these germplasm collections. AgRenSeq may also allow breeders to discover genes in their breeding germplasm and develop gene-specific markers for marker assisted selection. Thus, this method has enormous utility for genetic improvement of most crops and will also provide fundamental insights into the structure and evolution of species-wide functional fl gene or other gene architectures.

Materials and Methods

The process for performing AgRenSeq is described in full in WO2019138244A1 and Arora et al. 2019 (ibid), which is incorporated by reference.

Reference Example 2: “CapRenSeq”

Introduction

Recent-established protocols now allow for the generation of high-quality assemblies of complex plant genomes such as polyploid wheat (see, for example, Avni et al. [2017] Science 357: 93-97). Requirements for the generation of those high-quality assemblies may include: (i) a 450 bp insert library sequenced to high coverage with 250 bp paired end reads to generate overlapping fragments for an initial assembly, followed by (ii) scaffolding with a series of long mate pair (“LMP”) libraries, and/or (iii) scaffolding with “Linked-Reads” (e.g. from 10x Genomics Inc.), which provides long-range information from short-read sequencing data. RenSeq may therefore also be performed on LMP or 10x libraries and the captured reads used to improve a standard RenSeq assembly. The elements of this “combined assembly procedure” (Cap) for the gene content in a whole genome can be applied in the present invention to the fraction of genes captured by RenSeq, for example to obtain NLR repertoires with a higher proportion of full-length genes (a procedure referred to herein as “CapRenSeq”). CapRenSeq as described above was applied to an Ae. tauschii line in which we had identified candidate genes for stem rust, but where we require additional sequence for functional analysis.

Results and Discussion

Ae. tauschii accession BW01111 (also known as TOWWC112, available from the Germplasm Resources Centre; www.seedstor.ac.uk) was previously enriched with an NLR library optimized for Ae. tauschii to generate 2.43 Gb of raw resistance gene enrichment sequence (RenSeq) data (Arora et al., 2018, Resistance gene discovery and cloning by sequence capture and association genetics (http://www.biorxiv.org /content/early/2018/01/15/248146). The sequences were trimmed using T rimmomatic v0.2 (Bolger et al., 2014, supra) and de novo assembled with the CLC Assembly Cell (http://www.clcbio.com/products/clc-assembly-cell/)using word size (-w = 64) with standard parameters. This resulted in a RenSeq assembly containing 298 complete and 1 ,171 partial NLRs (Table 1).

Table 1. CapRenSeq of Ae. tauschii accession BW01111 with an LMP3 and LMP6 library.

^aAs determined by NLR-Parser (Steuernagel etal., 2015, supra). ^bValues refer to contig/scaffold size in bp followed by number (n) of contigs/scaffolds.

Subsequently, two lllumina Nextera long mate-pair (LMP) libraries with insert sizes of 3 kb and 6 kb (LMP3 and LMP6, respectively) were converted into lllumina T ruSeq libraries and enriched with Triticeae NLR library V3. This generic Triticeae NLR library contains -200- thousand baits from NLR exons and repeat-masked introns from Triticum aestivum, T. diccocoides, T. diccocoides, T. urartu, Ae. tauschii, Ae. speltoides, Ae. sharonensis, Ae. peregrina, Ae. umbellulata, Secale cereale and Hordeum vulgare. The original BW01111 RenSeq assembly was scaffolded with SOAPdenovo2 (Luo et al., 2012, GigaScience 18, doi:10.1186/2047-217X-1 -18) using LMP3 and LMP6 and filtered by NLR-Parser (Steuernagel et al., 2015, supra). This resulted in 399 complete and 744 partial NLRs (Table 1). Moreover, the total size of the assembly increased 1.75-fold from 4.84 Mb to 8.47 Mb and the average scaffold size increased 2.25-fold from 3,296 bp to 7,413 bp (Table 1).

To investigate the reason for the increased scaffold size in RenSeq+LMP3+LMP6, we looked at the sequences of three known stem rust (Sr) resistance genes, namely Sr45 (Steuernagel et al., 2016, supra), Sr46 (NCBI accession number MG851023) and SrTA1662 (NCBI accession number MG763911 ). These three genes were previously found to be present in accession BW 01111 (Arora et al., 2018, supra). We also looked at contig_1689_1 (6,364 bp).

For SrTA1662 we found that the original contig had increased from 7,557 to 13,093 bp without any gap. For Sr45, the original contig (5,172 bp) had been joined to another contig (862 bp) with a gap spanned by 4,103 N’s. Similarly, the Sr46 contig (-4,400 bp) had been joined to another contig (-400 bp) with a gap of -5,000 Ns. There was no increase in the length of the original Sr45 and Sr46 contigs. The contig_1689_1 increased in length from 6,260 to 8,731 bp, and was also joined to two other contigs of lengths 1 ,205 bp and 1 ,332 with gaps of 1 ,630 and 1 ,611 Ns, respectively. These results show that RenSeq assemblies can be improved by scaffolding with LMP libraries. Further improvements could be obtained by scaffolding with an additional LMP library with a 10 kb insert size, and/or by scaffolding with linked read sequences (e.g. obtained from a 10x Genomics library). The gaps (Ns) in the scaffolds could subsequently be filled in by (i) aligning the LMP reads to the scaffolds and performing local assemblies, and/or (ii) PCR and Sanger sequencing.

Reference Example 3 Results and Discussion

For association analysis, we prefiltered the /c-mers by calculating the Pearson’s correlation coefficient of NLR-associated /c-mers with the phenotype, and only retained the positively correlated /c-mers.

For each filtered /c-mer, a linear regression model (Lees, etal. 2016, Nat Corns 7: 12797) was fitted to predict the phenotype of an accession based on whether it contains the k- mer, while using a number of significant PCA dimensions obtained from the SNP-marker matrix as covariates to control for the population structure. The regression model for a particular /c-mer can be described by:

where pheno represents the vector of the phenotype scores, K represents the presence/absence vector of the /c-mer, and Pi,P₂- _>P_n represent the n most significant PCA dimensions. Fitting the above regression model means finding the coefficients ^a _>b_ί>b_{2> >}b_h such that the Euclidean distance between the vectors on the left and the right hand sides of the above expression is minimized. A likelihood ratio test for nested models (Wilks 1938, Ann Math Stat 9: 60-62) was then used to obtain a p-value for each /c-mer. We then calculated a p-value score based on the negative log of the p-value such that higher p-value scores represent greater statistical significance.

The presence/absence of each /c-mer in the Ae. tauschii panel was checked for association with stem rust race RKQQC. using a method based on univariate linear regression, and the regression scores of /c-mers were then mapped to the assembly of the resistant accession, BW 01005 (see Arora etal., 2018, supra). This resulted in the identification of the positive control, Sr33, along with another candidate. Mapping onto other resistant accessions resulted in more candidates. To determine which of these candidates merit functional testing, the presence/absence of all the candidate contigs (having a regression score above a threshold of 20) in the Ae. tauschii panel was inferred using BLAST+ 2.2.30 (Camacho etal., 2009, BMC Bioinformatics 10: 421 ), while grouping them into sets containing elements with a high sequence similarity. This grouping resulted in five sets.

A representative from each set, namely contig_1208_1 of BW 01084, contig_1208_1 of BW 01086, contig_622_1 of BW_01084, contig_51_1 of BW_01010, contig_2059_1 of BW 01115, was taken and their presence/absence was jointly tested for association with RKQQC using multivariate lasso regression (Friedman etal., 2008, J. Statistical Software 33: 1 -22). The contig_1208_1 of BW 01084 corresponds to Sr33, while the contig_622_1 of BW 01084 shares high sequence similarity with the second candidate. The first two candidates, including the positive control Sr33, were found to have a much higher effect size relative to others, the fourth had a moderate effect size, while the rest two had a negligible effect size.

This procedure reduces background (false positives) and allows the identification of rare candidate gene variants.

Example 4

The principle behind the construction of the k-mer presence/absence matrix described in Reference Examples 1 -3 can be applied to whole genome shotgun (WGS) sequence data.

WGS data generally constitutes a far larger data set than that produced by enrichment sequencing. As discussed above, using WGS data can be a disadvantage in that processing WGS data is much more computationally expensive. Therefore, high performance computer clusters and computationally efficient data processing workflows are required in order to deal with the data in a reasonable timeframe. However, as discussed above, using WGS data has the significant advantage of avoiding bias introduced through the process of enrichment sequencing. Therefore, some changes are required to the operations in order to produce a useful output.

In the pre-filtering step described in Reference Example 3, we additionally retained negatively correlated /c-mers (those below the threshold of -0.2) rather than just positively correlated k- mers (above the threshold of 0.2). This meant that only non -correlated k- mers were removed.

In the linear regression step described in Example 3, k-mers with a p-value score (negative log p-value) exceeding the threshold of 15 are stored for mapping to different assemblies.

These significantly associated k- mers were mapped, via exact sequence matching, to multiple assemblies in order to query the species pan-genome for associated genes.

For a chromosome-level reference assembly, each dot-column on the x-axis of the graphs shown in Fig. 4a-d corresponds to a 10 kb genomic block. Dots on the plot represent the p-value scores of the significantly associated k-mers within each block. Dot size is proportional to the number of k-mers with the specific p-value score.

For an assembly with an N50 significantly smaller than the average linkage disequilibrium (LD) block size, the contigs were ordered by mapping them to the chromosome-level reference assembly using minimap2 (Li FI. Bioinformatics, 2018, 34(18):3094-3100), and assigning them the genomic coordinates of their longest hits. At the starting genomic coordinate thus assigned to each contig, the p-value scores of all the significantly associated k-mers within that contig are plotted as previously described. The ordering based on a high-quality reference genome brings most of the contigs in LD to a single locus, thus mitigating the noise and revealing the loci harboring the causal genes.

The aforementioned steps were used for WGS sequence data of 142 Ae. tauschii L2 accessions to re-identify the stem rust resistance (Sr) genes Sr45 and Sr46 (previously identified with AgRenSeq; Arora et al. 2019, Nature Biotechnology, 37:139-143) using the stem rust phenotype data for Puccinia graminis f. sp. tritici isolate 04KEN156/04, race TTKSK. k-mers deemed to be significantly associated to the TTKSK phenotype by our association mapping algorithm were first mapped to the Ae. tauschii reference genome AL8/78. The Ae. tauschii AL8/78 reference accession contains the non-functional (not providing resistance to TTKSK) alleles of Sr45 and Sr46. Therefore, the peaks obtained on chromosomes 1 D and 2D when mapping the associated k-mers to AL8/78 were derived from negatively correlated k-mers (depicted as grey, as opposed to black, dots in Fig. 4a) in the regions which are in linkage disequilibrium with the non -functional Sr45 and Sr46 alleles.

To identify the true Sr45 and Sr46 haplotypes we de novo assembled accession TOWWC- 113 (which carries Sr45 and Sr46) from 30x WGS data using MEGAHIT (Li D. et al., Bioinformatics, 2015, 31 (10):1674-1676), and mapped the associated /c-mers to the unordered contigs of this assembly (as previously described in AgRenSeq; Arora et al. 2019, ibid). However, there was too much noise to discern the positive signals for Sr45 and Sr46, likely due to the random distribution of short contigs (N50 1 .1 kb) along the x- axis (Fig. 4b). The ordering of the contigs using minimap2 and the AL8/78 reference genome, significantly improved the plot (Fig. 4c).

An improved assembly (N50 196.3 kb) of TOWWC-112 (which also carries Sr45 and Sr46) was generated using two long mate pair libraries, and the associated /c-mers were mapped to the ordered scaffolds of this assembly. The larger average scaffold size improved the accuracy of scaffold ordering using the reference, thus further reducing the noise in the association plot and producing clear, positive signals for Sr45 and Sr46 (black peaks, Fig. 4d).

Although the present invention has been described with reference to preferred or exemplary embodiments, those skilled in the art will recognize that various modifications and variations to the same can be accomplished without departing from the spirit and scope of the present invention and that such modifications are clearly contemplated herein. No limitation with respect to the specific embodiments disclosed herein and set forth in the appended claims is intended nor should any be inferred.

All documents cited herein are incorporated by reference in their entirety.

ACKNOWLEDGEMENTS

This research was supported by the NBI Computing Infrastructure for Science (CiS) group and financed by the Two Blades Foundation, USA, the Biotechnology and Biological Sciences Research Council (BBSRC), the BBSRC Designing Future Wheat Cross- Institute Strategic Programme, the Norwich Research Park Translational Fund, the Lieberman-Okinow Endowment at the University of Minnesota; the Grains Research and Development Corporation; and in-kind support by MYcroarray.

Claims

1 . A method for identifying one or more genes associated with a selected phenotype in an organism, the method comprising:

(c) ordering one or more contigs using a reference genome assembly; and

2. The method according to claim 1 , wherein step (a) further comprises screening the database for genetic samples or nucleic acids associated with the selected phenotype.

3. The method according to claim 1 or 2, wherein step (b) comprises recording the presence or absence of one or more k- mers associated with the selected phenotype and ascribing a score to the one or more k- mers according to the association of the one or more k- mers with the selected phenotype.

4. The method according to claim 3, wherein step (b) further comprises selecting k- mers with a score with a value greater than a first threshold value and/or lower than a second threshold value, for mapping in step (d).

5. The method according to either of claim 3 or claim 4, wherein step (d) comprises mapping the identified k- mers onto each of the ordered contigs to identify high scoring or low scoring regions in the sequence information.

6. The method according to any preceding claim, wherein the reference genome assembly is a chromosome-level reference genome assembly.

7. The method according to any preceding claim, wherein the reference genome assembly is derived from a genetic sample of a related organism, preferably with a highly similar genome to the tested organism.

8. The method according to any preceding claim, wherein step (b) comprises identifying k- mers of one or more fixed lengths k.

9. The method according to any preceding claim, wherein the organism is a plant.

10. The method according to any preceding claim, wherein the identified genes encode plant disease resistance (R) proteins.

11 . The method according to any preceding claim, wherein the sequence information is obtained using high-throughput sequencing or Linked-Reads sequencing.

12. The method according to any preceding claim, wherein the sequence information is whole genome shotgun (WGS) sequence data.

13. The method according to any preceding claim, wherein the ordering in step (c) comprises identifying the longest sequence alignment for each contig to a corresponding sequence in the reference genome assembly.

14. The method according to any preceding claim, wherein the genetic samples comprise multiple nucleic acids associated with similar phenotypes.

15. The method according to claim 14, wherein the multiple nucleic acids within each genetic sample are associated with the same phenotype.

16. The method according to any preceding claim, wherein the database comprises genetic samples from a crop plant, or a non-domesticated plant in the same species, genus or family as at least one crop plant.

17. The method according to claim 16, wherein the database comprises genetic samples from a plantation crop plant.

18. The method according to any preceding claim, wherein the database comprises or consists of genetic samples from wheat (for example Triticum urartu, Triticum aestivum, Aegilops sharonensis or Ae. speltoides), rye (for example Secale cereale), millet, onion, palm, tomato, pepper (for example Capsicum annuum), tobacco, canola, cotton (for example Gossypium ssp.), peanut, alfalfa, sunflower, safflower, Hordeum vulgare, oat (for example Avena sativa ), Oryza spp., maize (for example Zea mays), sorghum (for example Sorghum ίooΐoή, sugarcane (for example Saccharum spp.), banana (for example Musa spp.), potato (for example Solarium spp.), Ipomoea batatas, soya bean (for example Glycine max or G. soja), Cicer arietinum, Pisum sativum, cassava (for example Manihot escuienta), Dioscorea spp., sugar beet (for example Beta vulgaris), legumes (for example Favaceae spp.), brassicas (for example Brassica napus), Cannabis sativa, Arabidopsis thaliana or Theobroma cacao.