Title: Methods and Compositions for Cannabis Characterization
Field
[0001] The present disclosure provides methods, compositions and kits for characterizing cannabis samples. The present disclosure also provides method, compositions and kits for distinguishing Cannabis sativa from Cannabis indica, and marijuana from hemp as well as measuring contribution of Cannabis sativa and Cannabis indica in marijuana.
Background Cannabis is one of humanity's oldest crops, with records of use dating to 6000
years before present. It is used as a source of high-quality bast fibre, nutritious and oil-rich seeds and for the production of cannabinoid compounds including delta-9 tetrahydrocannabinol (THC) and cannabidiol (CBD). The evolutionary history and taxonomy of Cannabis remains poorly understood. Hillig (2005) proposed that the genus Cannabis consists of three species (C. sativa, C. indica, and C. ruderalis) [1], whereas an alternative viewpoint is that Cannabis is monotypic and that observable subpopulations represent subspecies of C. sativa: C. sativa subspecies sativa, C. sativa subspecies indica and C. sativa subspecies ruderalis [2]. The putative ruderalis type may represent feral populations of the other types or those adapted to northern regions. The classification of Cannabis populations is confounded by many cultural factors, and tracing the history of a plant that has seen wide geographic dispersal and artificial selection by humans over thousands of years has proven difficult. Many hemp types have varietal names while marijuana types lack an organized horticultural registration system and are referred to as strains. The draft genome and transcriptome of C. sativa were published in 2011 [3]. As both public opinion and legislation in many countries shifts towards recognizing Cannabis as a plant of medical and agricultural value [4], the genetic characterization of marijuana and hemp becomes increasingly important for both clinical research and crop improvement efforts.
[0003] Differences between Cannabis sativa and Cannabis indica have been reported.
[0004] Although the taxonomy of the genus Cannabis remains unclear, many breeders, growers and users (patients) consuming cannabis for its psychoactive and/or medicinal properties differentiate Sativa-type from Indica-type plants.
[0005] Hillig & Mahlberg (2004) [20] have reported that mean THC levels and the frequency of the THCA synthase gene (BT allele) were significantly higher in C. indica than C. sativa. Plants with relatively high levels of tetrahydrocannabivarin (THCV) and/or cannabidivarin (CBDV) were common only in C. indica.
[0006] Hazekamp & Fischedick (2011 ) [10] summarized differences between typical Sativa and Indica effects upon smoking. As a result of limited understanding and support from the medical community, they indicate that medicinal users of cannabis nerally adopt the terminology derived from recreational users to describe the therapeutic effects they experience.
[0007] They report that the psychoactive effects (the "high") from Sativa-type plants are often characterized as uplifting and energetic. The effects are mostly cerebral (head-high), and are also described as spacey or hallucinogenic. Sativa is considered as providing pain relief for certain symptoms. In contrast, the high from Indica-type plants is most often described as a pleasant body buzz (body-high or body stone). Indicas are primarily enjoyed for relaxation, stress relief, and for an overall sense of calm and serenity and are supposedly effective for overall body pain relief and in the treatment of insomnia.
[0008] They reported that the most common way currently used to classify cannabis cultivars is through plant morphology (phenotype) with Indica-type plants smaller in height with broader leaves, while Sativa-type plants taller with long, narrow leaves. Indica-type plants typically mature faster than Sativa-type plants under similar conditions, and the types tend to have a different smell, perhaps reflecting a different profile of terpenoids.
[0009] There remains a need for more accurate classification of cannabis for medicinal and other commercial purposes.
Summary
Using 14,031 single- nucleotide polymorphisms (SNPs) genotyped in 81
marijuana and 43 hemp samples, marijuana and hemp are found to be significantly
differentiated at a genome-wide level, demonstrating that the distinction between these populations is not limited to genes underlying THC production. [00013 I]n addition, using additional SNPs including a second set of 9123 SNPs genotyped in 37 reported Cannabis indica and 63 reported Cannabis sativa samples, ancestry determinations could be made which can be used for example for selecting breeding partners. [00013 Other features and advantages of the present disclosure will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples while indicating preferred mbodiments of the disclosure are given by way of illustration only, since various changes and modifications within the spirit and scope of the disclosure will become apparent to those skilled in the art from this detailed description.
Brief description of the drawings
[00013] An embodiment of the present disclosure will now be described in relation to the drawings in which:
[0001] Figure 1. Genetic structure of marijuana and hemp, (a) Principal Components Analysis (PCA) plot of 42 hemp and 80 marijuana samples using 14,031 SNPs. Hemp samples are closed circles and marijuana samples are open circles. The proportion of the variance explained by each Principal Component (PC) is shown in parentheses along each axis. The two samples labeled with their IDs are discussed in the text, (b) Boxplots showing significantly lower heterozygosity in marijuana than in hemp, (c) Population structure of hemp and marijuana estimated using the fastSTRUCTURE admixture model at K = 2. Each sample is represented by a thin vertical line, which is partitioned into two colored segments that represent the sample's estimated membership in each of the two inferred clusters. Hemp and marijuana samples are labeled below the plot.
[0002] Figure 2. Genetic structure of marijuana, (a) PCA plot of 81 marijuana samples using 9,776 SNPs. Samples are shaded according to their reported C. sativa ancestry. The proportion of the variance explained by each PC is shown in parentheses along each axis, (b) Population structure of marijuana calculated using
the fastSTRUCTURE admixture model at K = 2. Each sample is represented by a horizontal bar, which is partitioned into two segments that represent the sample's estimated membership in each of the two inferred clusters. Adjacent to each bar is the sample's name and reported % C. sativa ancestry, (c) The correlation between the principal axis of genetic structure (PC1 ) in marijuana and reported C. sativa ancestry.
[0003] Figure 3. Distribution of FST between marijuana and hemp samples across 14,031 SNPs. (a) FST distribution for all SNPs genotyped. (b) Distribution of SNPs with FST greater than 0.5. Average FST is weighted by allele frequency and was lculated according to equation 10 in Weir and Cockerham (1984) [19]. [0004] Figure 4. Mean pairwise Identity by State (IBS) between each marijuana sample and all hemp samples versus reported C. sativa ancestry.
[0005] Figure 5. Example PCA of 81 marijuana strains using 9776 SNPs.
[0006] Figure 6. Example distribution of per-SNP FST values between 9 presumed C. indica and 9 presumed C. sativa strains. [0007] Figure 7. Example evaluation of panels of ancestry informative markers (AIMS). Accuracy is defined here as the correlation between the positions of non- ancestral samples along PC1 calculated using 9766 SNPs and the positions calculated using a given subset of AIMs.
[0008] Figure 8. Example PCA of 100 marijuana strains using 9123 SNPs.
Detailed description of the Disclosure
[0009] The term "cannabis reference" as used herein means a cannabis strain, species (e.g. sativa or indica) (also referred to as subspecies (e.g. sativa or indica)) or type (marijuana or hemp) with at least some known genotype profile information which is used as a reference comparison to a test sample, optionally wherein the genotype and/or allele frequency of at least 10 SNPs in Table 4, 5 and/or 8 are known, optionally all of the SNPs in any one of Tables 4, 5 and/or 8. The cannabis
reference can be a Cannabis sativa reference, Cannabis indica reference, marijuana reference or hemp reference or a reference profile of any of the foregoing.
[0010] The term "Cannabis sativa reference" and "Cannabis indica reference" as used herein mean respectively, a selected Cannabis sativa strain or Cannabis indica strain which is used as a reference for comparison and/or genotype information of such a strain or genotype information associated with the particular Cannabis sativa or Cannabis indica reference strain e.g. a reference profile for a particular strain or a reference profile associated with the species. The reference profile comprises at least 10 known SNPs (e.g. genotype and optionally frequency) i Table 4 and/or Table 8, optionally all of the SNPs in Table 4 and/or 8 found in the particular Cannabis sativa or Cannabis indica strain respectively or a composite of strains of the particular species. The Cannabis sativa reference or Cannabis indica reference can include in addition to the predominant allele in the species or a particular strain of the species the frequency of the SNP allele in the population.
[0011] The term "cannabis reference profile" as used herein means genotype information of one (e.g. a particular strain) or plurality of cannabis strains and/or species, including Cannabis sativa and/or Cannabis indica strains or marijuana and/or hemp strains, and includes the genotype of at least 10 SNPs in Table 4, 5 and/or 8, optionally all of the SNPs in Table 4, 5 and/or 8. A Cannabis sativa reference profile as used herein means genotype information of a plurality of cannabis strains and includes genotype sequence (and optionally including frequency information) associated with Cannabis sativa strains and a Cannabis indica reference profile as used herein means genotype information of a plurality of cannabis strains and includes genotype sequence (and optionally including frequency information) associated with Cannabis indica strains. [0012] The term "marijuana" as used herein denotes cannabis plants and plant parts that are cultivated and consumed as a drug or medicine. Marijuana often contains high amounts of psychoactive cannabinoids such as tetrahydrocannabinolic acid (THCA) and delta-9 tetrahydrocannabinol (THC) but it may also contain cannabidiolic acid (CBDA) and cannabidiol (CBD). For example, marijuana can be defined as cannabis plants and plant parts wherein the leaves and flowering heads of contain more than 0.3% w/w, 0.4% w/w or 0.5% w/w of delta-9-
tetrahydrocannabinol (THC) (dry weight). The term "hemp" as used herein denotes cannabis plants that are cultivated and used for the production of fibre or seeds rather than as drug or medicine. Often hemp plants often contain high amounts of CBDA and CBD, and low amounts of THCA and THC. For example, hemp can be defined as cannabis plants and plant parts wherein the leaves and flowering heads of which do not contain more than 0.3% w/w, 0.4% w/w or 0.5% w/w of delta-9- tetrahydrocannabinol (THC) (dry weight).
[0013] The term "polynucleotide", "nucleic acid", "nucleic acid molecule" and/or "oligonucleotide" as used herein refers to a sequence of nucleotide or nucleoside onomers consisting of naturally occurring and/or modified bases, sugars, and intersugar (backbone) linkages, and is intended to include DNA and RNA which can be either double stranded or single stranded, representing the sense or antisense strand.
[0014] As used herein, the term "isolated nucleic acid molecule" refers to a nucleic acid substantially free of cellular material or culture medium when produced by recombinant DNA techniques, or chemical precursors, or other chemicals when chemically synthesized. The term "nucleic acid" is intended to include DNA and RNA and can be either double stranded or single stranded.
[0015] The term "primer" as used herein refers to a nucleic acid molecule, whether occurring naturally as in a purified restriction digest or produced synthetically, which is capable of acting as a point of synthesis of when placed under conditions in which synthesis of a primer extension product, which is complementary to a nucleic acid strand is induced (e.g. in the presence of nucleotides and an inducing agent such as DNA polymerase and at a suitable temperature and pH). The primer must be sufficiently long to prime the synthesis of the desired extension product in the presence of the inducing agent. The exact length of the primer will depend upon factors, including temperature, sequences of the primer and the methods used. A primer typically contains 15-25 or more nucleotides, although it can contain less, for example 10 nucleotides. The factors involved in determining the appropriate length of primer are readily known to one of ordinary skill in the art. [0016] As used herein, the term "upstream primer" as used herein refers to a primer that can hybridize to a DNA sequence and act as a point of synthesis
upstream, or at a 5', of a target polynucleotide sequence e.g. SNP, to produce a polynucleotide complementary to the target polynucleotide anti-sense strand. The term "downstream primer" as used herein refers to a primer that can hybridize to a polynucleotide sequence and act as a point of synthesis downstream, or at a 3' end, of a target polynucleotide sequence, to produce a polynucleotide complementary to the target polynucleotide sense strand.
[0017] The term "probe" as used herein refers to a polynucleotide (interchangeably used with nucleic acid) that comprises a sequence of nucleotides that will hybridize specifically to a target nucleic acid sequence. For example the obe comprises at least 18 or more bases or nucleotides that are complementary and hybridize to contiguous bases and/or nucleotides in the target nucleic acid sequence. The length of probe depends on the hybridization conditions and the sequences of the probe and nucleic acid target sequence and can for example be 10-20, 21-70, 71 -100 or more bases or nucleotides in length. The probes can optionally be fixed to a solid support such as an array chip or a microarray chip. For example, the PCR product produced with the primers could be used as a probe. The PCR product can be for example be subcloned into a vector and optionally digested and used as a probe.
[0051] The term "reverse complement" or "reverse complementary", when referring to a polynucleotide, as used herein refers to a polynucleotide comprising a sequence that is complementary to a DNA in terms of base-pairing and which is reversed so oriented from the 5' to 3' direction.
[0018] As used herein, the term "kit" refers to a collection of products that are used to perform a reaction, procedure, or synthesis, such as, for example, a genotyping assay etc., which are typically shipped together, usually within a common packaging, to an end user.
[0019] The term "target allele" as used herein means an allele for a SNP listed in Table 4, 5 or 8.
[0020] The term "major allele" as used herein is the allele most commonly present in a population .The major allele listed in Tables 4, 5 and 8 is the allele most
commonly present in Cannabis sativa and Cannabis indica strains (Tables 4 and 8) and marijuana and hemp strains (Table 5) respectively.
[0021] The term "minor allele" as used herein is the allele least commonly present in a population (e.g. C. sativa and C. indica or marijuana and hemp). The minor allele listed in Tables 4, 5 and 8 is present in the frequency indicated therein. [0022] A single-stranded nucleic acid molecule is "complementary" to another single-stranded nucleic acid molecule when it can base-pair (hybridize) with all or a portion of the other nucleic acid molecule to form a double helix (double-stranded nucleic acid molecule), based on the ability of guanine (G) to base pair with cytosine (C) and adenine (A) to base pair with thymine (T) or uridine (U). [0023] The term "hybridize" as used herein refers to the sequence specific non- covalent binding interaction with a complementary nucleic acid.
[0024] The term "selectively hybridize" as used herein refers to hybridization under moderately stringent or highly stringent physiological conditions, which can distinguish related nucleotide sequences from unrelated nucleotide sequences. In nucleic acid hybridization reactions, the conditions used to achieve a particular level of stringency are known to vary, depending on the nature of the nucleic acids being hybridized, including, for example, the length, degree of complementarity, nucleotide sequence composition (e.g., relative GC:AT content), and nucleic acid type, i.e., whether the oligonucleotide or the target nucleic acid sequence is DNA or RNA. An additional consideration is whether one of the nucleic acids is immobilized, for example, on a filter, bead, chip, or other solid matrix. Appropriate stringency conditions which promote hybridization are known to those skilled in the art, or can be found in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y. (1989), 6.3.1 6.3.6 and/or Current Protocols in Nucleic Acid Chemistry available at http://onlinelibrary.wiley.com/browse/publications?type=labprotocols.
[0025] As used in this application, the words "comprising" (and any form of comprising, such as "comprise" and "comprises"), "having" (and any form of having, such as "have" and "has"), "including" (and any form of including, such as "include" and "includes") or "containing" (and any form of containing, such as "contain" and
"contains"), are inclusive or open-ended and do not exclude additional, unrecited elements or process steps.
[0026] As used in this application and claim(s), the word "consisting" and its derivatives, are intended to be close ended terms that specify the presence of stated features, elements, components, groups, integers, and/or steps, and also exclude the presence of other unstated features, elements, components, groups, integers and/or steps.
[0027] The term "consisting essentially of", as used herein, is intended to specify the presence of the stated features, elements, components, groups, integers, d/or steps as well as those that do not materially affect the basic and novel characteristic(s) of these features, elements, components, groups, integers, and/or steps.
[0028] The terms "about", "substantially" and "approximately" as used herein mean a reasonable amount of deviation of the modified term such that the end result is not significantly changed. These terms of degree should be construed as including a deviation of at least ±5% of the modified term if this deviation would not negate the meaning of the word it modifies.
[0029] As used in this application, the singular forms "a", "an" and "the" include plural references unless the content clearly dictates otherwise. For example, an embodiment including "a compound" should be understood to present certain aspects with one compound or two or more additional compounds.
[0030] Further, the definitions and embodiments described in particular sections are intended to be applicable to other embodiments herein described for which they are suitable as would be understood by a person skilled in the art. For example, in the following passages, different aspects of the invention are defined in more detail. Each aspect so defined may be combined with any other aspect or aspects unless clearly indicated to the contrary. In particular, any feature indicated as being preferred or advantageous may be combined with any other feature or features indicated as being preferred or advantageous.
III. Methods and Products
[0031] The present disclosure identifies for example a plurality of single nucleotide polymorphisms (SNPs) ancestry informative markers (AIMs) that can be used to characterize cannabis samples. Cannabis samples can be characterized for example according to their ancestral relatedness and/or whether the sample is likely marijuana or hemp. Accordingly, the present disclosure provides methods, nucleic acids, primers and kits useful for detecting whether a sample is Cannabis sativa dominant or Cannabis indica dominant, for assessing the relatedness of a test sample to Cannabis sativa and/or Cannabis indica reference samples as well as methods, nucleic acids, primers and kits for distinguishing marijuana from hemp. Also provided are a computer implemented method, a computer program embodied a computer readable medium, a system, apparatus and/or processor for carrying out a method or part thereof described herein.
[0032] Embodiments of the methods and systems described herein may be implemented in hardware or software, or a combination of both. These embodiments may be implemented in computer programs executing on programmable computers, each computer including at least one processor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication interface. For example, and without limitation, the various programmable computers may be a server, network appliance, set-top box, embedded device, computer expansion module, personal computer, laptop, mobile telephone, smartphone or any other computing device capable of being configured to carry out the methods described herein.
[0033] The data storage system may comprise a database, such as on a data storage element, in order to provide a database of Cannabis reference strains, and/or reference profiles. Furthermore, computer instructions may be stored for configuring the processor to execute any of the steps and algorithms described herein as a computer program.
[0034] Each program may be implemented in a high level procedural or object oriented programming or scripting language, or both, to communicate with a computer system. However, alternatively the programs may be implemented in assembly or machine language, if desired. The language may be a compiled or
interpreted language. Each such computer program may be stored on a non- transitory computer readable storage medium (e.g. read-only memory, magnetic disk, optical disc). The storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
[0035] An aspect of the present method for detecting the presence or absence of each of a set of target alleles in a cannabis sample, the method comprising:
I) obtaining a test sample comprising genomic DNA, and
II) either i) genotyping the test sample for a set of single nucleotide p ymorphisms (SNPs), the set comprising at least 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96, 100 or any number between and including 10-200 of the SNPs in Table 4 and/or 8, wherein each SNP comprises a major allele and a minor allele as provided in Table 4 and 8; and ii) detecting for each SNP of the set the presence or absence of the major allele and/or the minor allele in the test sample; or a) genotyping the test sample for a set of SNPs, the set at least 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96 or 100 of the comprising the SNPS in Table 5, wherein each SNP comprises a major allele and a minor allele as provided in Table 5; and b) detecting for each SNP of the set the presence or absence of the major allele and/or the minor allele in the test sample.
[0036] In an embodiment, the SNPS in Table 4 and/or 8 can be used to determine the ancestral contribution of Cannabis sativa and/or Cannabis indica in a marijuana strain. [0037] The step of obtaining a test sample comprising genomic DNA can be accomplished, for example by taking the cannabis sample or an aliquot thereof for example if the cannabis sample is isolated genomic DNA, or can comprise preparing an isolated genomic DNA from the cannabis sample or a portion thereof.
[0038] The cannabis sample or the test sample (e.g. comprising at least a portion of the cannabis sample) is any cannabis sample comprising genomic DNA. The sample can be isolated genomic DNA or a portion of a plant and/or seed comprising genomic DNA and optionally from which genomic DNA can be isolated. For example, the test sample can be a plant sample, a seed sample, a leaf sample, a flower sample, a trichome sample, a pollen sample a sample of dried plant material including leaf, flower, pollen and/or trichomes, or a sample produced through in vitro tissue or cell culture. Genomic DNA can be isolated using a number of techniques such as NaOH extraction, phenol/chloroform extraction, DNA extraction systems such as Qiagen Direct PCR DNA Extraction System (Cedarlane, Burlington ON). In me embodiments, genomic DNA is not purified prior to genotyping. For example, with the Phire Plant Direct PCR Kit the DNA target can be used to detect SNP alleles without prior DNA extraction (Life Technologies, Burlington ON).
[0039] In an embodiment, the set of target alleles which are detected are a plurality of SNPs in Tables 4, 5 and/or 8. Tables 4 and 8 each list 100 SNPs, including a major allele and a minor allele and the minor allele frequency in Cannabis sativa strains and Cannabis indica strains. Table 5 lists 100 SNPs including a major allele and a minor allele and the minor allele frequency in marijuana and hemp. Also described in these Tables is the SNP position in the canSat3 C. sativa reference genome assembly which is described in van Bakel et al [3], identified as the SNP name. The genome build assembly is identified by the number 3 for SNPs defined by SEQ ID NOs:1 -400 (CanSat3) and the number 5 for SNPs defined by SEQ ID Nos: 401-600 (CanSat5). Tables 6, 7 and 9 also identify the upstream+SNP and downstream sequences associated with each SNP. A person skilled in the art would understand that genomic DNA is double stranded and that the complementary nucleotide on the reverse strand can also be detected based on the complementary base pairing rules.
[0040] Genotyping the cannabis sample at the loci listed for example in Tables 4, 5 and 8 can be accomplished by various methods and platforms.
[0041] In an embodiment, the step of genotyping comprises sequencing genomic DNA for example using a genotyping by sequencing (GBS) method. GBS is typically a multiplexed approach involving tagging randomly sheared DNA from
different samples with DNA barcodes and pooling the samples in a sequencing reaction. Target enrichment and/or reduction of genome complexity for example using restriction enzymes.
[0042] In another embodiment, the step of genotyping comprises sequencing pooled amplicons, including captured amplicons. In an embodiment, the amplicons are produced using primers flanking the SNPs, for example within 100 nucleotides upstream and/or within 100 nucleotides downstream of the SNP location and amplifying targeted region. The resulting amplification products are then sequenced. Forward primers and reverse primers that amplify for example 25 or more cleotides surrounding and including the SNP can be used in such genotyping methods.
[0043] A variety of sequencing methods can be employed including electrophoresis-based sequencing technology (e.g. chain termination methods, dye- terminator sequencing), by hybridization, mass spectrometry based sequencing, sequence-specific detection of single-stranded DNA using engineered nanopores and sequencing by ligation. For example, amplified fragments can be purified and sequenced directly or after gel electrophoresis and extraction from the gel.
[0044] Other PCR based genotyping methods can also be used optionally comprising DNA amplification using forward and reverse primers and/or primer extension. [0045] For example the iPLEX Gold Assay by Sequenom® provides a SNP genotyping assay where PCR primers are designed in a region of approximately 100 base pairs around the SNP of interest and an extension primer is designed adjacent to the SNP. The method involves PCR amplification followed by the addition of Shrimp alkaline phosphatase (SAP) to inactivate remaining nucleotides in the reaction. The primer extension mixture is then added and the mixture is deposited on a chip for data analysis by a TM MALDI-TOF mass spectrometer (Protocol Guide 2008).
[0046] In another embodiment, the genotyping method comprises using an allele specific primer. An example is the KASP™ genotyping system is a fluorescent genotyping technology which uses two different allele specific competing forward
primers with unique tail sequences and one reverse primer. Each unique tail binds a unique fluorescent labelled oligo generating a signal upon PCR amplification of the unique tail.
[0047] In an embodiment, allele specific probes are utilized. For example, an allele specific probe includes the complementary residue for the target allele of interest and under specified conditions preferentially binds the target allele. The probe can comprise a DNA or RNA polynucleotide and the genotyping step can comprise contacting the test sample with a plurality of probes each of the probes specific for a SNP allele of the set of SNPs under conditions suitable for detecting for ample the minor SNP alleles. [0048] In an embodiment, the genotyping method comprises using an array. The array can be a fixed or flexible array comprising for example allele specific probes. The array can be a bead array for example as is the Infinium HD Assay by lllumina. In an embodiment, the array comprises primers and/or probes using sequences or parts thereof described in SEQ ID Nos: 1 -600. The array format can comprise primers or probes for genotyping for example at least 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96 or 100 or more SNPs, for example any number between and including 1 and 300, optionally 10 and 300 or 10 and 200 or 10 and 100. In an embodiment, the array format comprises one or more primers or probes for each SNP. In an embodiment, the array comprises 96 reactions. [0049] Upstream sequence, the SNP as well as downstream sequence for the SNPs in Tables 4, 5 and 8 are provided in Tables 6, 7 and 9.
[0050] As demonstrated in Figure 7, a level of accuracy can be achieved using the 10 SNPs with the highest Fst values. Accordingly in one embodiment, the set of SNPs comprises the first listed 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96 or 100 SNPs or any number or combination of SNPs between and including 10 and 300, optionally 10 and 100 in Table 4, 5 or 8, optionally any combination of SNPs in Tables 4 and/or 8. In an embodiment, the set of SNPs comprises a plurality or all of the SNPs in Table 4 and/or 8 with a Fst of greater than 0.712 or 06277. In another embodiment, the set of SNPs comprises a plurality or all of the SNPs in Table 5 with a Fst of greater than 0.679. In an embodiment, the set of SNPs includes at least 2 wherein the allele frequency is 0.
[0051] In an embodiment, any number of SNPS listed in Tables 4 and/or 8, or Table 5 is genotyped.
[0052] In an embodiment, a plurality of SNPs listed in Tables 4 and/or 8 and 5 are detected. In such methods, both ancestry contribution and marijuana versus hemp assessments can be conducted in one assay. [0053] In an embodiment, the step of detecting the SNP comprises receiving, reviewing and/or extracting from a file, document, reaction, array or database, the genotype for each of the SNPs of the set.
[0054] In certain embodiments, the method further comprises displaying and/or oviding a document displaying one or more features of the major and/or minor alleles. For example, the one or more features can comprise the position of the SNP, the nucleotide identity of the SNP or the nucleotide identity if a minor allele is detected, the number of reads or reaction, the number of minor alleles, confidence intervals etc. The document can be an electronic document that is provided to a third party. In an embodiment, the one or more features displayed is selected from the allele nucleotide identity and the number of minor alleles in common with Cannabis sativa, Cannabis indica, marijuana or hemp.
[0055] As demonstrated herein, the SNP allele information can be used to characterize the cannabis sample. Accordingly, in an embodiment, the method further comprises determining ancestry contribution of the test sample. [0056] The ancestry contribution is optionally an ancestry contribution estimate or identification of ancestry dominance. For example, the ancestry dominance of the test sample can be Cannabis sativa dominant or Cannabis indica dominant according to the set of target alleles detected in step II) ii). If the target alleles in combination when compared to a database of cannabis reference strains and/or the reference profiles provided in Table 4 and 8 are most similar to alleles more commonly found in Cannabis sativa, for example if greater than 50% of the cannabis sample's SNPs are alleles more commonly present in Cannabis sativa, the cannabis sample is identified as Cannabis sativa dominant. Conversely, if the target alleles in combination are most similar to alleles more commonly found in Cannabis indica, for example if greater than 50% of the cannabis sample's SNPs are alleles more
commonly present in Cannabis indica, the cannabis sample is identified as Cannabis indica dominant.
[0057] An ancestry contribution estimate is calculated in one embodiment, according to a method described in the Examples. Other calculations for determining admixture can also be applied as further described herein. [0058] Other nucleotides may be detected at the SNP positions described or a particular reaction may fail. In an embodiment, if an allele other than an allele reported in Tables 4, 5 and 8 is detected or if the nucleotide at the position is unknown, the allele is not considered in the methods described.
[0059] An ancestry contribution estimate can identify a population structure that is associated or is most likely given the nucleotide occurrences of the SNPs in the cannabis sample.
[0060] In an embodiment, the method further comprises identifying the test sample as marijuana or hemp, according to the set of target alleles detected in step II) b). A cannabis sample is identified as hemp for example if the target alleles in combination when compared to a database of cannabis reference strains and/or the reference profiles provided in Table 5 are most similar to alleles more commonly found in hemp, the cannabis sample is identified as hemp. Conversely, if the target alleles in combination are most similar to alleles more commonly found in marijuana, the cannabis sample is identified as marijuana. [0061] An aspect accordingly includes a method of determining ancestry contribution of a cannabis sample, optionally to determine if a sample comprises nabis sativa and/or Cannabis indica, the method comprising:
I) obtaining a test sample comprising genomic DNA,
II) i) genotyping the test sample for a set of single nucleotide polymorphisms (SNPs), the set comprising at least 10, 20, 30, 40, 48, 50, 60, 70, 80,
90, 96 or 100 or more of the SNPs in Table 4 and/or 8, wherein each SNP comprises a major allele and a minor allele as provided in Table 4 and 8; and ii) detecting for each SNP of the set the presence or absence of the major allele and/or the minor allele in the test sample; and
III) determining ancestry contribution of the test sample according to the set of target alleles detected in step II) ii and providing an estimate of the ancestry contribution or the identifying the test sample as Cannabis sativa dominant or Cannabis indica dominant.
[0062] As mentioned above, dominance is assigned as Cannabis sativa dominant or Cannabis indica dominant according to the similarity of the detected alleles. If the set of detected alleles, when compared to a database of cannabis reference strains and/or the reference profiles provided in Table 4 and/or 8 are most similar to alleles more commonly found in Cannabis sativa as indicated in Table 4 d 8, the cannabis sample is assigned as Cannabis sativa dominant. Similarly, if the set of detected alleles are most similar to alleles more commonly found in Cannabis indica as indicated in Table 4 and 8, the cannabis sample is assigned as Cannabis indica dominant.
[0063] In an embodiment, the method further comprises selecting a breeding partner. [0064] The ancestry estimates can be used for example to identify Sativa- or Indica-type breeding individuals when classification is unknown or unsure. As an example, the SNPs described herein can be used to breed an offspring with a desired or defined contribution, for example about equal contribution, of Cannabis indica and Cannabis sativa genetic material. The SNPs in Table 4, 5 and 8 can be used to select for marijuana and hemp, or Indica- and Sativa-type strains with the desired ancestry contribution for use as parents.
[0065] For example, these markers can be used in marker-assisted selection (MAS) to breed cannabis plants that contain defined levels of Indica-type or Sativa- type ancestry. [0066] As another example SNPs as described herein can be used in ancestry selection breeding and used to speed the recovery of the cultivated genetic background (as described in [22]). For example in a cross between a cultivated line and a wild line, the F1 offspring generated from such a cross necessarily derive 50% of its ancestry from each parent. On backcrossing to the cultivated line, each offspring will differ in the proportion of its ancestry from the wild and cultivated
sources. Genetic markers distributed across the genome can be used to provide an estimate of the ancestry proportions, and the breeder can then select the offspring with the highest proportion of cultivated ancestry. Such methods can for example be performed with marker assisted selection (which uses trait associated markers), to select a small number of offspring in each generation that carry both the desired trait from the wild and the most cultivate ancestry.
[0067] In an embodiment, the method is for assessing if the cannabis sample is marijuana. For example, the marijuana can be for medical use.
[0068] Also provided is a set of SNPs that can be used to determine if a mple comprises hemp or marijuana. Accordingly another aspect includes a method for determining if a sample likely comprises hemp and/or marijuana, the method comprising:
I) obtaining a test sample comprising genomic DNA,
II) a) genotyping the test sample for a set of single nucleotide polymorphisms (SNPs), the set comprising at least 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96 or 100 of the SNPs in Table 5, wherein each SNP comprises a major allele and a minor allele as provided in Table 5; and b) detecting for each SNP of the set the presence or absence of the major allele and/or the minor allele in the test sample; and
III) identifying whether the sample likely comprises hemp or marijuana according to the set of target alleles detected in step II) b).
[0069] In an embodiment, the method is for differentiating medicinal/drug/pharmaceutical and non-medicinal/non-drug/non-pharmaceutical cannabis.
[0070] The identifying step comprises for example comparing to a database of reference alleles and/or comparing to the reference profiles in Table 5. The comparing step is further described below.
[0071] A further aspect includes a method for measuring genetic relatedness of a cannabis sample to a Cannabis sativa reference and/or a Cannabis indica reference, the method comprising:
I) obtaining a test sample comprising genomic DNA,
II) i) genotyping the test sample for a set of single nucleotide polymorphisms (SNPs), the set comprising at least 10, 20, 30, 40, 48, 50, 60, 70, 80, 90, 96, 100 or any number between 10 and 200 of the SNPs in Table 4 and/or 8, wherein each SNP comprises a major allele and a minor allele as provided in Table 4 and 8; and ii) detecting for each SNP of the set the presence or absence of the major allele and/or the minor allele in the test sample;
III) comparing the test sample SNP to the Cannabis sativa reference d/or Cannabis indica reference according to the set of target alleles detected in step II) ii; and
IV) displaying and/or providing a document displaying the calculated genetic relatedness of the test sample.
[0072] In an embodiment, the detecting, identifying and/or comparing step comprises calculating the genetic relatedness of the test sample to the cannabis reference, optionally a Cannabis sativa reference and/or Cannabis indica reference according to the set of target alleles detected in step II). The comparing step in an embodiment is carried out using a computer, for example a computer comprising a database for storing reference profiles for one or more strains or for the particular Cannabis sativa reference and/or Cannabis indica reference. [0073] In an embodiment, the Cannabis sativa reference and/or the Cannabis indica reference is a reference profile or plurality of reference profiles stored in a database. The reference profile can for example include the SNP allele identities (e.g. minor allele) in Table 4 and/or 8 and its frequency for the species (e.g. a master reference profile) or the SNP allele identities of a particular strain. [0074] In some embodiments, the reference is a reference sample and the method can comprise genotyping one or more reference samples and the test sample and comparing the detected alleles to identify the number of matches.
[0075] A further aspect includes a method for measuring a genetic relatedness of a Cannabis sativa sample to a reference marijuana or reference hemp sample, the method comprising:
I) obtaining a test sample comprising genomic DNA,
II) a) genotyping the test sample for a set of single nucleotide polymorphisms (SNPs), the set comprising at least 10, 20, 30, 40, 48, 50, 60, 70, 80,
90, 96 or 100 of the SNPs in Table 5, wherein each SNP comprises a major allele and a minor allele as provided in Table 5; and b) detecting for each SNP of the set the presence or absence of e major allele and/or the minor allele in the test sample; III) calculating the genetic relatedness of the test sample to the marijuana reference and/or the hemp reference according to the set of target alleles detected in step II) b; and
IV) displaying and/or providing a document displaying the calculated genetic relatedness of the test sample. [0076] The method of determining ancestry contribution and/or the comparison for identifying the sample can involve use of a specifically programmed computer using for an example an algorithm to 1 ) compare the identity of the allele e.g whether the major and/or minor allele is detected, for each of the set of SNPs genotyped in the test sample to one or more cannabis references optionally compared to a database comprising a cannabis reference profile such as a master cannabis profile or a plurality of reference profiles, wherein each cannabis reference profile comprises genotype information for the set of SNPs detected; and 2) assign or calculate the ancestry contribution of the cannabis sample. Any algorithm for admixture analyses can be used. Computer implemented clustering and assignment protocols can also be used. The comparing step can also comprise comparing the relative frequency differences.
[0077] For example as demonstrated herein, the algorithm can direct a principle components analysis or a fastStructure analysis. For example, as demonstrated herein, principal component axes can be established using a plurality of cannabis reference strains and/or reference profiles. A cannabis sample genotype
can be projected onto the two PCs. The ancestry contribution of Cannabis sativa for example can then be calculated using the formula:
% Cannabis sativa + b/(a+b)', wherein the a and b are the chord distances along the first principal component from the centroids of the Cannabis sativa strains and the Cannabis indica strains respectively.
[0078] In an embodiment, the algorithm is an algorithm described in the Examples.
[0079] Both the major allele and the minor allele can be detected in a test mple which can be used in determining the ancestry and/or assessing marijuana and/or hemp relatedness.
[0080] Also described herein are isolated nucleic acids, for example as primers or probes to detect the SNPs described herein . Accordingly another aspect includes an isolated nucleic acid comprising at least 9, 12, 15 or at least 18 contiguous nucleotides of any one of SEQ ID Nos 1 - 600 or the complement thereof. [0081] In an embodiment, the isolated nucleic acid is a probe and comprises at least 12 or at least 18 nucleotides of contiguous sequence including the minor or major allele nucleotide; optionally including upstream sequence and/or downstream sequence contiguous with the minor or major allele.
[0082] In an embodiment, the nucleic acid is a primer comprising an isolated nucleic acid described herein.
[0083] In an embodiment, the primer is a forward PCR primer that hybridizes with a contiguous set of residues within 1-100 of any one of odd numbered SEQ ID Nos 1-600 or the, complement or reverse complement of residues 1 -100 of any one of odd numbered SEQ ID Nos 1 -600. In another embodiment, the primer is a reverse PCR primer (downstream primer) that hybridizes with residues 1 to 100 of any one of even numbered SEQ ID Nos 1 -600 or the complement or the reverse complement with residues 1 to 100 of any one of even numbered SEQ ID Nos 1 -600.
[0084] In another embodiment, the primer is an allele specific primer for a major allele and/or a minor allele in Table 4, 5 or 8 and binds to residue 101 of any
one of odd numbered SEQ ID Nos 1 -600. The odd numbered SEQ ID NOs comprise upstream sequence (for example 10 or more nucleotides) and the SNP allele at position 101 (e.g. 90-101 ). The even numbered SEQ ID NOs provide downstream sequence as indicated Tables 6, 7 and 9. For example SEQ ID NO:1 provides upstream sequence for SNP scaffoldl 4566:24841 at nucleotides 1-100 and the SNP at nucleotide 101. SEQ ID NO:2 provides downstream sequence for this SNP.
[0085] In another embodiment, the primer is a primer extension primer and binds to residue 101 of any one of any one of odd numbered SEQ ID Nos 1 -600.
[0086] Another aspect includes a plurality of primers for detecting a SNP allele Table 4, 5 and/or 8, wherein the plurality comprises as least 2 different primers selected from primers described herein.
[0087] In an embodiment, the plurality is a plurality of primer pairs.
[0088] A further aspect is a probe that is specific for an allele.
[0089] In yet another embodiment, the primer or probe further comprises a covalently bound tag, optionally a sequence specific nucleotide tail or label. The primer or probe nucleotide sequence tag can comprise or can be coupled to a fluoresecent, radioactive, metal or other detectable label.
[0090] The primer or probe can also comprise a linker.
[0091] Yet a further aspect includes an array, optionally a species specific array comprising a plurality of nucleic acid probes attached to a support surface, each isolated nucleic acid probe comprising a sequence of about 9 to about 100 nucleotides, for example about 9 to about 50 nucleotides or about 18 to about 30 nucleotides, wherein the sequence is at least 9, 12, 15 or at least 18 contiguous nucleotides of any one of SEQ ID NOs: 1 -600.
[0092] The probe can comprise a sequence that is just upstream of the SNP nucleotide, for example nucleotides 83-100 of any odd numbered SEQ ID NO: 1 - 600. In an embodiment, the array comprises allele specific probes (nucleic acids optionally labeled), for example wherein the probe comprises upstream sequence and the SNP.
[0093] In an embodiment, the array further comprises one or more negative control probes and/or one or more positive control probes.
[0094] A further aspect includes a kit comprising an isolated nucleic acid, primer, or plurality of primers and/or array described herein.
[0095] The kit can comprise various other reagents for amplifying DNA and/or using an array to detect a SNP such as dNTPs, polymerase, reaction buffer, wash buffers and the like. Accordingly in an embodiment, the kit comprises at least one reagent for an amplifying DNA reaction.
[0096] In an embodiment, the kit further comprises at least one reagent for a mer extension reaction. [0097] In an embodiment, the set for any of the methods, sets, pluralities, kits, nucleic acids or arrays comprises at least 10, 20, 30, 40 of the SNPS in Table 4, 5 and/or 8.
[0098] The above disclosure generally describes the present application. A more complete understanding can be obtained by reference to the following specific examples. These examples are described solely for the purpose of illustration and are not intended to limit the scope of the application. Changes in form and substitution of equivalents are contemplated as circumstances might suggest or render expedient. Although specific terms have been employed herein, such terms are intended in a descriptive sense and not for purposes of limitation. [0099] The following non-limiting examples are illustrative of the present disclosure:
Examples Example 1
[00100] Despite its cultivation as a source of food, fibre and medicine, and its global status as the most used illicit drug, the genus Cannabis has an inconclusive taxonomic organization and evolutionary history. Drug types of Cannabis (marijuana), which contain high amounts of the psychoactive cannabinoid delta-9 tetrahydrocannabinol (THC), are used for medicinal purposes and as a recreational drug. Hemp types are grown for the production of seed and fibre, and contain low
amounts of THC. Two species or gene pools (C. sativa and C. indica) are widely used in describing the pedigree or appearance of cultivated cannabis plants. Using 14,031 single- nucleotide polymorphisms (SNPs) genotyped in 81 marijuana and 43 hemp samples, marijuana and hemp are found to be significantly differentiated at a genome-wide level, demonstrating that the distinction between these populations is not limited to genes underlying THC production. There is a moderate correlation between the genetic structure of marijuana strains and their reported C. sativa and C. indica ancestry.
[00101] To evaluate the genetic structure of commonly cultivated Cannabis, 81 arijuana and 43 hemp samples were genotyped using genotyping-by-sequencing (GBS) [5]. The marijuana samples represent a broad cross section of modern commercial strains and landraces, while the hemp samples include diverse European and Asian accessions and modern varieties. In total, 14,031 SNPs were identified after applying quality and missingness filters. Principal components analysis (PCA) of both marijuana and hemp (Fig. 1a) revealed clear genetic structure separating marijuana and hemp along the first principal component (PC1 ). This distinction was further supported using the fastSTRUCTURE algorithm [6] assuming K = 2 ancestral populations (Fig. 1c). PCA and fastSTRUCTURE produced highly similar results: a sample's position along PC1 was strongly correlated with its group membership according to fastSTRUCTURE at K = 2 (n = 0.964; p-value = 3.55 x 10-eo).
[00102] A putative C. indica marijuana strain from Pakistan that is genetically more similar to hemp than it is to other marijuana strains was identified (Fig. 1a). Similarly, hemp sample CAN 37/97 clusters more closely with marijuana strains (Fig. 1a). These outliers may be due to sample mix-up or their classification as hemp or marijuana may be incorrect.
[00103] These results significantly expand our understanding of the evolution of marijuana and hemp lineages in Cannabis. Previous analyses have shown that marijuana and hemp differ in their capacity for cannabinoid biosynthesis, with marijuana possessing the BT allele coding for tetrahydrocannabinolic acid synthase and hemp typically possessing the BD allele for cannabidiolic acid synthase [7]. As well, transcriptome analysis of female flowers showed that cannabinoid pathway
genes are significantly upregulated in marijuana compared to hemp, as expected from the very high THC levels in the former compared to the latter [3]. The present results indicate that the genetic differences between the two are distributed across the genome and are not restricted to loci involved in cannabinoid production. In addition, levels of heterozygosity are higher in hemp than in marijuana (Fig. 1 b; Mann-Whitney U-test, p-value = 8.64 x 10-14), which suggests that hemp cultivars are derived from a broader genetic base than that of marijuana strains and/or that breeding among close relatives is more common in marijuana than in hemp.
[00104] The difference between marijuana and hemp plants has considerable l al implications in many countries, and to date forensic applications have largely focused on determining whether a plant should be classified as drug or non-drug [8]. EU and Canadian regulations only permit hemp cultivars containing less than 0.3% THC to be grown. While hemp and marijuana appear relatively well separated along PC1 (Fig. 1a), no SNPs with fixed differences were found between these two groups: the highest FST value between hemp and marijuana among all 14,031 SNPs was 0.87 for a SNP with an allele frequency of 0.82 in hemp and 0 in marijuana (Table 1 )
[00105] The average FST between hemp and marijuana is 0.156 (Fig. 3), which is similar to the degree of genetic differentiation in humans between Europeans and East Asians [9]. Thus, while cannabis breeding has resulted in a clear genetic differentiation according to use, hemp and marijuana still largely share a common pool of genetic variation.
[00106] Although the taxonomic separation of the putative taxa C. sativa and C. indica remains controversial, a vernacular taxonomy that distinguishes between "Sativa" and "Indica" strains is widespread in the marijuana community. Sativa-type plants tall with narrow leaves, are widely believed to produce marijuana with a stimulating, cerebral psychoactive effect while Indica-type plants, short with wide leaves, are reported to produce marijuana that is sedative and relaxing. The genetic structure of marijuana is in partial agreement with strain-specific ancestry estimates obtained from various online sources (Fig. 2, Table 2). A moderate correlation between the positions of marijuana strains along the first principal component (PC1 ) of Fig. 2a and reported estimates of C. sativa ancestry (Fig. 2c)(r2 = 0.22; p-value =
9 x 10-6) was observed. This relationship is also observed for the second principal component (PC2) of Fig. 1a (r2 = 0.23; p-value = 6.71 x 10-6). This observation suggests that C. sativa and C. indica may represent distinguishable pools of genetic diversity [1] but that breeding has resulted in considerable admixture between the two. While there appears to be a genetic basis for the reported ancestry of many marijuana strains, in some cases the assignment of ancestry strongly disagrees with our genotype data. For example Jamaican Lambs Bread (100% reported C. sativa) was nearly identical (IBS = 0.98) to a reported 100% C. indica strain from Afghanistan. Sample mix-up cannot be excluded as a potential reason for these discrepancies, but a similar level of misclassification was found in strains obtained m Dutch coffee shops based on chemical composition [10]. The inaccuracy of reported ancestry in marijuana likely stems from the predominantly clandestine nature of Cannabis growing and breeding over the past century. Recognizing this, marijuana strains sold for medical use are often referred to as Sativa or Indica "dominant" to describe their morphological characteristics and therapeutic effects
[10]. The results suggest that the reported ancestry of some of the most common marijuana strains only partially captures their true ancestry.
Materials and Methods
[00107] Genetic material and genotyping. The marijuana strains genotyped were grown by Health Canada authorized producers and represent germplasm grown and used for breeding in the medical and recreational marijuana industries (Table 2). Hemp strains were obtained from a Health Canada hemp cultivation licensee, and represent modern seed and fibre cultivars grown in Canada as well as diverse European and Asian germplasm (Table 3). DNA was extracted from leaf tissue using standard protocols, and library preparation and sequencing were performed using the GBS protocol published by Sonah et al [15]. SNPs were called using the GBS pipeline developed by Gardner et al. [16], aligning to the canSat3 C. sativa reference genome assembly [3]. Quality filtering of genetic markers was performed in PLINK [17] by removing SNPs with (i) greater than 20% missingness by locus (ii) a minor allele frequency less than 1 % and (iii) excess heterozygosity (a
Hardy-Weinberg equilibrium p-value less than 0.0001 ). After filtering, 14,031 SNPs remained for analysis.
[00108] Collection of reported marijuana ancestry. Reported ancestry proportions (% C. sativa and % C. indica) were manually obtained from online strain databases, cannabis seed retailers, and licensed producers of medical marijuana (Table 2). Ancestry estimates for 26 strains for which no online information was available were assigned.
[00109] Analysis of population structure and heterozygosity. Principal components analysis (PCA) was performed using the adegenet v1.4-2 package [18] R v3.1.1 using default parameters. fastSTRUCTURE [6] was run at K = 2 and K = 3 using default parameters for hemp and marijuana samples combined (14,031 SNPs) (Fig. 1a, c), and marijuana samples alone (10,651 SNPs) (Fig. 2a,b). Heterozygosity by individual was calculated in R by dividing the number of heterozygous sites by the number of non-missing genotypes for each sample.
[00110] Identity by state (IBS) Analysis. Pairwise proportion IBS between all pairs of samples was calculated using PLINK. One outlier was excluded from this analysis, C. indica (Pakistan), because of its significantly higher IBS to hemp than all other marijuana strains (Labeled marijuana sample in Fig. 1a).
[00111] To determine if the hemp population shared greater allelic similarity to C. sativa or C. indica marijuana, the mean pairwise IBS was calculated between each marijuana strain and all hemp strains. This analysis was performed at various minor allele frequency thresholds and the result remained unchanged.
Example 2
Selection of Cannabis informative markers [00112] Nine reported C. indica and 9 reported C. sativa individuals were selected to form ancestral populations for the selection of genetic markers that are able to differentiate the two groups. Individuals were selected manually on the basis of both their position along the first principal component in Figure 5 (actual genetic structure observed using 9776 SNPs), as well as their reported C. sativa or C. indica ancestry.
Selection of ancestry informative markers (AIMs)
[001 13] The top 100 highest FST SNPS were extracted and evaluated for their use in estimating genetic structure, which in the present case is being used as a proxy for C. sativa/C. indica ancestry given the unavailability of true pure C. sativa and C. indica populations. The same was performed between hemp cultivars and marijuana strains of Cannabis (Fig. 1 a).
Example evaluation of AIMs for estimating population structure
[001 14] Assuming the first principal component of Figure 5 is representative of population structure between C. indica and C. sativa type marijuana strains, a strain's position along the X axis (PC1 ) represents genetic similarity to each p pulation. In the case of admixed individuals, the position could be representative of genomic contribution from the C. indica and C. sativa gene pools. For the purposes of this analysis, an individual's position along PC1 using 9776 SNPs (Figure 5), is considered to be an individual's true ancestry. By projecting samples on to principal components computed using only the ancestral populations, additional samples can be added to the analysis without changing the relative positions of our ancestral strains in PC space and the centroids of the clusters can be used as anchors along PC1 for estimating ancestry. Because not every SNP will contribute equally to an individual's position along PC1 , a subset of markers that will capture nearly all the variance accounted for by that component is selected. [001 15] First, the 2 highest FST SNPS are selected, and used to perform PCA using only the ancestral C. sativa and C. indica populations. The rest of the samples are then projected (n = 63) onto those components, and their positions along PC1 stored. To determine the accuracy of this 2 marker panel, the Pearson's product moment correlation coefficient (Y axis, Figure 6) was calculated between these positions and the positions calculated using the full set of 9766 SNPs. The next highest FST SNP was added to the panel and this process was repeated for all 100 highest FST markers. Accuracy is not improved within this dataset for marker panels of more than approximately 40 of the highest FST SNPS within this population (Fig. 7). Additional ancestry informative SNPs may provide greater accuracy in novel samples and can provide redundancy in the event of failed genotyping reactions.
Example 3
Weighting of SNPs:
[00116] To rank SNPs according to their ancestry informativeness, the fixation index (FST) according to Weir and Cockerham (1984) was calculated for each marker. This estimate ranges between 0 and 1 , where a SNP with FSj = 1 has an allele found at 100% frequency in one population, and 0% frequency in another.
[00117] Willing, Dreyer, and van Oosterhout (2012) [21 ] describe the calculation as follows: 0118]
where
[00119] Here, s is the observed variance of allele frequencies, n is the number of individuals per population, p is the mean allele frequency over all populations, r is the number of sampled populations and h is the mean observed heterozygosity."
Example 4 Population Assignment
[00120] Population assignment can be performed if the novel sample has been genotyped for ancestry informative markers for which the alleles and allele frequencies are already known in the ancestral populations.
[00121] A test sample of a cannabis sample to be characterized is obtained. The test sample is genomic DNA and the genomic DNA is subjected to genotyping of at
least 10 of the markers in Table 4 or Table 5 depending on whether it is desired that the ancestral contribution be determined or the sample be identified as marijuana or hemp.
[00122] An assignment test developed by Paetkau et al (23) and described in Hansen, Kenchington and Nielsen (2001 ) (24) can be used. [00123] For each cannabis sample being assigned, the log-likelihood of it being derived from a specific population is calculated as:
Equation 1
where n denotes the number of loci, / and / denote the two alleles at the Ah locus, and ρ,- and p
j denote the frequency of the Ah and yth allele of the Ah locus in the population being considered.
[00124] Calculations are made for each population using the loci and frequencies provided in Table 4 or 5, and the cannabis sample is assigned to the population in which it has the highest likelihood of belonging.
Example 5
Ancestry estimation
[00125] Calculation of a novel sample's hybridization index (e.g. ancestry contribution) can be performed if the novel sample has been genotyped for ancestry informative markers for which the alleles and allele frequencies are already known in the ancestral populations.
[00126] Ancestry analysis can determine if the cannabis sample is a 'pure' descendant of a reference sample or reference profile or if it is the result of interbreeding between individuals from two different populations, i.e. an admixed individual or 'intraspecific hybrid'.
[00127] Campton and Utter (1985) developed a "hybrid index" (25). The hybrid index can be regarded as a way of visualizing the relative assignment probabilities in an assignment test involving two parental populations. The hybrid index, IH, requires three samples (or a sample and two reference profiles), i.e. a sample or reference profile of each of the two possible parental populations and a sample of the group of suspected 'hybrids'.
Equation 2
where p
x denotes the likelihood of the multilocus genotype of an individual in population χ and p
y similarly denotes the likelihood in population y, calculated as in equation 1."
Example 6 SNP discovery
[00129] Genetic material and genotyping. The marijuana strains genotyped were grown by Health Canada authorized producers and represent germplasm grown and used for breeding in the medical and recreational marijuana industries (Table 10). DNA was extracted from leaf tissue using standard protocols, and library preparation and sequencing were performed using the GBS protocol published by Poland et al [26] SNPs were called using the GBS pipeline developed by Melo et al.
[27], aligning to the canSat5 C. sativa reference genome assembly (unpublished). Quality filtering of genetic markers was performed in PLINK [17] by removing SNPs with (i) greater than 20% missingness by locus (ii) a minor allele frequency less than 1 % and (iii) excess heterozygosity (a Hardy-Weinberg equilibrium p-value less than 0.0001 ). After filtering, 9,123 SNPs remained for analysis.
[00130] Table 8 identifies the major and minor alleles identified in Cannabis sativa and Cannabis indica and Table 9 provides upstream and downstream sequence for each SNP. Table 10 provides reference information on the reported ancestry. Figure 8 shows a PCA analysis based on whether the strain is reported as C indica or C sativa.
Table 1
Positions and allele frequencies of the top 50 SNPs by FST between marijuana and hemp calculated according to equation 10 in Weir and Cockerham (1984) [19].
Table 2
le 3 mple names of genotyped hemp varieties.
Table 4
Positions and allele frequencies of the top 100 SNPs by FST between Cannabis Sativa and Cannabis Indica calculated according to equation 10 in Weir and Cockerham (1984) [19]
Table 5
Positions and allele frequencies of the top 100 SNPs by FST between marijuana and hemp calculated according to equation 10 in Weir and Cockerham (1984) [19]
Table 6
Upstream, Allele and Downstream sequences for SNPs from Table 4
Table 7
Upstream, Allele and Downstream sequences for SNPs from Table 5
Table 8
[00131] While the present application has been described with reference to what are presently considered to be the preferred examples, it is to be understood that the application is not limited to the disclosed examples. To the contrary, the application is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. [00132] All publications, patents and patent applications are herein incorporated by reference in their entirety to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety. Specifically, the sequences associated with ch accession numbers provided herein including for example accession numbers and/or biomarker sequences (e.g. protein and/or nucleic acid) provided in the Tables or elsewhere, are incorporated by reference in its entirely.
CITATIONS FOR REFERENCES REFERRED TO IN THE SPECIFICATION
1. Hillig K. Genetic evidence for speciation in Cannabis (Cannabaceae). Genet. Resour. Crop Evol. 2005;52(2):161-80.
2. de Meijer EPM. The Chemical Phenotypes (Chemotypes) of Cannabis. In: Pertwee RG, editor. Handbook of Cannabis. Handbooks in Psychopharmacology: Oxford University Press; 2014. p. 89-110.
3. van Bakel H, Stout J, Cote A, Tallon C, Sharpe A, Hughes T, et al. The draft genome and transcriptome of Cannabis sativa. Genome Biol. 2011 ;12(10):R102.
4. Bostwick JM. Blurred Boundaries: The Therapeutics and Politics of Medical Marijuana. Mayo Clin. Proc. 2012;87(2):172-86. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE. 201 1 ;6(5):e19379.
6. Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Datasets. Genetics. 2014. 7. de Meijer EPM, Bagatta M, Carboni A, Crucitti P, Moliterni VMC, Ranalli P, et al. The Inheritance of Chemical Phenotype in Cannabis sativa L. Genetics. 2003; 163(1 ):335-46.
8. Piluzza G, Delogu G, Cabras A, Marceddu S, Bullitta S. Differentiation between fiber and drug types of hemp (Cannabis sativa L.) from a collection of wild and domesticated accessions. Genet. Resour. Crop Evol. 2013;60(8):2331 -42.
9. Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, et al. Whole- Genome Patterns of Common DNA Variation in Three Human Populations. Science. 2005;307(5712): 1072-9.
10. Hazekamp A, Fischedick JT. Cannabis - from cultivar to chemovar. Drug Test Anal. 2012;4(7-8):660-7.
1 1. Small E, Cronquist A. A Practical and Natural Taxonomy for Cannabis. Taxon. 1976;25(4):405-35.
12. Salentijn EMJ, Zhang Q, Amaducci S, Yang M, Trindade LM. New developments in fiber hemp (Cannabis sativa L.) breeding. Ind Crops Prod. 2014. 13. Franz-Warkentin P. Hemp production sees steady growth in Canada 2013 [cited 2014]. Available from: http://www.agcanada.com/daily/hemp-production-sees- steady-growth-in-canada.
14. Agricultural Act of 2014, Pub. L. No. 113-17 Stat. 128 (Feb. 7, 2014, 2014).
15. Sonah H, Bastien M, Iquira E, Tardivel A, Legare G, Boyle B, et al. An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping. PLoS ONE. 2013;8(1 ):e54603.
16. Gardner KM, Brown P, Cooke TF, Cann S, Costa F, Bustamante C, et al. Fast and Cost-Effective Genetic Mapping in Apple Using Next-Generation Sequencing. G3 (Bethesda). 2014;4(9):1681 -7.
17. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet. 2007;81 (3):559-75.
18. Jombart T, Ahmed I. adegenet 1.3-1 : new tools for the analysis of genome- wide SNP data. Bioinformatics. 201 1 ;27(21 ):3070-1. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution. 1984;38(6): 1358-70.
20. Hilling KW, Mahlberg PG. A Chemotaxonomic Analysis of Cannabinoid Variation in Cannabis (Cannabaceae). American Journal of Botany. 2004, 91 (6):966-975 21. Willing E-M, Dreyer C, van Oosterhout C. Estimates of Genetic Differentiation Measured by FST DO Not Necessarily Require Large Sample Sized When Using Many SNP Markers. PLOS ONE. 2012, 7(8):e42649.
22. McClure KA, Sawler J, Gardner KM, Money D, Myles S. Genomics: A Potential Panacea for the Perrenial Problem. Am J Botany 2014 101 :1780-90. 23. Paetkau D, Calvert W, Stirling I, Strobeck C (1995) Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology, 4, 347-354
24. Hansen MM, Kenchington E, Nielsen EE (2001 ) Assigning individual fish to populations using microsatellite DNA markers: Methods and applications. Fish and Fisheries, 2, 93-1 12. 25. Campton DE and Utter FM, 1985. Natural hybridization between steelhead trout (Salmo gairdneri) and coastal cutthroat trout (Salmo clarki clarki) in two Puget Sound streams. Can J Fish Aquat Sci 42: 110-119.
26. Poland JA, Brown PJ, Sorrells ME, Jannink JL. Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by- Sequencing Approach. PLoS ONE. 2012;7(2):e32253.
27. Melo AT, Bartaula R, Hale I. GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired- end genotyping-by-sequencing data. BMC Bioinformatics. 2016; 17:29