WO2011117388A1 - Method of molecular marker selection - Google Patents

Method of molecular marker selection Download PDF

Info

Publication number
WO2011117388A1
WO2011117388A1 PCT/EP2011/054624 EP2011054624W WO2011117388A1 WO 2011117388 A1 WO2011117388 A1 WO 2011117388A1 EP 2011054624 W EP2011054624 W EP 2011054624W WO 2011117388 A1 WO2011117388 A1 WO 2011117388A1
Authority
WO
WIPO (PCT)
Prior art keywords
seq
primer
pair
reverse
reverse primer
Prior art date
Application number
PCT/EP2011/054624
Other languages
French (fr)
Inventor
Florian Martin
Paolo Donini
Gregor Nicolas Bindler
Ferruccio Gadani
Original Assignee
Philip Morris Products S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Philip Morris Products S.A. filed Critical Philip Morris Products S.A.
Publication of WO2011117388A1 publication Critical patent/WO2011117388A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the invention relates to a method for selecting a minimal set of molecular markers from a plurality of markers for identifying an unknown sample or differentiating between groups of unknown samples, or a combination thereof.
  • the invention further relates to genotyping kits comprising a minimal set of markers determined by a method of the invention and to the use of said kits for performing a given discrimination task.
  • RNA ribo- or desoxy ribonucleic acid
  • DNA ribo- or desoxy ribonucleic acid
  • SSR simple sequence repeat markers
  • SNPs single nucleotide polymorphisms
  • the SSRs of interest for marker development include di-nucleotide and higher order repeats (e.g. (AG)n, (TAT)n, etc.).
  • the number of repeats usually ranges between just a few units to several dozens of units.
  • a polymorphism between individuals of a population may exist, for example, at a site within the genome containing a microsatellite comprising a different number of repeat units. As a consequence, the DNA of a given individual may be a little bit longer or shorter than the DNA of another individual. The detection of these differences occurs by site-specific amplification, using PCR (polymerase chain reaction) of the DNA, followed by size- based separation of the DNA fragments.
  • Polymorphisms (literally multiple forms) at a specific site in the genome are also referred to as alleles.
  • the site where the allele appears in the genome is referred to as locus.
  • locus The site where the allele appears in the genome.
  • an individual can have one or more alleles at a specific locus.
  • the set of alleles that has been collected for a given individual (often representing a single sample in the study) is referred to as the genotype of that individual.
  • SSR markers hundreds or more
  • the invention provides a molecular marker selection method to identify a minimal set of markers from a plurality of markers and to establish a genotyping kit comprising said minimal marker set.
  • the marker selection is combined with computing a prototype kernel to finally build a probabilistic prediction model for unknown sample identification.
  • some or all the steps of the molecular marker selection method are computer-implemented.
  • the method of molecular marker selection from a plurality of markers for identifying an unknown sample or for differentiating between or classifying a group of unknown samples preferably comprises the steps of: (a) computing a score for each marker, which score represents the ability of the given marker to discriminate between groups of samples; (b) computing a redundancy score among the markers; and (c) selecting a minimal set of k non-redundant markers on the basis of the computed discrimination score and redundancy score; and optionally further comprises (d) providing a genotyping kit comprising said minimal set of k non-redundant markers selected in step (c).
  • the invention relates to a computer-implemented method of molecular marker selection from a plurality of markers for identifying an unknown sample or for differentiating between or classifying a group of unknown samples preferably comprises the steps of: (a) computing a score for each marker, which score represents the ability of the given marker to discriminate between groups of samples; (b) computing a redundancy score among the markers; and (c) selecting a minimal set of k non-redundant markers on the basis of the computed discrimination score and redundancy score; and optionally further comprising (d) providing a genotyping kit comprising said minimal set of k non-redundant markers selected in step (c).
  • a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of k non-redundant markers, which is represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1.
  • step (a) of the method according to the invention and as described herein which step, in a specific embodiment of the invention, is a computer-implemented step, preferably provides a discriminating power pi,..., p m for each of the markers, with the suffix "m" representing the number of markers.
  • step (b) of the method according to the invention and as described herein relating to the computing a redundancy score among the markers which step, in a specific embodiment of the invention, is a computer- implemented step, further comprises the steps of (b1) providing a maximum size k for the kit to be selected; (b2) providing thresholds on the discriminating power; (b3) ordering the discriminating powers p, in descending order; and (b4) computing the redundancy and associated redundancy thresholds for each pair of markers.
  • step (c) of the method according to the invention and as described herein comprises selecting the k markers forming the kit depending on the computed discrimination power, the computed redundancy and on the associated thresholds.
  • the selected minimal set of k markers is used to compute a prototype kernel.
  • the computing of the prototype kernel preferably comprises the steps of calculating a genetic similarity measure for each pair of samples. This provides a similarity matrix.
  • the computing of the prototype kernel further comprises the step of choosing templates for each group C, as a fixed proportion p of the observations of that group, by using K-medoids clustering for each group.
  • the kernel is preferably defined by wherein ⁇ ,,..., ⁇ (/ ⁇ X
  • a method of the invention and as described herein is provided, wherein in a next step a Kernel-Linear Discriminant Analysis or a Kernel- Principal Component Analysis followed by Linear Discriminant Analysis is applied on the computed prototype kernel for obtaining a probabilistic model for prediction and identification.
  • the molecular markers to be used within the scope of the present invention are particularly simple sequence repeat markers. However, it is to be noted in this context that the invention is applicable to any problem where the predictors are nominal variables and where a similarity measure between samples based on those variables can be defined.
  • SNPs single nucleotide polymorphisms
  • indels ⁇ i.e., insertions/deletions
  • SSRs simple sequence repeats
  • RFLPs restriction fragment length polymorphisms
  • RAPDs random amplified polymorphic DNAs
  • CAS cleaved amplified polymorphic sequence
  • DrT Diversity Arrays Technology
  • AFLPs amplified fragment length polymorphisms
  • the method and genotyping kits disclosed herein can potentially be used for any area of application where molecular markers are utilized including plant and animal breeding, crop and animal production, quality control, parentage testing, variety or type discrimination, traceability and authentication testing, but also medical diagnosis such as, for example, diagnosis of animals or humans based on molecular markers or based on certain of patient characteristics, for example age range, medical history, etc.
  • the selected set of markers is applied to at least one unknown sample to be identified.
  • This step which again may be a computer- implemented step, comprises calculating genetic similarity for the unknown sample to be identified and the known sample used for marker selection.
  • the method further comprises computing a prototype kernel for the unknown sample on the basis of the k selected markers, and the chosen templates. The computation of the prototype kernel is preferably followed by applying a Kernel-Linear Discriminant Analysis or a Kernel-Principal Component Analysis followed by Linear Discriminant Analysis on the computed prototype kernel for obtaining a probability output for the unknown sample to be identified.
  • a method and a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, particularly SSR-based markers, which are representative of the tobacco genome and may be used to perform a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
  • the invention relates to a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, which minimal set of markers is represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1.
  • a genotyptng kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, represented by at least one of the markers, particularly by at least one of the non-redundant markers, represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1
  • a genotyping kit comprising a set of markers, particularly a minimal set of marker, particularly a set or minimal set of non-redundant marker, represented by at least two, by at feast 3, by at least 4, by at least 5, by at least 6, by at least 7, by at least 8, by at least 9, by at least 10, particularly by at least 20, by at least 25, by at least 30, by at least 35, by at least 40, by at least 45, by at least 50 of the markers, particularly by all of the markers represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1
  • the invention also relates to the use of a genotyping kit, including a kit for performing a specific discrimination or classification task according to the invention and as described herein before comprising a set of markers, particularly a minimal set of markers,
  • the task may include, but is not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
  • tobacco variety parentage testing variety discrimination or classification
  • type discrimination or classification type discrimination or classification
  • quality assurance of tobacco variety material tracing the source, storage, or route of transportation of a tobacco variety material
  • authentication of tobacco variety material variableiety fingerprinting
  • Fig. 1 shows, by way of example, the genetic dissimilarities using all the SSR markers.
  • the matrix is ordered by tobacco type showing the contrast between the within- group dissimilarities and the between group dissimilarities;
  • Fig. 2 shows the genetic dissimilarities between the samples using the selected kit only, showing the within group dissimilarity-between group dissimilarity ratio
  • Fig. 3 shows the ordered discrimination power of the SSR markers of the example and the ones selected by the described approach.
  • SEQ ID NO: 1 shows the nucleotide sequence of forward primer PT1014
  • SEQ ID NO. 2 shows the nucleotide sequence of reverse primer PT1014
  • SEQ ID NO: 3 shows the nucleotide sequence of forward primer PT1043n
  • SEQ ID NO. 4 shows the nucleotide sequence of reverse primer PT1043n
  • SEQ ID NO: 5 shows the nucleotide sequence of forward primer PT1069n
  • SEQ ID NO. 6 shows the nucleotide sequence of reverse primer PT1069n
  • SEQ ID NO: 7 shows the nucleotide sequence of forward primer PT1073
  • SEQ ID NO: 8 shows the nucleotide sequence of reverse primer PT1073
  • SEQ ID NO: 9 shows the nucleotide sequence of forward primer PT1078
  • SEQ ID NO: 10 shows the nucleotide sequence of reverse primer PT1078
  • SEQ ID NO: 11 shows the nucleotide sequence of forward primer PT1085n
  • SEQ ID NO: 12 shows the nucleotide sequence of reverse primer PT1085n
  • SEQ ID NO: 13 shows the nucleotide sequence of forward primer PT1089n
  • SEQ ID NO: 14 shows the nucleotide sequence of reverse primer PT1089n
  • SEQ ID NO: 15 shows the nucleotide sequence of forward primer PT1104n
  • SEQ ID NO: 16 shows the nucleotide sequence of reverse primer PT1104n
  • SEQ ID NO: 17 shows the nucleotide sequence of forward primer PT1140
  • SEQ ID NO: 18 shows the nucleotide sequence of reverse primer PT1140
  • SEQ ID NO: 19 shows the nucleotide sequence of forward primer PT1169n
  • SEQ ID NO: 20 shows the nucleotide sequence of reverse primer PT 169n
  • SEQ ID NO: 21 shows the nucleotide sequence of forward primer PT1176n
  • SEQ ID NO: 22 shows the nucleotide sequence of reverse primer PT1176n
  • SEQ ID NO: 23 shows the nucleotide sequence of forward primer PT1194
  • SEQ ID NO: 24 shows the nucleotide sequence of reverse primer PT1194
  • 25 shows the nucleotide sequence of forward primer PT1199n
  • SEQ ID NO: 26 shows the nucleotide sequence of reverse primer PT1199n
  • SEQ ID NO: 27 shows the nucleotide sequence of forward primer PT1201n
  • SEQ ID NO: 28 shows the nucleotide sequence of reverse primer PT1201n
  • SEQ ID NO: 29 shows the nucleotide sequence of forward primer PT1242
  • SEQ ID NO: 30 shows the nucleotide sequence of reverse primer PT1242
  • 31 shows the nucleotide sequence of forward primer PT1245
  • SEQ ID NO: 32 shows the nucleotide sequence of reverse primer PT1245
  • 33 shows the nucleo
  • SEQ ID NO: 92 shows the nucleotide sequence of reverse primer PT54871
  • SEQ ID NO: 93 shows the nucleotide sequence of forward primer PT54893
  • SEQ ID NO: 94 shows the nucleotide sequence of reverse primer PT54893
  • SEQ ID NO: 95 shows the nucleotide sequence of forward primer PT50392
  • SEQ ID NO: 96 shows the nucleotide sequence of reverse primer PT50392
  • SEQ ID NO: 97 shows the nucleotide sequence of forward primer PT52856
  • SEQ ID NO: 98 shows the nucleotide sequence of reverse primer PT52856
  • SEQ ID NO: 99 shows the nucleotide sequence of forward primer PT54452
  • SEQ ID NO: 100 shows the nucleotide sequence of reverse primer PT54452
  • SEQ ID NO: 101 shows the nucleotide sequence of forward primer PT54693
  • SEQ ID NO: 102 shows the nucleotide sequence of reverse primer PT54693
  • SEQ ID NO: 103 shows the nucleotide sequence of forward primer PT54789
  • SEQ ID NO: 104 shows the nucleotide sequence of reverse primer PT54789
  • SSR marker single sequence repeat markers
  • Primers associated to an SSR marker are amplified by PCR on a DNA sample and leads to several ampficon sizes, the "alleles".
  • the set of alleles for a given DNA preparation is called the genotype of the sample.
  • the results of such amplification on one sample are of the form g, - a x la 2 where a, is an integer depending on the number of microsatellite repeats between the two flanking primers.
  • the number of alleles depends on the uniqueness of the focus associated with the primers and on the ploidity type (diploid, tetraploid, amphi-diploid) of the organism from which the DNA is extracted. From a data analysis point of view, these data are neither continuous, nor nominal (even if they could be considered as such), nor ordinal.
  • the set of genotypes is partially ordered by set inclusion:
  • the SSR data are then used to estimate the degree of polymorphism between two samples by computing a "genetic distance" between them. For example, the Jaccard distance or the Net-Li distance is used. Given two samples SI , S2 on which m SSR markers are amplified, leading to m genotypes for the first sample and m genotypes for the second sample , where is seen as the amplicons set, the following quantities can be computed:
  • denotes the symmetric difference of the two sets.
  • denotes the symmetric difference of the tvm sets an the set cardinality
  • D dissimilarity matrix
  • the basic concept of the kernel method is to model a classier in a feature space (which will be a Hilbert space) based only on a similarity matrix as long as this similarity is positive definite. Indeed, If the measure of similarity between the samples is a positive definite kernel, then classifiers can be learned in the reproducing Hilbert space associated to it. However, the Jaccard or Nei-Li similarity (one minus the dissimilarity previously defined) are not positive definite.
  • the prototype kernel is defined by choosing templates for each class C as a fixed proportion p of the class observations. Those will be the prototypes resulting from the K-medoids algorithm applied to the each class with the number of group being equal to integer part of #C p.
  • further processing methods are preferably applied. For example, Kernel-LDA (Linear Discriminant Analysis in the feature space) or Kernel-PCA-LDA (Kernel-Principal Component Analysis followed by Linear Discriminant Analysis) is preferably applied on the computed prototype kernel for obtaining a probabilistic model for sample prediction and identification.
  • a subset of markers for example containing 5 to 20 markers only
  • selection of a marker subset is performed as follows.
  • the set of markers that show the biggest polymorphism between the groups to discriminate and the lowest polymorphism within the groups are chosen.
  • a score will then be computed for each marker, which score represent the ability of a given marker to discriminate between the groups.
  • a redundancy score is computed, in order to assess if the polymorphism contained in a marker A is similar to the polymorphism of a second marker B. If this was the case, one marker is preferably dropped in favor of another one exhibiting a different polymorphism.
  • the measure or extent of association between the marker and the group that is used is the Asymmetric Uncertainty Coefficient which reflects the dependency of the marker and the group to be discriminated.
  • the redundancy between two markers will be quantified by the Uncertainty Coefficient.
  • the symmetric uncertainty coefficient is defined by
  • the asymmetric uncertainty coefficient is defined by:
  • the threshold can be chosen according to the asymptotic variance of the estimated uncertainty coefficients.
  • the threshold is set to be a constant ci time the asymptotic standard deviation.
  • Ci will be 10.
  • the uncertainty coefficient threshold between two markers i, j is set to o. » Std(U(marker, .marker ⁇ ) where Std denotes the asymptotic standard deviation of the quantity.
  • Std denotes the asymptotic standard deviation of the quantity.
  • a forward selection (respectively backward selection, respectively exhaustive search) can preferably be applied in order to select a smaller kit (if needed).
  • the cheapest approach to decrease the kit size is of course to choose the k-th first marker proposed by the method of the invention.
  • the method for selecting a kit according to the invention is advantageous as it allows to build an economical and efficient prediction model for discrimination.
  • the method is independent of the supervised method chosen in the modeling process.
  • polymorphism between the (group of) samples is very well encoded in the prototype kernel and, when combined with Kernel-LDA or Kernel-PCA-LDA leads to satisfactory prediction models.
  • the selection method of the invention performs well and, when preferably combined with Kernel-LDA and Kernel-PCA-LDA leads to correct classification rates, as shown befow by means of examples. Both alternatives lead to good classification error-rates, with a slight advantage for Kernel-PCA-LDA in the example.
  • Another advantage is to benefit from a fast algorithm (as exhaustive search are generally unfeasible or are very time consuming), which is useful to come up with a few markers that are feasible for a given task.
  • a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non- redundant markers, determined by using the method according to the invention and as described and claimed herein, particularly SSR-based markers, which are representative of the tobacco genome and may be used to perform a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
  • a set of non-redundant molecular markers which has a minimal number of markers, and which have been selected as being suitable for a variety of uses as mentioned above and which are maximally required for accomplishing defined tasks including, but not limited to tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting). Accordingly, not all the markers comprised in the minimal set of markers may be needed for accomplishing a certain task. Depending on the specific genotyping problem to be addressed, one of skill in the art can by further experimentation further reduce the number of markers that are used in the genotyping test.
  • the minimal number of markers used for authentication of a known type of tobacco material may be lower than the set of markers used for distinguishing several varieties. Accordingly, it is contemplated a subset of markers may be selected for use in genotyping from the minimal set of non-redundant markers as identified by the pairs of primers (SEQ ID NO: 1-104) provided herein. In a less preferred embodiment, it is also contemplated other molecular markers can be used in conjunction with one or more non-redundant molecular markers as identified by the pairs of primers (SEQ ID NO: 1-104) provided herein.
  • the invention relates to a genotyping kit comprising, consisting essentially of or consisting of a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by a plurality of polynucleotide fragments that are obtainable in a PC amplification reaction using a pair of primers selected from the group consisting of primer pair 1 comprising a forward primer of SEQ ID NO: 1 and a reverse primer of SEQ ID NO: 2; primer pair 2 comprising a forward primer of SEQ ID NO: 3 and a reverse primer of SEQ ID NO: 4; primer pair 3 comprising a forward primer of SEQ ID NO: 5 and a reverse primer of SEQ ID NO: 6; primer pair 4 comprising a forward primer of SEQ ID NO: 7 and a reverse primer of SEQ ID NO: 8; primer pair 5 comprising a forward primer of SEQ ID NO:
  • a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by at least one of the polynucleotide fragments disclosed herein before, which fragments are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 as shown in Table 1.
  • a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by at least one, particularly at least two, particularly at least 3, particularly at least 4, particularly at least 5, particularly at least 10, particularly at least 20, particularly at least 25, particularly at least 30, particularly at least 40 of the non- redundant markers, particularly all of the polynucleotide fragments recited herein before, which fragments are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 as shown in Table 1.
  • the invention also relates to the use of a genotyping kit according to the invention and as described herein before comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, for performing a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
  • the number and composition of the genotyping kit may change. Based on the disclosure provided herein, the person skilled in the art is now in a position to identify the set of non-redundant markers most suitable for performing the respective genotyping task.
  • Nicotiana tabacum a functional diploid was used.
  • the dataset used here contained the measurements from the 52 described SSR markers on 234 varieties (or fandraces) without replicates that lead to 234 observations.
  • the objective was to discriminate the tobacco types Burley, Flue Cured and Oriental.
  • the selection method of the invention tends to maximize the between types polymorphism and minimize the within types polymorphism.
  • Fig. 3 shows the ordered discrimination power of the SSR for this particular discrimination task.
  • the algorithm proposed a kit of size 4.
  • 10 ⁇ 00 random selections were performed (with the proportion of prototype being equal to 1/5-th).
  • the 5-fold cross- validation results from these simulations are summarized below. It shows that, among the 10 ⁇ 00 4-tuples, the one chosen by the algorithm belongs to the 0.9%-best 4-tuples for KPCA-LDA and th 1.5%-best 4-tuples for KLDA: Summary of cross-validation error rates (in %) from simulations:
  • the method of the invention is very efficient in selecting SSR markers that perform well.
  • the prototype kernel seems to encode the desired polymorphism information and lead to good prediction models.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Organic Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Mycology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • General Engineering & Computer Science (AREA)
  • Botany (AREA)

Abstract

The invention relates to a computer-implemented method of molecular marker selection from a plurality of markers for identifying an unknown sample or differentiating between groups of unknown samples, or a combination thereof. The invention further relates to genotyping kits comprising a minimal set of markers determined by a method of the invention and to the use of said kits for performing a given discrimination task.

Description

Method of molecular marker selection
The invention relates to a method for selecting a minimal set of molecular markers from a plurality of markers for identifying an unknown sample or differentiating between groups of unknown samples, or a combination thereof. The invention further relates to genotyping kits comprising a minimal set of markers determined by a method of the invention and to the use of said kits for performing a given discrimination task.
In crop and animal production systems but aiso in medical diagnosis, genetic markers are increasingly being used. In crop and animal production markers are used to distinguish individuals in a larger population based on their genetic make-up. Depending on a specific discrimination task, supervised approaches can be applied to build a prediction model. Under economical aspects, it is desirable to find a minimal set of molecular markers that have optimal ability to discriminate for example between groups of varieties or between varieties.
Genetic markers are target sites in the genome that differ between individuals of a population. These differences can occur in ribo- or desoxy ribonucleic acid (RNA; DNA) coding for specific genes, or in the usually vast areas of intergenic DNA. These differences in the make-up of the genetic code at a specific site in the genome are often referred to as polymorphisms. These polymorphisms are detected with a range of different technologies of which simple sequence repeat markers (SSRs) and single nucleotide polymorphisms (SNPs) are currently the most commonly used types. SSR markers consist of numerous repeats of short sequences of DNA bases, which are found at loci throughout the plant's genome and have a likelihood of being highly polymorphic. The SSRs of interest for marker development include di-nucleotide and higher order repeats (e.g. (AG)n, (TAT)n, etc.). The number of repeats usually ranges between just a few units to several dozens of units. A polymorphism between individuals of a population may exist, for example, at a site within the genome containing a microsatellite comprising a different number of repeat units. As a consequence, the DNA of a given individual may be a little bit longer or shorter than the DNA of another individual. The detection of these differences occurs by site-specific amplification, using PCR (polymerase chain reaction) of the DNA, followed by size- based separation of the DNA fragments. Polymorphisms (literally multiple forms) at a specific site in the genome are also referred to as alleles. The site where the allele appears in the genome is referred to as locus. Depending on the pioidy level of the organism being studied (hap!oid, diploid, tetraploid, etc.) an individual can have one or more alleles at a specific locus. The set of alleles that has been collected for a given individual (often representing a single sample in the study) is referred to as the genotype of that individual. When large numbers of samples and SSR markers (hundreds or more) are involved, the genotyping process can be costly in terms of laboratory consumables, labor and time. As a consequence, it is generally desirable to select a minimal set of markers and to establish a genotyping kit which can be used to perform a given discrimination task.
According to a first aspect, the invention provides a molecular marker selection method to identify a minimal set of markers from a plurality of markers and to establish a genotyping kit comprising said minimal marker set. Preferably, the marker selection is combined with computing a prototype kernel to finally build a probabilistic prediction model for unknown sample identification. In a specific embodiment, some or all the steps of the molecular marker selection method are computer-implemented.
In one embodiment of the invention, the method of molecular marker selection from a plurality of markers for identifying an unknown sample or for differentiating between or classifying a group of unknown samples preferably comprises the steps of: (a) computing a score for each marker, which score represents the ability of the given marker to discriminate between groups of samples; (b) computing a redundancy score among the markers; and (c) selecting a minimal set of k non-redundant markers on the basis of the computed discrimination score and redundancy score; and optionally further comprises (d) providing a genotyping kit comprising said minimal set of k non-redundant markers selected in step (c). In a specific embodiment of the invention, at least one, at least two, at least three or all of the above steps a) to d) are computer-implemented. In one embodiment, the invention relates to a computer-implemented method of molecular marker selection from a plurality of markers for identifying an unknown sample or for differentiating between or classifying a group of unknown samples preferably comprises the steps of: (a) computing a score for each marker, which score represents the ability of the given marker to discriminate between groups of samples; (b) computing a redundancy score among the markers; and (c) selecting a minimal set of k non-redundant markers on the basis of the computed discrimination score and redundancy score; and optionally further comprising (d) providing a genotyping kit comprising said minimal set of k non-redundant markers selected in step (c).
in a specific embodiment of the invention, a genotyping kit is provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of k non-redundant markers, which is represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1.
In another specific embodiment, step (a) of the method according to the invention and as described herein, which step, in a specific embodiment of the invention, is a computer-implemented step, preferably provides a discriminating power pi,..., pm for each of the markers, with the suffix "m" representing the number of markers.
In another specific embodiment of the invention, step (b) of the method according to the invention and as described herein relating to the computing a redundancy score among the markers, which step, in a specific embodiment of the invention, is a computer- implemented step, further comprises the steps of (b1) providing a maximum size k for the kit to be selected; (b2) providing thresholds on the discriminating power; (b3) ordering the discriminating powers p, in descending order; and (b4) computing the redundancy and associated redundancy thresholds for each pair of markers.
In still another specific embodiment, step (c) of the method according to the invention and as described herein, which step, in a specific embodiment of the invention, is a computer-implemented step, comprises selecting the k markers forming the kit depending on the computed discrimination power, the computed redundancy and on the associated thresholds.
According to a second aspect of the invention, the selected minimal set of k markers is used to compute a prototype kernel. The computing of the prototype kernel preferably comprises the steps of calculating a genetic similarity measure for each pair of samples. This provides a similarity matrix. The computing of the prototype kernel further comprises the step of choosing templates for each group C, as a fixed proportion p of the observations of that group, by using K-medoids clustering for each group. The kernel is preferably defined by wherein τ,,...,τ(/ ε X
Figure imgf000004_0001
are chosen templates and s denotes the similarity measure. In a specific embodiment, a method of the invention and as described herein is provided, wherein in a next step a Kernel-Linear Discriminant Analysis or a Kernel- Principal Component Analysis followed by Linear Discriminant Analysis is applied on the computed prototype kernel for obtaining a probabilistic model for prediction and identification.
The molecular markers to be used within the scope of the present invention are particularly simple sequence repeat markers. However, it is to be noted in this context that the invention is applicable to any problem where the predictors are nominal variables and where a similarity measure between samples based on those variables can be defined. Alternative molecular markers may therefore be used within the present invention, including, for example, single nucleotide polymorphisms (SNPs), indels {i.e., insertions/deletions), simple sequence repeats (SSRs), restriction fragment length polymorphisms (RFLPs), random amplified polymorphic DNAs (RAPDs), cleaved amplified polymorphic sequence (CAPS) markers, Diversity Arrays Technology (DArT) markers, and amplified fragment length polymorphisms (AFLPs), among many other examples.
Thus, the method and genotyping kits disclosed herein can potentially be used for any area of application where molecular markers are utilized including plant and animal breeding, crop and animal production, quality control, parentage testing, variety or type discrimination, traceability and authentication testing, but also medical diagnosis such as, for example, diagnosis of animals or humans based on molecular markers or based on certain of patient characteristics, for example age range, medical history, etc.
According to a third aspect of the invention, the selected set of markers is applied to at least one unknown sample to be identified. This step, which again may be a computer- implemented step, comprises calculating genetic similarity for the unknown sample to be identified and the known sample used for marker selection. In a specific embodiment, the method further comprises computing a prototype kernel for the unknown sample on the basis of the k selected markers, and the chosen templates. The computation of the prototype kernel is preferably followed by applying a Kernel-Linear Discriminant Analysis or a Kernel-Principal Component Analysis followed by Linear Discriminant Analysis on the computed prototype kernel for obtaining a probability output for the unknown sample to be identified.
in one embodiment of the invention, a method and a genotyping kit are provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, particularly SSR-based markers, which are representative of the tobacco genome and may be used to perform a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
In particular, the invention relates to a genotyping kit comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, which minimal set of markers is represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1.
In one embodiment, a genotyptng kit according to the invention is provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, represented by at least one of the markers, particularly by at least one of the non-redundant markers, represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1
In one embodiment, a genotyping kit according to the invention is provided comprising a set of markers, particularly a minimal set of marker, particularly a set or minimal set of non-redundant marker, represented by at least two, by at feast 3, by at least 4, by at least 5, by at least 6, by at least 7, by at least 8, by at least 9, by at least 10, particularly by at least 20, by at least 25, by at least 30, by at least 35, by at least 40, by at least 45, by at least 50 of the markers, particularly by all of the markers represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 comprising a forward primer and a reverse primer of SEQ ID NOs: 1-104 and as shown in Table 1 The invention also relates to the use of a genotyping kit, including a kit for performing a specific discrimination or classification task according to the invention and as described herein before comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers. The task may include, but is not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting). Depending of the task to be performed, the number and composition of the marker sets will vary.
Brief description of the drawings and sequences
Fig. 1 shows, by way of example, the genetic dissimilarities using all the SSR markers.
The matrix is ordered by tobacco type showing the contrast between the within- group dissimilarities and the between group dissimilarities;
Fig. 2 shows the genetic dissimilarities between the samples using the selected kit only, showing the within group dissimilarity-between group dissimilarity ratio; and
Fig. 3 shows the ordered discrimination power of the SSR markers of the example and the ones selected by the described approach.
SEQ ID NO: 1 shows the nucleotide sequence of forward primer PT1014
SEQ ID NO. 2 shows the nucleotide sequence of reverse primer PT1014
SEQ ID NO: 3 shows the nucleotide sequence of forward primer PT1043n
SEQ ID NO. 4 shows the nucleotide sequence of reverse primer PT1043n
SEQ ID NO: 5 shows the nucleotide sequence of forward primer PT1069n
SEQ ID NO. 6 shows the nucleotide sequence of reverse primer PT1069n
SEQ ID NO: 7 shows the nucleotide sequence of forward primer PT1073
SEQ ID NO: 8 shows the nucleotide sequence of reverse primer PT1073
SEQ ID NO: 9 shows the nucleotide sequence of forward primer PT1078
SEQ ID NO: 10 shows the nucleotide sequence of reverse primer PT1078
SEQ ID NO: 11 shows the nucleotide sequence of forward primer PT1085n
SEQ ID NO: 12 shows the nucleotide sequence of reverse primer PT1085n
SEQ ID NO: 13 shows the nucleotide sequence of forward primer PT1089n
SEQ ID NO: 14 shows the nucleotide sequence of reverse primer PT1089n
SEQ ID NO: 15 shows the nucleotide sequence of forward primer PT1104n
SEQ ID NO: 16 shows the nucleotide sequence of reverse primer PT1104n
SEQ ID NO: 17 shows the nucleotide sequence of forward primer PT1140
SEQ ID NO: 18 shows the nucleotide sequence of reverse primer PT1140
SEQ ID NO: 19 shows the nucleotide sequence of forward primer PT1169n
SEQ ID NO: 20 shows the nucleotide sequence of reverse primer PT 169n
SEQ ID NO: 21 shows the nucleotide sequence of forward primer PT1176n
SEQ ID NO: 22 shows the nucleotide sequence of reverse primer PT1176n SEQ ID NO: 23 shows the nucleotide sequence of forward primer PT1194 SEQ ID NO: 24 shows the nucleotide sequence of reverse primer PT1194 SEQ ID NO: 25 shows the nucleotide sequence of forward primer PT1199n SEQ ID NO: 26 shows the nucleotide sequence of reverse primer PT1199n SEQ ID NO: 27 shows the nucleotide sequence of forward primer PT1201n SEQ ID NO: 28 shows the nucleotide sequence of reverse primer PT1201n SEQ ID NO: 29 shows the nucleotide sequence of forward primer PT1242 SEQ ID NO: 30 shows the nucleotide sequence of reverse primer PT1242 SEQ ID NO: 31 shows the nucleotide sequence of forward primer PT1245 SEQ ID NO: 32 shows the nucleotide sequence of reverse primer PT1245 SEQ ID NO: 33 shows the nucleotide sequence of forward primer PT1289 SEQ ID NO: 34 shows the nucleotide sequence of reverse primer PT1289 SEQ ID NO: 35 shows the nucleotide sequence of forward primer PT1348 SEQ ID NO: 36 shows the nucleotide sequence of reverse primer PT1348 SEQ ID NO: 37 shows the nucleotide sequence of forward primer PT1426n SEQ ID NO: 38 shows the nucleotide sequence of reverse primer PT1426n SEQ ID NO: 39 shows the nucleotide sequence of forward primer PT1436 SEQ ID NO: 40 shows the nucleotide sequence of reverse primer PT1436 SEQ ID NO: 41 shows the nucleotide sequence of forward primer PT1445 SEQ ID NO: 42 shows the nucleotide sequence of reverse primer PT1445 SEQ ID NO: 43 shows the nucleotide sequence of forward primer PT1454 SEQ ID NO: 44 shows the nucleotide sequence of reverse primer PT1454 SEQ ID NO: 45 shows the nucleotide sequence of forward primer PT20335 SEQ ID NO: 46 shows the nucleotide sequence of reverse primer PT20335 SEQ ID NO: 47 shows the nucleotide sequence of forward primer PT50182 SEQ ID NO: 48 shows the nucleotide sequence of reverse primer PT50182 SEQ ID NO: 49 shows the nucleotide sequence of forward primer PT50529 SEQ ID NO: 50 shows the nucleotide sequence of reverse primer PT50529 SEQ ID NO: 51 shows the nucleotide sequence of forward primer PT50539 SEQ ID NO: 52 shows the nucleotide sequence of reverse primer PT50539 SEQ ID NO: 53 shows the nucleotide sequence of forward primer PT50936 SEQ ID NO: 54 shows the nucleotide sequence of reverse primer PT50936 SEQ ID NO: 55 shows the nucleotide sequence of forward primer PT50943 SEQ ID NO: 56 shows the nucleotide sequence of reverse primer PT50943 SEQ ID NO: 57 shows the nucleotide sequence of forward primer PT51050 SEQ ID NO: 58 shows the nucleotide sequence of reverse primer PT51050 SEQ ID NO: 59 shows the nucleotide sequence of forward primer PT51063 SEQ ID NO: 60 shows the nucleotide sequence of reverse primer PT51063 SEQ ID NO: 61 shows the nucleotide sequence of forward primer PT51191 SEQ ID NO: 62 shows the nucleotide sequence of reverse primer PT51191 SEQ ID NO: 63 shows the nucleotide sequence of forward primer PT51214 SEQ ID NO: 64 shows the nucleotide sequence of reverse primer PT51214 SEQ ID NO: 65 shows the nucleotide sequence of forward primer PT51878 SEQ ID NO: 66 shows the nucleotide sequence of reverse primer PT51878 SEQ ID NO: 67 shows the nucleotide sequence of forward primer PT52002 SEQ ID NO: 68 shows the nucleotide sequence of reverse primer PT52002 SEQ ID NO: 69 shows the nucleotide sequence of forward primer PT52378 SEQ ID NO: 70 shows the nucleotide sequence of reverse primer PT52378 SEQ ID NO: 71 shows the nucleotide sequence of forward primer PT52641 SEQ ID NO: 72 shows the nucleotide sequence of reverse primer PT52641 SEQ ID NO: 73 shows the nucleotide sequence of forward primer PT52718 SEQ ID NO: 74 shows the nucleotide sequence of reverse primer PT52718 SEQ ID NO: 75 shows the nucleotide sequence of forward primer PT52722 SEQ ID NO: 76 shows the nucleotide sequence of reverse primer PT52722 SEQ ID NO: 77 shows the nucleotide sequence of forward primer PT52927 SEQ ID NO: 78 shows the nucleotide sequence of reverse primer PT52927 SEQ ID NO: 79 shows the nucleotide sequence of forward primer PT52997 SEQ ID NO: 80 shows the nucleotide sequence of reverse primer PT52997 SEQ ID NO: 81 shows the nucleotide sequence of forward primer PT53005 SEQ ID NO: 82 shows the nucleotide sequence of reverse primer PT53005 SEQ ID NO: 83 shows the nucleotide sequence of forward primer PT54346 SEQ ID NO: 84 shows the nucleotide sequence of reverse primer PT54346 SEQ ID NO: 85 shows the nucleotide sequence of forward primer PT54551 SEQ ID NO: 86 shows the nucleotide sequence of reverse primer PT54551 SEQ ID NO: 87 shows the nucleotide sequence of forward primer PT54630 SEQ ID NO: 88 shows the nucleotide sequence of reverse primer PT54630 SEQ ID NO: 89 shows the nucleotide sequence of forward primer PT54746 SEQ ID NO: 90 shows the nucleotide sequence of reverse primer PT54746 SEQ ID NO: 91 shows the nucleotide sequence of forward primer PT54871
SEQ ID NO: 92 shows the nucleotide sequence of reverse primer PT54871
SEQ ID NO: 93 shows the nucleotide sequence of forward primer PT54893
SEQ ID NO: 94 shows the nucleotide sequence of reverse primer PT54893
SEQ ID NO: 95 shows the nucleotide sequence of forward primer PT50392
SEQ ID NO: 96 shows the nucleotide sequence of reverse primer PT50392
SEQ ID NO: 97 shows the nucleotide sequence of forward primer PT52856
SEQ ID NO: 98 shows the nucleotide sequence of reverse primer PT52856
SEQ ID NO: 99 shows the nucleotide sequence of forward primer PT54452
SEQ ID NO: 100 shows the nucleotide sequence of reverse primer PT54452
SEQ ID NO: 101 shows the nucleotide sequence of forward primer PT54693
SEQ ID NO: 102 shows the nucleotide sequence of reverse primer PT54693
SEQ ID NO: 103 shows the nucleotide sequence of forward primer PT54789
SEQ ID NO: 104 shows the nucleotide sequence of reverse primer PT54789
The invention will now be described in more detail. In this more detailed description, single sequence repeat markers (SSR marker) are used as example. However, this is to illustrate the invention but in no way intended to be limiting.
Primers associated to an SSR marker are amplified by PCR on a DNA sample and leads to several ampficon sizes, the "alleles". The set of alleles for a given DNA preparation is called the genotype of the sample. For example, the results of such amplification on one sample are of the form g, - ax la2 where a, is an integer depending on the number of microsatellite repeats between the two flanking primers. The number of alleles depends on the uniqueness of the focus associated with the primers and on the ploidity type (diploid, tetraploid, amphi-diploid) of the organism from which the DNA is extracted. From a data analysis point of view, these data are neither continuous, nor nominal (even if they could be considered as such), nor ordinal. The set of genotypes is partially ordered by set inclusion:
is
Figure imgf000010_0001
contained in
Figure imgf000010_0002
Two genotypes are associated if the intersection
Figure imgf000010_0003
js non empty,
(that is they share a common allele). The SSR data are then used to estimate the degree of polymorphism between two samples by computing a "genetic distance" between them. For example, the Jaccard distance or the Net-Li distance is used. Given two samples SI , S2 on which m SSR markers are amplified, leading to m genotypes for the first sample
Figure imgf000011_0005
and m genotypes for the second sample
Figure imgf000011_0003
, where
Figure imgf000011_0004
is seen as the amplicons set, the following quantities can be computed:
1 ) The, Jaccard coefficient dissimilarity between S1 and S2 is defined
Figure imgf000011_0001
where Δ denotes the symmetric difference of the two sets.
2) The. Nei-Li genetic distance between S1 and S2 s defined as
Figure imgf000011_0002
where Δ denotes the symmetric difference of the tvm sets an the set cardinality.
Therefore, given a data set of samples on which m SSR markers are amplified leads to an dissimilarity matrix (dissimilarities- 1 -similarities) whose entries are the estimated genetic distance between a pair of samples, this matrix will be denoted by D in the following. The purpose here is not to estimate sharply the evolutionary distances between the varieties but is to exploit the polymorphism encoded in the SSR data.
The basic concept of the kernel method is to model a classier in a feature space (which will be a Hilbert space) based only on a similarity matrix as long as this similarity is positive definite. Indeed, If the measure of similarity between the samples is a positive definite kernel, then classifiers can be learned in the reproducing Hilbert space associated to it. However, the Jaccard or Nei-Li similarity (one minus the dissimilarity previously defined) are not positive definite.
One principle way for converting a similarity measure to a valid kernel is called empirical kernel map. If s - Χ x Χ→R denotes the similarity, then it consists in chosen objects ,,...,τ ε X called templates and then defining a kernel by:
Figure imgf000012_0001
Thus, according to the invention, the prototype kernel is defined by choosing templates for each class C as a fixed proportion p of the class observations. Those will be the prototypes resulting from the K-medoids algorithm applied to the each class with the number of group being equal to integer part of #C p. Once a valid kernel is defined, further processing methods are preferably applied. For example, Kernel-LDA (Linear Discriminant Analysis in the feature space) or Kernel-PCA-LDA (Kernel-Principal Component Analysis followed by Linear Discriminant Analysis) is preferably applied on the computed prototype kernel for obtaining a probabilistic model for sample prediction and identification.
In general, in order to reduce cost, it is desirable to select a subset of markers (for example containing 5 to 20 markers only) from the hundreds of markers available. According to the invention, selection of a marker subset (a marker kit) is performed as follows.
First, the set of markers that show the biggest polymorphism between the groups to discriminate and the lowest polymorphism within the groups are chosen. A score will then be computed for each marker, which score represent the ability of a given marker to discriminate between the groups.
In the next step, a redundancy score is computed, in order to assess if the polymorphism contained in a marker A is similar to the polymorphism of a second marker B. If this was the case, one marker is preferably dropped in favor of another one exhibiting a different polymorphism.
The measure or extent of association between the marker and the group that is used is the Asymmetric Uncertainty Coefficient which reflects the dependency of the marker and the group to be discriminated. The redundancy between two markers will be quantified by the Uncertainty Coefficient.
If X is a finite variable then its entropy is given by
Figure imgf000012_0002
Figure imgf000012_0003
The entropy of the conjoint distribution is
Figure imgf000012_0004
The symmetric uncertainty coefficient is defined by
Figure imgf000013_0001
The asymptotic variance is
Figure imgf000013_0002
U(X,Y)=0 characterizes independence. Moreover, in the case of complete dependence and only in this case U(X,Y)=1.
The asymmetric uncertainty coefficient is defined by:
Figure imgf000013_0003
Its asymptotic variance is given by:
Figure imgf000013_0004
To estimate these coefficients, the estimates of all allele frequencies p, , p., and py given by
Figure imgf000013_0005
Wj|| be used.
The value U(Group|markerj) = p, for a given marker is called the discrimination power of the marker for tine given group classification, and U(markerj,markerj) = δi.j will be referenced as the redundancy between the markers I and j.
On the basis of these notions, the kit selection is preferably performed as follows. First, discrimination powers pi,... ,pm of the markers, a maximum size k of the desired marker kit, and thresholds on the discriminating power (t(1))j=1....m are provided. The set of discrimination powers is then ordered P(1)≥...≥ P(H,) .
Then, the following algorithm is preferably run:
Figure imgf000014_0001
The threshold can be chosen according to the asymptotic variance of the estimated uncertainty coefficients. For the asymmetric uncertainty, the threshold is set to be a constant ci time the asymptotic standard deviation. Usually Ci will be 10. The uncertainty coefficient threshold between two markers i, j is set to o.»Std(U(marker, .marker^) where Std denotes the asymptotic standard deviation of the quantity. The choice of the constants ci, c2 depends on user requirement. The bigger c2 will be the smaller the kit size will be in general. Setting c2 higher (which means that markers are discarded more easily) can improve the performance of the classifier. In the example shown below C1=10 and C2=12 .
If the number of selected markers leads to a very good classier a forward selection (respectively backward selection, respectively exhaustive search) can preferably be applied in order to select a smaller kit (if needed). The cheapest approach to decrease the kit size is of course to choose the k-th first marker proposed by the method of the invention.
The method for selecting a kit according to the invention is advantageous as it allows to build an economical and efficient prediction model for discrimination. The method is independent of the supervised method chosen in the modeling process. Moreover, polymorphism between the (group of) samples is very well encoded in the prototype kernel and, when combined with Kernel-LDA or Kernel-PCA-LDA leads to satisfactory prediction models.
The selection method of the invention performs well and, when preferably combined with Kernel-LDA and Kernel-PCA-LDA leads to correct classification rates, as shown befow by means of examples. Both alternatives lead to good classification error-rates, with a slight advantage for Kernel-PCA-LDA in the example. Another advantage is to benefit from a fast algorithm (as exhaustive search are generally unfeasible or are very time consuming), which is useful to come up with a few markers that are feasible for a given task.
In one embodiment of the invention, a genotyping kit is provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non- redundant markers, determined by using the method according to the invention and as described and claimed herein, particularly SSR-based markers, which are representative of the tobacco genome and may be used to perform a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting).
By applying the methods of the invention, a set of non-redundant molecular markers is provided, which has a minimal number of markers, and which have been selected as being suitable for a variety of uses as mentioned above and which are maximally required for accomplishing defined tasks including, but not limited to tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting). Accordingly, not all the markers comprised in the minimal set of markers may be needed for accomplishing a certain task. Depending on the specific genotyping problem to be addressed, one of skill in the art can by further experimentation further reduce the number of markers that are used in the genotyping test. For example, the minimal number of markers used for authentication of a known type of tobacco material may be lower than the set of markers used for distinguishing several varieties. Accordingly, it is contemplated a subset of markers may be selected for use in genotyping from the minimal set of non-redundant markers as identified by the pairs of primers (SEQ ID NO: 1-104) provided herein. In a less preferred embodiment, it is also contemplated other molecular markers can be used in conjunction with one or more non-redundant molecular markers as identified by the pairs of primers (SEQ ID NO: 1-104) provided herein.
In particular, the invention relates to a genotyping kit comprising, consisting essentially of or consisting of a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by a plurality of polynucleotide fragments that are obtainable in a PC amplification reaction using a pair of primers selected from the group consisting of primer pair 1 comprising a forward primer of SEQ ID NO: 1 and a reverse primer of SEQ ID NO: 2; primer pair 2 comprising a forward primer of SEQ ID NO: 3 and a reverse primer of SEQ ID NO: 4; primer pair 3 comprising a forward primer of SEQ ID NO: 5 and a reverse primer of SEQ ID NO: 6; primer pair 4 comprising a forward primer of SEQ ID NO: 7 and a reverse primer of SEQ ID NO: 8; primer pair 5 comprising a forward primer of SEQ ID NO: 9 and a reverse primer of SEQ ID NO: 10; primer pair 6 comprising a forward primer of SEQ ID NO: 1 1 and a reverse primer of SEQ ID NO: 12; primer pair 7 comprising a forward primer of SEQ ID NO: 13 and a reverse primer of SEQ ID NO: 14; primer pair 8 comprising a forward primer of SEQ ID NO: 15 and a reverse primer of SEQ ID NO: 16; primer pair 9 comprising a forward primer of SEQ ID NO: 17 and a reverse primer of SEQ ID NO: 18; primer pair 10 comprising a forward primer of SEQ ID NO: 19 and a reverse primer of SEQ ID NO: 20; primer pair 11 comprising a forward primer of SEQ ID NO: 21 and a reverse primer of SEQ ID NO: 22; primer pair 12 comprising a forward primer of SEQ ID NO: 23 and a reverse primer of SEQ ID NO: 24, primer pair 13 comprising a forward primer of SEQ ID NO: 25 and a reverse primer of SEQ ID NO: 26; primer pair 14 comprising a forward primer of SEQ ID NO: 27 and a reverse primer of SEQ ID NO: 28; primer pair 15 comprising a forward primer of SEQ ID NO: 29 and a reverse primer of SEQ ID NO: 30; primer pair 16 comprising a forward primer of SEQ ID NO: 31 and a reverse primer of SEQ ID NO: 32; primer pair 17 comprising a forward primer of SEQ ID NO: 33 and a reverse primer of SEQ ID NO: 34; primer pair 18 comprising a forward primer of SEQ ID NO: 35 and a reverse primer of SEQ ID NO: 36; primer pair 19 comprising a forward primer of SEQ ID NO: 37 and a reverse primer of SEQ ID NO: 38; primer pair 20 comprising a forward primer of SEQ ID NO: 39 and a reverse primer of SEQ ID NO: 40; primer pair 21 comprising a forward primer of SEQ ID NO: 41 and a reverse primer of SEQ ID NO: 42; primer pair 22 comprising a forward primer of SEQ ID NO: 43 and a reverse primer of SEQ ID NO: 44; primer pair 23 comprising a forward primer of SEQ ID NO: 45 and a reverse primer of SEQ ID NO: 46; primer pair 24 comprising a forward primer of SEQ ID NO: 47 and a reverse primer of SEQ ID NO: 48; primer pair 25 comprising a forward primer of SEQ ID NO: 49 and a reverse primer of SEQ ID NO: 50; primer pair 26 comprising a forward primer of SEQ ID NO: 51 and a reverse primer of SEQ ID NO: 52; primer pair 27 comprising a forward primer of SEQ ID NO: 53 and a reverse primer of SEQ ID NO: 54; primer pair 28 comprising a forward primer of SEQ ID NO: 55 and a reverse primer of SEQ ID NO: 56; primer pair 29 comprising a forward primer of SEQ ID NO: 57 and a reverse primer of SEQ ID NO: 58; primer pair 30 comprising a forward primer of SEQ ID NO: 59 and a reverse primer of SEQ ID NO: 60; primer pair 31 comprising a forward primer of SEQ ID NO: 61 and a reverse primer of SEQ ID NO: 62; primer pair 32 comprising a forward primer of SEQ ID NO: 63 and a reverse primer of SEQ ID NO: 64; primer pair 33 comprising a forward primer of SEQ ID NO: 65 and a reverse primer of SEQ ID NO: 66; primer pair 34 comprising a forward primer of SEQ ID NO: 67 and a reverse primer of SEQ ID NO: 68; primer pair 35 comprising a forward primer of SEQ ID NO: 69 and a reverse primer of SEQ ID NO: 70; primer pair 36 comprising a forward primer of SEQ ID NO: 71 and a reverse primer of SEQ ID NO: 72; primer pair 37 comprising a forward primer of SEQ ID NO: 73 and a reverse primer of SEQ ID NO: 74, primer pair 38 comprising a forward primer of SEQ ID NO: 75 and a reverse primer of SEQ ID NO: 76; primer pair 39 comprising a forward primer of SEQ ID NO: 77 and a reverse primer of SEQ ID NO: 78; primer pair 40 comprising a forward primer of SEQ ID NO: 79 and a reverse primer of SEQ ID NO: 80; primer pair 41 comprising a forward primer of SEQ ID NO: 81 and a reverse primer of SEQ ID NO: 82; primer pair 42 comprising a forward primer of SEQ ID NO: 83 and a reverse primer of SEQ ID NO: 84; primer pair 43 comprising a forward primer of SEQ ID NO: 85 and a reverse primer of SEQ ID NO: 86; primer pair 44 comprising a forward primer of SEQ ID NO: 87 and a reverse primer of SEQ ID NO: 88; primer pair 45 comprising a forward primer of SEQ ID NO: 89 and a reverse primer of SEQ ID NO: 90; primer pair 46 comprising a forward primer of SEQ ID NO: 91 and a reverse primer of SEQ ID NO: 92; primer pair 47 comprising a forward primer of SEQ ID NO: 93 and a reverse primer of SEQ ID NO: 94; primer pair 48 comprising a forward primer of SEQ ID NO: 95 and a reverse primer of SEQ ID NO: 96; primer pair 49 comprising a forward primer of SEQ ID NO: 97 and a reverse primer of SEQ ID NO: 98; primer pair 50 comprising a forward primer of SEQ ID NO: 99 and a reverse primer of SEQ ID NO: 100, primer pair 51 comprising a forward primer of SEQ ID NO: 101 and a reverse primer of SEQ ID NO: 102; and primer pair 52 comprising a forward primer of SEQ ID NO: 103 and a reverse primer of SEQ ID NO: 104, including forward and reverse primers that exhibit a nucleotide sequence that has at least 90%, particularly at least 95%, particularly at least 96%, particularly at least 97%, particularly at least 98%, particularly at least 99%, sequence identity to the nucleotide sequences of the primers shown in SEQ ID NOs: 1-104.
In one embodiment, a genotyping kit according to the invention is provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by at least one of the polynucleotide fragments disclosed herein before, which fragments are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 as shown in Table 1.
In one embodiment, a genotyping kit according to the invention is provided comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, which set or minimal set of non-redundant markers is represented by at least one, particularly at least two, particularly at least 3, particularly at least 4, particularly at least 5, particularly at least 10, particularly at least 20, particularly at least 25, particularly at least 30, particularly at least 40 of the non- redundant markers, particularly all of the polynucleotide fragments recited herein before, which fragments are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52 as shown in Table 1.
The invention also relates to the use of a genotyping kit according to the invention and as described herein before comprising a set of markers, particularly a minimal set of markers, particularly a set or minimal set of non-redundant markers, determined by using the method according to the invention and as described and claimed herein, for performing a given discrimination or classification task including, but not limited to, tobacco variety parentage testing, variety discrimination or classification, type discrimination or classification, quality assurance of tobacco variety material, tracing the source, storage, or route of transportation of a tobacco variety material, and authentication of tobacco variety material (variety fingerprinting). Depending on the genotyping task to be performed, the number and composition of the genotyping kit may change. Based on the disclosure provided herein, the person skilled in the art is now in a position to identify the set of non-redundant markers most suitable for performing the respective genotyping task. EXAMPLES
In the following example it is explained in detail how the invention can be applied to discriminate tobacco types. In this example, Nicotiana tabacum, a functional diploid was used.
The dataset used here contained the measurements from the 52 described SSR markers on 234 varieties (or fandraces) without replicates that lead to 234 observations. In this example, the objective was to discriminate the tobacco types Burley, Flue Cured and Oriental.
With reference to Figures 1 and 2, it can be seen that the selection method of the invention tends to maximize the between types polymorphism and minimize the within types polymorphism. The method of the invention selected 4 SSR markers (the conditions on the constant c? was stringent as a few markers are willed, ci = 10, & « 12). Fig. 3 shows the ordered discrimination power of the SSR for this particular discrimination task.
In this example, the prototype kernel approach combined with the Kernel PCA-LDA and Kernel-LDA lead to excellent results. It has to be noticed that the proportion of prototypes chosen can impact slightly the error rates. The final model should be chosen keeping in mind that less prototypes would certainly lead to more robust models. All the error rates presented are estimated by 5-fold cross-validation.
Figure imgf000019_0001
The algorithm proposed a kit of size 4. In order to evaluate how the selected 4-tuple of markers performs versus a randomly chosen 4-tuple, 10Ό00 random selections were performed (with the proportion of prototype being equal to 1/5-th). The 5-fold cross- validation results from these simulations are summarized below. It shows that, among the 10Ό00 4-tuples, the one chosen by the algorithm belongs to the 0.9%-best 4-tuples for KPCA-LDA and th 1.5%-best 4-tuples for KLDA: Summary of cross-validation error rates (in %) from simulations:
Figure imgf000020_0001
Therefore the method of the invention is very efficient in selecting SSR markers that perform well. The prototype kernel seems to encode the desired polymorphism information and lead to good prediction models.
Finally having skipped the third highest discrimination power marker) see Fig. 3) improves the prediction (up to 1.1%), which support the usage of the redundancy between markers.
Figure imgf000021_0001
Figure imgf000022_0001

Claims

Claims 1. A method of molecular marker selection to identify a minimal set of markers from a plurality of markers for identifying an unknown sample or for differentiating between or classifying a group of unknown samples, or a combination thereof comprising the steps of:
a) computing a discrimination score for each marker, which score represents the ability of the given marker to discriminate between groups of samples; b) computing a redundancy score among the markers; and
c) selecting a minimal set of non-redundant markers on the basis of the computed discrimination score and redundancy score;
wherein at least one, at least two or all of steps a), b) and c) are computer- implemented. 2. The method of claim 1 comprising the additional step (d), comprising providing a genotyping kit comprising said minimal set of non-redundant markers selected in step (c). 3. The method of claim 1 or 2, wherein step a) provides a discriminating power Pi , .... pm for each of the markers. 4. The method of claim 3, wherein step b) further comprises the steps of
b1 ) providing a maximum size K for the kit to be selected;
b2) providing thresholds on the discriminating power;
b3) ordering the discriminating powers pi in descending order;
b4) computing the redundancy and associated redundancy thresholds for each pair of markers. 5. The method of claim 4, wherein
i) in step b2) one threshold is provided for each discriminating power value; and/or
ii) in step b4) one threshold is provided for each redundancy value between pairs of markers; and/or in) step c) comprises selecting the k markers forming the kit depending on the computed discrimination powers, the computed redundancy and on the associated thresholds for the discrimination powers and the redundancies.
The method of claim 5, further comprising the step of
i) computing a prototype kernel; and/or
ii) calculating a genetic similarity measure for each pair of samples; and/or
iii) choosing templates for each group C, as a fixed proportion p of the observations of that group.
The method of claim 6, wherein the kernel is defined by wherein τ,..., τ, ε X are the chosen
Figure imgf000024_0001
templates and s denotes the similarity measure.
The method of claim 7, further comprising applying a Kernel-Linear Discriminant Analysis or a Kernel-Principal Component Analysis followed by Linear Discriminant Analysis on the computed prototype kernel for obtaining a probabilistic model for prediction and identification.
The method of claim 8, comprising the further step of
i) applying the selected set of markers and obtained model on at least one unknown sample; and/or
ii) calculating genetic similarity for the unknown sample to be identified to the known sample used for marker selection; and/or
iii) computing a prototype kernel for the unknown sample on the basis of the k selected markers, and the chosen templates. The method of claim 9, further comprising applying a Kernel-Linear Discriminant Analysis or a Kernel-Principal Component Analysis followed by Linear Discriminant Analysis on the computed prototype kernel for obtaining a probability output for the unknown sample to be identified. The method of claim 2, wherein said genotyping kit comprises a minimal set of non-redundant markers selected from a plurality of markers which non-redundant markers are represented by a plurality of polynucleotide fragments that are obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pair 1 comprising a forward primer of SEQ ID NO: 1 and a reverse primer of SEQ ID NO: 2; primer pair 2 comprising a forward primer of SEQ ID NO: 3 and a reverse primer of SEQ ID NO: 4; primer pair 3 comprising a forward primer of SEQ ID NO: 5 and a reverse primer of SEQ ID NO: 6; primer pair 4 comprising a forward primer of SEQ ID NO: 7 and a reverse primer of SEQ ID NO: 8; primer pair 5 comprising a forward primer of SEQ ID NO: 9 and a reverse primer of SEQ ID NO: 10; primer pair 6 comprising a forward primer of SEQ ID NO: 11 and a reverse primer of SEQ ID NO: 12; primer pair 7 comprising a forward primer of SEQ ID NO: 13 and a reverse primer of SEQ ID NO: 14; primer pair 8 comprising a forward primer of SEQ ID NO: 15 and a reverse primer of SEQ ID NO: 16; primer pair 9 comprising a forward primer of SEQ ID NO: 17 and a reverse primer of SEQ ID NO: 18; primer pair 10 comprising a forward primer of SEQ ID NO: 19 and a reverse primer of SEQ ID NO: 20; primer pair 11 comprising a forward primer of SEQ ID NO: 21 and a reverse primer of SEQ ID NO: 22; primer pair 12 comprising a forward primer of SEQ ID NO: 23 and a reverse primer of SEQ ID NO: 24, primer pair 13 comprising a forward primer of SEQ ID NO: 25 and a reverse primer of SEQ ID NO: 26; primer pair 14 comprising a forward primer of SEQ ID NO: 27 and a reverse primer of SEQ ID NO: 28; primer pair 15 comprising a forward primer of SEQ ID NO: 29 and a reverse primer of SEQ ID NO: 30; primer pair 16 comprising a forward primer of SEQ ID NO: 31 and a reverse primer of SEQ ID NO: 32; primer pair 17 comprising a forward primer of SEQ ID NO: 33 and a reverse primer of SEQ ID NO: 34; primer pair 18 comprising a forward primer of SEQ ID NO: 35 and a reverse primer of SEQ ID NO: 36; primer pair 19 comprising a forward primer of SEQ ID NO: 37 and a reverse primer of SEQ ID NO: 38; primer pair 20 comprising a forward primer of SEQ ID NO: 39 and a reverse primer of SEQ ID NO: 40; primer pair 21 comprising a forward primer of SEQ ID NO: 41 and a reverse primer of SEQ ID NO: 42; primer pair 22 comprising a forward primer of SEQ ID NO: 43 and a reverse primer of SEQ ID NO: 44; primer pair 23 comprising a forward primer of SEQ ID NO: 45 and a reverse primer of SEQ ID NO: 46; primer pair 24 comprising a forward primer of SEQ ID NO: 47 and a reverse primer of SEQ ID NO: 48; primer pair 25 comprising a forward primer of SEQ ID NO: 49 and a reverse primer of SEQ ID NO: 50, primer pair 26 comprising a forward primer of SEQ ID NO: 51 and a reverse primer of SEQ ID NO: 52; primer pair 27 comprising a forward primer of SEQ ID NO: 53 and a reverse primer of SEQ ID NO: 54; primer pair 28 comprising a forward primer of SEQ ID NO: 55 and a reverse primer of SEQ ID NO: 56; primer pair 29 comprising a forward primer of SEQ ID NO: 57 and a reverse primer of SEQ ID NO: 58; primer pair 30 comprising a forward primer of SEQ ID NO: 59 and a reverse primer of SEQ ID NO: 60; primer pair 31 comprising a forward primer of SEQ ID NO: 61 and a reverse primer of SEQ ID NO: 62; primer pair 32 comprising a forward primer of SEQ ID NO: 63 and a reverse primer of SEQ ID NO: 64; primer pair 33 comprising a forward primer of SEQ ID NO: 65 and a reverse primer of SEQ ID NO: 66; primer pair 34 comprising a forward primer of SEQ ID NO: 67 and a reverse primer of SEQ ID NO: 68; primer pair 35 comprising a forward primer of SEQ ID NO: 69 and a reverse primer of SEQ ID NO: 70; primer pair 36 comprising a forward primer of SEQ ID NO: 71 and a reverse primer of SEQ ID NO: 72; primer pair 37 comprising a forward primer of SEQ ID NO: 73 and a reverse primer of SEQ ID NO: 74, primer pair 38 comprising a forward primer of SEQ ID NO: 75 and a reverse primer of SEQ ID NO: 76; primer pair 39 comprising a forward primer of SEQ ID NO: 77 and a reverse primer of SEQ ID NO: 78; primer pair 40 comprising a forward primer of SEQ ID NO: 79 and a reverse primer of SEQ ID NO: 80; primer pair 41 comprising a forward primer of SEQ ID NO: 81 and a reverse primer of SEQ ID NO: 82; primer pair 42 comprising a forward primer of SEQ ID NO: 83 and a reverse primer of SEQ ID NO: 84; primer pair 43 comprising a forward primer of SEQ ID NO: 85 and a reverse primer of SEQ ID NO: 86; primer pair 44 comprising a forward primer of SEQ ID NO: 87 and a reverse primer of SEQ ID NO: 88; primer pair 45 comprising a forward primer of SEQ ID NO: 89 and a reverse primer of SEQ ID NO: 90; primer pair 46 comprising a forward primer of SEQ ID NO: 91 and a reverse primer of SEQ ID NO: 92; primer pair 47 comprising a forward primer of SEQ ID NO: 93 and a reverse primer of SEQ ID NO: 94; primer pair 48 comprising a forward primer of SEQ ID NO: 95 and a reverse primer of SEQ ID NO: 96; primer pair 49 comprising a forward primer of SEQ ID NO: 97 and a reverse primer of SEQ ID NO: 98; primer pair 50 comprising a forward primer of SEQ ID NO: 99 and a reverse primer of SEQ ID NO: 100, primer pair 51 comprising a forward primer of SEQ ID NO: 101 and a reverse primer of SEQ ID NO: 102; and primer pair 52 comprising a forward primer of SEQ ID NO: 103 and a reverse primer of SEQ ID NO: 104, including forward and reverse primers that exhibit a nucleotide sequence that has at least 90%, particularly at least 95%, particularly at least 96%, particularly at least 97%, particularly at least 98%, particularly at least 99%, sequence identity to the nucleotide sequences of the primers shown in SEQ ID NOs: 1-104.
A genotyping kit comprising at least 10, particularly at least 20, particularly at least 25, particularly at least 30, particularly at least 40 of the k non-redundant markers recited in claim 11.
A kit according to claim 11 comprising ail of the markers recited in claim 11 represented by at least one of the polynucleotide fragments obtainable in a PCR amplification reaction using a pair of primers selected from the group consisting of primer pairs 1 to 52.
Use of a kit according to any one of claims 11 to 13 for tobacco variety parentage testing, variety discrimination, type discrimination, quality assurance of tobacco variety material, traceability and authentication testing of tobacco variety material. Use according to claim 14, wherein the tobacco variety material is processed tobacco material.
PCT/EP2011/054624 2010-03-26 2011-03-25 Method of molecular marker selection WO2011117388A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP10158022 2010-03-26
EP10158022.3 2010-03-26

Publications (1)

Publication Number Publication Date
WO2011117388A1 true WO2011117388A1 (en) 2011-09-29

Family

ID=42712657

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/054624 WO2011117388A1 (en) 2010-03-26 2011-03-25 Method of molecular marker selection

Country Status (1)

Country Link
WO (1) WO2011117388A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2607494A1 (en) * 2011-12-23 2013-06-26 Philip Morris Products S.A. Biomarkers for lung cancer risk assessment
CN103294896A (en) * 2013-05-09 2013-09-11 国家电网公司 Method for selecting benchmarking photovoltaic components of photovoltaic power station on basis of principal component analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BLAZADONAKIS M E ET AL: "Polynomial and RBF Kernels as Marker Selection Tools-A Breast Cancer Case Study", MACHINE LEARNING AND APPLICATIONS, 2007. ICMLA 2007. SIXTH INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 13 December 2007 (2007-12-13), pages 488 - 493, XP031233758, ISBN: 978-0-7695-3069-7 *
LENG ET AL: "Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data", COMPUTATIONAL BIOLOGY AND CHEMISTRY, ELSEVIER LNKD- DOI:10.1016/J.COMPBIOLCHEM.2008.07.015, vol. 32, no. 6, 1 December 2008 (2008-12-01), pages 417 - 425, XP025583868, ISSN: 1476-9271, [retrieved on 20080716] *
SAEYS YVAN ET AL: "A review of feature selection techniques in bioinformatics", BIOINFORMATICS (OXFORD), vol. 23, no. 19, October 2007 (2007-10-01), pages 2507 - 2517, XP002601323, ISSN: 1367-4803 *
VERSTRAELEN SANDRA ET AL: "Gene profiles of THP-1 macrophages after in vitro exposure to respiratory (non-)sensitizing chemicals: identification of discriminating genetic markers and pathway analysis.", TOXICOLOGY IN VITRO : AN INTERNATIONAL JOURNAL PUBLISHED IN ASSOCIATION WITH BIBRA SEP 2009 LNKD- PUBMED:19527780, vol. 23, no. 6, September 2009 (2009-09-01), pages 1151 - 1162, XP002601325, ISSN: 1879-3177 *
WANG LEI ET AL: "A kernel-induced space selection approach to model selection in KLDA.", IEEE TRANSACTIONS ON NEURAL NETWORKS / A PUBLICATION OF THE IEEE NEURAL NETWORKS COUNCIL DEC 2008 LNKD- PUBMED:19054735, vol. 19, no. 12, December 2008 (2008-12-01), pages 2116 - 2131, XP002601324, ISSN: 1941-0093 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2607494A1 (en) * 2011-12-23 2013-06-26 Philip Morris Products S.A. Biomarkers for lung cancer risk assessment
CN103294896A (en) * 2013-05-09 2013-09-11 国家电网公司 Method for selecting benchmarking photovoltaic components of photovoltaic power station on basis of principal component analysis
CN103294896B (en) * 2013-05-09 2016-08-17 国家电网公司 A kind of photovoltaic plant benchmark photovoltaic component system of selection based on principal component analysis

Similar Documents

Publication Publication Date Title
Peterson et al. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species
US7853408B2 (en) Method for the design of oligonucleotides for molecular biology techniques
Little et al. A comparison of algorithms for the identification of specimens using DNA barcodes: examples from gymnosperms
Soueidan et al. Machine learning for metagenomics: methods and tools
JP2018513445A5 (en)
Cowan et al. Challenges in the DNA barcoding of plant material
CN110997936B (en) Method, device and application of genotyping based on low-depth genome sequencing
CN110379464B (en) Method for predicting DNA transcription terminator in bacteria
CN113823356B (en) Methylation site identification method and device
Jones et al. Implications of using genomic prediction within a high-density SNP dataset to predict DUS traits in barley
US20210358568A1 (en) Nucleic acid sample analysis
WO2011117388A1 (en) Method of molecular marker selection
Martin An application of kernel methods to variety identification based on SSR markers genetic fingerprinting
Mbadiwe et al. FaceSNPs: Identifying face-related SNPs from the human genome
US20210193262A1 (en) System and method for predicting antimicrobial phenotypes using accessory genomes
KR102139646B1 (en) System for providing genetic breed information using standard genome map by breeds of animals and method thereof
WO2020243678A1 (en) Compositions and methods related to quantitative reduced representation sequencing
Ijaz et al. Jukes-cantor evolutionary model-based phylogenetic relationship of economically important ornamental palms using maximum likelihood approach
Soumya et al. Genetic fingerprinting of two species of Averrhoa using RAPD and SRAP markers
Johnsiul et al. Preliminary Evaluation on The SNP Markers for The Malaysian Commercial Cocoa Clones
Rocío et al. Accurate detection of single nucleotide polymorphisms using nanopore sequencing
Garbarine et al. An information theoretic method of microarray probe design for genome classification
Ahmad et al. Assessment of genetic diversity and genetic relationships among twenty varieties of Brassica juncea L. using RAPD markers
WO2022168195A1 (en) Genetic information analysis system and genetic information analysis method
KR20230165037A (en) Snp markers for discriminating italian ryegrass ir605 varieties and method for discriminating italian ryegrass ir605 varieties using the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11710501

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11710501

Country of ref document: EP

Kind code of ref document: A1