EP2449510B1 - Application of machine learning methods for mining association rules in plant and animal data sets containing molecular genetic markers, followed by classification or prediction utilizing features created from these association rules - Google Patents

Application of machine learning methods for mining association rules in plant and animal data sets containing molecular genetic markers, followed by classification or prediction utilizing features created from these association rules Download PDF

Info

Publication number
EP2449510B1
EP2449510B1 EP10728031.5A EP10728031A EP2449510B1 EP 2449510 B1 EP2449510 B1 EP 2449510B1 EP 10728031 A EP10728031 A EP 10728031A EP 2449510 B1 EP2449510 B1 EP 2449510B1
Authority
EP
European Patent Office
Prior art keywords
algorithm
features
algorithms
feature
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP10728031.5A
Other languages
German (de)
French (fr)
Other versions
EP2449510B2 (en
EP2449510A1 (en
Inventor
Daniel Caraviello
Rinkal Patel
Reetal Pai
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Corteva Agriscience LLC
Original Assignee
Dow AgroSciences LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=42685709&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=EP2449510(B1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Dow AgroSciences LLC filed Critical Dow AgroSciences LLC
Publication of EP2449510A1 publication Critical patent/EP2449510A1/en
Application granted granted Critical
Publication of EP2449510B1 publication Critical patent/EP2449510B1/en
Publication of EP2449510B2 publication Critical patent/EP2449510B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • AHUMAN NECESSITIES
    • A01AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
    • A01HNEW PLANTS OR NON-TRANSGENIC PROCESSES FOR OBTAINING THEM; PLANT REPRODUCTION BY TISSUE CULTURE TECHNIQUES
    • A01H1/00Processes for modifying genotypes ; Plants characterised by associated natural traits
    • A01H1/04Processes of selection involving genotypic or phenotypic markers; Methods of using phenotypic markers for selection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the disclosure relates to the use of one or more association rule mining algorithms to mine data sets containing features created from at least one plant or animal-based molecular genetic marker, find association rules and utilize features created from these association rules for classification or prediction.
  • One of the main objectives of plant and animal improvement is to obtain new cultivars that are superior in terms of desirable target features such as yield, grain oil content, disease resistance, and resistance to abiotic stresses.
  • a traditional approach to plant and animal improvement is to select individual plants or animals on the basis of their phenotypes, or the phenotypes of their offspring. The selected individuals can then, for example, be subjected to further testing or become parents of future generations. It is beneficial for some breeding programs to have predictions of performance before phenotypes are generated for a certain individual or when only a few phenotypic records have been obtained for that individual.
  • Some key limitations of methods for plant and animal improvement that rely only on phenotypic selection are the cost and speed of generating such data, and that there is a strong impact of the environment (e.g., temperature, management, soil conditions, day light, irrigation conditions) on the expression of the target features.
  • Some important considerations for a data analyses method for this type of datasets are the ability to mine historical data, to be robust to multicollinearity, and to account for interactions between the features included in these datasets (e.g. epistatic effects and genotype by environment interactions).
  • the ability to mine historical data avoids the requirement of highly structured data for data analyses.
  • Methods that require highly structured data, from planned experiments, are usually resource intensive in terms of human resources, money, and time.
  • the strong environmental effect on the expression of many of the most important traits in economically important plants and animals requires that such experiments be large, carefully designed, and carefully controlled.
  • the multicollinearity limitation refers to a situation in which two or more features (or feature subsets) are linearly correlated to one another. Multicollinearity may lead to a less precise estimation of the impact of a feature (or feature subset) on a target feature and consequently biased predictions.
  • a framework based on mining association rules and using features created from these rules to improve prediction or classification, is suitable to address the three considerations mentioned above.
  • Preferred methods for classification or prediction are machine learning methods. Association rules can therefore be used for classification or prediction for one or more target features.
  • the approach described in the present disclosure relies on implementing one or more machine learning-based association rule mining algorithms to mine datasets containing at least one plant or animal molecular genetic marker, create features based on the association rules found, and use these features for classification or prediction of target features.
  • a method for prediction or classification of one or more target features in plants comprising:
  • Embodiments relate only to claimed combinations of features.
  • the term "embodiment” relates to unclaimed combinations of features, said term has to be understood as referring to examples of the present invention.
  • Methods to mine data sets containing features created from at least one plant-based molecular genetic marker to find at least one association rule and to then use features created from these association rules for classification or prediction are disclosed. Some of these methods are suitable for classification or prediction with datasets containing plant and animal features.
  • Steps to mine a data set with at least one feature created from at least one plant-based molecular genetic marker, to find at least one association rule, and utilizing features created from these association rules for classification or prediction for one or more target features include:
  • Described herein is a method of mining a data set with one or more features, wherein the method includes using at least one plant-based molecular marker to find at least one association rule and utilizing features created from these association rules for classification or prediction, the method comprising the steps of: (a) detecting association rules, (b) creating new features based on the findings of step (a) and adding these features to the data set; (c) selecting a subset of features from features in the data set.
  • association rule mining algorithms are utilized for classification or prediction with one or more machine learning algorithms selected from: feature evaluation algorithms, feature subset selection algorithms, Bayesian networks ( see Cheng and Greiner (1999), Comparing Bayesian network classifiers. Proceedings UAI, pp. 101-107 .), instance-based algorithms, support vector machines ( see e . g ., Shevade et al., (1999), Improvements to SMO Algorithm for SVM Regression. Technical Report CD-99-16, Control Division Dept of Mechanical and Production Engineering, National University of Singapore ; Smola et al., (1998). A tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series - NC2-TR-1998-030 ; Scholkopf, (1998).
  • association rule mining algorithms include, but are not limited to APriori algorithm (see Witten and Frank (2005), Data Mining: Practical machine learning Tools and Techniques. Morgan Kaufmann, San Francisco, Second Editi on), FP-growth algorithm, association rule mining algorithms that can handle large number of features, colossal pattern mining algorithms, direct discriminative pattern mining algorithm, decision trees, rough sets (see Zdzislaw Pawlak (1992), Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Print on Demand ) and Self-Organizing Map (SOM) algorithm.
  • APriori algorithm see Witten and Frank (2005), Data Mining: Practical machine learning Tools and Techniques. Morgan Kaufmann, San Francisco, Second Editi on
  • FP-growth algorithm association rule mining algorithms that can handle large number of features
  • colossal pattern mining algorithms direct discriminative pattern mining algorithm
  • decision trees rough sets
  • rough sets see Zdzislaw Pawlak (1992), Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Print on Demand
  • a suitable association rule mining algorithm for handling large numbers of features include, but are not limited to, CLOSET+ (see Wang et. al (2003), CLOSET+: Searching for best strategies for mining frequent closed itemsets, ACM SIGKDD 2003, pp. 236-245 ), CHARM (see Zaki et. al (2002), CHARM: An efficient algorithm for closed itemset mining, SIAM 2002, pp. 457-473 ), CARPENTER (see Pan et. al (2003), CARPENTER: Finding Closed Patterns in Long Biological Datasets, ACM SIGKDD 2003, pp. 637-642 ), and COBBLER (see Pan et al (2004), COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery, SSDBM 2004, pp. 21 ).
  • a suitable algorithm for finding direct discriminative patterns include, but are not limited to, DDPM (see Cheng et. al (2008), Direct Discriminative Pattern Mining for Effective Classification, ICDE 2008, pp. 169-178 ), HARMONY (see Jiyong et. al (2005), HARMONY: Efficiently Mining the Best Rules for Classification, SIAM 2005, pp. 205-216 ), RCBT ( see Cong et. al (2005), Mining top-K covering rule groups for gene expression data, ACM SIGMOD 2005, pp.
  • CAR see Kian stunt et al (2008), CARSVM: A class association rule-based classification framework and its application in gene expression data, Artificial Intelligence in Medicine 2008, pp. 7-25
  • PATCLASS see Cheng et. al (2007), Discriminative Frequent Pattern Analysis for Effective Classification, ICDE 2007, pp. 716-725 ).
  • a suitable algorithm for finding colossal patterns include, but are not limited to, Pattern Fusion algorithm ( see Zhu et. al (2007), Mining Colossal Frequent Patterns by Core Pattern Fusion, ICDE 2007, pp. 706-715 ).
  • a suitable feature evaluation algorithm is selected from the group of information gain algorithm, Relief algorithm (see e.g., Robnik-Sikonja and Kononenko (2003), Theoretical and empirical analysis of Relief and ReliefF. Machine learning, 53:23-69 ; and Kononenko (1995). On biases in estimating multi-valued attributes. In IJCAI95, pages 1034-1040 ), ReliefF algorithm (see e.g., Kononenko, (1994), Estimating attributes: analysis and extensions of Relief. In: L. De Raedt and F. Bergadano (eds.): Machine learning: ECML-94. 171-182, Springer Verlag .), RReliefF algorithm, symmetrical uncertainty algorithm, gain ratio algorithm, and ranker algorithm.
  • Relief algorithm see e.g., Robnik-Sikonja and Kononenko (2003), Theoretical and empirical analysis of Relief and ReliefF. Machine learning, 53:23-69 ; and Kononenko (1995).
  • ReliefF algorithm see e.g
  • a suitable machine learning algorithm is a feature subset selection algorithm selected from the group of correlation-based feature selection (CFS) algorithm (see Hall, M. A.. 1999. Correlation-based feature selection for Machine Learning. Ph.D. thesis. Department of Computer Science - The University of Waikato, New Zeal and.), and the wrapper algorithm in association with any other machine learning algorithm.
  • CFS correlation-based feature selection
  • These feature subset selection algorithms may be associated with a search method selected from the group of greedy stepwise search algorithm, best first search algorithm, exhaustive search algorithm, race search algorithm, and rank search algorithm.
  • a suitable machine learning algorithm is a Bayesian network algorithm including the naive Bayes algorithm.
  • a suitable machine learning algorithm is an instance-based algorithm selected from the group of instance-based 1 (IB1) algorithm, instance-based k-nearest neighbor (IBK) algorithm, KStar, lazy Bayesian rules (LBR) algorithm, and locally weighted learning (LWL) algorithm.
  • IB1 instance-based 1
  • IBK instance-based k-nearest neighbor
  • KStar instance-based k-nearest neighbor
  • LBR lazy Bayesian rules
  • LWL locally weighted learning
  • a suitable machine learning algorithm for classification or prediction is a support vector machine algorithm.
  • a suitable machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization (SMO) algorithm.
  • the machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization for regression (SMOReg) algorithm (see e . g ., Shevade et al., (1999), Improvements to SMO Algorithm for SVM Regression. Technical Report CD-99-16, Control Division Dept of Mechanical and Production Engineering, National University of Singapore ; Smola & Scholkopf (1998), A tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series - NC2-TR-1998-030 ).
  • SMOReg sequential minimal optimization for regression
  • a suitable machine learning algorithm is a self-organizing map (Self-organizing maps, Teuvo Kohonen, Springer).
  • a suitable machine learning algorithm is a decision tree algorithm selected from the group of logistic model tree (LMT) algorithm, alternating decision tree (ADTree) algorithm (see Freund and Mason (1999), The alternating decision tree learning algorithm. Proc. Sixteenth International Conference on machine learning, Bled, Slovenia, pp. 124-133 ), M5P algorithm (see Quinlan (1992), Learning with continuous classes, in Proceedings AI'92, Adams & Sterling (Eds.), World Scientific, pp. 343-348 ; Wang and Witten (1997), Inducing Model Trees for Continuous Classes. 9th European Conference on machine learning, pp.128-137 ), and REPTree algorithm (Witten and Frank, 2005).
  • LMT logistic model tree
  • ADTree alternating decision tree
  • a target feature is selected from the group of a continuous target feature and a discrete target feature.
  • a discrete target feature may be a binary target feature.
  • At least one plant-based molecular genetic marker is from a plant population and the plant population may be an unstructured plant population.
  • the plant population may include inbred plants or hybrid plants or a combination thereof.
  • a suitable plant population is selected from the group of maize, soybean, sorghum, wheat, sunflower, rice, canola, cotton, and millet.
  • the plant population may include between about 2 and about 100,000 members.
  • the number of molecular genetic markers may range from about 1 to about 1,000,000 markers.
  • the features may include molecular genetic marker data that includes, but is not limited to, one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof.
  • SSR simple sequence repeat
  • CAPS cleaved amplified polymorphic sequences
  • SSLP simple sequence length polymorphism
  • RFLP restriction fragment length polymorphism
  • the features may also include one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof, in conjunction with one or more phenotypic measurements, microarray data of expression levels of RNAs including mRNA, micro RNA (miRNA), non-coding RNA (ncRNA), analytical measurements, biochemical measurements, or environmental measurements or a combination thereof as features.
  • SSR simple sequence repeat
  • CAPS cleaved amplified poly
  • a suitable target feature in a plant population includes one or more numerically representable and/or quantifiable phenotypic traits including disease resistance, yield, grain yield, yarn strength, protein composition, protein content, insect resistance, grain moisture content, grain oil content, grain oil quality, drought resistance, root lodging resistance, plant height, ear height, grain protein content, grain amino acid content, grain color, and stalk lodging resistance.
  • a genotype of the sample plant population for one or more molecular genetic markers is experimentally determined by direct DNA sequencing.
  • a method to select inbred lines, select hybrids, rank hybrids, rank hybrids for a certain geography, select the parents of new inbred populations, find segments for introgression into elite inbred lines, or any combination thereof is completed using any combination of the steps (a) - (e) above.
  • the detecting association rules include spatial and temporal associations using self-organizing maps.
  • At least one feature of a model for predicting or classification is the subset of features selected earlier using a feature evaluation algorithm.
  • cross-validation is used to compare algorithms and sets of parameter values.
  • receiver operating characteristic (ROC) curves are used to compare algorithms and sets of parameter values.
  • one or more features are derived mathematically or computationally from other features.
  • the invention relates to a method of mining a data set that includes at least one plant-based molecular genetic marker is disclosed, to find at least one association rule, and utilizing features from these association rules for classification or prediction for one or more target features, wherein the method includes the steps of:
  • a data set with at least one plant-based molecular genetic marker is used to find at least one association rule and features created from these association rules are used for classification or prediction and selecting at least one plant from the plant population for one or more target features of interest.
  • prior knowledge comprised of preliminary research, quantitative studies of plant genetics, gene networks, sequence analyses, or any combination of thereof, is considered.
  • Figure 1 Area under the ROC curve, before and after adding the new features from step (b).
  • Association rule mining algorithms provide the framework and the scalability needed to find relevant interactions on very large datasets.
  • Methods disclosed herein are useful for identifying multi-locus interactions affecting phenotypes. Methods disclosed herein are useful for identifying interactions between molecular genetic markers, haplotypes and environmental factors. New features created based on these interactions are useful for classification or prediction.
  • WEKA Wiikato Environment for Knowledge Analysis developed at University of Waikato, New Zealand
  • This machine learning software workbench facilitates the implementation of machine learning algorithms and supports algorithm development or adaptation of data mining and computational methods.
  • WEKA also provides tools to appropriately test the performance of each algorithm and sets of parameter values through methods such as cross-validation and ROC (Receiver Operating Characteristic) curves.
  • WEKA was used to implement machine learning algorithms for modeling. However, one of ordinary skill in the art would appreciate that other machine learning software may be used to practice the present invention.
  • data mining using the approaches described herein provides a flexible, scalable framework for modeling with datasets that include features based on molecular genetic markers.
  • This framework is flexible because it includes tests (i.e. cross-validation and ROC curves) to determine which algorithm and specific parameter settings should be used for the analysis of a data set.
  • This framework is scalable because it is suitable for very large datasets.
  • the method of the invention as described herein is used to mine data sets containing features created from at least one plant-based molecular genetic marker to find at least one association rule and to then use features created from these association rules for classification or prediction.
  • the method is suitable for classification or prediction with datasets containing plant and animal features.
  • steps to mine a data set with at least one feature created from at least one plant-based molecular genetic marker, to find at least one association rule, and utilizing features created from these association rules for classification or prediction for one or more target features include:
  • association rule mining algorithms are utilized for classification or prediction with one or more machine learning algorithms selected from: feature evaluation algorithms, feature subset selection algorithms, Bayesian networks, instance-based algorithms, support vector machines, vote algorithm, cost-sensitive classifier, stacking algorithm, classification rules, and decision tree algorithms.
  • Suitable association rule mining algorithms include, but are not limited to,APriori algorithm, FP-growth algorithm, association rule mining algorithms that can handle large number of features, colossal pattern mining algorithms, direct discriminative pattern mining algorithm, decision trees, rough sets and Self-Organizing Map (SOM) algorithm.
  • a suitable association rule mining algorithm for handling large numbers of features include, but are not limited to, CLOSET+, CHARM, CARPENTER, and COBBLER.
  • a suitable algorithm for finding direct discriminative patterns include, but are not limited to, DDPM, HARMONY, RCBT, CAR, and PATCLASS.
  • a suitable algorithm for finding colossal patterns include, but are not limited to, Pattern Fusion algorithm.
  • a suitable machine learning algorithm is a feature subset selection algorithm selected from the group of correlation-based feature selection (CFS) algorithm, and the wrapper algorithm in association with any other machine learning algorithm.
  • CFS correlation-based feature selection
  • These feature subset selection algorithms may be associated with a search method selected from the group of greedy stepwise search algorithm, best first search algorithm, exhaustive search algorithm, race search algorithm, and rank search algorithm.
  • a suitable machine learning algorithm is a Bayesian network algorithm including the naive Bayes algorithm.
  • a suitable machine learning algorithm is an instance-based algorithm selected from the group of instance-based 1 (IB1) algorithm, instance-based k-nearest neighbor (IBK) algorithm, KStar, lazy Bayesian rules (LBR) algorithm, and locally weighted learning (LWL) algorithm.
  • IB1 instance-based 1
  • IBK instance-based k-nearest neighbor
  • KStar instance-based k-nearest neighbor
  • LBR lazy Bayesian rules
  • LWL locally weighted learning
  • a suitable machine learning algorithm for classification or prediction is a support vector machine algorithm.
  • a suitable machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization (SMO) algorithm.
  • the machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization for regression (SMOReg) algorithm.
  • a suitable machine learning algorithm is a self-organizing map.
  • a suitable machine learning algorithm is a decision tree algorithm selected from the group of logistic model tree (LMT) algorithm, alternating decision tree (ADTree) algorithm, M5P algorithm, and REPTree algorithm.
  • LMT logistic model tree
  • ADTree alternating decision tree
  • M5P M5P
  • REPTree REPTree
  • a target feature is selected from the group of a continuous target feature and a discrete target feature.
  • a discrete target feature may be a binary target feature.
  • At least one plant-based molecular genetic marker is from a plant population and the plant population may be an unstructured plant population.
  • the plant population may include inbred plants or hybrid plants or a combination thereof.
  • a suitable plant population is selected from the group of maize, soybean, sorghum, wheat, sunflower, rice, canola, cotton, and millet.
  • the plant population may include between about 2 and about 100,000 members.
  • the number of molecular genetic markers may range from about 1 to about 1,000,000 markers.
  • the features may include molecular genetic marker data that includes, but is not limited to, one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof.
  • SSR simple sequence repeat
  • CAPS cleaved amplified polymorphic sequences
  • SSLP simple sequence length polymorphism
  • RFLP restriction fragment length polymorphism
  • the features may also include one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof, in conjunction with one or more phenotypic measurements, microarray data, analytical measurements, biochemical measurements, or environmental measurements or a combination thereof as features.
  • SSR simple sequence repeat
  • CAPS cleaved amplified polymorphic sequences
  • SSLP simple sequence length polymorphism
  • RFLP restriction fragment length polymorphism
  • a suitable target feature in a plant population includes one or more numerically representable phenotypic traits including disease resistance, yield, grain yield, yarn strength, protein composition, protein content, insect resistance, grain moisture content, grain oil content, grain oil quality, drought resistance, root lodging resistance, plant height, ear height, grain protein content, grain amino acid content, grain color, and stalk lodging resistance.
  • a genotype of the sample plant population for the one or more molecular genetic markers is experimentally determined by direct DNA sequencing.
  • a method to select inbred lines, select hybrids, rank hybrids, rank hybrids for a certain geography, select the parents of new inbred populations, find segments for introgression into elite inbred lines, or any combination thereof is completed using any combination of the steps (a) - (e) above.
  • the detecting association rules include spatial and temporal associations using self-organizing maps.
  • At least one feature of a model for predicting or classification is the subset of features selected earlier using a feature evaluation algorithm.
  • cross-validation is used to compare algorithms and sets of parameter values.
  • receiver operating characteristic (ROC) curves are used to compare algorithms and sets of parameter values.
  • one or more features are derived mathematically or computationally from other features.
  • a data set with at least one plant-based molecular genetic marker is used to find at least one association rule and features created from these association rules are used for classification or prediction and selecting at least one plant from the plant population for one or more target features of interest.
  • prior knowledge comprised of preliminary research, quantitative studies of plant genetics, gene networks, sequence analyses, or any combination of thereof, is considered.
  • feature evaluation algorithms such as information gain, symmetrical uncertainty, and the Relief family of algorithms
  • these algorithms are capable of evaluating all features together, instead of one feature at a time. Some of these algorithms are robust to biases, missing values, and collinearity problems.
  • the Relief family of algorithms provides tools capable of accounting for deep-level interactions, but requires reduced collinearity between features in the dataset.
  • subset selection techniques are applied through algorithms such as the CFS subset evaluator.
  • Subset selection techniques may be used for complexity reduction by eliminating redundant, distracting features and retaining a subset capable of properly explaining the target feature. The elimination of these distracting features generally increases the performance of modeling algorithms when evaluated using methods such as cross-validation and ROC curves.
  • Certain classes of algorithms, such as the instance-based algorithms are known to be very sensitive to distracting features, and others such as the support vector machines are moderately affected by distracting features. Reducing complexity by generating new features based on existing features also often leads to increased predictive performance of machine learning algorithms.
  • filter and wrapper algorithms can be used for feature subset selection.
  • an efficient search method e.g. greedy stepwise, best first, and race search
  • a merit formula e.g. CFS subset evaluator
  • the CFS subset evaluator appropriately accounts for the level of redundancy within the subset while not overlooking locally predictive features.
  • machine learning-based subset selection techniques may also be used to select a subset of features that appropriately explain the target feature while having low level of redundancy between the features included in the subset.
  • subset selection approaches One of the purposes of subset selection approaches is reducing wastage during future data collection, manipulation and storage efforts by focusing only on the subset found to properly explain the target feature.
  • the machine learning techniques used for complexity reduction described herein can be compared using cross-validation and ROC curves, for example.
  • the feature subset selection algorithm with the best performance may then be selected for the final analysis. This comparison is generally performed through cross-validation and ROC curves, applied to different combinations of subset selection algorithms and modeling algorithms.
  • To run the cross-validation during the subset selection and modeling steps, multiple computers running a parallelized version of a machine learning software (e.g. WEKA) may be used.
  • the techniques described herein for feature subset selection use efficient search methods for finding the best subset of features (i.e. exhaustive search is not always possible).
  • An aspect of the modeling methods disclosed herein is that because a single algorithm may not always be the best option for modeling every data set, the framework presented herein uses cross-validation techniques, ROC curves and precision and recall to choose the best algorithm for each data set from various options within the field of machine learning.
  • several algorithms and parameter settings may be compared using cross-validation, ROC curves and precision and recall, during model development.
  • Several machine learning algorithms are robust to multicollinearity problems (allowing modeling with large number of features), robust to missing values, and able to account for deep level interactions between features without over-fitting the data.
  • machine learning algorithms for modeling are support vector machines, such as the SMOReg, decision trees, such as the M5P, the RepTree, and the ADTree, in addition to Bayesian networks and instance-based algorithms. Trees generated by the M5P, REPTree, and ADTree algorithm grow focusing on reducing the variance of the target feature in the subset of samples assigned to each newly created node.
  • the M5P is usually used to handle continuous target features
  • the ADTree is usually used to handle binary (or binarized) target features
  • the REPTree may be used to handle both continuous and discrete target features.
  • An aspect of the machine learning methods disclosed herein is that the algorithms used herein may not require highly structured data sets, unlike some methods based strictly on statistical techniques, which often rely on highly structured data sets. Structured experiments are often resource intensive in terms of manpower, costs, and time because the strong environmental effect in the expression of many of the most important quantitatively inherited traits in economically important plants and animals requires that such experiments be large, carefully designed, and carefully controlled. Data mining using machine learning algorithms, however, may effectively utilize existing data that was not specifically generated for this data mining purpose.
  • the methods disclosed herein may be used for prediction of a target feature value in one or more members of a second, target plant population based on their genotype for the one or more molecular genetic markers or haplotypes associated with the trait.
  • the values may be predicted in advance of or instead of experimentally being determined.
  • the methods disclosed herein have a number of applications in applied breeding programs in plants (e.g., hybrid crop plants) in association or not with other statistical methods, such as BLUP (Best Linear Unbiased Prediction).
  • the methods can be used to predict the phenotypic performance of hybrid progeny, e.g., a single cross hybrid produced (either actually or in a hypothetical situation) by crossing a given pair of inbred lines of known molecular genetic marker genotype.
  • the methods are also useful in selecting plants (e.g., inbred plants, hybrid plants, etc.) for use as parents in one or more crosses; the methods permit selection of parental plants whose offspring have the highest probability of possessing the desired phenotype.
  • associations between at least one feature and the target feature are learned.
  • the associations may be evaluated in a sample plant population (e.g., a breeding population).
  • the associations are evaluated in a first plant population by training a machine learning algorithm using a data set with features that incorporate genotypes for at least one molecular genetic marker and values for the target feature in at least one member of the plant population.
  • the values of a target feature may then be predicted on a second population using the trained machine learning algorithm and the values for at least one feature. The values may be predicted in advance of or instead of experimentally being determined.
  • the target feature may be a quantitative trait, e.g., for which a quantitative value is provided.
  • the target feature may be a qualitative trait, e.g., for which a qualitative value is provided.
  • the phenotypic traits that may be included in some features may be determined by a single gene or a plurality of genes.
  • the methods may also include selecting at least one of the members of the target plant population having a desired predicted value of a target feature, and include breeding at least one selected member of the target plant population with at least one other plant (or selfing the at least one selected member, e.g., to create an inbred line).
  • the sample plant population may include a plurality of inbreds, single cross F1 hybrids, or a combination thereof.
  • the inbreds may be from inbred lines that are related and/or unrelated to each other, and the single cross F1 hybrids may be produced from single crosses of the inbred lines and/or one or more additional inbred lines.
  • the members of the sample plant population include members from an existing, established breeding population (e.g., a commercial breeding population).
  • the members of an established breeding population are usually descendents of a relatively small number of founders and are generally inter-related.
  • the breeding population may cover a large number of generations and breeding cycles. For example, an established breeding population may span three, four, five, six, seven, eight, nine or more breeding cycles.
  • the sample plant population need not be a breeding population.
  • the sample population may be a sub-population of any existing plant population for which genotypic and phenotypic data are available either completely or partially.
  • the sample plant population may include any number of members.
  • the sample plant population includes between about 2 and about 100,000 members.
  • the sample plant population may comprise at least about 50, 100, 200, 500, 1000, 2000, 3000, 4000, 5000, or even 6000 or 10,000 or more members.
  • the sample plant population usually exhibits variability for the target feature of interest (e.g., quantitative variability for a quantitative target feature).
  • the sample plant population may be extracted from one or more plant cell cultures.
  • the value of the target feature in the sample plant population is obtained by evaluating the target feature among the members of the sample plant population (e.g., quantifying a quantitative target feature among the members of the population).
  • the phenotype may be evaluated in the members (e.g., the inbreds and/or single cross F1 hybrids) comprising the first plant population.
  • the target feature may include any quantitative or qualitative target feature, e.g., one of agronomic or economic importance.
  • the target feature may be selected from yield, grain moisture content, grain oil content, yarn strength, plant height, ear height, disease resistance, insect resistance, drought resistance, grain protein content, test weight, visual or aesthetic appearance, and cob color.
  • the genotype of the sample or test plant population for the set of molecular genetic markers can be determined experimentally, predicted, or a combination thereof.
  • the genotype of each inbred present in the plant population is experimentally determined and the genotype of each single cross F1 hybrid present in the first plant population is predicted (e.g., from the experimentally determined genotypes of the two inbred parents of each single cross hybrid).
  • Plant genotypes can be experimentally determined by any suitable technique.
  • a plurality of DNA segments from each inbred is sequenced to experimentally determine the genotype of each inbred.
  • pedigree trees and a probabilistic approach can be used to calculate genotype probabilities at different marker loci for the two inbred parents of single cross hybrids.
  • the methods disclosed herein may be used to select plants for a selected genotype including at least one molecular genetic marker associated with the target feature.
  • an "allele” or “allelic variant” refers to an alternative form of a genetic locus.
  • a single allele for each locus is inherited separately from each parent.
  • a diploid individual is homozygous if the same allele is present twice (i.e., once on each homologous chromosome), or heterozygous if two different alleles are present.
  • animal is meant to encompass non-human organisms other than plants, including, but not limited to, companion animals (i.e. pets), food animals, work animals, or zoo animals.
  • Preferred animals include, but are not limited to, fish, cats, dogs, horses, ferrets and other Mustelids, cattle, sheep, and swine. More preferred animals include cats, dogs, horses and other companion animals, with cats, dogs and horses being even more preferred.
  • the term "companion animal” refers to any animal which a human regards as a pet.
  • a cat refers to any member of the cat family (i.e., Felidae), including domestic cats, wild cats and zoo cats.
  • cats include, but are not limited to, domestic cats, lions, tigers, leopards, panthers, cougars, bobcats, lynx, jaguars, cheetahs, and servals.
  • a preferred cat is a domestic cat.
  • a dog refers to any member of the family Canidae, including, but not limited to, domestic dogs, wild dogs, foxes, wolves, jackals, and coyotes and other members of the family Canidae.
  • a preferred dog is a domestic dog.
  • a horse refers to any member of the family Equidae.
  • An equid is a hoofed mammal and includes, but is not limited to, domestic horses and wild horses, such as, horses, asses, donkeys, and zebras. Preferred horses include domestic horses, including race horses.
  • association in the context of machine learning, refers to any interrelation among features, not just ones that predict a particular class or numeric value. Association includes, but it is not limited to, finding association rules, finding patterns, performing feature evaluation, performing feature subset selection, developing predictive models, and understanding interactions between features.
  • association rules refers to elements that co-occur frequently within the data set. It includes, but is not limited to association patterns, discriminative patterns, frequent patterns, closed patterns, and colossal patterns.
  • binary in the context of machine learning, refers to a continuous or categorical feature that has been transformed to a binary feature.
  • breeding population refers generally to a collection of plants used as parents in a breeding program. Usually, the individual plants in the breeding population are characterized both genotypically and phenotypically.
  • data mining refers to the identification or extraction of relationships and patterns from data using computational algorithms to reduce, model, understand, or analyze data.
  • decision trees refers to any type of tree-based learning algorithms, including, but not limited to, model trees, classification trees, and regression trees.
  • feature in the context of machine learning refers to one or more raw input variables, to one or more processed variables, or to one or more mathematical combinations of other variables, including raw variables and processed variables.
  • Features may be continuous or discrete.
  • Features may be generated through processing by any filter algorithm or any statistical method.
  • Features may include, but are not restricted to, DNA marker data, haplotype data, phenotypic data, biochemical data, microarray data, environmental data, proteomic data, and metabolic data.
  • feature evaluation in the context of this invention, refers to the ranking of features or to the ranking followed by the selection of features based on their impact on the target feature.
  • feature subset refers to a group of one or more features.
  • a “genotype” refers to the genetic makeup of a cell or the individual plant or organism with regard to one or more molecular genetic markers or alleles.
  • haplotype refers to a set of alleles that an individual inherited from one parent.
  • the term haplotype may also refer to physically linked and/or unlinked molecular genetic markers (for example polymorphic sequences) associated with a target feature.
  • haplotype may also refer to a group of two or more molecular genetic markers that are physically linked on a chromosome.
  • the term "instance”, in the context of machine learning, refers to an example from a data set.
  • interaction refers to the association between features and target features by way of dependency of one feature on another feature.
  • learning in the context of machine learning refers to the identification and training of suitable algorithms to accomplish tasks of interest.
  • learning includes, but is not restricted to, association learning, classification learning, clustering, and numeric prediction.
  • machine learning refers to the field of the computer sciences that studies the design of computer programs able to induce patterns, regularities, or rules from past experiences to develop an appropriate response to future data, or describe the data in some meaningful way.
  • machine learning algorithms in the context of this invention, it is meant association rule algorithms (e.g. Apriori, discriminative pattern mining, frequent pattern mining, closed pattern mining, colossal pattern mining, and self-organizing maps), feature evaluation algorithms (e.g. information gain, Relief, ReliefF, RReliefF, symmetrical uncertainty, gain ratio, and ranker), subset selection algorithms (e.g.
  • wrapper consistency, classifier, correlation-based feature selection (CFS)), support vector machines, Bayesian networks, classification rules, decision trees, neural networks, instance-based algorithms, other algorithms that use the herein listed algorithms (e.g. vote, stacking, cost-sensitive classifier) and any other algorithm in the field of the computer sciences that relates to inducing patterns, regularities, or rules from past experiences to develop an appropriate response to future data, or describing the data in some meaningful way.
  • CFS correlation-based feature selection
  • model development refers to a process of building one or more models for data mining.
  • molecular genetic marker refers to any one of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular marker derived from DNA, RNA, protein, or metabolite, and a combination thereof.
  • SSR simple sequence repeat
  • CAPS cleaved amplified polymorphic sequences
  • SSLP simple sequence length polymorphism
  • RFLP restriction fragment length polymorphism
  • RAPD random amplified polymorphic DNA
  • SNP single nucleotide polymorphism
  • AFLP arbitrary fragment length polymorphism
  • Molecular genetic markers also refer to polynucleotide sequences used as
  • phenotypic trait or “phenotype” refers to an observable physical or biochemical characteristics of an organism, as determined by both genetic makeup and environmental influences. Phenotype refers to the observable expression of a particular genotype.
  • plant includes the class of higher and lower plants including angiosperms (monocotyledonous and dicotyledonous plants), gymnosperms, ferns, and multicellular algae. It includes plants of a variety of ploidy levels, including aneuploid, polyploid, diploid, haploid and hemizygous.
  • plant-based molecular genetic marker refers to any one of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular marker derived from plant DNA, RNA, protein, or metabolite, and a combination thereof.
  • SSR simple sequence repeat
  • CAPS cleaved amplified polymorphic sequences
  • SSLP simple sequence length polymorphism
  • RFLP restriction fragment length polymorphism
  • RAPD random amplified polymorphic DNA
  • SNP single nucleotide polymorphism
  • AFLP arbitrary fragment length polymorphism
  • Molecular genetic markers also refer to polynucleotide
  • prior knowledge in the context of this invention, refers to any form of information that can be used to modify the performance of a machine learning algorithm.
  • a relationship matrix indicating the degree of relatedness between individuals, is an example of prior knowledge.
  • a “qualitative trait” generally refers to a feature that is controlled by one or a few genes and is discrete in nature. Examples of qualitative traits include flower color, cob color, and disease resistance.
  • a “quantitative trait” generally refers to a feature that can be quantified.
  • a quantitative trait typically exhibits continuous variation between individuals of a population.
  • a quantitative trait is often the result of a genetic locus interacting with the environment or of multiple genetic loci interacting with each other and/or with the environment. Examples of quantitative traits include grain yield, protein content, and yarn strength.
  • ranking in relation to the features refers to an orderly arrangement of the features, e.g., molecular genetic markers may be ranked by their predictive ability in relation to a trait.
  • self-organizing map refers to an unsupervised learning technique often used for visualization and analysis of high-dimensional data.
  • supervised in the context of machine learning, refers to methods that operate under supervision by being provided with the actual outcome for each of the training instances.
  • support vector machine in the context of machine learning includes, but is not limited to, support vector classifier, used for classification purposes, and support vector regression, used for numeric prediction.
  • Other algorithms e.g. sequential minimal optimization (SMO)
  • SMO sequential minimal optimization
  • target feature in the context of this invention, refers, but is not limited to, a feature which is of interest to predict, or explain, or with which it is of interest to develop associations.
  • a data mining effort may include one target feature or more than one target feature and the term “target feature” may refer to one or more than one feature.
  • Target features may include, but are not restricted to, DNA marker data, phenotypic data, biochemical data, microarray data, environmental data, proteomic data, and metabolic data. In the field of machine learning, when the "target feature" is discrete, it is often called “class”. Grain yield is an example of a target feature.
  • unsupervised in the context of machine learning, refers to methods that operate without supervision by not being provided with the actual outcome for each of the training instances.
  • Association rule mining is a technique for extracting meaningful association patterns among features.
  • One of the machine learning algorithms suitable for learning association rules is the APriori algorithm.
  • a usual primary step of ARM algorithms is to find a set of items or features that are most frequent among all the observations. These are known as frequent itemsets. Their frequency is also known as support (the user may identify a minimum support threshold for a itemset to be considered frequent). Once the frequent itemsets are obtained, rules are extracted from them (with a user specified minimum confidence measure, for example). The later part is not as computationally intensive as the former. Hence, an objective of ARM algorithms is focused on finding frequent itemsets.
  • a frequent closed pattern is a pattern that meets the minimal support requirement specified by the user and does not have the same support as its immediate supersets.
  • a frequent pattern is not closed if at least one of its immediate supersets has the same support count as it does. Finding frequent closed patterns allows us to find a subset of relevant interactions among the features.
  • the Apriori algorithm works iteratively by combining frequent itemsets with n-1 features to form a frequent itemset with n features. This procedure is exponential in execution time with the increase in number of features. Hence, extracting frequent itemsets with the Apriori algorithm becomes computationally intensive for datasets with very large number of features.
  • CARPENTER a depth-first row enumeration algorithm
  • CARPENTER does not scale well with the increase in number of samples.
  • COBBLER is a column and row enumeration algorithm that scale well with the increase in number of features and samples.
  • DDPMine Direct Discriminative Pattern Mining
  • discriminative pattern mining algorithm does not follow the above described two step approach. Instead of deriving frequent patterns, it generates a shrinked FP-tree representation of the data. This procedure, not only reduces the problem size, but also speeds up the mining process. It uses information gain as a measure to mine the discriminative patterns.
  • HARMONY is an instance-centric rule-based classifier. It directly mines a final set of classification rules.
  • the RCBT classifier works by first identifying top-k covering rule groups for each row and use them for the classification framework.
  • PatClass takes a two step procedure by first mining a set of frequent itemsets followed by a feature selection step.
  • the Self-Organizing Map also known as Kohonen network preserving map is an unsupervised learning technique often used for visualization and analysis of high-dimensional data. Typical applications are focused on the visualization of the central dependencies within the data on the map. Some areas where they have been used include automatic speech recognition, clinical voice analysis, classification of satellite images, analyses of electrical signals from the brain, and organization and retrieval from large document collections.
  • the map generated by SOMs has been used to speed up the identification of association rules by methods like Apriori, by utilizing the SOM clusters (visual clusters identified during SOM training).
  • the SOM map consists of a grid of processing units, "neurons". Each neuron is associated with a feature vector (observation).
  • the map attempts to represent all the available observations with optimal accuracy using a restricted set of models. At the same time the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. This procedure enables the identification as well as the visualization of dependencies or associations between the features in the data.
  • SOM algorithms have also been frequently used to explore the spatial and temporal relationships between entities. Relationships and associations between observations are derived based on the spatial clustering of these observations on the map. If the neurons represent various time states then the map visualizes the temporal patterns between observations.
  • One of the main purposes of feature evaluation algorithms is to understand the underlying process that generates the data. These methods are also frequently applied to reduce the number of “distracting" features with the aim of improving the performance of classification algorithms (see Guyon and Elisseeff (2003). An Introduction to Variable and Feature Selection. Journal of Machine learning Research 3, 1157-1182 ).
  • the term “variable” is sometimes used instead of the broader terms “feature” or “attribute”.
  • Feature (or attribute) selection refers to the selection of variables processed through methods such as kernel methods, but is sometimes used to refer to the selection of raw input variables.
  • the desired output of these feature evaluation algorithms is usually the ranking of features based on their impact on the target feature or the ranking followed by selection of features. This impact may be measured in different ways.
  • Information gain is one of the machine learning methods suitable for feature evaluation.
  • the definition of information gain requires the definition of entropy, which is a measure of impurity in a collection of training instances.
  • the reduction in entropy of the target feature that occurs by knowing the values of a certain feature is called information gain.
  • Information gain may be used as a parameter to determine the effectiveness of a feature in explaining the target feature.
  • Symmetrical uncertainty used by the Correlation based Feature Selection (CFS) algorithm described herein, compensates for information gain's bias towards features with more values by normalizing features to a [0,1] range. Symmetrical uncertainty always lies between 0 and 1. It is one way to measure the correlation between two nominal features.
  • the Ranker algorithm may also be used to rank the features by their individual evaluations at each fold of cross-validation and output the average merit and rank for each feature.
  • Relief is a class of attribute evaluator algorithms that may be used for the feature evaluation step disclosed herein. This class contains algorithms that are capable of dealing with categorical or continuous target features. This broad range makes them useful for several data mining applications.
  • the original Relief algorithm has several versions and extensions.
  • the ReliefF an extension of the original Relief algorithm, is not limited to two class problems and can handle incomplete data sets. ReliefF is also more robust than Relief and can deal with noisy data.
  • the estimated importance of a feature is determined by a sum of scores assigned to it for each one of the instances. Each score depends on how important the feature is in determining the class of an instance. The feature gets maximum value if it is decisive in determining the class. When a significant number of uninformative features are added to the analysis, many instances are necessary for these algorithms to converge to the correct estimates of the worth of each feature. When dealing with several neighboring misses, the important features are those for which a minimal change in their value leads to a change in the class of the instance being evaluated. In ReliefF, when the number of instances is enormous, the near hits play a minimal role and the near misses play a huge role, but with problems of practical size near hits play a bigger role.
  • RReliefF is an extension of ReliefF that deals with continuous target features.
  • the positive updates form the probabilities that the feature discriminates between the instances with different class values.
  • the negative updates form the probabilities that the feature discriminates between the instances with the same class values.
  • RReliefF algorithms reward features for not separating similar prediction values and punish features for not separating different prediction values.
  • RReliefF differently from Relief and ReliefF, does not use signs, so the concept of hit and miss does not apply.
  • RReliefF considers good features to be the ones that separate instances with different prediction values and do not separate instances with close prediction values.
  • the estimations generated by algorithms from the class of Relief algorithms are dependent on the number of neighbors used. If one does not use a restriction on the number of neighbors, each feature will suffer the impact of all of the samples in the data set.
  • the restriction on the number of samples used provides estimates by Relief algorithms that are averages over local estimates in smaller parts of the instance space. These local predictions allow Relief algorithms to take into account other features when updating the weight of each feature, as the nearest-neighbors are determined by a distance measure that considers all of the features. Therefore, Relief algorithms are sensitive to the number and usefulness of the features included in the data set. Other features are considered through their conditional dependencies to the feature being updated given the predicted values, which can be detected in the context of locality.
  • the distance between instances is determined by the sum of the differences in the values of the "relevant” and “irrelevant” features.
  • these algorithms are not robust to irrelevant features. Therefore, in the presence of a lot of irrelevant features, it is recommended to use a large value of k (i.e. increase number of nearest-neighbors). Doing that, better conditions are provided for the relevant features to "impose” the "correct” update for each feature.
  • Relief algorithms can lose functionality when the number of nearest-neighbors used in the weight formula is too big, often confounding informative features.
  • RReliefF algorithm may tend to underestimate important numerical features in comparison to nominal features when calculating Euclidian or Manhattan distance between instances to determine nearest-neighbors. RReliefF also overestimates random (non-important) numerical features, potentially reducing the separability of two groups of features.
  • the ramp function see Hong (1994) Use of contextual information for feature ranking and discretization. Technical Report RC19664, IBM ; and Hong (1997) IEEE transactions on knowledge and data engineering, 9(5) 718-730 ) can be used to overcome this problem of RReliefF.
  • ReliefF and RReliefF are context sensitive and therefore more sensitive to the number of random (non important) features in the analysis than myopic measures (e.g. gain ratio and MSE).
  • Relief algorithms estimate each feature in the context of other features and better features get higher scores.
  • Relief algorithms tend to underestimate less important features when there are hundreds of important features in the data set yet duplicated or highly redundant features will share the credit and seem to be more important than they actually are. This can occur because additional copies of the feature change the problem space in which the nearest-neighbors are searched. Using nearest-neighbors, the updates will only occur when there are differences between feature values for two neighboring instances. Therefore, no updates for a given feature at a given set of neighbors will occur if the difference between two neighbors is zero.
  • Subset selection algorithms rely on a combination of an evaluation method (e.g. symmetrical uncertainty, and information gain) and a search method (e.g. ranker, exhaustive search, best first, and greedy hill-climbing).
  • an evaluation method e.g. symmetrical uncertainty, and information gain
  • a search method e.g. ranker, exhaustive search, best first, and greedy hill-climbing.
  • Subset selection algorithms similarly to feature evaluation algorithms, rank subsets of features. In contrast to feature evaluation algorithms, however, subset selection algorithms aim at selecting the subset of features with the highest impact on the target feature, while accounting for the degree of redundancy between the features included in the subset. Subset selection algorithms are designed to be robust to multicollinearity and missing values and thus allow for selection from an initial pool of hundreds or even thousands of features. The benefits from feature subset selection include facilitating data visualization and understanding, reducing measurement and storage requirements, reducing training and utilization times, and eliminating distracting features to improve classification.
  • results from subset selection methods are useful for plant and animal geneticists because they can be used to pre-select the molecular genetic markers to be analyzed during a marker assisted selection project with a phenotypic trait as the target feature. This can significantly reduce the number of molecular genetic markers that must be assayed and thus reduce the costs associated with the effort.
  • Subset selection algorithms can be applied to a wide range of data sets.
  • An important consideration in the selection of a suitable search algorithm is the number of features in the data set. As the number of features increases, the number of possible subsets of features increases exponentially. For this reason, the exhaustive search algorithm is only suitable when the number of features is relatively small. With adequate computational power, however, it is possible to use exhaustive search to determine the most relevant subset of features.
  • Two basic approaches to subset selection algorithms are the process of adding features to a working subset (forward selection) and deleting from the current subset of features (backward elimination).
  • forward selection is done differently than the statistical procedure with the same name.
  • the feature to be added to the current subset is found by evaluating the performance of the current subset augmented by one new feature using cross-validation.
  • forward selection subsets are built up by adding each remaining feature in turn to the current subset while evaluating the expected performance of each new subset using cross-validation.
  • the feature that leads to the best performance when added to the current subset is retained and the process continues.
  • the search ends when none of the remaining available features improves the predictive ability of the current subset. This process finds a local (i.e. not necessarily global) optimum set of features.
  • Backward elimination is implemented in a similar fashion. With backward elimination, the search ends when further reduction in the feature set does not improve the predictive ability of the subset. To introduce bias towards smaller subsets one may require the predictive ability to improve by a certain amount for a feature to be added (during forward selection) or deleted (during backward elimination).
  • the best first algorithm can search forward, backward or in both directions (by considering all possible single feature additions and deletions at a given point) through the application of greedy hill-climbing augmented with a backtracking facility (see Pearl, J. (1984), Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, p. 48 ; and Russell, S.J., & Norvig, P. Artificial Intelligence: A Modern Approach. 2nd edition. Pearson Education, Inc., 2003, pp. 94 and 95 ).
  • This method keeps a list with all of the subsets previously visited and revisits them whenever the predictive ability stops to improve for a certain subset. It will search the entire space (i.e.
  • the beam search method works similarly to best first but truncates the list of feature subsets at each stage, so it is restricted to a fixed number called the beam width.
  • the genetic algorithm is a search method that uses random perturbations of a current list of candidate subsets to generate new good subsets (see Schmitt, Lothar M (2001), Theory of Genetic Algorithms, Theoretical Computer Science (259), pp. 1-61 ). They are adaptive and use search techniques based on the principles of natural selection in biology. Competing solutions are set up and evolve over time searching the solution space in parallel (which helps with avoiding local maxima). Crossover and mutations are applied to the members of the current generation to create the next generation. The random addition or deletion of features from a subset is conceptually analogous to the role of mutation in natural systems. Similarly, crossovers combine features from a pair of subsets to form a new subset. The concept of fitness comes into play in that the fittest (best) subset at a given generation has a greater chance of being selected to form a new subset through crossover and mutation. Therefore, good subsets evolve over time.
  • the scheme-specific (wrapper) ( Kohavi and John (1997), Wrappers for feature selection. Artificial Intelligence, 97(1-2):273-324, December 1997 .) is a suitable search method.
  • the idea here is to select the subset of features that will have the best classification performance when used for building a model with a specific algorithm. Accuracy is evaluated through cross-validation, holdout set, or bootstrap estimator. A model and a set of cross-validation folds must be performed for each subset of features being evaluated. For example, forward selection or backward elimination with k features and 10-fold cross-validation will take approximately k 2 times 10 learning procedures. Exhaustive search algorithms will take something on the order of 2 k times 10 learning procedures.
  • race search that uses a t-test to determine the probability of a subset being better than the current best subset by at least a small user-specified threshold is suitable. If during the leave-one-out cross-validation process, the probability becomes small, a subset can be discarded because it is very unlikely that adding or deleting features to this subset will lead to an improvement over the current best subset. In forward selection, for example, all of the feature additions to a subset are evaluated simultaneously and the ones that don't perform well enough are dropped. Therefore, not all the instances are used (on leave-one-out cross-validation) to evaluate all the subsets.
  • the race search algorithm also blocks all of the nearly identical feature subsets and uses Bayesian statistics to maintain a probability distribution on the estimate of the mean leave-one-out cross-validation error for each competing subset. Forward selection is used but, instead of sequentially trying all the possible changes to the best subset, these changes are raced and the race finishes when cross-validation finishes or a single subset is left.
  • schemata search is a more complicated method designed for racing, running an iterative series of races that each determines if a feature should be included or not (see Moore, A. W., and Lee, M. S. (1994). Efficient algorithms for minimizing cross-validation error. In Cohen, W. W., and Hirsh, H., eds.,Machine learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann ).
  • a search begins with all features marked as unknown rather than an empty or full set of features. All combinations of unknown features are used with equal probability.
  • a feature is chosen and subsets with and without the chosen feature are raced. The other features that compose the subset are included or excluded randomly at each point in the evaluation.
  • rank race search orders the features based on their information gain, for example, and then races using subsets that are based on the rankings of the features.
  • the race starts with no features, continues with the top-ranked feature, the top two features, the top three features, and so on.
  • Cross-validation may be used to determine the best search method for a specific data set.
  • selective naive Bayes uses a search algorithm such as forward selection to avoid including redundant features and features that are dependent on each other (see eg ., Domingos, Pedro & Michael Pazzani (1997) "On the optimality of the simple Bayesian classifier under zero-one loss”. Machine learning, 29:103--137 ).
  • the best subset is found by simply testing the performance of the subsets using the training set.
  • Filter methods operate independently of any learning algorithm, while wrapper methods rely on a specific learning algorithm and use methods such as cross-validation to estimate the accuracy of feature subsets. Wrappers often perform better than filters, but are much slower, and must be re-run whenever a different learning algorithm is used or even when a different set of parameter settings is used. The performance of wrapper methods depend on which learning algorithm is used, the procedure used to estimate the off-sample accuracy of the learning algorithm, and the organization of the search.
  • Filters are much faster than wrappers for subset selection (due to the reasons pointed out above), so filters can be used with larger data sets. Filters can also improve the accuracy of a certain algorithm by providing a starting feature subset for the wrapper algorithms. This process would therefore speed up the wrapper analysis.
  • CFS assumes that the features are independent given the target feature. If strong feature dependency exists, CFS' performance may suffer and it might fail to select all of the relevant features. CFS is effective at eliminating redundant and irrelevant features and will detect all of the relevant features in the absence of strong dependency between features. CFS will accept features capable of predicting the response variable in areas of the instance space not already predicted by other features.
  • CFS CFS
  • the numerator of the evaluation function indicates how predictive of the target feature the subset is, and the denominator indicates how redundant the features in the subset are.
  • the target feature is first made discrete using the method of Fayyad and Irani ( Fayyad, U. M. and Irani, K. B.. 1993. Multi-interval discretisation of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth International Join Conference on Artificial Intelligence. Morgan Kaufmann, 1993 .).
  • the algorithm calculates all feature-target feature correlations (that will be used in the numerator of the evaluation function) and all feature-feature correlations (that will be used in the denominator of the evaluation function).
  • the algorithm searches the feature subset space (using any user-determined search method) looking for the best subset.
  • symmetrical uncertainty is used to calculate correlations.
  • CFS CFS may fail to detect relevant features.
  • CFS is expected to perform well under moderate levels of interaction.
  • CFS tends to penalize noisy features.
  • CFS is heavily biased towards small feature subsets, leading to reduced accuracy in some cases.
  • CFS is not heavily dependent on the search method used.
  • CFS may be set to place more value on locally predictive features, even if these features don't show outstanding global predictive ability. If not set to account for locally predictive features, the bias of CFS towards small subsets may exclude these features.
  • CFS tends to do better than wrappers in small data sets also because it does not need so save part of the data set for testing.
  • Wrappers perform better than CFS when interactions are present.
  • a wrapper with forward selection can be used to detect pair-wise interactions, but backward elimination is needed to detect higher level interactions.
  • Backward searches make wrappers even slower.
  • Bi-directional search can be used for wrappers, starting from the subset chosen by the CFS algorithm. This smart approach can significantly reduce the amount of time needed by the wrapper to complete the search.
  • Bayesian network methods provide useful and flexible probabilistic approach to inference.
  • the Bayes optimal classifier algorithm does more than apply the maximum a posteriori hypothesis to a new record in order to predict the probability of its classification ( Friedman et al., (1997), Bayesian network classifiers. Machine learning, 29:131-163 ). It also considers the probabilities from each of the other hypotheses obtained from the training set (not just the maximum a posteriori hypothesis) and uses these probabilities as weighting factors for future predictions. Therefore, future predictions are carried out using all of the hypotheses ( i . e ., all of the possible models) weighted by their posterior probabilities.
  • the naive Bayes classifier assigns the most probable classification to a record, given the joint probability of the features. Calculating the joint probability requires a large data set and is computationally intensive.
  • the naive Bayes classifier is part of a larger class of algorithms called Bayesian networks. Some of these Bayesian networks can relax the strong assumption made by the naive Bayes algorithm of independence between features.
  • a Bayesian network is a direct acyclic graph (DAG) with a conditional probability distribution for each node. It relies on the assumption that features are conditionally independent given the target feature (naive Bayes) or its parents, which may require the inclusion of the target feature (Bayesian augmented network) or not (general Bayesian network).
  • DAG direct acyclic graph
  • the assumption of conditional independence is restricted to subsets of the features, and this leads to a set of conditional independence assumptions, together with a set of conditional probabilities.
  • the output reflects a description of the joint probability
  • different search algorithms can be implemented using the package WEKA in each of these areas, and probability tables may be calculated by the simple estimator or by Bayesian model averaging (BMA).
  • BMA Bayesian model averaging
  • one option is to use the global score metric-based algorithms. These algorithms rely on cross-validation performed with leave-one-out, k-fold, or cumulative cross-validation.
  • the leave-one-out method isolates one record, trains on the rest of the data set, and evaluates that isolated record (repeatedly, for each of the records).
  • the k-fold method splits the data into k parts, isolates one of these parts, trains with the rest of the data set, and evaluates the isolated set of records.
  • the cumulative cross-validation algorithm starts with an empty data set and adds record by record, updating the state of the network after each additional record, and evaluating the next record to be added according to the current state of the network.
  • an appropriate network structure found by one of these processes is considered as the structure that best fits the data, as determined by a global or a local score. It can also be considered as a structure that best encodes the conditional independencies between features; these independencies can be measured by Chi-squared tests or mutual information tests. Conditional independencies between the features are used to build the network. When the computational complexity is high, the classification may be performed by a subset of the features, determined by any subset selection method.
  • the target feature may be used as any other node (general Bayesian network) when finding dependencies, after that it is isolated from other features via its Markov blanket.
  • the Markov blanket isolates a node from being affected by any node outside its boundary, which is composed of the node's parents, its children, and the parents of its children.
  • the Markov blanket of the target feature is often sufficient to perform classification without a loss of accuracy and all of the other nodes may be deleted.
  • This method selects the features (i.e. the ones included in the Markov blanket) that should be used in the classification and reduces the risk of over-fitting the data by deleting all nodes that are outside the Markov blanket of the target feature.
  • instance-based algorithms are also suitable for model development.
  • Instance-based algorithms also referred to as "lazy” algorithms, are characterized by generating a new model for each instance, instead of basing predictions on trees or networks generated (once) from a training set. In other words, they do not provide a general function that can explain the target feature.
  • These algorithms store the entire training set in memory and build a model from a set of records similar to those being tested. This similarity is evaluated through nearest-neighbor or locally weighted methods, using Euclidian distances. Once a set of records is selected, the final model may be built using several different algorithms, such as the naive Bayes. The resulting model is generally not designed to perform well when applied to other records. Because the training observations are stored explicitly, not in the form of a tree or network, information is never wasted when training instance-based algorithms.
  • instance-based algorithms are useful for complex, multidimensional problems for which the computational demands of trees and networks exceed the available memory.
  • This approach avoids the problem of attempting to perform complexity reduction via selection of features to fit the demands of trees or networks.
  • this process may perform poorly when classifying a new instance, because all of the computations take place at the classification time. This is generally not a problem during applications in which one or a few instances are to be classified at a time.
  • these algorithms give similar importance to all of the features, without placing more weight on those that better explain the target feature. This may lead to selection of instances that are not actually closest to the instance being evaluated in terms of their relationship to the target feature.
  • Instance-based algorithms are robust to noise in data collection because instances get the most common assignment among their neighbors or an average (continuous case) of these neighbors, and these algorithms usually perform well with very large training sets.
  • support vector machines are used to model data sets for data mining purposes.
  • Support vector machines are an outgrowth of Statistical Learning Theory and were first described in 1992.
  • An important aspect of SVMs is that once the support vectors have been identified, the remaining observations can be removed from the calculations, thus greatly reducing the computational complexity of the problem.
  • decision tree learning algorithms are suitable machine learning methods for modeling. These decision tree algorithms include ID3, Assistant, and C4.5. These algorithms have the advantage of searching through a large hypothesis space without many restrictions. They are often biased towards building small trees, a property that is sometimes desirable.
  • the resulting trees can usually be represented by a set of "if-then” rules; this property which does not apply to other classes of algorithms such as instance-based algorithms, can improve human readability.
  • the classification of an instance occurs by scanning the tree from top to bottom and evaluating some feature at each node of the tree.
  • Different decision tree learning algorithms vary in terms of their capabilities and requirements; some work only with discrete features. Most decision tree algorithms also require the target feature to be binary while others can handle continuous target features. These algorithms are usually robust to errors in the determination of classes (coding) for each feature. Another relevant feature is that some of these algorithms can effectively handle missing values.
  • the Iterative Dichotomiser 3 (ID3) algorithm is a suitable decision tree algorithm. This algorithm uses "information gain” to decide which feature best explains the target by itself, and it places this feature in the top of the tree ( i . e ., at the root node). Next, a descendant is assigned for each class of the root node by sorting the training records according to classes of the root node and finding the feature with the greatest information gain in each of these classes. This cycle is repeated for each newly added feature, and so on. This algorithm can not "back-track” to reconsider its previous decisions, and this may lead to convergence to a local maximum. There are several extensions of the ID3 algorithm that perform "post-pruning" of the decision tree, which is a form of back-tracking.
  • the ID3 algorithm performs a "hill-climbing search" through the space of decision trees, starting from a simple hypothesis and progressing through more elaborate hypotheses. Because it performs a complete search of the hypothesis space, it avoids the problem of choosing a hypothesis space that does not contain the target feature.
  • the ID3 algorithm outputs just one tree, not all reasonable trees.
  • Inductive bias can occur with the ID3 algorithm because it is a top-down, breadth-first algorithm. In other words, it considers all possible trees at a certain depth, chooses the best one, and then moves to the next depth. It prefers short trees over long trees, and by selecting the shortest tree at a certain depth it places features with highest information gain closest to the root.
  • a variation of ID3 algorithm is the logistic model tree (LMT) ( Landwehr et al., (2003), Logistic Model Trees. Proceedings of the 14th European Conference on machine learning. Cavtat-Dubrovnik, Croatia. Springer-Verlag .).
  • LMT logistic model tree
  • This classifier implements logistic regression functions at the leaves. This algorithm deals with discrete target features, and can handle missing values.
  • the C4.5 is a decision tree generating algorithm based on the ID3 algorithm ( Quinlan (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers ). Some of the improvements include, for example, choosing an appropriate feature evaluation measure; handling training data with missing feature values; handling features with differing costs; and handling continuous features.
  • ROC Receiver Operating Characteristic
  • ROC curve is built over several such thresholds, which help determining the best threshold for a given problem (the one that gives the best balance between the true positive rate and the false positive rate). Lower thresholds lead to higher false positive rates because of the increased ratio of false positives and true negatives (several of the negative records are going to be assigned as positive).
  • the area under the ROC curve is a measure of the overall performance of a classifier, but the choice of the best classifier may be based on specific sections of that curve.
  • Cross-validation techniques are methods by which a particular algorithm or a particular set of algorithms are chosen to provide optimal performance for a given data set.
  • Cross-validation techniques are used herein to select a particular machine learning algorithm during model development, for example. When several algorithms are available for implementation, it is usually interesting to choose the one that is expected to have the best performance in the future. Cross-validation is usually the methodology of choice for this task.
  • Cross-validation is based on first separating part of the training data, then training with the rest of the data, and finally evaluating the performance of the algorithm on the separated data set.
  • Cross-validation techniques are preferred over residual evaluation because residual evaluation is not informative as to how an algorithm will perform when applied to a new data set.
  • one variant of cross-validation is based on splitting the data in two, training on the first subset, and testing on the second subset. It takes about the same amount of time to compute as the residual method, and it is preferred when the data set is large enough. The performance of this method may vary depending on how the data set is split into subsets.
  • a k-fold cross-validation method is an improvement over the holdout method.
  • the data set is divided into k subsets, and the holdout method is repeated k times.
  • the average error across the k trials is then computed.
  • Each record is part of the testing set once, and is part of the training set k-1 times. This method is less sensitive to the way in which the data set is divided, but the computational cost is k times greater than with the holdout method.
  • the leave-one-out cross-validation method is similar to k-fold cross-validation.
  • the training is performed using N-1 records (where N is the total number of records), and the testing is performed using only one record at a time.
  • Locally weighted learners reduce the running time of these algorithms to levels similar to that of residual evaluation.
  • the random sample technique is another option for testing, in which a reasonably sized sample from the data set (e.g., more than 30), is used for testing, with the rest of the data set being used for training.
  • a reasonably sized sample from the data set e.g., more than 30
  • the advantage of using random samples for testing is that sampling can be repeated any number of times, which may result in a reduction of the confidence interval of the predictions.
  • Cross-validation techniques have the advantage that records in the testing sets are independent across testing sets.
  • the M5P algorithm is a model tree algorithm suitable for continuous and discrete target features. It builds decision trees with regression functions instead of terminal class values. Continuous features may be directly handled without transformation to discrete features. It uses conditional class probability function to deal with discrete classes. The class whose model tree generates the greatest approximate probability value is chosen as the predicted class.
  • the M5P algorithm represents any piecewise linear approximation to an unknown function. M5P examines all possible tests and chooses the one that maximizes the expected error reduction. Then M5P prunes this tree back by replacing sub-trees with linear regression models wherever the latter has lower estimated error. Estimated error is the average absolute difference between predicted and actual values for all the instances at a node.
  • alternating decision trees In an embodimentof modeling with decision tree algorithms, alternating decision trees (ADTrees) are used herein.
  • AdaBoost boosting technique
  • the alternating decision tree algorithm tends to build smaller trees with simpler rules, and therefore be more readily interpretable. It also associates real values with each of the nodes, which allows each node to be evaluated independently from the other nodes.
  • the multiple paths followed by a record after a prediction node make this algorithm more robust to missing values because all of the alternative paths are followed in spite of the one ignored path.
  • this algorithm provides a measure of confidence in each classification, called "classification margin", which in some applications is as important as the classification itself. As with other decision trees, this algorithm is also very robust with respect to multicollinearity among features.
  • Plants and animals are often propagated on the basis of certain desirable features such as grain yield, percent body fat, oil profile, and resistance to diseases.
  • One of the objectives of a plant or animal improvement program is to identify individuals for propagation such that the desirable features are expressed more frequently or more prominently in successive generations. Learning involves, but is not restricted to, changing the practices, activities, or behaviors involved in identifying individuals for propagation such that the extent of the increase in the expression of the desirable feature is greater or the cost of identifying the individuals to propagate is lower.
  • data can be obtained for one or more additional features that may or may not have an obvious relationship to the desirable features.
  • Elite maize lines containing high and low levels of resistance to a pathogen were identified through field and greenhouse screening.
  • a line, which demonstrates high levels of resistance to this pathogen was used as a donor and crossed to a susceptible elite line.
  • the offspring were then backcrossed to the same susceptible elite line.
  • the resulting population was crossed to the haploid inducer stock and chromosome doubling technology was used to develop 191 fixed inbred lines.
  • the level of resistance to the pathogen was evaluated for each line in two replications using field screening methodologies. Forty four replications of the susceptible elite line were also evaluated using field screening methodologies. Genotype data was generated for all 191 double haploid lines, the susceptible elite line and the resistant donor using 93 polymorphic SSR markers.
  • the final dataset contained 426 samples that were divided in two groups based on the field screening results. Plants with field screening scores ranging from 1 to 4 comprised the susceptible group, while plants with field screening scores ranging from 5 to 9 comprised the resistant group. For our analyses, the susceptible group was labeled with "0" and the resistant group was labeled with "1".
  • the data set was analyzed using a three step process consisting of: (a) Detecting association rules; (b) Creating new features based on the findings of step (a) and adding these features to the data set; (c) Developing a classification model for a target feature without the features from step (b) and another model with the features from step (b). A description of the application of each of these steps to this data set follows.
  • the association rule detected by the DDPM algorithm included the following features:
  • the five association rules detected by the CARPENTER algorithm included the following features:
  • the REPTree algorithm was applied to the data set.
  • Table 2 shows that after adding the new features to the data set, the mean absolute error decreased (i.e. the new features improved the classification accuracy).
  • Table 3 shows the confusion matrix resulting from a REPTree model using the original data set without the new features from step (b).
  • Table 4 shows the confusion matrix resulting from a REPTree model using the original data set and the new features from step (b). The addition of the new features from step (b) led to an increase in the number of correctly classified records for both classes of the target feature.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Botany (AREA)
  • Developmental Biology & Embryology (AREA)
  • Environmental Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Description

  • This application claims a priority based on provisional application 61/221,804 which was filed in the U.S. Patent and Trademark Office on June 30, 2009.
  • FIELD
  • The disclosure relates to the use of one or more association rule mining algorithms to mine data sets containing features created from at least one plant or animal-based molecular genetic marker, find association rules and utilize features created from these association rules for classification or prediction.
  • BACKGROUND
  • One of the main objectives of plant and animal improvement is to obtain new cultivars that are superior in terms of desirable target features such as yield, grain oil content, disease resistance, and resistance to abiotic stresses.
  • A traditional approach to plant and animal improvement is to select individual plants or animals on the basis of their phenotypes, or the phenotypes of their offspring. The selected individuals can then, for example, be subjected to further testing or become parents of future generations. It is beneficial for some breeding programs to have predictions of performance before phenotypes are generated for a certain individual or when only a few phenotypic records have been obtained for that individual.
  • Some key limitations of methods for plant and animal improvement that rely only on phenotypic selection are the cost and speed of generating such data, and that there is a strong impact of the environment (e.g., temperature, management, soil conditions, day light, irrigation conditions) on the expression of the target features.
  • Recently, the development of molecular genetic markers has opened the possibility of using DNA-based features of plants or animals in addition to their phenotypes, environmental information, and other types of features to accomplish many tasks, including the tasks described above.
  • Some important considerations for a data analyses method for this type of datasets are the ability to mine historical data, to be robust to multicollinearity, and to account for interactions between the features included in these datasets (e.g. epistatic effects and genotype by environment interactions). The ability to mine historical data avoids the requirement of highly structured data for data analyses. Methods that require highly structured data, from planned experiments, are usually resource intensive in terms of human resources, money, and time. The strong environmental effect on the expression of many of the most important traits in economically important plants and animals requires that such experiments be large, carefully designed, and carefully controlled. The multicollinearity limitation refers to a situation in which two or more features (or feature subsets) are linearly correlated to one another. Multicollinearity may lead to a less precise estimation of the impact of a feature (or feature subset) on a target feature and consequently biased predictions.
  • A framework based on mining association rules and using features created from these rules to improve prediction or classification, is suitable to address the three considerations mentioned above. Preferred methods for classification or prediction are machine learning methods. Association rules can therefore be used for classification or prediction for one or more target features.
  • The approach described in the present disclosure relies on implementing one or more machine learning-based association rule mining algorithms to mine datasets containing at least one plant or animal molecular genetic marker, create features based on the association rules found, and use these features for classification or prediction of target features.
  • SUMMARY
  • In a first aspect, there is provided a method for prediction or classification of one or more target features in plants, the method comprising:
    • determining the genotype of plants in a plant population for at least one plant-based molecular genetic marker;
    • providing a data set comprising a set of features, wherein at least one of the features in the data set comprises the at least one molecular genetic marker;
    • determining at least one association rule from the data set utilizing a computer and one or more association rule mining algorithms;
    • utilizing the association rule to create one or more new features;
    • adding the new feature to the data set;
    • utilizing one or more machine learning algorithms for developing a plurality of models for prediction or classification of a desired target feature using at least one new feature;
    • utilizing cross-validation to compare the algorithms and sets of parameter values and selecting an algorithm for accurate prediction or classification of desired target features for the data set;
    • utilizing the selected algorithm for prediction of a value of a target feature in one or more members of the plant population; and
    • selecting at least one of the member of the plant population having a desired predicted value of the target feature.
  • Embodiments relate only to claimed combinations of features. In the following, when the term "embodiment" relates to unclaimed combinations of features, said term has to be understood as referring to examples of the present invention. Methods to mine data sets containing features created from at least one plant-based molecular genetic marker to find at least one association rule and to then use features created from these association rules for classification or prediction are disclosed. Some of these methods are suitable for classification or prediction with datasets containing plant and animal features.
  • Steps to mine a data set with at least one feature created from at least one plant-based molecular genetic marker, to find at least one association rule, and utilizing features created from these association rules for classification or prediction for one or more target features include:
    1. (a) detecting association rules;
    2. (b) creating new features based on the findings of step (a) and adding these features to the data set;
    3. (c) model development for one or more target features with at least one feature created using the features created on step (b);
    4. (d) selecting a subset of features from features in the data set; and
    5. (e) detecting association rules from spatial and temporal associations using self-organizing maps (see Teuvo Kohonen (2000), Self-Organizing Map, Springer, 3rd edition.)
  • Described herein is a method of mining a data set with one or more features, wherein the method includes using at least one plant-based molecular marker to find at least one association rule and utilizing features created from these association rules for classification or prediction, the method comprising the steps of: (a) detecting association rules, (b) creating new features based on the findings of step (a) and adding these features to the data set; (c) selecting a subset of features from features in the data set.
  • In an embodiment, association rule mining algorithms are utilized for classification or prediction with one or more machine learning algorithms selected from: feature evaluation algorithms, feature subset selection algorithms, Bayesian networks (see Cheng and Greiner (1999), Comparing Bayesian network classifiers. Proceedings UAI, pp. 101-107.), instance-based algorithms, support vector machines (see e.g., Shevade et al., (1999), Improvements to SMO Algorithm for SVM Regression. Technical Report CD-99-16, Control Division Dept of Mechanical and Production Engineering, National University of Singapore; Smola et al., (1998). A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series - NC2-TR-1998-030; Scholkopf, (1998). SVMs - a practical consequence of learning theory. IEEE Intelligent Systems. IEEE Intelligent Systems 13.4: 18-21; Boser et al., (1992), A Training Algorithm for Optimal Margin Classifiers V 144-52; and Burges (1998), A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (1998): 121-67), vote algorithm, cost-sensitive classifier, stacking algorithm, classification rules, and decision tree algorithms (see Witten and Frank (2005), Data Mining: Practical machine learning Tools and Techniques. Morgan Kaufmann, San Francisco, Second Edition.).
  • Suitable association rule mining algorithms include, but are not limited to APriori algorithm (see Witten and Frank (2005), Data Mining: Practical machine learning Tools and Techniques. Morgan Kaufmann, San Francisco, Second Edition), FP-growth algorithm, association rule mining algorithms that can handle large number of features, colossal pattern mining algorithms, direct discriminative pattern mining algorithm, decision trees, rough sets (see Zdzislaw Pawlak (1992), Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Print on Demand) and Self-Organizing Map (SOM) algorithm.
  • In an embodiment, a suitable association rule mining algorithm for handling large numbers of features include, but are not limited to, CLOSET+ (see Wang et. al (2003), CLOSET+: Searching for best strategies for mining frequent closed itemsets, ACM SIGKDD 2003, pp. 236-245), CHARM (see Zaki et. al (2002), CHARM: An efficient algorithm for closed itemset mining, SIAM 2002, pp. 457-473), CARPENTER (see Pan et. al (2003), CARPENTER: Finding Closed Patterns in Long Biological Datasets, ACM SIGKDD 2003, pp. 637-642), and COBBLER (see Pan et al (2004), COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery, SSDBM 2004, pp. 21).
  • In an embodiment a suitable algorithm for finding direct discriminative patterns include, but are not limited to, DDPM (see Cheng et. al (2008), Direct Discriminative Pattern Mining for Effective Classification, ICDE 2008, pp. 169-178), HARMONY (see Jiyong et. al (2005), HARMONY: Efficiently Mining the Best Rules for Classification, SIAM 2005, pp. 205-216), RCBT (see Cong et. al (2005), Mining top-K covering rule groups for gene expression data, ACM SIGMOD 2005, pp. 670-681), CAR (see Kianmehr et al (2008), CARSVM: A class association rule-based classification framework and its application in gene expression data, Artificial Intelligence in Medicine 2008, pp. 7-25), and PATCLASS (see Cheng et. al (2007), Discriminative Frequent Pattern Analysis for Effective Classification, ICDE 2007, pp. 716-725).
  • In an embodiment a suitable algorithm for finding colossal patterns include, but are not limited to, Pattern Fusion algorithm (see Zhu et. al (2007), Mining Colossal Frequent Patterns by Core Pattern Fusion, ICDE 2007, pp. 706-715).
  • In an embodiment, a suitable feature evaluation algorithm is selected from the group of information gain algorithm, Relief algorithm (see e.g., Robnik-Sikonja and Kononenko (2003), Theoretical and empirical analysis of Relief and ReliefF. Machine learning, 53:23-69; and Kononenko (1995). On biases in estimating multi-valued attributes. In IJCAI95, pages 1034-1040), ReliefF algorithm (see e.g., Kononenko, (1994), Estimating attributes: analysis and extensions of Relief. In: L. De Raedt and F. Bergadano (eds.): Machine learning: ECML-94. 171-182, Springer Verlag.), RReliefF algorithm, symmetrical uncertainty algorithm, gain ratio algorithm, and ranker algorithm.
  • In an embodiment, a suitable machine learning algorithm is a feature subset selection algorithm selected from the group of correlation-based feature selection (CFS) algorithm (see Hall, M. A.. 1999. Correlation-based feature selection for Machine Learning. Ph.D. thesis. Department of Computer Science - The University of Waikato, New Zealand.), and the wrapper algorithm in association with any other machine learning algorithm. These feature subset selection algorithms may be associated with a search method selected from the group of greedy stepwise search algorithm, best first search algorithm, exhaustive search algorithm, race search algorithm, and rank search algorithm.
  • In an embodiment, a suitable machine learning algorithm is a Bayesian network algorithm including the naive Bayes algorithm.
  • In an embodiment, a suitable machine learning algorithm is an instance-based algorithm selected from the group of instance-based 1 (IB1) algorithm, instance-based k-nearest neighbor (IBK) algorithm, KStar, lazy Bayesian rules (LBR) algorithm, and locally weighted learning (LWL) algorithm.
  • In an embodiment, a suitable machine learning algorithm for classification or prediction is a support vector machine algorithm. In a preferred embodiment, a suitable machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization (SMO) algorithm. In a preferred embodiment, the machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization for regression (SMOReg) algorithm (see e.g., Shevade et al., (1999), Improvements to SMO Algorithm for SVM Regression. Technical Report CD-99-16, Control Division Dept of Mechanical and Production Engineering, National University of Singapore; Smola & Scholkopf (1998), A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series - NC2-TR-1998-030).
  • In an embodiment, a suitable machine learning algorithm is a self-organizing map (Self-organizing maps, Teuvo Kohonen, Springer).
  • In an embodiment, a suitable machine learning algorithm is a decision tree algorithm selected from the group of logistic model tree (LMT) algorithm, alternating decision tree (ADTree) algorithm (see Freund and Mason (1999), The alternating decision tree learning algorithm. Proc. Sixteenth International Conference on machine learning, Bled, Slovenia, pp. 124-133), M5P algorithm (see Quinlan (1992), Learning with continuous classes, in Proceedings AI'92, Adams & Sterling (Eds.), World Scientific, pp. 343-348; Wang and Witten (1997), Inducing Model Trees for Continuous Classes. 9th European Conference on machine learning, pp.128-137), and REPTree algorithm (Witten and Frank, 2005).
  • In an embodiment, a target feature is selected from the group of a continuous target feature and a discrete target feature. A discrete target feature may be a binary target feature.
  • In an embodiment, at least one plant-based molecular genetic marker is from a plant population and the plant population may be an unstructured plant population. The plant population may include inbred plants or hybrid plants or a combination thereof. In an embodiment, a suitable plant population is selected from the group of maize, soybean, sorghum, wheat, sunflower, rice, canola, cotton, and millet. In an embodiment, the plant population may include between about 2 and about 100,000 members.
  • In an embodiment, the number of molecular genetic markers may range from about 1 to about 1,000,000 markers. The features may include molecular genetic marker data that includes, but is not limited to, one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof.
  • In an embodiment, the features may also include one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof, in conjunction with one or more phenotypic measurements, microarray data of expression levels of RNAs including mRNA, micro RNA (miRNA), non-coding RNA (ncRNA), analytical measurements, biochemical measurements, or environmental measurements or a combination thereof as features.
  • A suitable target feature in a plant population includes one or more numerically representable and/or quantifiable phenotypic traits including disease resistance, yield, grain yield, yarn strength, protein composition, protein content, insect resistance, grain moisture content, grain oil content, grain oil quality, drought resistance, root lodging resistance, plant height, ear height, grain protein content, grain amino acid content, grain color, and stalk lodging resistance.
  • In an embodiment, a genotype of the sample plant population for one or more molecular genetic markers is experimentally determined by direct DNA sequencing.
  • In an embodiment, a method to select inbred lines, select hybrids, rank hybrids, rank hybrids for a certain geography, select the parents of new inbred populations, find segments for introgression into elite inbred lines, or any combination thereof is completed using any combination of the steps (a) - (e) above.
  • In an embodiment, the detecting association rules include spatial and temporal associations using self-organizing maps.
  • In an embodiment, at least one feature of a model for predicting or classification is the subset of features selected earlier using a feature evaluation algorithm.
  • In an embodiment, cross-validation is used to compare algorithms and sets of parameter values. In an embodiment, receiver operating characteristic (ROC) curves are used to compare algorithms and sets of parameter values.
  • In an embodiment, one or more features are derived mathematically or computationally from other features.
  • The invention relates to a method of mining a data set that includes at least one plant-based molecular genetic marker is disclosed, to find at least one association rule, and utilizing features from these association rules for classification or prediction for one or more target features, wherein the method includes the steps of:
    1. (a) detecting association rules;
      1. (i) wherein association rules, spatial and temporal associations are detected using self organizing maps.
    2. (b) creating new features based on the findings of step (a) and adding these features to the data set;
    3. (c) developing a model for prediction or classification for one or more target features with at least one feature created at step (b);
    wherein the steps (a), (b), and (c) may be preceded by the step of selecting a subset of features from features in the data set. The invention is defined in the first aspect.
  • In an embodiment the results of the method are applied to:
    1. (a) predict hybrid performance,
    2. (b) predict hybrid performance across various geographical locations;
    3. (c) select inbred lines;
    4. (d) select hybrids;
    5. (e) rank hybrids for certain geographies;
    6. (f) select the parents of new inbred populations;
    7. (g) find DNA segments for introgression into elite inbred lines;
    8. (h) or any combination thereof (a) - (g).
  • In an embodiment, a data set with at least one plant-based molecular genetic marker is used to find at least one association rule and features created from these association rules are used for classification or prediction and selecting at least one plant from the plant population for one or more target features of interest.
  • In an embodiment, prior knowledge, comprised of preliminary research, quantitative studies of plant genetics, gene networks, sequence analyses, or any combination of thereof, is considered.
  • In an embodiment, the methods described above are modified to include the following steps:
    1. (a) reducing dimensionality by replacing the original features with a combination of one or more of the features included in one or more of the association rules;
    2. (b) mining discriminative and essential frequent patterns via model-based search tree.
    BRIEF DESCRIPTION OF THE DRAWINGS
  • Figure 1: Area under the ROC curve, before and after adding the new features from step (b).
  • DETAILED DESCRIPTION
  • Association rule mining algorithms provide the framework and the scalability needed to find relevant interactions on very large datasets.
  • Methods disclosed herein are useful for identifying multi-locus interactions affecting phenotypes. Methods disclosed herein are useful for identifying interactions between molecular genetic markers, haplotypes and environmental factors. New features created based on these interactions are useful for classification or prediction.
  • The robustness of some of these methods with respect to multicollinearity problems and missing values for features, as well as the capacity of these methods to describe intricate dependencies between features, makes such methods suitable for analysis of large, complex datasets that include features based on molecular genetic markers.
  • WEKA (Waikato Environment for Knowledge Analysis developed at University of Waikato, New Zealand) is a suite of machine learning software, written using the Java programming language which implements numerous machine learning algorithms from various learning paradigms. This machine learning software workbench facilitates the implementation of machine learning algorithms and supports algorithm development or adaptation of data mining and computational methods. WEKA also provides tools to appropriately test the performance of each algorithm and sets of parameter values through methods such as cross-validation and ROC (Receiver Operating Characteristic) curves. WEKA was used to implement machine learning algorithms for modeling. However, one of ordinary skill in the art would appreciate that other machine learning software may be used to practice the present invention.
  • Moreover, data mining using the approaches described herein provides a flexible, scalable framework for modeling with datasets that include features based on molecular genetic markers. This framework is flexible because it includes tests (i.e. cross-validation and ROC curves) to determine which algorithm and specific parameter settings should be used for the analysis of a data set. This framework is scalable because it is suitable for very large datasets.
  • The method of the invention as described herein is used to mine data sets containing features created from at least one plant-based molecular genetic marker to find at least one association rule and to then use features created from these association rules for classification or prediction. The method is suitable for classification or prediction with datasets containing plant and animal features.
  • In an embodiment, steps to mine a data set with at least one feature created from at least one plant-based molecular genetic marker, to find at least one association rule, and utilizing features created from these association rules for classification or prediction for one or more target features include:
    1. (a) detecting association rules;
    2. (b) creating new features based on the findings of step (a) and adding these features to the data set;
    3. (c) model development for one or more target features with at least one feature created using the features created on step (b);
    4. (d) selecting a subset of features from features in the data set; and
    5. (e) detecting association rules from spatial and temporal associations using self-organizing maps.
  • In an embodiment, association rule mining algorithms are utilized for classification or prediction with one or more machine learning algorithms selected from: feature evaluation algorithms, feature subset selection algorithms, Bayesian networks, instance-based algorithms, support vector machines, vote algorithm, cost-sensitive classifier, stacking algorithm, classification rules, and decision tree algorithms.
  • Suitable association rule mining algorithms include, but are not limited to,APriori algorithm, FP-growth algorithm, association rule mining algorithms that can handle large number of features, colossal pattern mining algorithms, direct discriminative pattern mining algorithm, decision trees, rough sets and Self-Organizing Map (SOM) algorithm.
  • In an embodiment, a suitable association rule mining algorithm for handling large numbers of features include, but are not limited to, CLOSET+, CHARM, CARPENTER, and COBBLER.
  • In an embodiment a suitable algorithm for finding direct discriminative patterns include, but are not limited to, DDPM, HARMONY, RCBT, CAR, and PATCLASS.
  • In an embodiment a suitable algorithm for finding colossal patterns include, but are not limited to, Pattern Fusion algorithm.
  • In an embodiment, a suitable machine learning algorithm is a feature subset selection algorithm selected from the group of correlation-based feature selection (CFS) algorithm, and the wrapper algorithm in association with any other machine learning algorithm. These feature subset selection algorithms may be associated with a search method selected from the group of greedy stepwise search algorithm, best first search algorithm, exhaustive search algorithm, race search algorithm, and rank search algorithm.
  • In an embodiment, a suitable machine learning algorithm is a Bayesian network algorithm including the naive Bayes algorithm.
  • In an embodiment, a suitable machine learning algorithm is an instance-based algorithm selected from the group of instance-based 1 (IB1) algorithm, instance-based k-nearest neighbor (IBK) algorithm, KStar, lazy Bayesian rules (LBR) algorithm, and locally weighted learning (LWL) algorithm.
  • In an embodiment, a suitable machine learning algorithm for classification or prediction is a support vector machine algorithm. In a preferred embodiment, a suitable machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization (SMO) algorithm. In a preferred embodiment, the machine learning algorithm is a support vector machine algorithm that uses the sequential minimal optimization for regression (SMOReg) algorithm.
  • In an embodiment, a suitable machine learning algorithm is a self-organizing map.
  • In an embodiment, a suitable machine learning algorithm is a decision tree algorithm selected from the group of logistic model tree (LMT) algorithm, alternating decision tree (ADTree) algorithm, M5P algorithm, and REPTree algorithm.
  • In an embodiment, a target feature is selected from the group of a continuous target feature and a discrete target feature. A discrete target feature may be a binary target feature.
  • In an embodiment, at least one plant-based molecular genetic marker is from a plant population and the plant population may be an unstructured plant population. The plant population may include inbred plants or hybrid plants or a combination thereof. In an embodiment, a suitable plant population is selected from the group of maize, soybean, sorghum, wheat, sunflower, rice, canola, cotton, and millet. In an embodiment, the plant population may include between about 2 and about 100,000 members.
  • In an embodiment, the number of molecular genetic markers may range from about 1 to about 1,000,000 markers. The features may include molecular genetic marker data that includes, but is not limited to, one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof.
  • In an embodiment, the features may also include one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof, in conjunction with one or more phenotypic measurements, microarray data, analytical measurements, biochemical measurements, or environmental measurements or a combination thereof as features.
  • A suitable target feature in a plant population includes one or more numerically representable phenotypic traits including disease resistance, yield, grain yield, yarn strength, protein composition, protein content, insect resistance, grain moisture content, grain oil content, grain oil quality, drought resistance, root lodging resistance, plant height, ear height, grain protein content, grain amino acid content, grain color, and stalk lodging resistance.
  • In an embodiment, a genotype of the sample plant population for the one or more molecular genetic markers is experimentally determined by direct DNA sequencing.
  • In an embodiment, a method to select inbred lines, select hybrids, rank hybrids, rank hybrids for a certain geography, select the parents of new inbred populations, find segments for introgression into elite inbred lines, or any combination thereof is completed using any combination of the steps (a) - (e) above.
  • In an embodiment, where the detecting association rules include spatial and temporal associations using self-organizing maps.
  • In an embodiment, at least one feature of a model for predicting or classification is the subset of features selected earlier using a feature evaluation algorithm.
  • In an embodiment, cross-validation is used to compare algorithms and sets of parameter values. In an embodiment, receiver operating characteristic (ROC) curves are used to compare algorithms and sets of parameter values.
  • In an embodiment, one or more features are derived mathematically or computationally from other features.
  • In an embodiment the results of the methods are applied to:
    1. (a) predict hybrid performance,
    2. (b) predict hybrid performance across various geographical locations;
    3. (c) select inbred lines;
    4. (d) select hybrids;
    5. (e) rank hybrids for certain geographies;
    6. (f) select the parents of new inbred populations;
    7. (g) find DNA segments for introgression into elite inbred lines;
    8. (h) or any combination thereof (a) - (g).
  • In an embodiment a data set with at least one plant-based molecular genetic marker is used to find at least one association rule and features created from these association rules are used for classification or prediction and selecting at least one plant from the plant population for one or more target features of interest.
  • In an embodiment, prior knowledge, comprised of preliminary research, quantitative studies of plant genetics, gene networks, sequence analyses, or any combination of thereof, is considered.
  • In an embodiment, the methods described above are modified to include the following steps:
    1. (a) reducing dimensionality by replacing the original features with a combination of one or more of the features included in one or more of the association rules;
    2. (b) mining discriminative and essential frequent patterns via model-based search tree.
  • In an embodiment, feature evaluation algorithms, such as information gain, symmetrical uncertainty, and the Relief family of algorithms, are suitable algorithms. These algorithms are capable of evaluating all features together, instead of one feature at a time. Some of these algorithms are robust to biases, missing values, and collinearity problems. The Relief family of algorithms provides tools capable of accounting for deep-level interactions, but requires reduced collinearity between features in the dataset.
  • In an embodiment, subset selection techniques are applied through algorithms such as the CFS subset evaluator. Subset selection techniques may be used for complexity reduction by eliminating redundant, distracting features and retaining a subset capable of properly explaining the target feature. The elimination of these distracting features generally increases the performance of modeling algorithms when evaluated using methods such as cross-validation and ROC curves. Certain classes of algorithms, such as the instance-based algorithms, are known to be very sensitive to distracting features, and others such as the support vector machines are moderately affected by distracting features. Reducing complexity by generating new features based on existing features also often leads to increased predictive performance of machine learning algorithms.
  • In an embodiment, filter and wrapper algorithms can be used for feature subset selection. To perform feature subset selection using filters, it is usual to associate an efficient search method (e.g. greedy stepwise, best first, and race search) for finding the best subset of features (i.e. exhaustive search may not always be computationally feasible) with a merit formula (e.g. CFS subset evaluator). The CFS subset evaluator appropriately accounts for the level of redundancy within the subset while not overlooking locally predictive features. Besides complexity reduction to support modeling, machine learning-based subset selection techniques may also be used to select a subset of features that appropriately explain the target feature while having low level of redundancy between the features included in the subset. One of the purposes of subset selection approaches is reducing wastage during future data collection, manipulation and storage efforts by focusing only on the subset found to properly explain the target feature. The machine learning techniques used for complexity reduction described herein can be compared using cross-validation and ROC curves, for example. The feature subset selection algorithm with the best performance may then be selected for the final analysis. This comparison is generally performed through cross-validation and ROC curves, applied to different combinations of subset selection algorithms and modeling algorithms. To run the cross-validation during the subset selection and modeling steps, multiple computers running a parallelized version of a machine learning software (e.g. WEKA) may be used. The techniques described herein for feature subset selection use efficient search methods for finding the best subset of features (i.e. exhaustive search is not always possible).
  • An aspect of the modeling methods disclosed herein is that because a single algorithm may not always be the best option for modeling every data set, the framework presented herein uses cross-validation techniques, ROC curves and precision and recall to choose the best algorithm for each data set from various options within the field of machine learning. In an embodiment, several algorithms and parameter settings may be compared using cross-validation, ROC curves and precision and recall, during model development. Several machine learning algorithms are robust to multicollinearity problems (allowing modeling with large number of features), robust to missing values, and able to account for deep level interactions between features without over-fitting the data.
  • In an embodiment, machine learning algorithms for modeling are support vector machines, such as the SMOReg, decision trees, such as the M5P, the RepTree, and the ADTree, in addition to Bayesian networks and instance-based algorithms. Trees generated by the M5P, REPTree, and ADTree algorithm grow focusing on reducing the variance of the target feature in the subset of samples assigned to each newly created node. The M5P is usually used to handle continuous target features, the ADTree is usually used to handle binary (or binarized) target features, and the REPTree may be used to handle both continuous and discrete target features.
  • An aspect of the machine learning methods disclosed herein is that the algorithms used herein may not require highly structured data sets, unlike some methods based strictly on statistical techniques, which often rely on highly structured data sets. Structured experiments are often resource intensive in terms of manpower, costs, and time because the strong environmental effect in the expression of many of the most important quantitatively inherited traits in economically important plants and animals requires that such experiments be large, carefully designed, and carefully controlled. Data mining using machine learning algorithms, however, may effectively utilize existing data that was not specifically generated for this data mining purpose.
  • In an embodiment, the methods disclosed herein may be used for prediction of a target feature value in one or more members of a second, target plant population based on their genotype for the one or more molecular genetic markers or haplotypes associated with the trait. The values may be predicted in advance of or instead of experimentally being determined.
  • In an embodiment, the methods disclosed herein have a number of applications in applied breeding programs in plants (e.g., hybrid crop plants) in association or not with other statistical methods, such as BLUP (Best Linear Unbiased Prediction). For example, the methods can be used to predict the phenotypic performance of hybrid progeny, e.g., a single cross hybrid produced (either actually or in a hypothetical situation) by crossing a given pair of inbred lines of known molecular genetic marker genotype. The methods are also useful in selecting plants (e.g., inbred plants, hybrid plants, etc.) for use as parents in one or more crosses; the methods permit selection of parental plants whose offspring have the highest probability of possessing the desired phenotype.
  • In an embodiment, associations between at least one feature and the target feature are learned. The associations may be evaluated in a sample plant population (e.g., a breeding population). The associations are evaluated in a first plant population by training a machine learning algorithm using a data set with features that incorporate genotypes for at least one molecular genetic marker and values for the target feature in at least one member of the plant population. The values of a target feature may then be predicted on a second population using the trained machine learning algorithm and the values for at least one feature. The values may be predicted in advance of or instead of experimentally being determined.
  • In an embodiment, the target feature may be a quantitative trait, e.g., for which a quantitative value is provided. In another embodiment, the target feature may be a qualitative trait, e.g., for which a qualitative value is provided. The phenotypic traits that may be included in some features may be determined by a single gene or a plurality of genes.
  • In an embodiment, the methods may also include selecting at least one of the members of the target plant population having a desired predicted value of a target feature, and include breeding at least one selected member of the target plant population with at least one other plant (or selfing the at least one selected member, e.g., to create an inbred line).
  • In an embodiment, the sample plant population may include a plurality of inbreds, single cross F1 hybrids, or a combination thereof. The inbreds may be from inbred lines that are related and/or unrelated to each other, and the single cross F1 hybrids may be produced from single crosses of the inbred lines and/or one or more additional inbred lines.
  • In an embodiment, the members of the sample plant population include members from an existing, established breeding population (e.g., a commercial breeding population). The members of an established breeding population are usually descendents of a relatively small number of founders and are generally inter-related. The breeding population may cover a large number of generations and breeding cycles. For example, an established breeding population may span three, four, five, six, seven, eight, nine or more breeding cycles.
  • In an embodiment, the sample plant population need not be a breeding population. The sample population may be a sub-population of any existing plant population for which genotypic and phenotypic data are available either completely or partially. The sample plant population may include any number of members. For example, the sample plant population includes between about 2 and about 100,000 members. The sample plant population may comprise at least about 50, 100, 200, 500, 1000, 2000, 3000, 4000, 5000, or even 6000 or 10,000 or more members. The sample plant population usually exhibits variability for the target feature of interest (e.g., quantitative variability for a quantitative target feature). The sample plant population may be extracted from one or more plant cell cultures.
  • In an embodiment, the value of the target feature in the sample plant population is obtained by evaluating the target feature among the members of the sample plant population (e.g., quantifying a quantitative target feature among the members of the population). The phenotype may be evaluated in the members (e.g., the inbreds and/or single cross F1 hybrids) comprising the first plant population. The target feature may include any quantitative or qualitative target feature, e.g., one of agronomic or economic importance. For example, the target feature may be selected from yield, grain moisture content, grain oil content, yarn strength, plant height, ear height, disease resistance, insect resistance, drought resistance, grain protein content, test weight, visual or aesthetic appearance, and cob color. These traits, and techniques for evaluating (e.g., quantifying) them, are well known in the art.
  • In an embodiment, the genotype of the sample or test plant population for the set of molecular genetic markers can be determined experimentally, predicted, or a combination thereof. For example, in one class of embodiments, the genotype of each inbred present in the plant population is experimentally determined and the genotype of each single cross F1 hybrid present in the first plant population is predicted (e.g., from the experimentally determined genotypes of the two inbred parents of each single cross hybrid). Plant genotypes can be experimentally determined by any suitable technique. In an embodiment, a plurality of DNA segments from each inbred is sequenced to experimentally determine the genotype of each inbred. In an embodiment, pedigree trees and a probabilistic approach can be used to calculate genotype probabilities at different marker loci for the two inbred parents of single cross hybrids.
  • In an embodiment, the methods disclosed herein may be used to select plants for a selected genotype including at least one molecular genetic marker associated with the target feature.
  • An "allele" or "allelic variant" refers to an alternative form of a genetic locus. A single allele for each locus is inherited separately from each parent. A diploid individual is homozygous if the same allele is present twice (i.e., once on each homologous chromosome), or heterozygous if two different alleles are present.
  • As used herein, the term "animal" is meant to encompass non-human organisms other than plants, including, but not limited to, companion animals (i.e. pets), food animals, work animals, or zoo animals. Preferred animals include, but are not limited to, fish, cats, dogs, horses, ferrets and other Mustelids, cattle, sheep, and swine. More preferred animals include cats, dogs, horses and other companion animals, with cats, dogs and horses being even more preferred. As used herein, the term "companion animal" refers to any animal which a human regards as a pet. As used herein, a cat refers to any member of the cat family (i.e., Felidae), including domestic cats, wild cats and zoo cats. Examples of cats include, but are not limited to, domestic cats, lions, tigers, leopards, panthers, cougars, bobcats, lynx, jaguars, cheetahs, and servals. A preferred cat is a domestic cat. As used herein, a dog refers to any member of the family Canidae, including, but not limited to, domestic dogs, wild dogs, foxes, wolves, jackals, and coyotes and other members of the family Canidae. A preferred dog is a domestic dog. As used herein, a horse refers to any member of the family Equidae. An equid is a hoofed mammal and includes, but is not limited to, domestic horses and wild horses, such as, horses, asses, donkeys, and zebras. Preferred horses include domestic horses, including race horses.
  • The term "association", in the context of machine learning, refers to any interrelation among features, not just ones that predict a particular class or numeric value. Association includes, but it is not limited to, finding association rules, finding patterns, performing feature evaluation, performing feature subset selection, developing predictive models, and understanding interactions between features.
  • The term "association rules", in the context of this invention, refers to elements that co-occur frequently within the data set. It includes, but is not limited to association patterns, discriminative patterns, frequent patterns, closed patterns, and colossal patterns.
  • The term "binarized", in the context of machine learning, refers to a continuous or categorical feature that has been transformed to a binary feature.
  • A "breeding population" refers generally to a collection of plants used as parents in a breeding program. Usually, the individual plants in the breeding population are characterized both genotypically and phenotypically.
  • The term "data mining" refers to the identification or extraction of relationships and patterns from data using computational algorithms to reduce, model, understand, or analyze data.
  • The term "decision trees" refers to any type of tree-based learning algorithms, including, but not limited to, model trees, classification trees, and regression trees.
  • The term "feature" or "attribute" in the context of machine learning refers to one or more raw input variables, to one or more processed variables, or to one or more mathematical combinations of other variables, including raw variables and processed variables. Features may be continuous or discrete. Features may be generated through processing by any filter algorithm or any statistical method. Features may include, but are not restricted to, DNA marker data, haplotype data, phenotypic data, biochemical data, microarray data, environmental data, proteomic data, and metabolic data.
  • The term "feature evaluation", in the context of this invention, refers to the ranking of features or to the ranking followed by the selection of features based on their impact on the target feature.
  • The phrase "feature subset" refers to a group of one or more features.
  • A "genotype" refers to the genetic makeup of a cell or the individual plant or organism with regard to one or more molecular genetic markers or alleles.
  • A "haplotype" refers to a set of alleles that an individual inherited from one parent. The term haplotype may also refer to physically linked and/or unlinked molecular genetic markers (for example polymorphic sequences) associated with a target feature. A haplotype may also refer to a group of two or more molecular genetic markers that are physically linked on a chromosome.
  • The term "instance", in the context of machine learning, refers to an example from a data set.
  • The term "interaction" within the context of this invention, refers to the association between features and target features by way of dependency of one feature on another feature.
  • The term "learning" in the context of machine learning refers to the identification and training of suitable algorithms to accomplish tasks of interest. The term "learning" includes, but is not restricted to, association learning, classification learning, clustering, and numeric prediction.
  • The term "machine learning" refers to the field of the computer sciences that studies the design of computer programs able to induce patterns, regularities, or rules from past experiences to develop an appropriate response to future data, or describe the data in some meaningful way. By "machine learning" algorithms, in the context of this invention, it is meant association rule algorithms (e.g. Apriori, discriminative pattern mining, frequent pattern mining, closed pattern mining, colossal pattern mining, and self-organizing maps), feature evaluation algorithms (e.g. information gain, Relief, ReliefF, RReliefF, symmetrical uncertainty, gain ratio, and ranker), subset selection algorithms (e.g. wrapper, consistency, classifier, correlation-based feature selection (CFS)), support vector machines, Bayesian networks, classification rules, decision trees, neural networks, instance-based algorithms, other algorithms that use the herein listed algorithms (e.g. vote, stacking, cost-sensitive classifier) and any other algorithm in the field of the computer sciences that relates to inducing patterns, regularities, or rules from past experiences to develop an appropriate response to future data, or describing the data in some meaningful way.
  • The term "model development" refers to a process of building one or more models for data mining.
  • The term "molecular genetic marker" refers to any one of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular marker derived from DNA, RNA, protein, or metabolite, and a combination thereof. Molecular genetic markers also refer to polynucleotide sequences used as probes.
  • The term "phenotypic trait" or "phenotype" refers to an observable physical or biochemical characteristics of an organism, as determined by both genetic makeup and environmental influences. Phenotype refers to the observable expression of a particular genotype.
  • The term "plant" includes the class of higher and lower plants including angiosperms (monocotyledonous and dicotyledonous plants), gymnosperms, ferns, and multicellular algae. It includes plants of a variety of ploidy levels, including aneuploid, polyploid, diploid, haploid and hemizygous.
  • The term "plant-based molecular genetic marker" refers to any one of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular marker derived from plant DNA, RNA, protein, or metabolite, and a combination thereof. Molecular genetic markers also refer to polynucleotide sequences used as probes.
  • The term "prior knowledge", in the context of this invention, refers to any form of information that can be used to modify the performance of a machine learning algorithm. A relationship matrix, indicating the degree of relatedness between individuals, is an example of prior knowledge.
  • A "qualitative trait" generally refers to a feature that is controlled by one or a few genes and is discrete in nature. Examples of qualitative traits include flower color, cob color, and disease resistance.
  • A "quantitative trait" generally refers to a feature that can be quantified. A quantitative trait typically exhibits continuous variation between individuals of a population. A quantitative trait is often the result of a genetic locus interacting with the environment or of multiple genetic loci interacting with each other and/or with the environment. Examples of quantitative traits include grain yield, protein content, and yarn strength.
  • The term "ranking" in relation to the features refers to an orderly arrangement of the features, e.g., molecular genetic markers may be ranked by their predictive ability in relation to a trait.
  • The term "self-organizing map" refers to an unsupervised learning technique often used for visualization and analysis of high-dimensional data.
  • The term "supervised", in the context of machine learning, refers to methods that operate under supervision by being provided with the actual outcome for each of the training instances.
  • The term "support vector machine", in the context of machine learning includes, but is not limited to, support vector classifier, used for classification purposes, and support vector regression, used for numeric prediction. Other algorithms (e.g. sequential minimal optimization (SMO)), may be implemented for training a support vector machine.
  • The term "target feature" in the context of this invention, refers, but is not limited to, a feature which is of interest to predict, or explain, or with which it is of interest to develop associations. A data mining effort may include one target feature or more than one target feature and the term "target feature" may refer to one or more than one feature. "Target features" may include, but are not restricted to, DNA marker data, phenotypic data, biochemical data, microarray data, environmental data, proteomic data, and metabolic data. In the field of machine learning, when the "target feature" is discrete, it is often called "class". Grain yield is an example of a target feature.
  • The term "unsupervised," in the context of machine learning, refers to methods that operate without supervision by not being provided with the actual outcome for each of the training instances.
  • Overview of theoretical and practical aspects of some relevant methods Association Rule Mining:
  • Association rule mining (ARM) is a technique for extracting meaningful association patterns among features. One of the machine learning algorithms suitable for learning association rules is the APriori algorithm.
  • A usual primary step of ARM algorithms is to find a set of items or features that are most frequent among all the observations. These are known as frequent itemsets. Their frequency is also known as support (the user may identify a minimum support threshold for a itemset to be considered frequent). Once the frequent itemsets are obtained, rules are extracted from them (with a user specified minimum confidence measure, for example). The later part is not as computationally intensive as the former. Hence, an objective of ARM algorithms is focused on finding frequent itemsets.
  • It is not always certain that the frequent itemsets are the core (most relevant) information patterns of the dataset, as there often is a lot of redundancy among patterns. As a result, many applications rely on obtaining frequent closed patterns. A frequent closed pattern is a pattern that meets the minimal support requirement specified by the user and does not have the same support as its immediate supersets. A frequent pattern is not closed if at least one of its immediate supersets has the same support count as it does. Finding frequent closed patterns allows us to find a subset of relevant interactions among the features.
  • The Apriori algorithm works iteratively by combining frequent itemsets with n-1 features to form a frequent itemset with n features. This procedure is exponential in execution time with the increase in number of features. Hence, extracting frequent itemsets with the Apriori algorithm becomes computationally intensive for datasets with very large number of features.
  • The scalability problem for finding frequent closed itemsets can be handled by some existing algorithms. CARPENTER, a depth-first row enumeration algorithm, is capable of finding frequent closed patterns from large biological datasets with large number of features. CARPENTER does not scale well with the increase in number of samples.
  • Other frequent pattern mining algorithms are CHARM, CLOSET. Both of them are efficient depth-first column enumeration algorithms.
  • COBBLER is a column and row enumeration algorithm that scale well with the increase in number of features and samples.
  • For many different purposes, finding discriminative frequent patterns is even more useful than finding frequent closed association patterns. Several algorithms effectively mine only discriminative patterns from the dataset. Most of the existing algorithms perform a two set approach for finding discriminative patterns: (a) Find frequent patterns (b) From the frequent patterns, obtain discriminative patterns. Step (a) is a very time consuming process and results into many redundant frequent patterns.
  • DDPMine (Direct Discriminative Pattern Mining), discriminative pattern mining algorithm, does not follow the above described two step approach. Instead of deriving frequent patterns, it generates a shrinked FP-tree representation of the data. This procedure, not only reduces the problem size, but also speeds up the mining process. It uses information gain as a measure to mine the discriminative patterns.
  • Other discriminative pattern mining algorithms are HARMONY, RCBT and PatClass. HARMONY is an instance-centric rule-based classifier. It directly mines a final set of classification rules. The RCBT classifier works by first identifying top-k covering rule groups for each row and use them for the classification framework. PatClass takes a two step procedure by first mining a set of frequent itemsets followed by a feature selection step.
  • Most of the existing association rule mining algorithms return small sized frequent or closed patterns. With the increase in number of features, the number of large sized frequent or closed patterns also increases. It is computationally too expensive, rather impossible, to derive all the frequent patterns of all lengths for data sets with large number of features. The Pattern fusion algorithm tries to address this problem by combining small frequent patterns into colossal patterns by taking leaps in the pattern search space.
  • Self-organizing maps:
  • The Self-Organizing Map (SOM) also known as Kohonen network preserving map is an unsupervised learning technique often used for visualization and analysis of high-dimensional data. Typical applications are focused on the visualization of the central dependencies within the data on the map. Some areas where they have been used include automatic speech recognition, clinical voice analysis, classification of satellite images, analyses of electrical signals from the brain, and organization and retrieval from large document collections.
  • The map generated by SOMs has been used to speed up the identification of association rules by methods like Apriori, by utilizing the SOM clusters (visual clusters identified during SOM training).
  • The SOM map consists of a grid of processing units, "neurons". Each neuron is associated with a feature vector (observation). The map attempts to represent all the available observations with optimal accuracy using a restricted set of models. At the same time the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. This procedure enables the identification as well as the visualization of dependencies or associations between the features in the data.
  • During the training phase of SOM, a competitive learning algorithm is used to fit the model vectors to the grid of neurons. It is a sequential regression process, where t = 1,2,... is the step index: For each sample x(t), first the winner index c (best matching neuron) is identified by the condition i , x t m c t x t m i t
    Figure imgb0001
  • After that, all model vectors or a subset of them that belong to nodes centered around node c = c(x) are updated as m i t + 1 = m i t + h c x , i x t m i t
    Figure imgb0002
    Where:
    • mc is the mean weight vector of the cth (i.e. winner) node.
    • mi is the mean weight vector of the ith node.
    • h c(x)i is the "neighborhood function", a decreasing function of the distance between the ith and cth nodes on the map grid.
    • mi (t + 1) is the updated weight vector after the tth step.
  • This regression is usually reiterated over the available observations.
  • SOM algorithms have also been frequently used to explore the spatial and temporal relationships between entities. Relationships and associations between observations are derived based on the spatial clustering of these observations on the map. If the neurons represent various time states then the map visualizes the temporal patterns between observations.
  • Feature evaluation:
  • One of the main purposes of feature evaluation algorithms is to understand the underlying process that generates the data. These methods are also frequently applied to reduce the number of "distracting" features with the aim of improving the performance of classification algorithms (see Guyon and Elisseeff (2003). An Introduction to Variable and Feature Selection. Journal of Machine learning Research 3, 1157-1182). The term "variable" is sometimes used instead of the broader terms "feature" or "attribute". Feature (or attribute) selection refers to the selection of variables processed through methods such as kernel methods, but is sometimes used to refer to the selection of raw input variables. The desired output of these feature evaluation algorithms is usually the ranking of features based on their impact on the target feature or the ranking followed by selection of features. This impact may be measured in different ways.
  • Information gain is one of the machine learning methods suitable for feature evaluation. The definition of information gain requires the definition of entropy, which is a measure of impurity in a collection of training instances. The reduction in entropy of the target feature that occurs by knowing the values of a certain feature is called information gain. Information gain may be used as a parameter to determine the effectiveness of a feature in explaining the target feature.
  • Symmetrical uncertainty, used by the Correlation based Feature Selection (CFS) algorithm described herein, compensates for information gain's bias towards features with more values by normalizing features to a [0,1] range. Symmetrical uncertainty always lies between 0 and 1. It is one way to measure the correlation between two nominal features.
  • The Ranker algorithm may also be used to rank the features by their individual evaluations at each fold of cross-validation and output the average merit and rank for each feature.
  • Relief is a class of attribute evaluator algorithms that may be used for the feature evaluation step disclosed herein. This class contains algorithms that are capable of dealing with categorical or continuous target features. This broad range makes them useful for several data mining applications.
  • The original Relief algorithm has several versions and extensions. For example, the ReliefF, an extension of the original Relief algorithm, is not limited to two class problems and can handle incomplete data sets. ReliefF is also more robust than Relief and can deal with noisy data.
  • Usually, in Relief and ReliefF, the estimated importance of a feature is determined by a sum of scores assigned to it for each one of the instances. Each score depends on how important the feature is in determining the class of an instance. The feature gets maximum value if it is decisive in determining the class. When a significant number of uninformative features are added to the analysis, many instances are necessary for these algorithms to converge to the correct estimates of the worth of each feature. When dealing with several neighboring misses, the important features are those for which a minimal change in their value leads to a change in the class of the instance being evaluated. In ReliefF, when the number of instances is enormous, the near hits play a minimal role and the near misses play a huge role, but with problems of practical size near hits play a bigger role.
  • RReliefF is an extension of ReliefF that deals with continuous target features. The positive updates form the probabilities that the feature discriminates between the instances with different class values. On the other hand, the negative updates form the probabilities that the feature discriminates between the instances with the same class values. In regression problems, it is often difficult to infer whether two instances pertain to the same class or not, therefore the algorithm introduces a probability value that predicts if the values of two instances are different. Therefore, RReliefF algorithms reward features for not separating similar prediction values and punish features for not separating different prediction values. RReliefF, differently from Relief and ReliefF, does not use signs, so the concept of hit and miss does not apply. RReliefF considers good features to be the ones that separate instances with different prediction values and do not separate instances with close prediction values.
  • The estimations generated by algorithms from the class of Relief algorithms are dependent on the number of neighbors used. If one does not use a restriction on the number of neighbors, each feature will suffer the impact of all of the samples in the data set. The restriction on the number of samples used provides estimates by Relief algorithms that are averages over local estimates in smaller parts of the instance space. These local predictions allow Relief algorithms to take into account other features when updating the weight of each feature, as the nearest-neighbors are determined by a distance measure that considers all of the features. Therefore, Relief algorithms are sensitive to the number and usefulness of the features included in the data set. Other features are considered through their conditional dependencies to the feature being updated given the predicted values, which can be detected in the context of locality. The distance between instances is determined by the sum of the differences in the values of the "relevant" and "irrelevant" features. As other k-nearest-neighbor algorithms, these algorithms are not robust to irrelevant features. Therefore, in the presence of a lot of irrelevant features, it is recommended to use a large value of k (i.e. increase number of nearest-neighbors). Doing that, better conditions are provided for the relevant features to "impose" the "correct" update for each feature. However, it is known that Relief algorithms can lose functionality when the number of nearest-neighbors used in the weight formula is too big, often confounding informative features. This is especially true when all of the samples are considered as there will be only a small asymmetry between hits and misses and this asymmetry is much more prominent when only a few nearest-neighbors are considered. The power of Relief algorithms comes from the ability to use the context of locality while providing a global view.
  • RReliefF algorithm may tend to underestimate important numerical features in comparison to nominal features when calculating Euclidian or Manhattan distance between instances to determine nearest-neighbors. RReliefF also overestimates random (non-important) numerical features, potentially reducing the separability of two groups of features. The ramp function (see Hong (1994) Use of contextual information for feature ranking and discretization. Technical Report RC19664, IBM; and Hong (1997) IEEE transactions on knowledge and data engineering, 9(5) 718-730) can be used to overcome this problem of RReliefF.
  • When evaluating the weight that should be assigned to each feature in a given feature set, it is standard practice to emphasize closer instances in comparison to more distant instances. It is often dangerous, however, to use too small a number of neighbors with noisy and complex target features since this can lead to a loss of robustness. Using a larger number of nearest-neighbors avoids reducing the importance of some features for which the top 10 (for example) nearest-neighbors are temporally similar. Such features lose importance as the number of neighbors decreases. If the influence of all neighbors is treated as equal (disregarding their distance to the query point), then the proposed value for the number of nearest-neighbors is usually 10. If distance is taken into account, the proposed value is usually 70 nearest-neighbors with exponentially decreasing influence.
  • ReliefF and RReliefF are context sensitive and therefore more sensitive to the number of random (non important) features in the analysis than myopic measures (e.g. gain ratio and MSE). Relief algorithms estimate each feature in the context of other features and better features get higher scores. Relief algorithms tend to underestimate less important features when there are hundreds of important features in the data set yet duplicated or highly redundant features will share the credit and seem to be more important than they actually are. This can occur because additional copies of the feature change the problem space in which the nearest-neighbors are searched. Using nearest-neighbors, the updates will only occur when there are differences between feature values for two neighboring instances. Therefore, no updates for a given feature at a given set of neighbors will occur if the difference between two neighbors is zero. Highly redundant features will have these differences always equal to zero, reducing the opportunity for updating across all neighboring instances and features. Myopic estimators such as Gain ratio and MSE are not sensitive to duplicated features. However, Relief algorithms will perform better than myopic algorithms if there are interactions between features.
  • Subset Selection
  • Subset selection algorithms rely on a combination of an evaluation method (e.g. symmetrical uncertainty, and information gain) and a search method (e.g. ranker, exhaustive search, best first, and greedy hill-climbing).
  • Subset selection algorithms, similarly to feature evaluation algorithms, rank subsets of features. In contrast to feature evaluation algorithms, however, subset selection algorithms aim at selecting the subset of features with the highest impact on the target feature, while accounting for the degree of redundancy between the features included in the subset. Subset selection algorithms are designed to be robust to multicollinearity and missing values and thus allow for selection from an initial pool of hundreds or even thousands of features. The benefits from feature subset selection include facilitating data visualization and understanding, reducing measurement and storage requirements, reducing training and utilization times, and eliminating distracting features to improve classification. For example, the results from subset selection methods are useful for plant and animal geneticists because they can be used to pre-select the molecular genetic markers to be analyzed during a marker assisted selection project with a phenotypic trait as the target feature. This can significantly reduce the number of molecular genetic markers that must be assayed and thus reduce the costs associated with the effort.
  • Subset selection algorithms can be applied to a wide range of data sets. An important consideration in the selection of a suitable search algorithm is the number of features in the data set. As the number of features increases, the number of possible subsets of features increases exponentially. For this reason, the exhaustive search algorithm is only suitable when the number of features is relatively small. With adequate computational power, however, it is possible to use exhaustive search to determine the most relevant subset of features.
  • There are several algorithms suitable for data sets with a feature set that is too large (or the computational power available is not large enough) for exhaustive search. Two basic approaches to subset selection algorithms are the process of adding features to a working subset (forward selection) and deleting from the current subset of features (backward elimination). In machine learning, forward selection is done differently than the statistical procedure with the same name. Here, the feature to be added to the current subset is found by evaluating the performance of the current subset augmented by one new feature using cross-validation. In forward selection, subsets are built up by adding each remaining feature in turn to the current subset while evaluating the expected performance of each new subset using cross-validation. The feature that leads to the best performance when added to the current subset is retained and the process continues. The search ends when none of the remaining available features improves the predictive ability of the current subset. This process finds a local (i.e. not necessarily global) optimum set of features.
  • Backward elimination is implemented in a similar fashion. With backward elimination, the search ends when further reduction in the feature set does not improve the predictive ability of the subset. To introduce bias towards smaller subsets one may require the predictive ability to improve by a certain amount for a feature to be added (during forward selection) or deleted (during backward elimination).
  • In an embodiment, the best first algorithm can search forward, backward or in both directions (by considering all possible single feature additions and deletions at a given point) through the application of greedy hill-climbing augmented with a backtracking facility (see Pearl, J. (1984), Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, p. 48; and Russell, S.J., & Norvig, P. Artificial Intelligence: A Modern Approach. 2nd edition. Pearson Education, Inc., 2003, pp. 94 and 95). This method keeps a list with all of the subsets previously visited and revisits them whenever the predictive ability stops to improve for a certain subset. It will search the entire space (i.e. exhaustive search) if time is permitted and no stop criterion is imposed, being much less likely to find a local maximum when compared to forward selection and backward elimination. Best first results are, as expected, very similar to the results obtained with exhaustive search. In an embodiment, the beam search method works similarly to best first but truncates the list of feature subsets at each stage, so it is restricted to a fixed number called the beam width.
  • In an embodiment, the genetic algorithm is a search method that uses random perturbations of a current list of candidate subsets to generate new good subsets (see Schmitt, Lothar M (2001), Theory of Genetic Algorithms, Theoretical Computer Science (259), pp. 1-61). They are adaptive and use search techniques based on the principles of natural selection in biology. Competing solutions are set up and evolve over time searching the solution space in parallel (which helps with avoiding local maxima). Crossover and mutations are applied to the members of the current generation to create the next generation. The random addition or deletion of features from a subset is conceptually analogous to the role of mutation in natural systems. Similarly, crossovers combine features from a pair of subsets to form a new subset. The concept of fitness comes into play in that the fittest (best) subset at a given generation has a greater chance of being selected to form a new subset through crossover and mutation. Therefore, good subsets evolve over time.
  • In an embodiment, the scheme-specific (wrapper) (Kohavi and John (1997), Wrappers for feature selection. Artificial Intelligence, 97(1-2):273-324, December 1997.) is a suitable search method. The idea here is to select the subset of features that will have the best classification performance when used for building a model with a specific algorithm. Accuracy is evaluated through cross-validation, holdout set, or bootstrap estimator. A model and a set of cross-validation folds must be performed for each subset of features being evaluated. For example, forward selection or backward elimination with k features and 10-fold cross-validation will take approximately k2 times 10 learning procedures. Exhaustive search algorithms will take something on the order of 2k times 10 learning procedures. Good results were shown for scheme-specific search, with the backward elimination leading to more accurate models than forward selection, and also larger subsets. More sophisticated techniques are not usually justified but can lead to much better results in some cases. Statistical significance tests can be used to determine the time to stop searching based on the chances that a subset being evaluated will lead to improvement over the current best subset.
  • In an embodiment, race search that uses a t-test to determine the probability of a subset being better than the current best subset by at least a small user-specified threshold is suitable. If during the leave-one-out cross-validation process, the probability becomes small, a subset can be discarded because it is very unlikely that adding or deleting features to this subset will lead to an improvement over the current best subset. In forward selection, for example, all of the feature additions to a subset are evaluated simultaneously and the ones that don't perform well enough are dropped. Therefore, not all the instances are used (on leave-one-out cross-validation) to evaluate all the subsets. The race search algorithm also blocks all of the nearly identical feature subsets and uses Bayesian statistics to maintain a probability distribution on the estimate of the mean leave-one-out cross-validation error for each competing subset. Forward selection is used but, instead of sequentially trying all the possible changes to the best subset, these changes are raced and the race finishes when cross-validation finishes or a single subset is left.
  • In an embodiment, schemata search is a more complicated method designed for racing, running an iterative series of races that each determines if a feature should be included or not (see Moore, A. W., and Lee, M. S. (1994). Efficient algorithms for minimizing cross-validation error. In Cohen, W. W., and Hirsh, H., eds.,Machine learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann). A search begins with all features marked as unknown rather than an empty or full set of features. All combinations of unknown features are used with equal probability. In each round, a feature is chosen and subsets with and without the chosen feature are raced. The other features that compose the subset are included or excluded randomly at each point in the evaluation. The winner of a race is used as starting point for the next iteration of races. Given the probabilistic framework, a good feature will be included in the final subset even if it depends on another feature. Schemata search takes interacting features into account while speeding up the search process and has been shown to be more effective and much faster than race search (which uses forward or backward selection).
  • In an embodiment, rank race search orders the features based on their information gain, for example, and then races using subsets that are based on the rankings of the features. The race starts with no features, continues with the top-ranked feature, the top two features, the top three features, and so on. Cross-validation may be used to determine the best search method for a specific data set.
  • In an embodiment, selective naive Bayes uses a search algorithm such as forward selection to avoid including redundant features and features that are dependent on each other (see eg., Domingos, Pedro & Michael Pazzani (1997) "On the optimality of the simple Bayesian classifier under zero-one loss". Machine learning, 29:103--137). The best subset is found by simply testing the performance of the subsets using the training set.
  • Filter methods operate independently of any learning algorithm, while wrapper methods rely on a specific learning algorithm and use methods such as cross-validation to estimate the accuracy of feature subsets. Wrappers often perform better than filters, but are much slower, and must be re-run whenever a different learning algorithm is used or even when a different set of parameter settings is used. The performance of wrapper methods depend on which learning algorithm is used, the procedure used to estimate the off-sample accuracy of the learning algorithm, and the organization of the search.
  • Filters (e.g. the CFS algorithm) are much faster than wrappers for subset selection (due to the reasons pointed out above), so filters can be used with larger data sets. Filters can also improve the accuracy of a certain algorithm by providing a starting feature subset for the wrapper algorithms. This process would therefore speed up the wrapper analysis.
  • The original version of the CFS algorithm, measured only the correlation between discrete features, so it would first discretize all the continuous features. More recent versions handle continuous features without need for discretization.
  • CFS assumes that the features are independent given the target feature. If strong feature dependency exists, CFS' performance may suffer and it might fail to select all of the relevant features. CFS is effective at eliminating redundant and irrelevant features and will detect all of the relevant features in the absence of strong dependency between features. CFS will accept features capable of predicting the response variable in areas of the instance space not already predicted by other features.
  • There are variations of CFS capable of improving detection of locally predictive features, very important in cases where strong globally predictive features overshadow locally predictive ones. CFS has been shown to outperform wrappers much of the time (Hall, M. A. 1999. Correlation-based feature selection for Machine Learning. Ph.D. thesis. Department of Computer Science - The University of Waikato, New Zealand.), especially with small data sets and in cases where there are small feature dependencies.
  • In the case of the CFS algorithm, the numerator of the evaluation function indicates how predictive of the target feature the subset is, and the denominator indicates how redundant the features in the subset are. In the original CFS algorithm, the target feature is first made discrete using the method of Fayyad and Irani (Fayyad, U. M. and Irani, K. B.. 1993. Multi-interval discretisation of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth International Join Conference on Artificial Intelligence. Morgan Kaufmann, 1993.). The algorithm then calculates all feature-target feature correlations (that will be used in the numerator of the evaluation function) and all feature-feature correlations (that will be used in the denominator of the evaluation function). After that, the algorithm searches the feature subset space (using any user-determined search method) looking for the best subset. In a modification of the CFS algorithm, symmetrical uncertainty is used to calculate correlations.
  • The greatest assumption of CFS is that the features are independent given the target feature (i.e. that there are no interactions). Therefore, if strong interactions are present, CFS may fail to detect relevant features. CFS is expected to perform well under moderate levels of interaction. CFS tends to penalize noisy features. CFS is heavily biased towards small feature subsets, leading to reduced accuracy in some cases. CFS is not heavily dependent on the search method used. CFS may be set to place more value on locally predictive features, even if these features don't show outstanding global predictive ability. If not set to account for locally predictive features, the bias of CFS towards small subsets may exclude these features. CFS tends to do better than wrappers in small data sets also because it does not need so save part of the data set for testing. Wrappers perform better than CFS when interactions are present. A wrapper with forward selection can be used to detect pair-wise interactions, but backward elimination is needed to detect higher level interactions. Backward searches, however, make wrappers even slower. Bi-directional search can be used for wrappers, starting from the subset chosen by the CFS algorithm. This smart approach can significantly reduce the amount of time needed by the wrapper to complete the search.
  • Model Development
  • For modeling of large data sets, several algorithms may be used, depending on the nature of the data. In an aspect, for example, Bayesian network methods provide useful and flexible probabilistic approach to inference.
  • In an embodiment, the Bayes optimal classifier algorithm does more than apply the maximum a posteriori hypothesis to a new record in order to predict the probability of its classification (Friedman et al., (1997), Bayesian network classifiers. Machine learning, 29:131-163). It also considers the probabilities from each of the other hypotheses obtained from the training set (not just the maximum a posteriori hypothesis) and uses these probabilities as weighting factors for future predictions. Therefore, future predictions are carried out using all of the hypotheses (i.e., all of the possible models) weighted by their posterior probabilities.
  • In an embodiment, the naive Bayes classifier assigns the most probable classification to a record, given the joint probability of the features. Calculating the joint probability requires a large data set and is computationally intensive. The naive Bayes classifier is part of a larger class of algorithms called Bayesian networks. Some of these Bayesian networks can relax the strong assumption made by the naive Bayes algorithm of independence between features. A Bayesian network is a direct acyclic graph (DAG) with a conditional probability distribution for each node. It relies on the assumption that features are conditionally independent given the target feature (naive Bayes) or its parents, which may require the inclusion of the target feature (Bayesian augmented network) or not (general Bayesian network). The assumption of conditional independence is restricted to subsets of the features, and this leads to a set of conditional independence assumptions, together with a set of conditional probabilities. The output reflects a description of the joint probability for a set of features.
  • In an embodiment, different search algorithms can be implemented using the package WEKA in each of these areas, and probability tables may be calculated by the simple estimator or by Bayesian model averaging (BMA).
  • Regarding methods to search for the best network structure, one option is to use the global score metric-based algorithms. These algorithms rely on cross-validation performed with leave-one-out, k-fold, or cumulative cross-validation. The leave-one-out method isolates one record, trains on the rest of the data set, and evaluates that isolated record (repeatedly, for each of the records). The k-fold method splits the data into k parts, isolates one of these parts, trains with the rest of the data set, and evaluates the isolated set of records. The cumulative cross-validation algorithm starts with an empty data set and adds record by record, updating the state of the network after each additional record, and evaluating the next record to be added according to the current state of the network.
  • In an embodiment, an appropriate network structure found by one of these processes is considered as the structure that best fits the data, as determined by a global or a local score. It can also be considered as a structure that best encodes the conditional independencies between features; these independencies can be measured by Chi-squared tests or mutual information tests. Conditional independencies between the features are used to build the network. When the computational complexity is high, the classification may be performed by a subset of the features, determined by any subset selection method.
  • In an alternative approach to building the network, the target feature may be used as any other node (general Bayesian network) when finding dependencies, after that it is isolated from other features via its Markov blanket. The Markov blanket isolates a node from being affected by any node outside its boundary, which is composed of the node's parents, its children, and the parents of its children. When applied, the Markov blanket of the target feature is often sufficient to perform classification without a loss of accuracy and all of the other nodes may be deleted. This method selects the features (i.e. the ones included in the Markov blanket) that should be used in the classification and reduces the risk of over-fitting the data by deleting all nodes that are outside the Markov blanket of the target feature.
  • In an embodiment, instance-based algorithms are also suitable for model development. Instance-based algorithms, also referred to as "lazy" algorithms, are characterized by generating a new model for each instance, instead of basing predictions on trees or networks generated (once) from a training set. In other words, they do not provide a general function that can explain the target feature. These algorithms store the entire training set in memory and build a model from a set of records similar to those being tested. This similarity is evaluated through nearest-neighbor or locally weighted methods, using Euclidian distances. Once a set of records is selected, the final model may be built using several different algorithms, such as the naive Bayes. The resulting model is generally not designed to perform well when applied to other records. Because the training observations are stored explicitly, not in the form of a tree or network, information is never wasted when training instance-based algorithms.
  • In an embodiment, instance-based algorithms are useful for complex, multidimensional problems for which the computational demands of trees and networks exceed the available memory. This approach avoids the problem of attempting to perform complexity reduction via selection of features to fit the demands of trees or networks. However, this process may perform poorly when classifying a new instance, because all of the computations take place at the classification time. This is generally not a problem during applications in which one or a few instances are to be classified at a time. Usually, these algorithms give similar importance to all of the features, without placing more weight on those that better explain the target feature. This may lead to selection of instances that are not actually closest to the instance being evaluated in terms of their relationship to the target feature. Instance-based algorithms are robust to noise in data collection because instances get the most common assignment among their neighbors or an average (continuous case) of these neighbors, and these algorithms usually perform well with very large training sets.
  • In an embodiment, support vector machines (SVMs) are used to model data sets for data mining purposes. Support vector machines are an outgrowth of Statistical Learning Theory and were first described in 1992. An important aspect of SVMs is that once the support vectors have been identified, the remaining observations can be removed from the calculations, thus greatly reducing the computational complexity of the problem.
  • In an embodiment, decision tree learning algorithms are suitable machine learning methods for modeling. These decision tree algorithms include ID3, Assistant, and C4.5. These algorithms have the advantage of searching through a large hypothesis space without many restrictions. They are often biased towards building small trees, a property that is sometimes desirable.
  • The resulting trees can usually be represented by a set of "if-then" rules; this property which does not apply to other classes of algorithms such as instance-based algorithms, can improve human readability. The classification of an instance occurs by scanning the tree from top to bottom and evaluating some feature at each node of the tree. Different decision tree learning algorithms vary in terms of their capabilities and requirements; some work only with discrete features. Most decision tree algorithms also require the target feature to be binary while others can handle continuous target features. These algorithms are usually robust to errors in the determination of classes (coding) for each feature. Another relevant feature is that some of these algorithms can effectively handle missing values.
  • In an embodiment, the Iterative Dichotomiser 3 (ID3) algorithm is a suitable decision tree algorithm. This algorithm uses "information gain" to decide which feature best explains the target by itself, and it places this feature in the top of the tree (i.e., at the root node). Next, a descendant is assigned for each class of the root node by sorting the training records according to classes of the root node and finding the feature with the greatest information gain in each of these classes. This cycle is repeated for each newly added feature, and so on. This algorithm can not "back-track" to reconsider its previous decisions, and this may lead to convergence to a local maximum. There are several extensions of the ID3 algorithm that perform "post-pruning" of the decision tree, which is a form of back-tracking.
  • The ID3 algorithm performs a "hill-climbing search" through the space of decision trees, starting from a simple hypothesis and progressing through more elaborate hypotheses. Because it performs a complete search of the hypothesis space, it avoids the problem of choosing a hypothesis space that does not contain the target feature. The ID3 algorithm outputs just one tree, not all reasonable trees.
  • Inductive bias can occur with the ID3 algorithm because it is a top-down, breadth-first algorithm. In other words, it considers all possible trees at a certain depth, chooses the best one, and then moves to the next depth. It prefers short trees over long trees, and by selecting the shortest tree at a certain depth it places features with highest information gain closest to the root.
  • In an aspect of decision trees, a variation of ID3 algorithm is the logistic model tree (LMT) (Landwehr et al., (2003), Logistic Model Trees. Proceedings of the 14th European Conference on machine learning. Cavtat-Dubrovnik, Croatia. Springer-Verlag.). This classifier implements logistic regression functions at the leaves. This algorithm deals with discrete target features, and can handle missing values.
  • The C4.5 is a decision tree generating algorithm based on the ID3 algorithm (Quinlan (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers). Some of the improvements include, for example, choosing an appropriate feature evaluation measure; handling training data with missing feature values; handling features with differing costs; and handling continuous features.
  • A useful tool for evaluating the performance of binary classifiers is the Receiver Operating Characteristic (ROC) curve. The ROC curve is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied (T. Fawcett (2003). ROC graphs: Notes and practical considerations for data mining researchers. Tech report HPL-2003-4. HP Laboratories, Palo Alto, CA, USA). Receiver operating characteristic (ROC) curves are, therefore, constructed by plotting the 'sensitivity' against'1 - specificity' for different thresholds. These thresholds determine if a record is classified as positive or negative and influence the sensitivity and the '1 - specificity'. As an example, consider an analysis in which a series of plant varieties are being evaluated for their response to a pathogen and it is desirable to establish a threshold above which a variety will be considered as susceptible. The ROC curve is built over several such thresholds, which help determining the best threshold for a given problem (the one that gives the best balance between the true positive rate and the false positive rate). Lower thresholds lead to higher false positive rates because of the increased ratio of false positives and true negatives (several of the negative records are going to be assigned as positive). The area under the ROC curve is a measure of the overall performance of a classifier, but the choice of the best classifier may be based on specific sections of that curve.
  • Cross-validation techniques are methods by which a particular algorithm or a particular set of algorithms are chosen to provide optimal performance for a given data set. Cross-validation techniques are used herein to select a particular machine learning algorithm during model development, for example. When several algorithms are available for implementation, it is usually interesting to choose the one that is expected to have the best performance in the future. Cross-validation is usually the methodology of choice for this task.
  • Cross-validation is based on first separating part of the training data, then training with the rest of the data, and finally evaluating the performance of the algorithm on the separated data set. Cross-validation techniques are preferred over residual evaluation because residual evaluation is not informative as to how an algorithm will perform when applied to a new data set.
  • In an embodiment, one variant of cross-validation, the holdout method, is based on splitting the data in two, training on the first subset, and testing on the second subset. It takes about the same amount of time to compute as the residual method, and it is preferred when the data set is large enough. The performance of this method may vary depending on how the data set is split into subsets.
  • In an embodiment of cross-validation, a k-fold cross-validation method is an improvement over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. The average error across the k trials is then computed. Each record is part of the testing set once, and is part of the training set k-1 times. This method is less sensitive to the way in which the data set is divided, but the computational cost is k times greater than with the holdout method.
  • In another embodiment of cross-validation, the leave-one-out cross-validation method is similar to k-fold cross-validation. The training is performed using N-1 records (where N is the total number of records), and the testing is performed using only one record at a time. Locally weighted learners reduce the running time of these algorithms to levels similar to that of residual evaluation.
  • In an aspect of cross-validation, the random sample technique is another option for testing, in which a reasonably sized sample from the data set (e.g., more than 30), is used for testing, with the rest of the data set being used for training. The advantage of using random samples for testing is that sampling can be repeated any number of times, which may result in a reduction of the confidence interval of the predictions. Cross-validation techniques, however, have the advantage that records in the testing sets are independent across testing sets.
  • Some of the association rule algorithms described herein may be used to detect interactions among the features in a data set, and may also be used for model development. The M5P algorithm is a model tree algorithm suitable for continuous and discrete target features. It builds decision trees with regression functions instead of terminal class values. Continuous features may be directly handled without transformation to discrete features. It uses conditional class probability function to deal with discrete classes. The class whose model tree generates the greatest approximate probability value is chosen as the predicted class. The M5P algorithm represents any piecewise linear approximation to an unknown function. M5P examines all possible tests and chooses the one that maximizes the expected error reduction. Then M5P prunes this tree back by replacing sub-trees with linear regression models wherever the latter has lower estimated error. Estimated error is the average absolute difference between predicted and actual values for all the instances at a node.
  • During pruning, the underestimation of the error for unseen cases is compensated by (n + v)/(n - v) where n is the number of instances reaching the node and v is the number of parameters in the linear model for that node (see Witten and Frank, 2005). The features involved in each regression are the features that are tested in the sub-trees below this node (see Wang and Witten, 1997). A smoothing process is then used to avoid steep discontinuities between neighboring linear models at the leaves when predicting continuous class values. During smoothing, the prediction with the leaf model is made first and smoothed by combining it with the predicted values from the linear models at each intermediate node in the path back to the root.
  • In an embodimentof modeling with decision tree algorithms, alternating decision trees (ADTrees) are used herein. This algorithm is a generalization of decision trees that relies on a boosting technique called AdaBoost (see Freund and Schapire (1996), Experiments with a new boosting algorithm. In L. Saitta, editor, Proceedings of the Thirteenth International Conference on machine learning, pages 148-156, San Mateo, CA, Morgan Kaufmann.) to improve performance.
  • When compared to other decision tree algorithms, the alternating decision tree algorithm tends to build smaller trees with simpler rules, and therefore be more readily interpretable. It also associates real values with each of the nodes, which allows each node to be evaluated independently from the other nodes. The smaller size of the resulting trees, and the corresponding reduction in memory requirements, makes the alternating decision tree algorithm one of few options for handling very large and complex data sets. The multiple paths followed by a record after a prediction node make this algorithm more robust to missing values because all of the alternative paths are followed in spite of the one ignored path. Finally, this algorithm provides a measure of confidence in each classification, called "classification margin", which in some applications is as important as the classification itself. As with other decision trees, this algorithm is also very robust with respect to multicollinearity among features.
  • Plants and animals are often propagated on the basis of certain desirable features such as grain yield, percent body fat, oil profile, and resistance to diseases. One of the objectives of a plant or animal improvement program is to identify individuals for propagation such that the desirable features are expressed more frequently or more prominently in successive generations. Learning involves, but is not restricted to, changing the practices, activities, or behaviors involved in identifying individuals for propagation such that the extent of the increase in the expression of the desirable feature is greater or the cost of identifying the individuals to propagate is lower. By accomplishing the steps listed herein, it is possible to develop a model to more effectively select individuals for propagation than by other methods and to more accurately classify or predict the performance of hypothetical individuals based on a combination of feature values.
  • In addition to the desirable features, data can be obtained for one or more additional features that may or may not have an obvious relationship to the desirable features.
  • EXAMPLE
  • The following example is for illustrative purposes only and is not intended to limit the scope of this disclosure.
  • Elite maize lines containing high and low levels of resistance to a pathogen were identified through field and greenhouse screening. A line, which demonstrates high levels of resistance to this pathogen was used as a donor and crossed to a susceptible elite line. The offspring were then backcrossed to the same susceptible elite line. The resulting population was crossed to the haploid inducer stock and chromosome doubling technology was used to develop 191 fixed inbred lines. The level of resistance to the pathogen was evaluated for each line in two replications using field screening methodologies. Forty four replications of the susceptible elite line were also evaluated using field screening methodologies. Genotype data was generated for all 191 double haploid lines, the susceptible elite line and the resistant donor using 93 polymorphic SSR markers.
  • The final dataset contained 426 samples that were divided in two groups based on the field screening results. Plants with field screening scores ranging from 1 to 4 comprised the susceptible group, while plants with field screening scores ranging from 5 to 9 comprised the resistant group. For our analyses, the susceptible group was labeled with "0" and the resistant group was labeled with "1".
  • The data set was analyzed using a three step process consisting of: (a) Detecting association rules; (b) Creating new features based on the findings of step (a) and adding these features to the data set; (c) Developing a classification model for a target feature without the features from step (b) and another model with the features from step (b). A description of the application of each of these steps to this data set follows.
  • Step (a): Detecting association rules: In this example, the 426 samples were evaluated using DDPM (discriminative pattern mining algorithm) and CARPENTER (frequent pattern mining algorithm). All the 94 features (target feature included) were used for evaluation.
  • The association rule detected by the DDPM algorithm included the following features:
    1. 1. Feature 48 = 5_103.776_umc2013, Feature 59 = 7_12.353_lgi2132 and Feature 89 = 10_43.909_phi050
  • This discriminative pattern has the best information gain (0.068) from all the patterns with support (occurrences out of 426 samples) >= 120.
  • The five association rules detected by the CARPENTER algorithm included the following features:
    1. 1. Feature 59 = 7_12.353_lgi2132, Feature 62 = 7_47.585_umc1036 and Response = 1
    2. 2. Feature 59 = 7_12.353_lgi2132, Feature 92 = 10_48.493_umc1648 and Response = 1
    3. 3. Feature 35 = 4_58.965_umc1964, Feature 59 = 7_12.353_lgi2132 and Response = 1
    4. 4. Feature 19 = 2_41.213_lgi2277, Feature 20 = 2_72.142_umc1285 and Response = 0
    5. 5. Feature 19 = 2_41.213_lgi2277, Feature 78 = 8_95.351_umc1384 and Response = 0
    6. 6. Feature 88 = 10_18.018_umc1576, Feature 89 = 10_43.909_phi050 and Response = 0
  • The association rules with Response=1 have a support of 180 and rules with Response=0 have a support of 140.
  • Step (b): Creating new features based on the findings of step (a) and adding these features to the data set: Using the original features included in the 6 association rules detected during step (a), new features were created. These new features were created by concatenating the original features as shown in Table 1. Table 1: Representation of the possible values of a new feature created from two other features.
    Feature 1 Feature 2 New Feature
    A a aa
    a ba
    -
    B
    A b ab
    B b bb
  • Step (c): Developing a classification model for a target feature before adding the features from step (b) and another model after adding the features from step (b): For model development, the REPTree algorithm was applied to the data set. Table 2 shows that after adding the new features to the data set, the mean absolute error decreased (i.e. the new features improved the classification accuracy). Table 3 shows the confusion matrix resulting from a REPTree model using the original data set without the new features from step (b). Table 4 shows the confusion matrix resulting from a REPTree model using the original data set and the new features from step (b). The addition of the new features from step (b) led to an increase in the number of correctly classified records for both classes of the target feature. For class "0", the number of correctly classified records increased from 91 to 97. For class "1", the number of correctly classified records increased from 166 to 175. Figure 1 shows the increase in the area under the ROC curve obtained with the addition of the new features from step (b). This indicates that adding the new features from step (b) leads to an improved model. These results were obtained using 10-fold cross validation. Table 2: Mean absolute errors obtained from a REPTree model applied to a data set consisting of 426 maize plants using 93 features created from SSR molecular genetic markers with and without the new features from step (b), and the target feature.
    Algorithm Mean Absolute Error
    REPTree (Original data) 0.4438
    REPTree (Original data plus new features from step (b)) 0.436
    Table 3: Confusion matrix resulting from a REPTree model using the original data set without the new features from step (b).
    Predicted
    Original Class-0 Class-1
    Class-0 91 91
    Class-1 78 166
    Table 4: Confusion matrix resulting from a REPTree model using the original data set and the new features from step (b).
    Predicted
    Original Class-0 Class-1
    Class-0 97 86
    Class-1 70 175

Claims (20)

  1. A method for prediction or classification of one or more target features in plants, the method comprising:
    determining the genotype of plants in a plant population for at least one plant-based molecular genetic marker;
    providing a data set comprising a set of features, wherein at least one of the features in the data set comprises the at least one molecular genetic marker;
    determining at least one association rule from the data set utilizing a computer and one or more association rule mining algorithms;
    utilizing the association rule to create one or more new features;
    adding the new feature to the data set;
    utilizing one or more machine learning algorithms for developing a plurality of models for prediction or classification of a desired target feature using at least one new feature;
    utilizing cross-validation to compare the algorithms and sets of parameter values and selecting an algorithm for accurate prediction or classification of desired target features for the data set;
    utilizing the selected algorithm for prediction of a value of a target feature in one or more members of the plant population; and
    selecting at least one of the member of the plant population having a desired predicted value of the target feature.
  2. The method of claim 1 wherein the association rules include spatial and temporal association rules that are determined with self-organizing maps.
  3. The method of claim 1, wherein the data set is selected from the group consisting of environmental data, phenotypic data, DNA sequence data, microarray data, biochemical data, metabolic data, or a combination thereof.
  4. The method of claim 1, wherein the one or more association rules determined by one or more association rule mining algorithms are utilized for classification or prediction with one or more machine learning algorithms selected from the group consisting of feature evaluation algorithms, feature subset selection algorithms, Bayesian networks, instance-based algorithms, support vector machines, vote algorithm, cost-sensitive classifier, stacking algorithm, classification rules, and decision trees.
  5. The method of claim 4 wherein the one or more association rule mining algorithms are selected from the group consisting of APriori algorithm, FP-growth algorithm, association rule mining algorithms that can handle large number of features, colossal pattern mining algorithms, direct discriminative pattern mining algorithm, decision trees, rough sets, and the Self-organizing map (SOM) algorithm.
  6. The method of claim 5, wherein, independently,
    i) the association rule mining algorithms that can handle large number of features include, but are not limited to, CLOSET+, CHARM, CARPENTER and COBBLER;
    ii) the algorithms that can find direct discriminative patterns include, but are not limited to, DDPM, HARMONY, RCBT, CAR, and PATCLASS;
    iii) the algorithms that can find colossal patterns include, but are not limited to, Pattern fusion algorithm.
  7. The method of claim 4, wherein an algorithm may be independently defined, as follows:
    i) wherein the feature evaluation algorithm is selected from the group consisting of information gain algorithm, Relief algorithm, ReliefF algorithm, RReliefF algorithm, symmetrical uncertainty algorithm, gain ratio algorithm, and ranker algorithm;
    ii) the feature subset selection algorithm is selected from the group consisting of correlation-based feature selection (CFS) algorithm, and the wrapper algorithm in association with any other machine learning algorithm;
    iii) the machine learning algorithm is a Bayesian network algorithm including naive Bayes algorithm;
    iv) the instance-based algorithm is selected from the group consisting of instance-based 1 (1BI) algorithm, instance-based k-nearest neighbor (IBK) algorithm, KStar, lazy Bayesian rules (LBR) algorithm, and locally weighted learning (LWL) algorithm;
    v) wherein the decision tree is selected from the group consisting of the logistic model tree (LMT) algorithm, alternating decision tree (ADTree) algorithm, M5P algorithm, and REPTree algorithm.
  8. The method of claim 4, wherein the machine learning algorithm is a support vector machine algorithm, preferably a support vector regression (SVR) algorithm.
  9. The method of claim 8, wherein the support vector machine algorithm uses the sequential minimal optimization (SMO) algorithm, or the sequential minimal optimization for regression (SMOReg) algorithm.
  10. The method of claim 1, wherein the one or more target features are selected from the group consisting of a continuous target feature and a discrete target feature, wherein preferably the discrete target feature is a binary target feature.
  11. The method of claim 1, wherein the at least one plant-based molecular genetic marker is from a plant population, preferably selected from the group consisting of maize, soybean, sugarcane, sorghum, wheat, sunflower, rice, canola, cotton and millet.
  12. The method of claim 11, wherein the plant population is a structured or an unstructured plant population.
  13. The method of claim 11, wherein the plant population comprises inbred plants or hybrid plants.
  14. The method of claim 1, wherein the features comprise one or more of a simple sequence repeat (SSR), cleaved amplified polymorphic sequences (CAPS), a simple sequence length polymorphism (SSLP), a restriction fragment length polymorphism (RFLP), a random amplified polymorphic DNA (RAPD) marker, a single nucleotide polymorphism (SNP), an arbitrary fragment length polymorphism (AFLP), an insertion, a deletion, any other type of molecular genetic marker derived from DNA, RNA, protein, or metabolite, a haplotype created from two or more of the above described molecular genetic markers derived from DNA, and a combination thereof, preferably in conjunction with one or more phenotypic measurements, microarray data, analytical measurements, biochemical measurements, or environmental measurements such as climate and soil characteristics of the field where plants are cultivated as features.
  15. The method of claim 1, wherein the one or more target features are numerically representable phenotypic traits including disease resistance, yield, grain yield, yarn strength, protein composition, protein content, insect resistance, grain moisture content, grain oil content, grain oil quality, drought resistance, root lodging resistance, plant height, ear height, grain protein content, grain amino acid content, grain color, and stalk lodging resistance, preferably adjusted utilizing statistical methods, machine learning methods, or any combination of thereof.
  16. The method of claim 1 further comprises using receiver operating characteristic (ROC) curves to compare algorithms and sets of parameter values.
  17. The method of claim 1, wherein one or more features are derived mathematically or computationally from other features.
  18. The method of claim 1, wherein the results are applied to detecting one or more quantitative trait loci, assign significance to one or more quantitative trait loci, position one or more quantitative trait loci, or any combination of thereof.
  19. The method of claim 1, wherein the prior knowledge is comprised of preliminary research, quantitative studies of plant genetics, gene networks, sequence analyses, or any combination of thereof.
  20. The method of claim 1, further comprising the steps of:
    (a) reducing dimensionality by replacing the original features with a combination of one or more of the features included in one or more of the association rules;
    (b) mining discriminative and essential frequent patterns via model-based search tree.
EP10728031.5A 2009-06-30 2010-06-03 Application of machine learning methods for mining association rules in plant and animal data sets containing molecular genetic markers, followed by classification or prediction utilizing features created from these association rules Active EP2449510B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US22180409P 2009-06-30 2009-06-30
PCT/US2010/037211 WO2011008361A1 (en) 2009-06-30 2010-06-03 Application of machine learning methods for mining association rules in plant and animal data sets containing molecular genetic markers, followed by classification or prediction utilizing features created from these association rules

Publications (3)

Publication Number Publication Date
EP2449510A1 EP2449510A1 (en) 2012-05-09
EP2449510B1 true EP2449510B1 (en) 2019-07-31
EP2449510B2 EP2449510B2 (en) 2022-12-21

Family

ID=42685709

Family Applications (1)

Application Number Title Priority Date Filing Date
EP10728031.5A Active EP2449510B2 (en) 2009-06-30 2010-06-03 Application of machine learning methods for mining association rules in plant and animal data sets containing molecular genetic markers, followed by classification or prediction utilizing features created from these association rules

Country Status (15)

Country Link
US (1) US10102476B2 (en)
EP (1) EP2449510B2 (en)
CN (1) CN102473247B (en)
AR (2) AR077103A1 (en)
AU (2) AU2010274044B2 (en)
BR (1) BRPI1015129A2 (en)
CA (1) CA2766914C (en)
CL (1) CL2011003328A1 (en)
CO (1) CO6430492A2 (en)
MX (1) MX2011014020A (en)
NZ (1) NZ596478A (en)
PH (1) PH12016501806A1 (en)
RU (1) RU2607999C2 (en)
WO (1) WO2011008361A1 (en)
ZA (1) ZA201108579B (en)

Families Citing this family (88)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8190647B1 (en) * 2009-09-15 2012-05-29 Symantec Corporation Decision tree induction that is sensitive to attribute computational complexity
US8593277B2 (en) 2011-03-17 2013-11-26 Kaarya, LLC. System and method for proximity detection
US8819065B2 (en) * 2011-07-08 2014-08-26 International Business Machines Corporation Mining generalized spatial association rule
US8217945B1 (en) 2011-09-02 2012-07-10 Metric Insights, Inc. Social annotation of a single evolving visual representation of a changing dataset
US9275334B2 (en) * 2012-04-06 2016-03-01 Applied Materials, Inc. Increasing signal to noise ratio for creation of generalized and robust prediction models
US9563669B2 (en) 2012-06-12 2017-02-07 International Business Machines Corporation Closed itemset mining using difference update
US9373087B2 (en) 2012-10-25 2016-06-21 Microsoft Technology Licensing, Llc Decision tree training in machine learning
US9754015B2 (en) * 2012-11-26 2017-09-05 Excalibur Ip, Llc Feature rich view of an entity subgraph
CN103884806B (en) * 2012-12-21 2016-01-27 中国科学院大连化学物理研究所 In conjunction with the Leaf proteins Label-free Protein Quantification Methods of second order ms and machine learning algorithm
US9471881B2 (en) 2013-01-21 2016-10-18 International Business Machines Corporation Transductive feature selection with maximum-relevancy and minimum-redundancy criteria
US20140207799A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Hill-climbing feature selection with max-relevancy and minimum redundancy criteria
US10102333B2 (en) 2013-01-21 2018-10-16 International Business Machines Corporation Feature selection for efficient epistasis modeling for phenotype prediction
RU2543315C2 (en) * 2013-03-22 2015-02-27 Федеральное государственное автономное образовательное учреждение высшего профессионального образования "Национальный исследовательский университет "Высшая школа экономики" Method of selecting effective versions in search and recommendation systems (versions)
AU2014287234A1 (en) * 2013-07-10 2016-02-25 Daniel M. Rice Consistent ordinal reduced error logistic regression machine
US9524510B2 (en) * 2013-10-02 2016-12-20 Turn Inc. Adaptive fuzzy fallback stratified sampling for fast reporting and forecasting
WO2015143393A1 (en) * 2014-03-20 2015-09-24 The Regents Of The University Of California Unsupervised high-dimensional behavioral data classifier
US9940343B2 (en) 2014-05-07 2018-04-10 Sas Institute Inc. Data structure supporting contingency table generation
CN104765810B (en) * 2015-04-02 2018-03-06 西安电子科技大学 Diagnosis and treatment rule digging method based on Boolean matrix
US10019542B2 (en) 2015-04-14 2018-07-10 Ptc Inc. Scoring a population of examples using a model
US10037361B2 (en) 2015-07-07 2018-07-31 Sap Se Frequent item-set mining based on item absence
CN105160087B (en) * 2015-08-26 2018-03-13 中国人民解放军军事医学科学院放射与辐射医学研究所 A kind of construction method of correlation rule optimal curve model
US11972336B2 (en) 2015-12-18 2024-04-30 Cognoa, Inc. Machine learning platform and system for data analysis
CN105827603A (en) * 2016-03-14 2016-08-03 中国人民解放军信息工程大学 Inexplicit protocol feature library establishment method and device and inexplicit message classification method and device
CN107516022A (en) * 2016-06-17 2017-12-26 北京光大隆泰科技有限责任公司 The data processing method and system of phenotype genes type based on discrete interrelated decision tree
CN106202883A (en) * 2016-06-28 2016-12-07 成都中医药大学 A kind of method setting up disease cloud atlas based on big data analysis
US11030673B2 (en) * 2016-07-28 2021-06-08 International Business Machines Corporation Using learned application flow to assist users in network business transaction based apps
US11222270B2 (en) 2016-07-28 2022-01-11 International Business Machiness Corporation Using learned application flow to predict outcomes and identify trouble spots in network business transactions
RU2649792C2 (en) * 2016-09-09 2018-04-04 Общество С Ограниченной Ответственностью "Яндекс" Method and learning system for machine learning algorithm
US10210283B2 (en) 2016-09-28 2019-02-19 International Business Machines Corporation Accessibility detection and resolution
CN106407711A (en) * 2016-10-10 2017-02-15 重庆科技学院 Recommendation method and recommendation system of pet feeding based on cloud data
CN106472332B (en) * 2016-10-10 2019-05-10 重庆科技学院 Pet feeding method and system based on dynamic intelligent algorithm
CN110050092B (en) * 2016-12-08 2023-01-03 中国种子集团有限公司 Rice whole genome breeding chip and application thereof
EP3340130A1 (en) * 2016-12-23 2018-06-27 Hexagon Technology Center GmbH Method for prediction of soil and/or plant condition
CN106709998A (en) * 2016-12-24 2017-05-24 郑州大学 Self-organization mapping modeling method for digital reconstruction of crop foliage attributes
WO2018129414A1 (en) * 2017-01-08 2018-07-12 The Henry M. Jackson Foundation For The Advancement Of Military Medicine, Inc. Systems and methods for using supervised learning to predict subject-specific pneumonia outcomes
CA3049582A1 (en) * 2017-01-08 2018-07-12 The Henry M. Jackson Foundation For The Advancement Of Military Medicine, Inc. Systems and methods for using supervised learning to predict subject-specific bacteremia outcomes
CN106803209B (en) * 2017-01-13 2020-09-18 浙江求是人工环境有限公司 Crop cultivation mode analysis optimization method of real-time database and advanced control algorithm
CN106886792B (en) * 2017-01-22 2020-01-17 北京工业大学 Electroencephalogram emotion recognition method for constructing multi-classifier fusion model based on layering mechanism
US20180239866A1 (en) * 2017-02-21 2018-08-23 International Business Machines Corporation Prediction of genetic trait expression using data analytics
CN108733966A (en) * 2017-04-14 2018-11-02 国网重庆市电力公司 A kind of multidimensional electric energy meter field thermodynamic state verification method based on decision woodlot
RU2672394C1 (en) * 2017-07-26 2018-11-14 Общество С Ограниченной Ответственностью "Яндекс" Methods and systems for evaluation of training objects through a machine training algorithm
US11263707B2 (en) 2017-08-08 2022-03-01 Indigo Ag, Inc. Machine learning in agricultural planting, growing, and harvesting contexts
WO2019040866A2 (en) 2017-08-25 2019-02-28 The Board Of Trustees Of The University Of Illinois Apparatus and method for agricultural data collection and agricultural operations
CN107679368A (en) * 2017-09-11 2018-02-09 宁夏医科大学 PET/CT high dimensional feature level systems of selection based on genetic algorithm and varied precision rough set
CN107844602B (en) * 2017-11-24 2021-03-16 重庆邮电大学 Prediction method based on spatio-temporal attribute association rule
KR20200092989A (en) * 2017-12-01 2020-08-04 지머젠 인코포레이티드 Production organism identification using unsupervised parameter learning for outlier detection
CN108280289B (en) * 2018-01-22 2021-10-08 辽宁工程技术大学 Rock burst danger level prediction method based on local weighted C4.5 algorithm
CN108307231B (en) * 2018-02-14 2021-01-08 南京邮电大学 Network video stream feature selection and classification method based on genetic algorithm
US11367093B2 (en) 2018-04-24 2022-06-21 Indigo Ag, Inc. Satellite-based agricultural modeling
US11138677B2 (en) 2018-04-24 2021-10-05 Indigo Ag, Inc. Machine learning in an online agricultural system
US11531934B2 (en) * 2018-05-31 2022-12-20 Kyndryl, Inc. Machine learning (ML) modeling by DNA computing
CN109308936B (en) * 2018-08-24 2020-09-01 黑龙江省稻无疆农业科技有限责任公司 Grain crop production area identification method, grain crop production area identification device and terminal identification equipment
CN109300502A (en) * 2018-10-10 2019-02-01 汕头大学医学院 A kind of system and method for the analyzing and associating changing pattern from multiple groups data
CN109726228A (en) * 2018-11-01 2019-05-07 北京理工大学 A kind of Cutting data integrated application method under big data background
CN109918708B (en) * 2019-01-21 2022-07-26 昆明理工大学 Material performance prediction model construction method based on heterogeneous ensemble learning
CN113412333A (en) 2019-03-11 2021-09-17 先锋国际良种公司 Method for clonal plant production
US11174522B2 (en) * 2019-03-11 2021-11-16 Pioneer Hi-Bred International, Inc. Methods and compositions for imputing or predicting genotype or phenotype
TWI759586B (en) * 2019-03-18 2022-04-01 崑山科技大學 Farmland irrigation recommendation method
WO2020214699A1 (en) * 2019-04-15 2020-10-22 Sports Data Labs, Inc. Animal data prediction system
CN110119551B (en) * 2019-04-29 2022-12-06 西安电子科技大学 Shield machine cutter abrasion degradation correlation characteristic analysis method based on machine learning
US10515715B1 (en) 2019-06-25 2019-12-24 Colgate-Palmolive Company Systems and methods for evaluating compositions
EP3997546A4 (en) * 2019-07-08 2023-07-12 Indigo AG, Inc. Crop yield forecasting models
CN110334133B (en) * 2019-07-11 2020-11-20 北京京东智能城市大数据研究院 Rule mining method and device, electronic equipment and computer-readable storage medium
CN110777214B (en) * 2019-07-11 2022-05-27 东北农业大学 SSR (simple sequence repeat) marker closely linked with corn seed storage resistance and application thereof in molecular marker-assisted breeding
US20210110298A1 (en) * 2019-10-15 2021-04-15 Kinaxis Inc. Interactive machine learning
US11526899B2 (en) 2019-10-11 2022-12-13 Kinaxis Inc. Systems and methods for dynamic demand sensing
US11886514B2 (en) 2019-10-11 2024-01-30 Kinaxis Inc. Machine learning segmentation methods and systems
CN111103157A (en) * 2019-11-26 2020-05-05 通鼎互联信息股份有限公司 Industrial equipment state monitoring method based on biological heuristic frequent item set mining
US11544593B2 (en) * 2020-01-07 2023-01-03 International Business Machines Corporation Data analysis and rule generation for providing a recommendation
US11176924B2 (en) 2020-01-09 2021-11-16 International Business Machines Corporation Reduced miss rate in sound to text conversion using banach spaces
US11669794B2 (en) * 2020-04-06 2023-06-06 Johnson Controls Tyco IP Holdings LLP Building risk analysis system with geographic risk scoring
CN111540408B (en) * 2020-05-12 2023-06-02 西藏自治区农牧科学院水产科学研究所 Screening method of genome-wide polymorphism SSR molecular markers
CN111738138B (en) * 2020-06-19 2024-02-02 安徽大学 Remote sensing monitoring method for severity of wheat strip embroidery disease based on coupling meteorological characteristic region scale
CN111784071B (en) * 2020-07-14 2024-05-07 北京月新时代科技股份有限公司 License occupation and prediction method and system based on Stacking integration
CN111984646A (en) * 2020-08-12 2020-11-24 中国科学院昆明植物研究所 Binary tree-based plant data storage and identification method and system
CN112182497B (en) * 2020-09-25 2021-04-27 齐鲁工业大学 Biological sequence-based negative sequence pattern similarity analysis method, realization system and medium
CN112446509B (en) * 2020-11-10 2023-05-26 中国电子科技集团公司第三十八研究所 Prediction maintenance method for complex electronic equipment
WO2022157872A1 (en) * 2021-01-21 2022-07-28 日本電気株式会社 Information processing apparatus, feature quantity selection method, teacher data generation method, estimation model generation method, stress level estimation method, and program
CN113053459A (en) * 2021-03-17 2021-06-29 扬州大学 Hybrid prediction method for integrating parental phenotypes based on Bayesian model
CN113381973B (en) * 2021-04-26 2023-02-28 深圳市任子行科技开发有限公司 Method, system and computer readable storage medium for identifying SSR flow
US11748384B2 (en) 2021-05-28 2023-09-05 International Business Machines Corporation Determining an association rule
CN113535694A (en) * 2021-06-18 2021-10-22 北方民族大学 Stacking frame-based feature selection method
WO2023034118A1 (en) 2021-08-30 2023-03-09 Indigo Ag, Inc. Systems for management of location-aware market data
CA3230474A1 (en) 2021-08-31 2023-03-09 Eleanor Elizabeth Campbell Systems and methods for ecosystem credit recommendations
CN114780599A (en) * 2022-04-06 2022-07-22 四川农业大学 Comprehensive analysis system based on wheat quality ratio test data
CN116189907B (en) * 2022-12-05 2023-09-05 广州盛安医学检验有限公司 Intelligent genetic metabolic disease screening system suitable for newborns
CN117461500B (en) * 2023-12-27 2024-04-02 北京市农林科学院智能装备技术研究中心 Plant factory system, method, device, equipment and medium for accelerating crop breeding
CN118230826B (en) * 2024-05-24 2024-08-23 安徽中医药大学 Diffusion model-based depression gene expression profile identification method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2547678A1 (en) 1996-07-03 1998-01-04 Cargill, Incorporated Canola oil having increased oleic acid and decreased linolenic acid content
WO2003040949A1 (en) 2001-11-07 2003-05-15 Biowulf Technologies, Llc Pre-processed Feature Ranking for a support Vector Machine
US20030130991A1 (en) * 2001-03-28 2003-07-10 Fidel Reijerse Knowledge discovery from data sets
RU2215406C2 (en) 2002-01-03 2003-11-10 Государстенное научное учреждение РАСХН - Всероссийский научно-исследовательский институт масличных культур им. В.С. Пустовойта Method for creating of soya forms with altered fatty-acid content of oil
EP1626621A4 (en) 2003-05-28 2009-10-21 Pioneer Hi Bred Int Plant breeding method
EP1866818A1 (en) * 2005-03-31 2007-12-19 Koninklijke Philips Electronics N.V. System and method for collecting evidence pertaining to relationships between biomolecules and diseases
US7836004B2 (en) * 2006-12-11 2010-11-16 International Business Machines Corporation Using data mining algorithms including association rules and tree classifications to discover data rules
RU2010101623A (en) 2007-06-20 2011-07-27 Интернэшнл Флауэр Девелопментс Проприэтери Лимитед (Au) ROSE CONTAINING FLAVON AND DOLPHINIDINE AND METHOD FOR PRODUCING IT

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
BAEZ-MONROY V.O. ET AL.: "Principles of Employing a Self- Organizing Map as a Frequent Itemset Miner", 15TH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS: BIOLOGICAL INSPIRATIONS, September 2005 (2005-09-01), pages 363 - 370, XP019018161
DIMOV R.: "WEKA: Practical Machine Learning Tools and Techniques in Java", SEMINAR A.I. TOOLS , WS 2006/2007, XP055738406
GUYON I. ET AL.: "An introduction to variable and feature selection", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 3, March 2003 (2003-03-01), pages 1157 - 1182, XP058112314
LONG N. ET AL.: "Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers", J. ANIM. BREED. GENETICS, vol. 124, no. 6, 7 December 2007 (2007-12-07), pages 377 - 389, XP55856541
MALOVINI A. ET AL.: "Phenotype forecasting with SNPs data through gene -based Bayesian networks", BMC BIOINFORMATICS, vol. 10, no. 2, February 2009 (2009-02-01), pages 1 - 9, XP021047454
P SERMSWATRSI, SRISA-AN: "A Neural-Networks Associative Classification for Association Rule Mining", WIT TRANSACTIONS ON INFORMATION AND COMMUNICATION TECHNOLOGIES, vol. 37, 1 January 2006 (2006-01-01), pages 93 - 102, XP055738417, ISSN: 1743-3517
SENTHIL K PALANISAMY: "Association Rule Based Classification", THESIS WORCESTER POLYTECHNIC INSTITUTE, 1 May 2006 (2006-05-01), pages 1 - 52, XP055738409
TAYLOR J. ET AL.: "Application of metabolomics to plant genotype discrimination using statistics and machine learning", BIOINFORMATICS, vol. 18, no. Suppl. 2, 1 October 2002 (2002-10-01), pages S241 - S248, XP055856540
WITTEN I.H. ET AL.: "Data Mining - Practical Machine Learning Tools and Techniques", THE WEKA MACHINE LEARNING WORKBENCH, 2005
YU J. ET AL.: "A unified mixed-model method for association mapping that account for multiple levels of relatedness", NATURE GENETICS, vol. 38, no. 2, February 2006 (2006-02-01), pages 203 - 208, XP002588875, DOI: 10.1038/NG1702
ZHONG S. ET AL.: "Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: A barley case study", GENETICS, vol. 182, no. 1, April 2009 (2009-04-01), pages 355 - 364, XP055131917, DOI: 10.1534/genetics.108.098277

Also Published As

Publication number Publication date
RU2607999C2 (en) 2017-01-11
CA2766914C (en) 2019-02-26
BRPI1015129A2 (en) 2016-07-12
CN102473247A (en) 2012-05-23
CO6430492A2 (en) 2012-04-30
CL2011003328A1 (en) 2012-08-31
AU2015243031B2 (en) 2016-12-22
CA2766914A1 (en) 2011-01-20
AR107503A2 (en) 2018-05-09
US20100332430A1 (en) 2010-12-30
EP2449510B2 (en) 2022-12-21
RU2012103024A (en) 2013-08-10
MX2011014020A (en) 2012-02-28
AU2015243031A1 (en) 2015-11-05
CN102473247B (en) 2017-02-08
AR077103A1 (en) 2011-08-03
PH12016501806A1 (en) 2018-06-11
US10102476B2 (en) 2018-10-16
ZA201108579B (en) 2013-01-30
NZ596478A (en) 2014-04-30
EP2449510A1 (en) 2012-05-09
AU2010274044A1 (en) 2012-01-19
WO2011008361A1 (en) 2011-01-20
AU2010274044B2 (en) 2015-08-13

Similar Documents

Publication Publication Date Title
AU2015243031B2 (en) Application of Machine Learning Methods for Mining Association Rules in Plant and Animal Data Sets Containing Molecular Genetic Markers, Followed by Classification or Prediction Utilizing Features Created from these Association Rules
Rostami et al. Review of swarm intelligence-based feature selection methods
Rostami et al. Integration of multi-objective PSO based feature selection and node centrality for medical datasets
Akinola et al. Multiclass feature selection with metaheuristic optimization algorithms: a review
Dadaneh et al. Unsupervised probabilistic feature selection using ant colony optimization
Squillero et al. Divergence of character and premature convergence: A survey of methodologies for promoting diversity in evolutionary optimization
US20220301658A1 (en) Machine learning driven gene discovery and gene editing in plants
CN116168766A (en) Variety identification method, system and terminal based on ensemble learning
Larrañaga et al. Estimation of distribution algorithms in machine learning: a survey
Bartusiak et al. Predicting dog phenotypes from genotypes
CN117476252A (en) Etiology and pathology prediction method based on knowledge graph
El Rahman et al. Machine learning model for breast cancer prediction
Shirzadifar et al. A machine learning approach to predict the most and the least feed–efficient groups in beef cattle
Carballido et al. Preclas: an evolutionary tool for unsupervised feature selection
Segall et al. Applications of neural network and genetic algorithm data mining techniques in bioinformatics knowledge discovery–A preliminary study
Segera An Excited Cuckoo Search-grey Wolf Adaptive Kernel Svm For Effective Pattern Recognition In Dna Microarray Cancer Chips
Logeswari et al. A Study on Attribute Selection Methodologies in Microarray Data to Classify the Cancer Type
Priya et al. Deep learning-based breast cancer disease prediction framework for medical industries
Alavi et al. scQuery: a web server for comparative analysis of single-cell RNA-seq data
Nour Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets
Ghevariya Faster markov blanket with tabu search for efficient feature selection of microarray cancer datasets
Kihlman et al. Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes
Salehi et al. Estimation of Distribution Algorithms in Gene Expression Data Analysis
De Paz et al. An adaptive algorithm for feature selection in pattern recognition
Sihombing et al. Application of Genetic Algorithm to Determine A Document Similarity Level in IRS

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20120127

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20170215

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20190204

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 1161705

Country of ref document: AT

Kind code of ref document: T

Effective date: 20190815

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602010060274

Country of ref document: DE

REG Reference to a national code

Ref country code: CH

Ref legal event code: NV

Representative=s name: FIAMMENGHI-FIAMMENGHI, CH

REG Reference to a national code

Ref country code: NL

Ref legal event code: FP

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG4D

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 1161705

Country of ref document: AT

Kind code of ref document: T

Effective date: 20190731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191031

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191031

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191202

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191101

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191130

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

REG Reference to a national code

Ref country code: DE

Ref legal event code: R026

Ref document number: 602010060274

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

PLBI Opposition filed

Free format text: ORIGINAL CODE: 0009260

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200224

26 Opposition filed

Opponent name: KWS SAAT SE & CO. KGAA

Effective date: 20200430

PLAX Notice of opposition and request to file observation + time limit sent

Free format text: ORIGINAL CODE: EPIDOSNOBS2

PG2D Information on lapse in contracting state deleted

Ref country code: IS

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20191030

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

PLBB Reply of patent proprietor to notice(s) of opposition received

Free format text: ORIGINAL CODE: EPIDOSNOBS3

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200603

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200603

APBM Appeal reference recorded

Free format text: ORIGINAL CODE: EPIDOSNREFNO

APBP Date of receipt of notice of appeal recorded

Free format text: ORIGINAL CODE: EPIDOSNNOA2O

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

APAH Appeal reference modified

Free format text: ORIGINAL CODE: EPIDOSCREFNO

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20190731

APBU Appeal procedure closed

Free format text: ORIGINAL CODE: EPIDOSNNOA9O

REG Reference to a national code

Ref country code: NL

Ref legal event code: HC

Owner name: CORTEVA AGRISCIENCE LLC; US

Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), CHANGE OF OWNER(S) NAME; FORMER OWNER NAME: DOW AGROSCIENCES LLC

Effective date: 20220902

PUAH Patent maintained in amended form

Free format text: ORIGINAL CODE: 0009272

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: PATENT MAINTAINED AS AMENDED

27A Patent maintained in amended form

Effective date: 20221221

AK Designated contracting states

Kind code of ref document: B2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

REG Reference to a national code

Ref country code: DE

Ref legal event code: R102

Ref document number: 602010060274

Country of ref document: DE

REG Reference to a national code

Ref country code: NL

Ref legal event code: FP

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230516

REG Reference to a national code

Ref country code: DE

Ref legal event code: R081

Ref document number: 602010060274

Country of ref document: DE

Owner name: CORTEVA AGRISCIENCE LLC (N.D.GES.DES STAATES D, US

Free format text: FORMER OWNER: DOW AGROSCIENCES LLC, INDIANAPOLIS, IND., US

REG Reference to a national code

Ref country code: BE

Ref legal event code: HC

Owner name: CORTEVA AGRISCIENCE LLC; US

Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), CHANGE OF OWNER(S) NAME; FORMER OWNER NAME: DOW AGROSCIENCES LLC

Effective date: 20230619

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: CH

Payment date: 20230702

Year of fee payment: 14

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20240624

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20240618

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NL

Payment date: 20240618

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20240627

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: BE

Payment date: 20240621

Year of fee payment: 15