CN116580773A - Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment - Google Patents
Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment Download PDFInfo
- Publication number
- CN116580773A CN116580773A CN202310373424.1A CN202310373424A CN116580773A CN 116580773 A CN116580773 A CN 116580773A CN 202310373424 A CN202310373424 A CN 202310373424A CN 116580773 A CN116580773 A CN 116580773A
- Authority
- CN
- China
- Prior art keywords
- generation
- data
- crop
- crops
- genotype
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000009395 breeding Methods 0.000 title claims abstract description 43
- 230000001488 breeding effect Effects 0.000 title claims abstract description 43
- 238000011156 evaluation Methods 0.000 claims abstract description 82
- 238000010801 machine learning Methods 0.000 claims abstract description 53
- 230000006870 function Effects 0.000 claims abstract description 45
- 230000002068 genetic effect Effects 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 18
- 238000012216 screening Methods 0.000 claims abstract description 13
- 244000068988 Glycine max Species 0.000 claims description 7
- 235000010469 Glycine max Nutrition 0.000 claims description 7
- 108700028369 Alleles Proteins 0.000 claims description 6
- 235000013339 cereals Nutrition 0.000 claims description 6
- 208000035240 Disease Resistance Diseases 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000035772 mutation Effects 0.000 claims description 3
- 239000002773 nucleotide Substances 0.000 claims description 3
- 125000003729 nucleotide group Chemical group 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 241000196324 Embryophyta Species 0.000 claims description 2
- 240000007594 Oryza sativa Species 0.000 claims description 2
- 235000007164 Oryza sativa Nutrition 0.000 claims description 2
- 241000209140 Triticum Species 0.000 claims description 2
- 235000021307 Triticum Nutrition 0.000 claims description 2
- 240000008042 Zea mays Species 0.000 claims description 2
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 claims description 2
- 235000002017 Zea mays subsp mays Nutrition 0.000 claims description 2
- 235000005822 corn Nutrition 0.000 claims description 2
- 239000003550 marker Substances 0.000 claims description 2
- 102000054765 polymorphisms of proteins Human genes 0.000 claims description 2
- 235000009566 rice Nutrition 0.000 claims description 2
- 235000019624 protein content Nutrition 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000007637 random forest analysis Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 108010073771 Soybean Proteins Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003147 molecular marker Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 235000019710 soybean protein Nutrition 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Genetics & Genomics (AREA)
- Evolutionary Computation (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Analytical Chemistry (AREA)
- Public Health (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Physiology (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a breeding cross-representation type prediction method and system based on ensemble learning, and electronic equipment, comprising the following steps: genotype data of a high-generation crop and corresponding later-generation crop are acquired, and target phenotype data of the high-generation crop are acquired; calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding offspring crops in the higher generation crops from the genotype data according to the evaluation function; training a number of different machine learning models through the subset; calculating evaluation indexes of the machine learning models, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners; stacking K basic learners based on an integrated learning method, and training to obtain meta learners; and inputting genotype data of the generation crop into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the generation crop.
Description
Technical Field
The application mainly relates to the field of crop computational breeding, in particular to a breeding cross-representative prediction method and system, electronic equipment and a storage medium.
Background
Crop breeding is an artificial optimization of the development process of crops, which is controlled by complex genes. In early crop breeding, breeders rely on visual observation of crop phenotype variation and experience accumulated for a long time to select crops with high quality phenotype as dominant species to be preserved, and offspring are purposefully obtained through hybridization and other modes, so that the purpose of obtaining and cultivating good varieties is achieved. Subsequently, thanks to the development of modern molecular biology, breeders can more efficiently and accurately perform crop breeding by analyzing the relationship between phenotypic variation and molecular markers or genotypes by means of molecular marker or genomic sequencing techniques. In recent years, crop breeding and genetic data are rapidly increased and accumulated in a large amount, and a foundation is provided for the emergence of emerging breeding modes. The computational breeding is the intersection of crop breeding and computer science, and refers to research and guidance of crop variety breeding by means of calculation methods such as big data analysis, artificial intelligence and the like.
The study of biological associations between crop genotypes and phenotypes is an important goal in crop breeding. The optimal linear unbiased prediction of ridge regression is one of the most commonly used models in the phenotype association prediction of crop breeding genotypes, and is a linear mixed model for obtaining individual breeding values according to the prediction random effect. In addition, a variety of machine learning or deep learning models have been used to correlate crop genotypes with phenotypes and predict the corresponding phenotypes by genotype or marker. However, these methods or models have greatly different predictive accuracy performance under different data sets, and there is no optimal method or optimal model that can be applied to different environments, different populations, and different species of crop populations. In addition, the method disclosed at present is only suitable for the situation that the training set and the testing set belong to the same generation population, and lacks practical application significance; however, there are great differences between genotypes and phenotypes of different generation crop populations, and it is difficult for general artificial intelligence methods to establish correlations between genotypes and phenotypes.
Disclosure of Invention
The application aims at overcoming the defects of the prior art and provides a breeding cross-representative prediction method and system, electronic equipment and a storage medium.
In order to achieve the above purpose, the present application provides the following specific technical solutions:
according to a first aspect of an embodiment of the present application, there is provided a breeding cross-representative prediction method based on ensemble learning, the method including:
genotype data of a high-generation crop and corresponding later-generation crop are acquired, and target phenotype data of the high-generation crop are acquired;
calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding offspring crops in the higher generation crops from the genotype data according to the evaluation function;
training a number of different machine learning models through the subset;
calculating the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, arranging the evaluation indexes,
selecting the first K machine learning models as basic learners;
stacking K basic learners based on an integrated learning method, and training to obtain meta learners;
and inputting genotype data of the generation crop into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the generation crop.
According to a second aspect of embodiments of the present application, there is provided an ensemble learning-based breeding cross-representation prediction system, the system comprising:
the data acquisition module is used for acquiring genotype data of the high-generation crops and the corresponding later-generation crops and acquiring target phenotype data of the high-generation crops;
the genotype data subset screening module is used for calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding later-generation crops in the higher-generation crops from the genotype data according to the evaluation function;
a machine learning model training module that trains a number of different machine learning models through the subset;
the basic learner selection module calculates the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, and arranges the evaluation index and the evaluation index, and selects the first K machine learning models as basic learners;
the meta learner training module stacks the K basic learners based on the integrated learning method and trains the K basic learners to obtain meta learners;
and the target phenotype data prediction module is used for inputting genotype data of the future generation crops into the basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the future generation crops.
According to a third aspect of embodiments of the present application, there is provided an electronic device comprising a memory and a processor, the memory being coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the breeding cross-representative prediction method based on the integrated learning.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described ensemble learning-based breeding cross-representation prediction method.
Compared with the prior art, the application has the beneficial effects that:
(1) According to the application, an evaluation function is calculated based on a genetic algorithm, a genotype data subset which is genetically related to the corresponding subsequent generation crop in the higher generation crop is screened out from genotype data according to the evaluation function, and the problem of low data quality in the process of machine learning for treating the breeding phenotype prediction problem can be solved by measuring the relation between the genotype data of the higher generation crop and the genotype data of the corresponding subsequent generation crop and removing samples which are not genetically related to the corresponding subsequent generation crop in the original higher generation crop.
(2) According to the application, a plurality of different machine learning models are trained through the subsets, the machine learning models are screened according to the evaluation indexes to obtain basic learners, and stacking is carried out based on an integrated learning method, so that meta learners are obtained through training; the application combines the advantages of a plurality of machine learning models, learns hidden features of different types, is integrated into a meta learner, and can solve the problems of low prediction accuracy and narrow application range of a single machine learning model.
(3) According to the application, the genotype data of the high-generation crops are used as a training set to establish an integrated learning model, and the progeny crops are used as a test set, so that the target phenotype can be predicted at the immature stage under the condition that only the genotype data of the progeny crops exist, thereby screening out some progeny crops with poorer target phenotype in advance, reducing the breeding cost, optimizing the seed selection and cultivation of the crops, and improving the breeding efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a breeding cross-representative prediction method provided by the application;
FIG. 2 is a schematic diagram of a training set optimization process provided by the present application;
FIG. 3 is a schematic diagram of an ensemble learning process provided by the present application;
FIG. 4 is a block diagram of a breeding cross-representation prediction system provided by the application;
fig. 5 is a schematic diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The features of the following examples and embodiments may be combined with each other without any conflict.
As shown in fig. 1, the present application provides a breeding cross-representative prediction method, which comprises the following steps:
step S1, genotype data of the high-generation crops and corresponding later-generation crops are obtained, and target phenotype data of the high-generation crops are collected.
The high-generation crops and the corresponding later-generation crops comprise grain crops such as soybean, rice, wheat and corn. Wherein, the latter generation crop refers to the latter generation or the alternate generation of the latter generation crop generated by crossing or selfing.
Genotype data is a collection of genotype markers or single nucleotide polymorphisms (Single Nucleotide Polymorphism, SNPs) obtained using a gene chip. The target phenotype data comprises continuous numerical variables including yield, plant height, hundred grain weight, protein content and oil content, discrete numerical variables including main stem node number, pod number, spike grain number and maturity, and classification variables including color, disease resistance and cold resistance.
The step S1 further includes: preprocessing the collected genotype data and target phenotype data, wherein the preprocessing comprises interpolation, normalization and other methods.
And S2, calculating an evaluation function based on a genetic algorithm to measure the relation between genotype data of the higher generation crops and the corresponding later generation crops, and screening out subsets which are genetically related to the corresponding later generation crops in the higher generation crops from the genotype data obtained in the step S1 according to the evaluation function.
Further, as shown in fig. 2, the step S2 specifically includes:
step S201, based on genetic algorithm, for a given sample size N TRN Is a high generation crop genotype dataset and a sample size of N TST Is a later generation crop genotype dataset. Setting the iteration times of the genetic algorithm as N, and setting the sample size of the subset as N OPT (N OPT <N TRN )。
Step S202, randomly and initially selecting a sample size N OPT Is used to calculate an evaluation function based on the genotype data. The evaluation function is selected from the group consisting of average genetic distance, euclidean distance, hamming distance, and cosine similarity.
Step S203, judging whether the value of the evaluation function meets the evaluation condition;
when the value of the evaluation function does not meet the evaluation condition, selecting, intersecting and mutating the subsets to obtain a subset of the next genetic algorithm iteration;
and stopping iteration when the value of the evaluation function meets the evaluation condition, or the value of the evaluation function tends to be stable, or when the iteration number is N, so as to obtain a final optimized subset.
In this example, the evaluation function is preferably an average genetic distance, and the average genetic distance between the high-generation crop genotype data sample and the latter-generation crop genotype data sample is determined by a genetic relationship matrix expressed as follows:
wherein ,for the average genetic distance of the ith high generation crop genotype data sample to all the subsequent generation crop genotype data samples, A ij For the genetic distance between the ith high-generation crop genotype data sample and the jth subsequent-generation crop genotype data sample in the genetic relation matrix A, M is the number of SNP loci, x ki Genotype value, x, of kth SNP locus of ith high generation crop genotype data sample kj Genotype value, p, of kth SNP locus of jth offspring crop genotype data sample k The frequency of the minor allele at the kth SNP site is the frequency of the minor allele at the kth SNP site for all populations, and the minor allele is the base of the SNP site mutation.
Alternatively, the evaluation function may be the euclidean distance, whose calculation formula is as follows:
alternatively, the evaluation function may be a hamming distance, whose calculation formula is as follows:
wherein ,representing exclusive or.
Alternatively, the evaluation function may be cosine similarity, whose calculation formula is as follows:
in step S2, the genotype data of the higher-generation crop and the genotype data of the corresponding later-generation crop are measured, so that the samples of the original higher-generation crop which are not genetically related to the corresponding later-generation crop are removed, and the problem of low data quality in the process of machine learning for treating the breeding phenotype prediction problem can be solved.
Step S3, training a plurality of different machine learning models through the subset obtained in the step S2.
The machine learning model is selected from a ridge regression optimal linear unbiased prediction model, a linear model, a support vector machine, a random forest, gradient lifting, a deep neural network, a convolutional neural network and a local connection neural network.
And S4, calculating evaluation indexes of the machine learning models according to the data types of the target phenotype data of the high-generation crops, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners.
The evaluation index of the machine learning model is determined by the data type of the target phenotype; when the data type of the target phenotype is a numerical variable, the evaluation index of the machine learning model is a correlation coefficient; when the data type of the target phenotype is a classification variable, the evaluation index of the machine learning model is the accuracy.
The selecting the first K machine learning models as the basic learner comprises: in the example, a ridge regression optimal linear unbiased prediction model is used as a reference model, and a machine learning model with a reference model and an evaluation index higher than the reference model is selected as a basic learner; and when the number of the machine learning models with the evaluation indexes higher than the reference model is smaller than the threshold value, arranging the evaluation indexes in descending order, and selecting the first K machine learning models as a basic learner.
And S5, stacking K basic learners based on an integrated learning method, and training to obtain a meta learner.
It should be noted that the element learner is selected from a linear model, a decision tree, and a shallow neural network.
In step S5, by combining the advantages of multiple machine learning models and integrating the learning of hidden features of different types into the meta learner, the problems of low prediction accuracy and narrow application range of a single machine learning model can be solved.
And S6, inputting genotype data of the later generation crops into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the later generation crops.
Example 1:
as shown in fig. 1, taking soybean protein content prediction as an example in this embodiment, the breeding cross-representative prediction method based on ensemble learning provided by the application is further described in detail, and the method specifically includes the following steps:
s1, genotype data of the high-generation crops and corresponding later-generation crops are acquired, and target phenotype data of the high-generation crops are acquired.
Specifically, first, the compositions of the high-generation crops and the corresponding latter-generation crops are respectively high-generation soybeans and the latter-generation soybeans, and samples are collected and stored. Extracting DNA thereof, and determining SNP information of the soybean of the high generation crop and the soybean of the later generation crop by using a gene chip technology, wherein the gene chip determines the number M of SNP loci, and the genotype value of each SNP locus of each sample can only be: -1 (homozygote 0/0), 0 (heterozygote 0/1) and 1 (homozygote 1/1). Protein content of the soybean of the higher generation crop was then measured as the target phenotype data using a protein content measuring instrument.
S2, calculating an evaluation function based on a genetic algorithm to measure the relation between genotype data of the higher generation crops and genotype data of the corresponding later generation crops, and screening out subsets which are genetically related to the corresponding later generation crops in the higher generation crops from the genotype data obtained in the step S1 according to the evaluation function.
Specifically, referring to fig. 2, the subset screening method is:
based on genetic algorithm, a given sample size is N TRN Is a high generation crop genotype dataset and a sample size of N TST The subsequent generation crop genotype dataset of (1) firstly setting the iteration times of the algorithm as N and the sample size of the subset as N OPT (N OPT <N TRN ). Then randomly and initially selecting a sample size of N OPT Is used to calculate an evaluation function based on the genotype data. In this example, the evaluation function uses an average genetic distance. The average genetic distance between the high generation crop genotype data sample and the later generation crop genotype data sample is determined by a genetic relationship matrix.
And selecting a high-generation crop genotype data sample with a small average genetic distance, randomly selecting the rest subset, obtaining a subset sample of the next iteration number after selection, crossover and mutation operation, and then calculating an evaluation function. Repeating the steps, and stopping iteration after reaching the iteration times or the average genetic distance sum of the subsets to be stable (namely not reducing within a certain range), so as to obtain the final optimized subsets.
S3, training a plurality of different machine learning models through the subset obtained in the step S2.
Specifically, referring to fig. 3, the input of the base learner is SNP genotype data, each sample is a vector of M SNP genotype values, and the output is the predicted protein content. And respectively training eight basic learners of ridge regression optimal linear unbiased prediction (rrBLUP), elastic Network (ENT), support vector machine regression (SVR), random Forest (RF), extreme gradient lifting (XGBoost), multi-layer perceptron (MLP), convolutional Neural Network (CNN) and Local Connected Neural Network (LCNN) by utilizing the optimal training set.
The loss function of the machine learning model is determined by the phenotype data type, the protein content is a numerical variable, the loss function is a Mean Square Error (MSE), and the calculation formula is as follows:
in the formula ,yi Andthe true and predicted values for the protein content of the ith sample, respectively.
Alternatively, if the output is a classification variable such as disease resistance, the loss function may take cross entropy, and its calculation formula is as follows:
in the formula ,yi Andthe true value and the predicted value of the disease resistance of the ith sample are respectively.
The hyper-parameters of each model are determined by a grid search and ten-fold cross-validation method.
S4, calculating evaluation indexes of the machine learning models, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners.
The evaluation index is determined by the phenotype data type, the protein content is a numerical variable, and the evaluation index is a Pearson correlation coefficient between the predicted protein content and the real protein content, and the calculation formula is as follows:
wherein , and />The average value of the true value and the average value of the predicted value of the protein content of the target phenotype data sample of the high-generation crops are respectively.
Alternatively, the evaluation index may be a Spearman correlation coefficient, which is calculated as follows:
wherein ,di Rank () represents the difference between the rank of the true value and the rank of the predicted value for the target phenotype data sample protein content of the ith higher generation crop, rank (y i ) I.e. the rank of the true value from small to large of the protein content of the target phenotype data sample of the ith higher generation crop,i.e., the target phenotype number of the ith higher generation cropRank ordered from small to large according to predicted value of sample protein content.
Optionally, if the output is a classification variable such as disease resistance, the evaluation index may take the accuracy of prediction, and its calculation formula is as follows:
where TP is the number of positive samples predicted to be of positive class, TN is the number of negative samples predicted to be of negative class, FP is the number of negative samples predicted to be of positive class, and FN is the number of positive samples predicted to be of negative class.
After training is completed, pearson correlation coefficients of each machine learning model on the target phenotype data samples of the high generation crops are calculated. Taking the rrBLUP model as a reference model, taking a Pearson correlation coefficient between a predicted value and a true value of the reference model as a threshold value, and reserving the rrBLUP model and a model with the correlation coefficient higher than the threshold value; when the number of models whose correlation coefficient is higher than the reference is too small, the first K models arranged from high to low in Pearson correlation coefficient are directly selected, and K basic learners are used in total.
And S5, stacking K basic learners based on an integrated learning method, and training to obtain a meta learner.
Specifically, referring to fig. 3, after the basic learner training and selection are completed, genotype data of the optimized training set is respectively input into each basic learner, so as to respectively obtain predicted protein content of the optimized training set on each basic learner. For each sample, there are K predicted protein contents, respectively, which constitute a K-dimensional vector as metadata. A multi-layer perceptron Model (MLP) is trained as a meta-learner by taking metadata of an optimized training set as input, a loss function is MSE, and the super-parameters are determined by a grid search and ten-fold cross-validation method.
S6, inputting genotype data of the later generation crops into a basic learner to obtain test set metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the later generation crops.
Example 2
As shown in fig. 4, the present application further provides a breeding cross-representative prediction system based on ensemble learning, the system comprising:
the data acquisition module is used for acquiring genotype data of the high-generation crops and the corresponding later-generation crops and acquiring target phenotype data of the high-generation crops;
the genotype data subset screening module is used for calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding later-generation crops in the higher-generation crops from the genotype data according to the evaluation function;
a machine learning model training module that trains a number of different machine learning models through the subset;
the basic learner selection module calculates the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, and arranges the evaluation index and the evaluation index, and selects the first K machine learning models as basic learners;
the meta learner training module stacks the K basic learners based on the integrated learning method and trains the K basic learners to obtain meta learners;
and the target phenotype data prediction module is used for inputting genotype data of the future generation crops into the basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the future generation crops.
The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.
For system embodiments, reference is made to the description of method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the cross-representation prediction method of ensemble learning based breeding as described above. As shown in fig. 5, a hardware structure diagram of any device with data processing capability, where the breeding cross-representation prediction method based on ensemble learning is located, is provided in the embodiment of the present application, except for the processor, the memory and the network interface shown in fig. 5, where any device with data processing capability is located in the embodiment, generally, according to the actual function of the any device with data processing capability, other hardware may also be included, which is not described herein again.
Correspondingly, the application also provides a computer readable storage medium, wherein computer instructions are stored on the computer readable storage medium, and the instructions are executed by a processor to realize the breeding cross-representation prediction method based on the integrated learning. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above embodiments are merely for illustrating the design concept and features of the present application, and are intended to enable those skilled in the art to understand the content of the present application and implement the same, the scope of the present application is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present application are within the scope of the present application.
Claims (10)
1. A cross-representation prediction method for breeding based on ensemble learning, the method comprising:
genotype data of a high-generation crop and corresponding later-generation crop are acquired, and target phenotype data of the high-generation crop are acquired;
calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding offspring crops in the higher generation crops from the genotype data according to the evaluation function;
training a number of different machine learning models through the subset;
calculating evaluation indexes of each machine learning model according to the data type of the target phenotype data of the high-generation crops, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners;
stacking K basic learners based on an integrated learning method, and training to obtain meta learners;
and inputting genotype data of the generation crop into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the generation crop.
2. The integrated learning-based cross-representation prediction method for breeding according to claim 1, wherein the high-generation crops and the corresponding later-generation crops comprise grain crops including soybeans, rice, wheat and corn; wherein, the latter generation crop is the latter generation or the alternate generation crop generated by crossing or selfing the high generation crop.
3. The ensemble learning-based breeding cross-representation prediction method of claim 1, wherein genotype data is a genotype marker or dataset of single nucleotide polymorphisms of crops; the target phenotype data comprises continuous numerical variables including yield, plant height, hundred grain weight, protein content and oil content, discrete numerical variables including main stem node number, pod number, spike grain number and maturity and classification variables including color, disease resistance and cold resistance.
4. The ensemble learning-based breeding cross-representation prediction method of claim 1, wherein calculating an evaluation function based on a genetic algorithm, and selecting a subset of genotype data having genetic correlation with a corresponding subsequent generation crop from genotype data according to the evaluation function comprises:
based on genetic algorithm, for a given sample size N TRN Is a high generation crop genotype dataset and a sample size of N TST A later generation crop genotype dataset of (1); setting the iteration times of the genetic algorithm as N, and setting the sample size of the subset as N OPT, wherein ,NOPT <N TRN ;
Randomly and initially selecting a sample size of N OPT Calculating an evaluation function from the genotype data; the evaluation function is selected from average genetic distance, euclidean distance, hamming distance and cosine similarity;
judging whether the value of the evaluation function meets the evaluation condition;
when the value of the evaluation function does not meet the evaluation condition, selecting, intersecting and mutating the subsets, and then performing iterative optimization based on a genetic algorithm;
and stopping iteration when the value of the evaluation function meets the evaluation condition, or the value of the evaluation function tends to be stable, or when the iteration number is N, so as to obtain a final optimized subset.
5. The ensemble learning-based breeding cross-representation prediction method of claim 4, wherein the evaluation function is an average genetic distance, and the average genetic distance between the genotype data sample of the higher generation crop and the genotype data sample of the later generation crop is determined by a genetic relationship matrix expressed as follows:
wherein ,for the average genetic distance of the ith high generation crop genotype data sample to all the subsequent generation crop genotype data samples, A ij For the genetic distance between the ith high-generation crop genotype data sample and the jth subsequent-generation crop genotype data sample in the genetic relation matrix A, M is the number of SNP loci, x ki Genotype value, x, of kth SNP locus of ith high generation crop genotype data sample kj Genotype value, p, of kth SNP locus of jth offspring crop genotype data sample k The frequency of the minor allele at the kth SNP site is the frequency of the minor allele at the kth SNP site for all populations, and the minor allele is the base of the SNP site mutation.
6. The ensemble learning-based breeding cross-representation prediction method as claimed in claim 1, wherein the evaluation index of the machine learning model is determined by the data type of the target phenotype; when the data type of the target phenotype is a numerical variable, the evaluation index of the machine learning model is a correlation coefficient; and when the data type of the target phenotype is a classification variable, the evaluation index of the machine learning model is the accuracy.
7. The ensemble learning-based breeding cross-representation prediction method of claim 1 or 6, wherein selecting the first K machine learning models as the base learner includes:
selecting a reference model, and taking the reference model and a machine learning model with an evaluation index higher than that of the reference model as a basic learner; and when the number of the machine learning models with the evaluation indexes higher than the reference model is smaller than the threshold value, arranging the evaluation indexes in descending order, and selecting the first K machine learning models as a basic learner.
8. An ensemble learning-based breeding cross-representation prediction system, the system comprising:
the data acquisition module is used for acquiring genotype data of the high-generation crops and the corresponding later-generation crops and acquiring target phenotype data of the high-generation crops;
the genotype data subset screening module is used for calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding later-generation crops in the higher-generation crops from the genotype data according to the evaluation function;
a machine learning model training module that trains a number of different machine learning models through the subset;
the basic learner selection module calculates the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, and arranges the evaluation index and the evaluation index, and selects the first K machine learning models as basic learners;
the meta learner training module stacks the K basic learners based on the integrated learning method and trains the K basic learners to obtain meta learners;
and the target phenotype data prediction module is used for inputting genotype data of the future generation crops into the basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the future generation crops.
9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the ensemble learning based breeding cross-representation prediction method of any one of the preceding claims 1-7.
10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the ensemble learning based breeding cross-representation prediction method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310373424.1A CN116580773A (en) | 2023-04-10 | 2023-04-10 | Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310373424.1A CN116580773A (en) | 2023-04-10 | 2023-04-10 | Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116580773A true CN116580773A (en) | 2023-08-11 |
Family
ID=87540295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310373424.1A Pending CN116580773A (en) | 2023-04-10 | 2023-04-10 | Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116580773A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117461500A (en) * | 2023-12-27 | 2024-01-30 | 北京市农林科学院智能装备技术研究中心 | Plant factory system, method, device, equipment and medium for accelerating crop breeding |
CN117672360A (en) * | 2024-01-30 | 2024-03-08 | 北京市农林科学院信息技术研究中心 | Genome selection method, device, equipment and medium based on transfer learning |
-
2023
- 2023-04-10 CN CN202310373424.1A patent/CN116580773A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117461500A (en) * | 2023-12-27 | 2024-01-30 | 北京市农林科学院智能装备技术研究中心 | Plant factory system, method, device, equipment and medium for accelerating crop breeding |
CN117461500B (en) * | 2023-12-27 | 2024-04-02 | 北京市农林科学院智能装备技术研究中心 | Plant factory system, method, device, equipment and medium for accelerating crop breeding |
CN117672360A (en) * | 2024-01-30 | 2024-03-08 | 北京市农林科学院信息技术研究中心 | Genome selection method, device, equipment and medium based on transfer learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116580773A (en) | Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment | |
CN113519028B (en) | Methods and compositions for estimating or predicting genotypes and phenotypes | |
US8321147B2 (en) | Statistical approach for optimal use of genetic information collected on historical pedigrees, genotyped with dense marker maps, into routine pedigree analysis of active maize breeding populations | |
EP3326093B1 (en) | Improved computer implemented method for predicting true agronomical value of a plant | |
Spindel et al. | Genomic selection in rice breeding | |
CN115331732B (en) | Gene phenotype training and predicting method and device based on graph neural network | |
CN109727641B (en) | Whole genome prediction method and device | |
Jeon et al. | Digitalizing breeding in plants: A new trend of next-generation breeding based on genomic prediction | |
CN116168766A (en) | Variety identification method, system and terminal based on ensemble learning | |
CN109727642B (en) | Whole genome prediction method and device based on random forest model | |
CN115050419A (en) | Breeding method for selecting corn bract tightness based on whole genome | |
Tang et al. | A strategy for the acquisition and analysis of image-based phenome in rice during the whole growth period | |
CN114743601B (en) | Breeding method, device and equipment based on multigroup data and deep learning | |
CN113838519B (en) | Gene selection method and system based on adaptive gene interaction regularization elastic network model | |
CN108416189A (en) | A kind of variety of crops Heterosis identification method based on molecular marking technique | |
Hodel et al. | Linking genome signatures of selection and adaptation in non-model plants: exploring potential and limitations in the angiosperm Amborella | |
Cudic et al. | Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs | |
CN112102880A (en) | Method for identifying variety, and method and device for constructing prediction model thereof | |
Howard et al. | Overview of Genomic Prediction Methods and the Associated Assumptions on the Variance of Marker Effect, and on the Architecture of the Target Trait | |
CN115995262B (en) | Method for analyzing corn genetic mechanism based on random forest and LASSO regression | |
He et al. | MINED: an efficient mutual information based epistasis detection method to improve quantitative genetic trait prediction | |
Cowling et al. | Stomata Detector: High-throughput automation of stomata counting in a population of African rice (Oryza glaberrima) using transfer learning | |
Zhai | Application of Various Genomic Selection Models in Cotton Fiber Quality | |
WO2024020441A1 (en) | Artificial intelligence-guided marker assisted selection | |
Urazaliev | Bioinformation Technologies in Plant Breeding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |