CN116580773A

CN116580773A - Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment

Info

Publication number: CN116580773A
Application number: CN202310373424.1A
Authority: CN
Inventors: 董成航; 陈红阳; 冯献忠
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-08-11

Abstract

The application discloses a breeding cross-representation type prediction method and system based on ensemble learning, and electronic equipment, comprising the following steps: genotype data of a high-generation crop and corresponding later-generation crop are acquired, and target phenotype data of the high-generation crop are acquired; calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding offspring crops in the higher generation crops from the genotype data according to the evaluation function; training a number of different machine learning models through the subset; calculating evaluation indexes of the machine learning models, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners; stacking K basic learners based on an integrated learning method, and training to obtain meta learners; and inputting genotype data of the generation crop into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the generation crop.

Description

Breeding cross-representation type prediction method and system based on ensemble learning and electronic equipment

Technical Field

The application mainly relates to the field of crop computational breeding, in particular to a breeding cross-representative prediction method and system, electronic equipment and a storage medium.

Background

Crop breeding is an artificial optimization of the development process of crops, which is controlled by complex genes. In early crop breeding, breeders rely on visual observation of crop phenotype variation and experience accumulated for a long time to select crops with high quality phenotype as dominant species to be preserved, and offspring are purposefully obtained through hybridization and other modes, so that the purpose of obtaining and cultivating good varieties is achieved. Subsequently, thanks to the development of modern molecular biology, breeders can more efficiently and accurately perform crop breeding by analyzing the relationship between phenotypic variation and molecular markers or genotypes by means of molecular marker or genomic sequencing techniques. In recent years, crop breeding and genetic data are rapidly increased and accumulated in a large amount, and a foundation is provided for the emergence of emerging breeding modes. The computational breeding is the intersection of crop breeding and computer science, and refers to research and guidance of crop variety breeding by means of calculation methods such as big data analysis, artificial intelligence and the like.

The study of biological associations between crop genotypes and phenotypes is an important goal in crop breeding. The optimal linear unbiased prediction of ridge regression is one of the most commonly used models in the phenotype association prediction of crop breeding genotypes, and is a linear mixed model for obtaining individual breeding values according to the prediction random effect. In addition, a variety of machine learning or deep learning models have been used to correlate crop genotypes with phenotypes and predict the corresponding phenotypes by genotype or marker. However, these methods or models have greatly different predictive accuracy performance under different data sets, and there is no optimal method or optimal model that can be applied to different environments, different populations, and different species of crop populations. In addition, the method disclosed at present is only suitable for the situation that the training set and the testing set belong to the same generation population, and lacks practical application significance; however, there are great differences between genotypes and phenotypes of different generation crop populations, and it is difficult for general artificial intelligence methods to establish correlations between genotypes and phenotypes.

Disclosure of Invention

The application aims at overcoming the defects of the prior art and provides a breeding cross-representative prediction method and system, electronic equipment and a storage medium.

In order to achieve the above purpose, the present application provides the following specific technical solutions:

according to a first aspect of an embodiment of the present application, there is provided a breeding cross-representative prediction method based on ensemble learning, the method including:

genotype data of a high-generation crop and corresponding later-generation crop are acquired, and target phenotype data of the high-generation crop are acquired;

calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding offspring crops in the higher generation crops from the genotype data according to the evaluation function;

training a number of different machine learning models through the subset;

calculating the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, arranging the evaluation indexes,

selecting the first K machine learning models as basic learners;

stacking K basic learners based on an integrated learning method, and training to obtain meta learners;

and inputting genotype data of the generation crop into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the generation crop.

According to a second aspect of embodiments of the present application, there is provided an ensemble learning-based breeding cross-representation prediction system, the system comprising:

the data acquisition module is used for acquiring genotype data of the high-generation crops and the corresponding later-generation crops and acquiring target phenotype data of the high-generation crops;

the genotype data subset screening module is used for calculating an evaluation function based on a genetic algorithm, and screening genotype data subsets which are genetically related to corresponding later-generation crops in the higher-generation crops from the genotype data according to the evaluation function;

a machine learning model training module that trains a number of different machine learning models through the subset;

the basic learner selection module calculates the evaluation index of each machine learning model according to the data type of the target phenotype data of the high-generation crops, and arranges the evaluation index and the evaluation index, and selects the first K machine learning models as basic learners;

the meta learner training module stacks the K basic learners based on the integrated learning method and trains the K basic learners to obtain meta learners;

and the target phenotype data prediction module is used for inputting genotype data of the future generation crops into the basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the future generation crops.

According to a third aspect of embodiments of the present application, there is provided an electronic device comprising a memory and a processor, the memory being coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the breeding cross-representative prediction method based on the integrated learning.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described ensemble learning-based breeding cross-representation prediction method.

Compared with the prior art, the application has the beneficial effects that:

(1) According to the application, an evaluation function is calculated based on a genetic algorithm, a genotype data subset which is genetically related to the corresponding subsequent generation crop in the higher generation crop is screened out from genotype data according to the evaluation function, and the problem of low data quality in the process of machine learning for treating the breeding phenotype prediction problem can be solved by measuring the relation between the genotype data of the higher generation crop and the genotype data of the corresponding subsequent generation crop and removing samples which are not genetically related to the corresponding subsequent generation crop in the original higher generation crop.

(2) According to the application, a plurality of different machine learning models are trained through the subsets, the machine learning models are screened according to the evaluation indexes to obtain basic learners, and stacking is carried out based on an integrated learning method, so that meta learners are obtained through training; the application combines the advantages of a plurality of machine learning models, learns hidden features of different types, is integrated into a meta learner, and can solve the problems of low prediction accuracy and narrow application range of a single machine learning model.

(3) According to the application, the genotype data of the high-generation crops are used as a training set to establish an integrated learning model, and the progeny crops are used as a test set, so that the target phenotype can be predicted at the immature stage under the condition that only the genotype data of the progeny crops exist, thereby screening out some progeny crops with poorer target phenotype in advance, reducing the breeding cost, optimizing the seed selection and cultivation of the crops, and improving the breeding efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a breeding cross-representative prediction method provided by the application;

FIG. 2 is a schematic diagram of a training set optimization process provided by the present application;

FIG. 3 is a schematic diagram of an ensemble learning process provided by the present application;

FIG. 4 is a block diagram of a breeding cross-representation prediction system provided by the application;

fig. 5 is a schematic diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The features of the following examples and embodiments may be combined with each other without any conflict.

As shown in fig. 1, the present application provides a breeding cross-representative prediction method, which comprises the following steps:

step S1, genotype data of the high-generation crops and corresponding later-generation crops are obtained, and target phenotype data of the high-generation crops are collected.

The high-generation crops and the corresponding later-generation crops comprise grain crops such as soybean, rice, wheat and corn. Wherein, the latter generation crop refers to the latter generation or the alternate generation of the latter generation crop generated by crossing or selfing.

Genotype data is a collection of genotype markers or single nucleotide polymorphisms (Single Nucleotide Polymorphism, SNPs) obtained using a gene chip. The target phenotype data comprises continuous numerical variables including yield, plant height, hundred grain weight, protein content and oil content, discrete numerical variables including main stem node number, pod number, spike grain number and maturity, and classification variables including color, disease resistance and cold resistance.

The step S1 further includes: preprocessing the collected genotype data and target phenotype data, wherein the preprocessing comprises interpolation, normalization and other methods.

And S2, calculating an evaluation function based on a genetic algorithm to measure the relation between genotype data of the higher generation crops and the corresponding later generation crops, and screening out subsets which are genetically related to the corresponding later generation crops in the higher generation crops from the genotype data obtained in the step S1 according to the evaluation function.

Further, as shown in fig. 2, the step S2 specifically includes:

step S201, based on genetic algorithm, for a given sample size N _TRN Is a high generation crop genotype dataset and a sample size of N _TST Is a later generation crop genotype dataset. Setting the iteration times of the genetic algorithm as N, and setting the sample size of the subset as N _OPT (N _OPT ＜N _TRN )。

Step S202, randomly and initially selecting a sample size N _OPT Is used to calculate an evaluation function based on the genotype data. The evaluation function is selected from the group consisting of average genetic distance, euclidean distance, hamming distance, and cosine similarity.

Step S203, judging whether the value of the evaluation function meets the evaluation condition;

when the value of the evaluation function does not meet the evaluation condition, selecting, intersecting and mutating the subsets to obtain a subset of the next genetic algorithm iteration;

and stopping iteration when the value of the evaluation function meets the evaluation condition, or the value of the evaluation function tends to be stable, or when the iteration number is N, so as to obtain a final optimized subset.

In this example, the evaluation function is preferably an average genetic distance, and the average genetic distance between the high-generation crop genotype data sample and the latter-generation crop genotype data sample is determined by a genetic relationship matrix expressed as follows:

wherein ,for the average genetic distance of the ith high generation crop genotype data sample to all the subsequent generation crop genotype data samples, A _ij For the genetic distance between the ith high-generation crop genotype data sample and the jth subsequent-generation crop genotype data sample in the genetic relation matrix A, M is the number of SNP loci, x _ki Genotype value, x, of kth SNP locus of ith high generation crop genotype data sample _kj Genotype value, p, of kth SNP locus of jth offspring crop genotype data sample _k The frequency of the minor allele at the kth SNP site is the frequency of the minor allele at the kth SNP site for all populations, and the minor allele is the base of the SNP site mutation.

Alternatively, the evaluation function may be the euclidean distance, whose calculation formula is as follows:

alternatively, the evaluation function may be a hamming distance, whose calculation formula is as follows:

wherein ,representing exclusive or.

Alternatively, the evaluation function may be cosine similarity, whose calculation formula is as follows:

in step S2, the genotype data of the higher-generation crop and the genotype data of the corresponding later-generation crop are measured, so that the samples of the original higher-generation crop which are not genetically related to the corresponding later-generation crop are removed, and the problem of low data quality in the process of machine learning for treating the breeding phenotype prediction problem can be solved.

Step S3, training a plurality of different machine learning models through the subset obtained in the step S2.

The machine learning model is selected from a ridge regression optimal linear unbiased prediction model, a linear model, a support vector machine, a random forest, gradient lifting, a deep neural network, a convolutional neural network and a local connection neural network.

And S4, calculating evaluation indexes of the machine learning models according to the data types of the target phenotype data of the high-generation crops, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners.

The evaluation index of the machine learning model is determined by the data type of the target phenotype; when the data type of the target phenotype is a numerical variable, the evaluation index of the machine learning model is a correlation coefficient; when the data type of the target phenotype is a classification variable, the evaluation index of the machine learning model is the accuracy.

The selecting the first K machine learning models as the basic learner comprises: in the example, a ridge regression optimal linear unbiased prediction model is used as a reference model, and a machine learning model with a reference model and an evaluation index higher than the reference model is selected as a basic learner; and when the number of the machine learning models with the evaluation indexes higher than the reference model is smaller than the threshold value, arranging the evaluation indexes in descending order, and selecting the first K machine learning models as a basic learner.

And S5, stacking K basic learners based on an integrated learning method, and training to obtain a meta learner.

It should be noted that the element learner is selected from a linear model, a decision tree, and a shallow neural network.

In step S5, by combining the advantages of multiple machine learning models and integrating the learning of hidden features of different types into the meta learner, the problems of low prediction accuracy and narrow application range of a single machine learning model can be solved.

And S6, inputting genotype data of the later generation crops into a basic learner to obtain metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the later generation crops.

Example 1:

as shown in fig. 1, taking soybean protein content prediction as an example in this embodiment, the breeding cross-representative prediction method based on ensemble learning provided by the application is further described in detail, and the method specifically includes the following steps:

s1, genotype data of the high-generation crops and corresponding later-generation crops are acquired, and target phenotype data of the high-generation crops are acquired.

Specifically, first, the compositions of the high-generation crops and the corresponding latter-generation crops are respectively high-generation soybeans and the latter-generation soybeans, and samples are collected and stored. Extracting DNA thereof, and determining SNP information of the soybean of the high generation crop and the soybean of the later generation crop by using a gene chip technology, wherein the gene chip determines the number M of SNP loci, and the genotype value of each SNP locus of each sample can only be: -1 (homozygote 0/0), 0 (heterozygote 0/1) and 1 (homozygote 1/1). Protein content of the soybean of the higher generation crop was then measured as the target phenotype data using a protein content measuring instrument.

S2, calculating an evaluation function based on a genetic algorithm to measure the relation between genotype data of the higher generation crops and genotype data of the corresponding later generation crops, and screening out subsets which are genetically related to the corresponding later generation crops in the higher generation crops from the genotype data obtained in the step S1 according to the evaluation function.

Specifically, referring to fig. 2, the subset screening method is:

based on genetic algorithm, a given sample size is N _TRN Is a high generation crop genotype dataset and a sample size of N _TST The subsequent generation crop genotype dataset of (1) firstly setting the iteration times of the algorithm as N and the sample size of the subset as N _OPT (N _OPT ＜N _TRN ). Then randomly and initially selecting a sample size of N _OPT Is used to calculate an evaluation function based on the genotype data. In this example, the evaluation function uses an average genetic distance. The average genetic distance between the high generation crop genotype data sample and the later generation crop genotype data sample is determined by a genetic relationship matrix.

And selecting a high-generation crop genotype data sample with a small average genetic distance, randomly selecting the rest subset, obtaining a subset sample of the next iteration number after selection, crossover and mutation operation, and then calculating an evaluation function. Repeating the steps, and stopping iteration after reaching the iteration times or the average genetic distance sum of the subsets to be stable (namely not reducing within a certain range), so as to obtain the final optimized subsets.

S3, training a plurality of different machine learning models through the subset obtained in the step S2.

Specifically, referring to fig. 3, the input of the base learner is SNP genotype data, each sample is a vector of M SNP genotype values, and the output is the predicted protein content. And respectively training eight basic learners of ridge regression optimal linear unbiased prediction (rrBLUP), elastic Network (ENT), support vector machine regression (SVR), random Forest (RF), extreme gradient lifting (XGBoost), multi-layer perceptron (MLP), convolutional Neural Network (CNN) and Local Connected Neural Network (LCNN) by utilizing the optimal training set.

The loss function of the machine learning model is determined by the phenotype data type, the protein content is a numerical variable, the loss function is a Mean Square Error (MSE), and the calculation formula is as follows:

in the formula ,y_i Andthe true and predicted values for the protein content of the ith sample, respectively.

Alternatively, if the output is a classification variable such as disease resistance, the loss function may take cross entropy, and its calculation formula is as follows:

in the formula ,y_i Andthe true value and the predicted value of the disease resistance of the ith sample are respectively.

The hyper-parameters of each model are determined by a grid search and ten-fold cross-validation method.

S4, calculating evaluation indexes of the machine learning models, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners.

The evaluation index is determined by the phenotype data type, the protein content is a numerical variable, and the evaluation index is a Pearson correlation coefficient between the predicted protein content and the real protein content, and the calculation formula is as follows:

wherein , and />The average value of the true value and the average value of the predicted value of the protein content of the target phenotype data sample of the high-generation crops are respectively.

Alternatively, the evaluation index may be a Spearman correlation coefficient, which is calculated as follows:

wherein ,d_i Rank () represents the difference between the rank of the true value and the rank of the predicted value for the target phenotype data sample protein content of the ith higher generation crop, rank (y _i ) I.e. the rank of the true value from small to large of the protein content of the target phenotype data sample of the ith higher generation crop,i.e., the target phenotype number of the ith higher generation cropRank ordered from small to large according to predicted value of sample protein content.

Optionally, if the output is a classification variable such as disease resistance, the evaluation index may take the accuracy of prediction, and its calculation formula is as follows:

where TP is the number of positive samples predicted to be of positive class, TN is the number of negative samples predicted to be of negative class, FP is the number of negative samples predicted to be of positive class, and FN is the number of positive samples predicted to be of negative class.

After training is completed, pearson correlation coefficients of each machine learning model on the target phenotype data samples of the high generation crops are calculated. Taking the rrBLUP model as a reference model, taking a Pearson correlation coefficient between a predicted value and a true value of the reference model as a threshold value, and reserving the rrBLUP model and a model with the correlation coefficient higher than the threshold value; when the number of models whose correlation coefficient is higher than the reference is too small, the first K models arranged from high to low in Pearson correlation coefficient are directly selected, and K basic learners are used in total.

Specifically, referring to fig. 3, after the basic learner training and selection are completed, genotype data of the optimized training set is respectively input into each basic learner, so as to respectively obtain predicted protein content of the optimized training set on each basic learner. For each sample, there are K predicted protein contents, respectively, which constitute a K-dimensional vector as metadata. A multi-layer perceptron Model (MLP) is trained as a meta-learner by taking metadata of an optimized training set as input, a loss function is MSE, and the super-parameters are determined by a grid search and ten-fold cross-validation method.

S6, inputting genotype data of the later generation crops into a basic learner to obtain test set metadata, and inputting the metadata into the basic learner to obtain predicted target phenotype data of the later generation crops.

Example 2

As shown in fig. 4, the present application further provides a breeding cross-representative prediction system based on ensemble learning, the system comprising:

The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.

For system embodiments, reference is made to the description of method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the cross-representation prediction method of ensemble learning based breeding as described above. As shown in fig. 5, a hardware structure diagram of any device with data processing capability, where the breeding cross-representation prediction method based on ensemble learning is located, is provided in the embodiment of the present application, except for the processor, the memory and the network interface shown in fig. 5, where any device with data processing capability is located in the embodiment, generally, according to the actual function of the any device with data processing capability, other hardware may also be included, which is not described herein again.

Correspondingly, the application also provides a computer readable storage medium, wherein computer instructions are stored on the computer readable storage medium, and the instructions are executed by a processor to realize the breeding cross-representation prediction method based on the integrated learning. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are merely for illustrating the design concept and features of the present application, and are intended to enable those skilled in the art to understand the content of the present application and implement the same, the scope of the present application is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present application are within the scope of the present application.

Claims

1. A cross-representation prediction method for breeding based on ensemble learning, the method comprising:

training a number of different machine learning models through the subset;

calculating evaluation indexes of each machine learning model according to the data type of the target phenotype data of the high-generation crops, arranging the evaluation indexes, and selecting the first K machine learning models as basic learners;

2. The integrated learning-based cross-representation prediction method for breeding according to claim 1, wherein the high-generation crops and the corresponding later-generation crops comprise grain crops including soybeans, rice, wheat and corn; wherein, the latter generation crop is the latter generation or the alternate generation crop generated by crossing or selfing the high generation crop.

3. The ensemble learning-based breeding cross-representation prediction method of claim 1, wherein genotype data is a genotype marker or dataset of single nucleotide polymorphisms of crops; the target phenotype data comprises continuous numerical variables including yield, plant height, hundred grain weight, protein content and oil content, discrete numerical variables including main stem node number, pod number, spike grain number and maturity and classification variables including color, disease resistance and cold resistance.

4. The ensemble learning-based breeding cross-representation prediction method of claim 1, wherein calculating an evaluation function based on a genetic algorithm, and selecting a subset of genotype data having genetic correlation with a corresponding subsequent generation crop from genotype data according to the evaluation function comprises:

based on genetic algorithm, for a given sample size N _TRN Is a high generation crop genotype dataset and a sample size of N _TST A later generation crop genotype dataset of (1); setting the iteration times of the genetic algorithm as N, and setting the sample size of the subset as N _OPT, wherein ,N_OPT ＜N _TRN ；

Randomly and initially selecting a sample size of N _OPT Calculating an evaluation function from the genotype data; the evaluation function is selected from average genetic distance, euclidean distance, hamming distance and cosine similarity;

judging whether the value of the evaluation function meets the evaluation condition;

when the value of the evaluation function does not meet the evaluation condition, selecting, intersecting and mutating the subsets, and then performing iterative optimization based on a genetic algorithm;

5. The ensemble learning-based breeding cross-representation prediction method of claim 4, wherein the evaluation function is an average genetic distance, and the average genetic distance between the genotype data sample of the higher generation crop and the genotype data sample of the later generation crop is determined by a genetic relationship matrix expressed as follows:

6. The ensemble learning-based breeding cross-representation prediction method as claimed in claim 1, wherein the evaluation index of the machine learning model is determined by the data type of the target phenotype; when the data type of the target phenotype is a numerical variable, the evaluation index of the machine learning model is a correlation coefficient; and when the data type of the target phenotype is a classification variable, the evaluation index of the machine learning model is the accuracy.

7. The ensemble learning-based breeding cross-representation prediction method of claim 1 or 6, wherein selecting the first K machine learning models as the base learner includes:

selecting a reference model, and taking the reference model and a machine learning model with an evaluation index higher than that of the reference model as a basic learner; and when the number of the machine learning models with the evaluation indexes higher than the reference model is smaller than the threshold value, arranging the evaluation indexes in descending order, and selecting the first K machine learning models as a basic learner.

8. An ensemble learning-based breeding cross-representation prediction system, the system comprising:

9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the ensemble learning based breeding cross-representation prediction method of any one of the preceding claims 1-7.

10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the ensemble learning based breeding cross-representation prediction method of any one of claims 1-7.