CN114360651A - Genome prediction method, prediction system and application - Google Patents

Genome prediction method, prediction system and application Download PDF

Info

Publication number
CN114360651A
CN114360651A CN202111620265.8A CN202111620265A CN114360651A CN 114360651 A CN114360651 A CN 114360651A CN 202111620265 A CN202111620265 A CN 202111620265A CN 114360651 A CN114360651 A CN 114360651A
Authority
CN
China
Prior art keywords
genome
genotype
prediction
value
prediction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111620265.8A
Other languages
Chinese (zh)
Inventor
刘士凯
杨奔
李琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111620265.8A priority Critical patent/CN114360651A/en
Publication of CN114360651A publication Critical patent/CN114360651A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the technical field of genomics, bioinformatics, genome prediction and genome breeding, and discloses a genome prediction method, a prediction system and application. The genome prediction method comprises the steps of acquiring genotype information; constructing a genome prediction model; estimating an effect value for each genotype; sequencing all genotypes from small to large according to the size of each genotype effect value, and gradually discarding the genotype with the minimum genotype effect value from a genome prediction model; estimating the prediction precision of the model constructed after discarding smaller genotypes; and selecting a corresponding genotype matrix according to the number of the genotypes appearing with the highest prediction precision of the genome prediction model to construct the genome prediction model. The invention reduces the cost of genotyping and improves the prediction precision, so that the application range of genome prediction in human disease risk prediction and animal and plant breeding is wider.

Description

Genome prediction method, prediction system and application
Technical Field
The invention belongs to the technical field of genomics, bioinformatics, genome prediction and genome breeding, and particularly relates to a genome prediction method, a prediction system and application.
Background
Currently, with the rapid development of high-throughput sequencing technology, researchers can accurately and rapidly perform genome assembly and whole genome sequencing on target species, so that genetics and genomics research enter a rapid development stage. High quality genome assembly and inexpensive, efficient genotyping techniques drive the acquisition of highly accurate genotypes for research purposes and thus drive the development of genome prediction. Genome prediction is an efficient and powerful tool in the fields of human complex traits, disease risk prediction and selective breeding, and a genome prediction model is constructed in a reference population by utilizing phenotypic information and genotypic information, so that the potential performance of a candidate population only with genotypic information can be predicted.
The most classical genome Prediction model is the Best Linear Unbiased Prediction (GBLUP) of the genome, and a genome G matrix constructed by using genotype information replaces a pedigree a matrix in the traditional Best Linear Unbiased Prediction, so that errors caused by mendelian sampling are avoided, and a better Prediction effect is obtained, and the method is widely applied to animal and plant selective breeding. In addition to the classical GBLUP, more genome prediction models have been developed, such as various bayesian methods. These bayesian-based genome prediction models have become the most widely used models for genome prediction due to their higher prediction accuracy. However, due to the complex genetic control mechanisms of different traits in different species, no model is yet optimal in all species and all traits, and the optimal model should still be selected for a specific species and trait.
In addition, the widely used genome prediction model utilizes a large number of genome-wide genotypes and a maximum of thousands of individuals to perform genome breeding value estimation of a candidate population, which causes a serious overfitting phenomenon in the genome prediction model and affects the prediction performance of the model. Although it has long been recognized that excessive genotypes can affect the predictive performance of the model, there is still no good way to solve this problem. Prevention of overfitting has been widely mentioned in various machine learning, but there is little solution to the overfitting problem in genomic predictive models. In the conventional genome prediction model, the prediction performance of the model is optimized by continuously optimizing the distribution of genotype effect values and the distribution of effect value variances, but the prediction performance of the genome model is rarely improved from the viewpoint of prevention of overfitting.
Based on the factors, the problem of overfitting caused by introduction of high-dimensional genotype data in a genome prediction model needs to be solved urgently, so that noise interference of the model is reduced, the prediction performance of the genome prediction model is improved, and the genome prediction model is more widely applied to the fields of disease risk prediction and animal and plant breeding.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the traditional genome prediction model uses tens of thousands of genotypes to construct the prediction model, so that the problem of 'big p and small n' caused by introduction of high-dimensional data is caused, and the problem of severe overfitting is caused. Although the existing Bayesian lasso model and other models are optimized in terms of parameters in model construction and added with penalty terms, the problem of fitting in genome prediction is still not solved, so that the genome prediction precision in disease risk prediction and animal and plant breeding is low.
(2) In the existing genome prediction model construction process, variation on the whole genome needs to be genotyped, so that the genotyping cost is higher, the application range of genome prediction in disease risk prediction and animal and plant breeding is limited, and the practicability is poor.
The difficulty in solving the above problems and defects is: (1) the biological significance of the influence of genome variation on the model is ignored, the model is optimized from the data perspective, and the burden of introducing the genotype information irrelevant to the character on the model is not considered; (2) the overfitting phenomenon has been rarely studied in the field of genome prediction.
The significance of solving the problems and the defects is as follows: the prediction accuracy of the existing genome prediction model is greatly improved, the high-accuracy prediction accuracy can be obtained by typing few genotypes related to characters, and the genotyping cost is reduced.
Disclosure of Invention
In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a genome prediction method, a prediction system, a readable storage medium, and an application. In particular to a method for improving the prediction accuracy of a genome prediction model.
The technical scheme is as follows: a genome prediction method applied to an information data processing terminal, the genome prediction method comprising: and gradually discarding genotypes with relatively low effect values in all genotype matrixes, only keeping the optimal genotype matrix with a large effect value to construct a genome prediction model, and predicting phenotypes of different species by using genome information.
In one embodiment, the genome prediction method specifically comprises the following steps:
firstly, determining a phenotypic value of a target species character to obtain phenotypic information, and genotyping an individual with the phenotypic value to obtain genotype information;
secondly, constructing a genome prediction model based on all genotypes by using the genotype information and the phenotype information;
and thirdly, estimating the effect value of each genotype in the constructed genome prediction model by utilizing maximum likelihood estimation or a Bayesian method based on Gibbs sampling according to the correlation between the phenotype information and the genotype information (in the constructed genome prediction model, corresponding parameters of effect value distribution are obtained according to different preconditions of different models on the effect value and the variance distribution thereof, so as to estimate the effect value of each genotype, specifically, according to the correlation between the phenotype information and the genotype information, corresponding parameters of distribution obeyed by the labeled effect value are obtained by utilizing maximum likelihood estimation or the Bayesian method based on the Gibbs sampling in the constructed genome prediction model, so as to estimate the effect value of each genotype).
Sequencing all genotypes from small to large according to the size of each genotype effect value, and gradually discarding the genotype with the minimum genotype effect value from a genome prediction model; wherein, starting from the genotype with the smallest genotype effect value, the genotype is discarded from the genome prediction model according to a certain number (for example, 200 genotypes are discarded each time), so that a new genome prediction model is constructed by using the retained genotypes with larger effect values.
Fifthly, by utilizing a cross validation method, a genome prediction model constructed by discarding genotypes with small effect values each time and only retaining genotypes with large effect values is subjected to prediction precision evaluation by calculating the correlation between the genome breeding value of the masked phenotype value individual and the actual phenotype value; and calculating the Pearson correlation coefficient or the area under the ROC curve of the genome breeding value and the actual phenotype value of the masked phenotype value individual, and performing prediction precision evaluation on the genome prediction model constructed by only retaining the genotype with a larger effect value after discarding the genotype with a smaller effect value each time.
Wherein, the calculation formula of the Pearson correlation coefficient is as follows:
Figure BDA0003437715340000051
wherein y is the masked phenotypic value and GEBV is the genomic breeding value; the calculation formula of the area under the ROC curve is as follows:
Figure BDA0003437715340000052
Figure BDA0003437715340000053
wherein n is1For the number of surviving individuals, the phenotypic value is recorded as 1, n0For the number of dead individuals, the phenotypic value was recorded as 0, R1Is the sum of the ranks of the surviving individuals.
And step six, after all genotypes are discarded from the genome prediction model, finishing the prediction precision evaluation of the genome prediction model, and selecting a corresponding genotype matrix to construct the genome prediction model according to the number of the genotypes appearing in the highest prediction precision of the genome prediction model, namely obtaining the highest genome prediction data information.
The genome prediction model is obtained on the basis of the optimal genotype matrix obtained by continuous iteration after discarding the smaller effect value markers, and the overfitting problem caused by introduction of high-dimensional data is remarkably reduced.
In one embodiment, the step one genotype information is genotype information based on single nucleotide polymorphism typing, only genotypes having two allelic sites are reserved and encoded by 0, 1, 2, respectively representing homozygous allele type AA, heterozygous allele type AA, and secondary homozygous allele type AA, and there is no deletion of genotype information for constructing the genome prediction model.
In one embodiment, the second step of genome prediction model construction is constructed by using all genotype data by an indirect method, and the constructed genome prediction models are linear models based on additive effect;
the linear model formula is expressed as y ═ μ + Zu + e;
wherein y is the phenotype value of an individual, mu is an intercept, Z is a genotype matrix of n rows and m columns, wherein n is the number of samples used for constructing a genome model, m is the genotype number, u is an effect value matrix of m rows and 1 column, and the ith row corresponds to the effect value of the ith column genotype in the Z matrix; e is a residual error;
the sum of the products of all genotypes and their effect values is the genome breeding value.
In one embodiment, the effect value estimation of the step three genotypes is performed in a genome prediction model constructed by all genotype data, and the genome prediction model generates an effect value for each genotype;
and the four-step effect value sorting is performed according to the absolute value of each genotype effect value from small to large.
In one embodiment, the step five cross-validation method comprises: the method comprises the steps of obtaining prediction information of a genome prediction model by covering phenotype information of partial individuals, only retaining genotype information, estimating genome breeding values of the covered phenotype individuals by using the genome prediction model, and performing correlation comparison with an actual phenotype;
the genotype matrix used in the sixth step is obtained by successively discarding the genotypes with smaller effect values and carrying out continuous iteration;
the genome prediction model is constructed based on the optimal genotype matrix with the highest prediction accuracy determined by discarding the minor effect value markers.
Another object of the present invention is to provide a genome prediction system comprising an information data processing terminal including a memory and a processor, the memory storing a computer program, which when executed by the processor, causes the processor to perform the steps of:
and gradually discarding genotypes with relatively low effect values in all genotype matrixes, only keeping the optimal genotype matrix with a large effect value to construct a genome prediction model, and predicting phenotype and genotype data of different species.
Another object of the present invention is to provide an application of the genome prediction method in the prediction of phenotypic and genotypic data of different species of sorghum, dairy cows, horses, rainbow trout and carp.
By combining all the technical schemes, the invention has the advantages and positive effects that:
the invention only retains the genotype with large effect by gradually discarding the genotype with lower effect value in the genome prediction model, thereby reducing the overfitting damage caused by introducing high-dimensional genotype data into the model. Compared with the method that the genome prediction model is constructed by generally utilizing all genotype data at present, the method provided by the invention has the advantages that the accuracy of genome prediction is obviously improved, the highest prediction precision can be obtained only by typing the genotype with a large effect, the cost of genotyping is reduced, and the application range of genome prediction in human disease risk prediction and animal and plant breeding is wider.
In the invention, all genotypes are the prediction precision of the genome prediction by using all genotypes to construct a genome prediction model in the prior art, and the optimal genotype is the prediction improvement condition of the genome prediction model in the prior art, compared with the prediction improvement condition of using all genotypes to construct the genome prediction model in the prior art, the genome prediction method has the remarkable prediction precision improvement effect in 12 characters of 5 tested species.
In addition, the invention obviously improves the prediction accuracy of the existing genome prediction model, can meet the demand only by typing the genotype with large effect, and has the prediction accuracy obviously higher than that of the prediction model constructed by using all genotypes, so that the prediction accuracy and the applicability of large-scale sample typing are obviously improved compared with the existing prediction model.
The optimal genotype matrix obtained by the invention can be used for various traditional genome prediction models based on additive effect, and can also be used for various prediction models based on machine learning algorithm, such as random forest, support vector machine and the like.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flowchart of a genome prediction method according to an embodiment of the present invention.
FIG. 2 is a graph illustrating the improvement in the prediction accuracy of the optimal genotype matrix in sorghum data set relative to a genome prediction model constructed using all genotype matrices provided in an embodiment of the present invention; wherein, fig. 2(a) is a prediction accuracy improvement diagram of the optimal genotype matrix in sorghum stalk length traits relative to a genome prediction model constructed by using all genotype matrices; fig. 2(b) is a prediction accuracy improvement diagram of the optimal genotype matrix in sorghum stalk trait relative to a genome prediction model constructed using all genotype matrices; FIG. 2(c) is a graph of the improvement in prediction accuracy of the optimal genotype matrix in sorghum gross traits relative to a genome prediction model constructed using all genotype matrices; fig. 2(d) is a diagram of the improvement of prediction accuracy of the optimal genotype matrix in sorghum stalk number traits relative to a genome prediction model constructed using all genotype matrices.
FIG. 3 is a graph of the improvement in the prediction accuracy of the optimal genotype matrix in the data sets of cows and horses relative to a genome prediction model constructed using all genotype matrices provided by an embodiment of the present invention; wherein, fig. 3(a) is a prediction precision improvement diagram of the optimal genotype matrix relative to a genome prediction model constructed by using all genotype matrices in the cow milk fat rate character; FIG. 3(b) is a graph of the improvement of the prediction accuracy of the optimal genotype matrix in the milk cow somatic cell scoring trait relative to a genome prediction model constructed using all genotype matrices; FIG. 3(c) is a graph of the improvement in the prediction accuracy of the optimal genotype matrix in the milk production traits of dairy cows relative to a genome prediction model constructed using all genotype matrices; FIG. 3(d) is a graph of the improvement in prediction accuracy of the optimal genotype matrix in the horse hair color trait relative to a genome prediction model constructed using all genotype matrices.
FIG. 4 is a graph of the improvement of the prediction accuracy of the optimal genotype matrix in the data sets of rainbow trout and carp relative to the genome prediction model constructed using all genotype matrices provided in the embodiments of the present invention; wherein, fig. 4(a) is a prediction precision improvement diagram of the optimal genotype matrix relative to a genome prediction model constructed by using all genotype matrices in the survival state traits of the rainbow trout; FIG. 4(b) is a diagram of the prediction accuracy improvement of the optimal genotype matrix in the survival time trait of rainbow trout relative to a genome prediction model constructed using all genotype matrices; FIG. 4(c) is a diagram of the improvement of the prediction accuracy of the optimal genotype matrix in the carp weight trait relative to a genome prediction model constructed using all genotype matrices; fig. 4(d) is a diagram showing the improvement in prediction accuracy of the optimal genotype matrix in the carp body length trait with respect to a genome prediction model constructed using all the genotype matrices.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as broadly as the present invention is capable of modification in various respects, all without departing from the spirit and scope of the present invention.
The invention provides a method for solving the problem of overfitting caused by introduction of high-dimensional genotype data in the conventional genome prediction model, namely a genome prediction method, which comprises the following steps: the genotypes with lower effect values are gradually discarded, only the optimal genotype matrix with large effect values is reserved for constructing a genome prediction model, and the highest prediction precision can be obtained. Compared with the conventional method for constructing a model by utilizing all genotype information and predicting the genome, the method provided by the invention has the advantages that the prediction performance of the genome prediction model is remarkably improved, the highest prediction precision can be obtained only by typing the optimal genotype matrix, and the genotyping cost is reduced.
The technical solution of the present invention is further described below with reference to specific examples.
Example 1:
as shown in fig. 1, the present invention provides a genome prediction method comprising:
s101, determining a phenotype value of the researched target trait to obtain phenotype information (such as growth, weight, disease risk and the like), and genotyping an individual with the phenotype value (genotyping can be carried out by adopting methods such as whole genome re-sequencing, simplified genome sequencing, gene chip and the like) to obtain the genotype information.
And S102, constructing a genome prediction model based on all genotypes by using the genotype information and the phenotype information.
And S103, estimating the effect value of each genotype in the constructed genome prediction model by utilizing maximum likelihood estimation or a Bayesian method based on Gibbs sampling according to the correlation between the phenotype information and the genotype information.
And S104, sequencing all genotypes from small to large according to the size of each genotype effect value, and gradually discarding the genotypes from the genotype with the minimum genotype effect value from the genome prediction model.
And S105, evaluating the prediction precision of the model after discarding the genotype with small effect value each time by calculating the correlation between the genome breeding value of the individual with the masked phenotype value and the actual phenotype value by using a cross validation method.
And S106, after all genotypes are discarded from the genome prediction model, finishing the prediction precision evaluation of the model, and selecting a corresponding genotype matrix according to the number of the genotypes appearing in the highest prediction precision of the model to construct the genome prediction model, so that the highest genome prediction accuracy can be obtained.
In a preferred embodiment of the present invention, the genotype information of step S101 is genotype information based on single nucleotide polymorphism typing, only genotypes having two allelic sites are retained, and are encoded by 0, 1, and 2, respectively, representing major homozygous allele type AA, heterozygous allele type AA, and minor homozygous allele type AA, and the genotype information used for constructing the genome prediction model cannot be deleted.
In a preferred embodiment of the present invention, the genome prediction model in step S102 is constructed by using all genotype data indirectly, and the constructed genome prediction models are all linear models based on additive effect;
the linear model formula is expressed as y ═ μ + Zu + e;
wherein y is the phenotype value of an individual, mu is an intercept, Z is a genotype matrix of n rows and m columns, wherein n is the number of samples used for constructing a genome model, m is the genotype number, and u is the effect value of the ith row of the effect value matrix of m rows and 1 columns corresponding to the ith column of the genotype in the Z matrix; e is a residual error;
the sum of the products of all genotypes and their effect values is the genome breeding value
In a preferred embodiment of the present invention, the step S103 finds corresponding parameters of the effect value distribution in the constructed model according to different premise assumptions of different models about the effect values and their variance distributions, so as to estimate the effect value of each genotype.
In a preferred embodiment of the present invention, the estimation of the effect value in step S103 is performed in a genome prediction model constructed from data of all genotypes, and the model generates an effect value for each genotype.
In a preferred embodiment of the present invention, the effect values in step S104 are sorted from small to large according to the absolute value of the effect value of each genotype, rather than the numerical value.
In a preferred embodiment of the present invention, the prediction accuracy of the genome prediction model in step S105 is calculated by using a cross validation method, by masking phenotype information of a part of individuals, retaining genotype information only, estimating a genome breeding value of the masked phenotype individual by using the genome prediction model, and performing correlation comparison with an actual phenotype, thereby obtaining the prediction accuracy of the model. The measurement of the binary character prediction accuracy is the area under the ROC curve of the genome breeding value and the actual phenotype value, the measurement of the non-binary character prediction accuracy is the Pearson correlation coefficient of the genome breeding value and the actual phenotype value, and the closer the numerical value is to 1, the higher the prediction accuracy is.
Wherein, the calculation formula of the Pearson correlation coefficient is as follows:
Figure BDA0003437715340000131
wherein y is the masked phenotypic value and GEBV is the genomic breeding value; the calculation formula of the area under the ROC curve is as follows:
Figure BDA0003437715340000132
Figure BDA0003437715340000133
wherein n is1For the number of surviving individuals, the phenotypic value is recorded as 1, n0For the number of dead individuals, the phenotypic value was recorded as 0, R1Is the sum of the ranks of the surviving individuals.
In a preferred embodiment of the present invention, the genotype matrix used for the highest prediction accuracy of the genome prediction model is obtained by successively discarding genotypes with smaller effect values and repeating the iteration.
In a preferred embodiment of the invention, the prediction accuracy of the model constructed after discarding the genotype with smaller effect value each time is evaluated by calculating the Pearson correlation coefficient or the area under the ROC curve of the genomic breeding value and the actual phenotype value of the individuals with masked phenotype values.
In a preferred embodiment of the present invention, the model finally used for genome prediction in step S106 is constructed based on the optimal genotype matrix with the highest prediction accuracy determined by discarding the smaller effect value markers, rather than the genome prediction model constructed using all genotypes, which is commonly used now.
Example 2:
the invention provides a genome prediction method (a method for improving the prediction accuracy of a genome prediction model), which comprises the following steps:
(1) phenotypic and genotypic data was collected for different species, including sorghum, cows, horses, rainbow trout and carp.
The genotype data of sorghum, including 3260 genotypes and 1020 samples, were genotyped by restriction enzyme site-dependent DNA sequencing, with the phenotype data being stalk length (CL), Total Weight (TW), stalk diameter (CD) and stalk number (CN). The data for cattle included 42551 genotypes and 5024 samples, which were genotyped by a 50KSNP chip. Phenotypic data are milk fat percentage (fp), somatic score (scs), and milk yield (my). The horse data included 50621 genotypes and 480 specimens, genotyped by 54KSNP chips, and the phenotype data is gross (cc). The rainbow trout data included 1934 samples and 27490 genotypes, genotyped on a 57KSNP chip, with phenotypic data being survival time (sd) and survival status (ss) after challenge with salmon rickettsia, a binary trait representing death or survival of each individual after challenge, recorded as 0 or 1. Carp data included 1259 samples and 15615 genotypes, genotyped by restriction site-dependent DNA sequencing, with standard length (sl) and body weight (bw). The above data are obtained from published articles from Sorghum M (Ishimori M, Takanashi H, Hamazaki K et al.Distinguishing the Genetic engineering of Biofuel-Related transactions in a Sorghum Breeding. G3 Genes | Genetics 2020; 10: 4565-derived 4577), cow (cow of milk-Related prediction using a Genetic engineering-enhanced Genetics | Genetics; 5: 615-derived 627), horse (KAML: experimental prediction of Genetic diagnosis using a Genetic mapping of Genes | Genetics; 5: 615-derived, horse) (cow of Genetic diagnosis of Genetic engineering, transgenic diagnosis. G3 Genetic engineering | Genetics | genome | Genetics; 5: 20111-derived Genetic diagnosis of Genetic diagnosis, horse (KAML: experimental prediction of Genetic diagnosis, III, Biochemical diagnosis of Genetic diagnosis, III strain, Biochemical expression of Genetic diagnosis, III strain, biological strain of Genetic diagnosis, III strain, biological strain of Genetic diagnosis, strain, biological strain.
(2) Genome prediction model construction
All genotype data and phenotype data were imported into R, and a genome prediction model was constructed. The genome prediction model is constructed by using R packet BGLR and rrBLUP and respectively corresponds to the genome prediction model constructed by a Bayesian method and a ridge regression optimal linear unbiased prediction method.
(3) Genotype effect value estimation and ranking
And generating an effect value for each genotype by using the constructed genome prediction model, and then sequencing according to the absolute value of the effect value. The rrBLUP method generates an effect value for each genotype using maximum likelihood estimation, while various bayesian methods construct markov chain monte carlo chains by gibbs sampling, sampling from the posterior distribution of each genotype parameter, and thereby calculating the effect value for each genotype. And (4) according to the magnitude of the effect value, utilizing an order function in the R, and sequencing the genotypes according to the sequence of the genotype effect values from small to large.
(4) Genotype matrix determination for highest prediction accuracy
According to the sequencing size of the genotype effect values, the genotype effect values are gradually discarded from the genome prediction model, and the number of discarded genotypes is determined to be 200. After 200 genotypes are discarded, quintuple cross validation is carried out on the genome prediction model constructed by the existing genotype matrix, so that the prediction accuracy of the genome prediction model is evaluated until the genome prediction model does not contain genotype data any more. The specific details of the implementation of quintuple cross validation are as follows: randomly dividing all genotype data and phenotype data into 5 subsets, wherein the phenotype information of one subset is covered, only the genotype information is kept, constructing a genome prediction model by using other 4 subsets and calculating an effect value of the genotype, carrying out matrix multiplication on an effect value matrix of the genotype and the genotype data of the covered phenotype subset to generate a genome breeding value for each individual, if the phenotype data is a continuous character, representing accuracy by using the genome breeding value and a Pearson correlation coefficient of the phenotype, and if the phenotype data is a binary character, using the area under the ROC curve as the accuracy. To prevent sampling errors when performing quintuple cross-validation, all prediction accuracy evaluations were performed 50 times with the mean and standard deviation of 50 replicates as the final result.
(5) In order to verify the validity of the present invention, the present invention was verified using all the data sets in (1). Fig. 2 is a graph showing the improvement in 4 phenotypic traits of sorghum according to the present invention, compared to a genome prediction model constructed using all genotype matrices, the accuracy of the model constructed using the optimal genotype matrix is significantly improved, and the model is applicable to all 4 traits. Fig. 3 shows the improvement of the prediction accuracy of the genome prediction models in the cattle and horse data sets, and fig. 4 shows the improvement of the prediction accuracy of the genome prediction models of rainbow trout and carp. The 12 traits of the 5 species tested all showed that the optimal genome prediction model constructed based on discarding the small effect genotypes had higher prediction accuracy than the genome prediction model constructed using all genotypes, indicating that the method is applicable to different traits and different species.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure should be limited only by the attached claims.

Claims (10)

1. A genome prediction method applied to an information data processing terminal, the genome prediction method comprising: and gradually discarding genotypes with relatively low effect values in all genotype matrixes, only keeping the optimal genotype matrix with a large effect value to construct a genome prediction model, and predicting phenotypes of different species by using genome information.
2. The genome prediction method according to claim 1, specifically comprising the steps of:
firstly, determining a phenotypic value of a target species character to obtain phenotypic information, and genotyping an individual with the phenotypic value to obtain genotype information;
secondly, constructing a genome prediction model based on all genotypes by using the genotype information and the phenotype information;
thirdly, estimating the effect value of each genotype in the constructed genome prediction model by utilizing maximum likelihood estimation or a Bayesian method based on Gibbs sampling according to the correlation between the phenotype information and the genotype information;
sequencing all genotypes from small to large according to the size of each genotype effect value, and gradually discarding the genotype with the minimum genotype effect value from a genome prediction model;
fifthly, estimating the prediction precision of the model after discarding the genotype with a small effect value each time by calculating the correlation between the genome breeding value of the masked phenotype value individual and the actual phenotype value by using a cross validation method;
and step six, after all genotypes are discarded from the genome prediction model, finishing the prediction precision evaluation of the genome prediction model, and selecting a corresponding genotype matrix to construct the genome prediction model according to the number of the genotypes appearing in the highest prediction precision of the genome prediction model to obtain the highest genome prediction precision.
3. The genome prediction method of claim 2, wherein in the first step, the genotype information is genotype information based on single nucleotide polymorphism typing, only genotypes having two loci are reserved, and are encoded by 0, 1, and 2, respectively representing homozygous allele type AA, heterozygous allele type AA, and minor homozygous allele type AA, and there is no deletion of the genotype information used for constructing the genome prediction model.
4. The genome prediction method according to claim 2, wherein in the second step, the genome prediction model is constructed by using all genotype data indirectly, and the constructed genome prediction models are linear models based on additive effect;
the linear model formula is expressed as y ═ μ + Zu + e;
wherein y is the phenotype value of an individual, mu is an intercept, Z is a genotype matrix of n rows and m columns, wherein n is the number of samples used for constructing a genome model, m is the genotype number, u is an effect value matrix of m rows and 1 column, and the ith row corresponds to the effect value of the ith column genotype in the Z matrix; e is a residual error;
the sum of the products of all genotypes and their effect values is the genome breeding value.
5. The method for genome prediction according to claim 2, wherein in step three, the estimation of the effect value of a genotype is performed in a genome prediction model constructed from data of all genotypes, and the genome prediction model generates an effect value for each genotype.
6. The method for genome prediction according to claim 2, wherein in step three, the step four effect value ranking is performed according to the absolute value of the effect value of each genotype from small to large.
7. The genome prediction method of claim 2, wherein the step five cross-validation method comprises: the method comprises the steps of obtaining prediction information of a genome prediction model by covering phenotype information of partial individuals, only retaining genotype information, estimating genome breeding values of the covered phenotype individuals by using the genome prediction model, and performing correlation comparison with an actual phenotype;
the calculating the correlation between the genome breeding value and the actual phenotype value of the masked phenotype value individual so as to evaluate the prediction precision of the genotype construction model only keeping the larger effect value after discarding the genotype with the smaller effect value each time comprises the following steps:
estimating the prediction precision of a genome prediction model constructed by only retaining a genotype with a larger effect value after discarding the genotype with a smaller effect value each time by calculating the Pearson correlation coefficient or the area under the ROC curve of the genome breeding value and the actual phenotype value of the masked phenotype value individual;
wherein, the calculation formula of the Pearson correlation coefficient is as follows:
Figure FDA0003437715330000031
wherein y is the masked phenotypic value and GEBV is the genomic breeding value; the calculation formula of the area under the ROC curve is as follows:
Figure FDA0003437715330000032
Figure FDA0003437715330000033
wherein n is1For the number of surviving individuals, the phenotypic value is recorded as 1, n0For the number of dead individuals, the phenotypic value was recorded as 0, R1Is the sum of the ranks of the surviving individuals.
8. The genome prediction method according to claim 2, wherein the genotype matrix used in the sixth step is obtained by successively discarding genotypes with smaller effect values and performing successive iterations;
the genome prediction model is constructed based on the optimal genotype matrix with the highest prediction accuracy determined by discarding the minor effect value markers.
9. A genome prediction system implementing the genome prediction method according to any one of claims 1 to 8, wherein the genome prediction system comprises an information data processing terminal including a memory and a processor, the memory storing a computer program, which when executed by the processor causes the processor to perform the steps of:
and gradually discarding genotypes with relatively low effect values in all genotype matrixes, only keeping the optimal genotype matrix with a large effect value to construct a genome prediction model, and predicting phenotypes of different species by using genome information.
10. Use of the method of genome prediction according to any one of claims 1 to 8 for the prediction of phenotypic and genotypic data for different species of sorghum, dairy cows, horses, rainbow trout and carp.
CN202111620265.8A 2021-12-28 2021-12-28 Genome prediction method, prediction system and application Pending CN114360651A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111620265.8A CN114360651A (en) 2021-12-28 2021-12-28 Genome prediction method, prediction system and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111620265.8A CN114360651A (en) 2021-12-28 2021-12-28 Genome prediction method, prediction system and application

Publications (1)

Publication Number Publication Date
CN114360651A true CN114360651A (en) 2022-04-15

Family

ID=81104056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111620265.8A Pending CN114360651A (en) 2021-12-28 2021-12-28 Genome prediction method, prediction system and application

Country Status (1)

Country Link
CN (1) CN114360651A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743601A (en) * 2022-04-18 2022-07-12 中国农业科学院农业基因组研究所 Breeding method, device and equipment based on multigroup data and deep learning
CN116072226A (en) * 2023-01-17 2023-05-05 中国农业大学 Machine learning method and system for selecting laying hen egg-laying character genome
CN116467596A (en) * 2023-04-11 2023-07-21 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus
WO2024065070A1 (en) * 2022-09-26 2024-04-04 之江实验室 Graph clustering-based genetic coding breeding prediction method and apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743601A (en) * 2022-04-18 2022-07-12 中国农业科学院农业基因组研究所 Breeding method, device and equipment based on multigroup data and deep learning
CN114743601B (en) * 2022-04-18 2023-02-03 中国农业科学院农业基因组研究所 Breeding method, device and equipment based on multigroup data and deep learning
WO2024065070A1 (en) * 2022-09-26 2024-04-04 之江实验室 Graph clustering-based genetic coding breeding prediction method and apparatus
CN116072226A (en) * 2023-01-17 2023-05-05 中国农业大学 Machine learning method and system for selecting laying hen egg-laying character genome
CN116467596A (en) * 2023-04-11 2023-07-21 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus
CN116467596B (en) * 2023-04-11 2024-03-26 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus

Similar Documents

Publication Publication Date Title
CN114360651A (en) Genome prediction method, prediction system and application
Speed et al. Relatedness in the post-genomic era: is it still useful?
Goddard et al. Genomic selection in livestock populations
Taylor Implementation and accuracy of genomic selection
CN113519028B (en) Methods and compositions for estimating or predicting genotypes and phenotypes
Lee et al. Comparison of alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo beef cattle
Misztal et al. Emerging issues in genomic selection
US20020119451A1 (en) System and method for predicting chromosomal regions that control phenotypic traits
Bâlteanu et al. The footprint of recent and strong demographic decline in the genomes of Mangalitza pigs
Jiménez-Montero et al. Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle
Henshall et al. Quantitative analysis of low-density SNP data for parentage assignment and estimation of family contributions to pooled samples
CN115083518A (en) SNP double-channel coding method
Hassani et al. Accuracy of prediction of simulated polygenic phenotypes and their underlying quantitative trait loci genotypes using real or imputed whole-genome markers in cattle
Sottile et al. Penalized classification for optimal statistical selection of markers from high-throughput genotyping: application in sheep breeds
Whalen et al. Evolving SNP panels for genomic prediction
Atefi et al. Accuracy of genomic prediction under different genetic architectures and estimation methods
JP2022537443A (en) Systems, computer program products and methods for determining genomic ploidy
Carlborg New methods for mapping quantitative trait loci
CN111354417B (en) Novel method for estimating aquatic animal genome variety composition based on ADMIXTURE-MCP model
Bani Saadat et al. Comparing machine learning algorithms and linear model for detecting significant SNPs for genomic evaluation of growth traits in F2 chickens
CN113470744B (en) Pedigree inference method and device based on SNP locus data and electronic equipment
CN115995262B (en) Method for analyzing corn genetic mechanism based on random forest and LASSO regression
Cesarani et al. Strategies for choosing core animals in the algorithm for proven and young and their impact on the accuracy of single-step genomic predictions in cattle
Elias Genomic selection models for plant breeding schemes: the power of choice
Ramírez-Flores et al. Accuracy of genomic values predicted using deregressed predicted breeding values as response variables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination