CN116863998B - Genetic algorithm-based whole genome prediction method and application thereof - Google Patents

Genetic algorithm-based whole genome prediction method and application thereof Download PDF

Info

Publication number
CN116863998B
CN116863998B CN202310741264.1A CN202310741264A CN116863998B CN 116863998 B CN116863998 B CN 116863998B CN 202310741264 A CN202310741264 A CN 202310741264A CN 116863998 B CN116863998 B CN 116863998B
Authority
CN
China
Prior art keywords
molecular marker
model
genome
fitness
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310741264.1A
Other languages
Chinese (zh)
Other versions
CN116863998A (en
Inventor
徐扬
张宇翔
周恺
于广宁
李成
杨文艳
王欣
徐辰武
杨泽峰
鲁月
陈茹佳
陶天云
李鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN202310741264.1A priority Critical patent/CN116863998B/en
Publication of CN116863998A publication Critical patent/CN116863998A/en
Application granted granted Critical
Publication of CN116863998B publication Critical patent/CN116863998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/10Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Public Health (AREA)
  • Analytical Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof, wherein a genome optimal linear unbiased estimation model is adopted to predict the breeding value of an individual, and the method comprises the following steps: obtaining a molecular marker of crops to be predicted; randomly selecting a certain proportion of molecular marker subsets from the molecular marker subsets repeatedly for the initialization of a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving the molecular marker subsets with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subsets at a certain ratio to generate a new molecular marker subset; and calculating fitness functions of different molecular marker subsets again, reserving the molecular marker subset with higher fitness until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model for whole genome prediction. The method can be used for improving the accuracy of the whole genome selection of the hybrid seeds, and can provide important technical support for the accurate breeding of the hybrid seeds.

Description

Genetic algorithm-based whole genome prediction method and application thereof
Technical Field
The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof.
Background
Cultivating high-quality high-yield, green and high-efficiency crop varieties is a great importance in the current crop genetic breeding work. Traditional crop breeding relies on phenotypic selection: breeders select lines with the trait of interest from them for further identification by examining the phenotype of crop lines in the field and in the laboratory, in combination with their breeding experience. However, many traits related to yield, quality of crops belong to quantitative traits, which are controlled by a large number of micro-efficient quantitative trait loci, are susceptible to environmental influences, and are unreliable to select by phenotype alone. Molecular biology development enables molecular marker assisted selection breeding, however, molecular marker assisted selection is only applicable to traits controlled by a few major quantitative trait loci, and has no ability to select for traits such as crop yield and quality. Whole genome selection techniques construct statistical models using high density molecular markers covering the whole genome and crop phenotypes to predict the behavior of materials of known genotype but unknown phenotype. Whole genome selection incorporates the effects of all markers on the genome into the model irrespective of their level of significance and is thus particularly suited for quantitative traits such as crop yield, quality, which are controlled by a minigenome. The genome best linear unbiased estimation (Genomic Best Liner Unbiased Prediction, GBLUP) model is the most robust and general model among the whole genome selection models, however, the GBLUP model assumes that all molecular markers have the same contribution to the target trait, which is contrary to the conclusion of modern molecular genetics, limiting further improvement in the prediction accuracy of GBLUP methods.
Disclosure of Invention
The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits. The prediction power of agronomic characters such as rice and corn hybrid yield can be effectively improved through a GA-GBLUP algorithm. Therefore, the invention can be used for improving the accuracy of the whole genome selection of the hybrid seeds, has important significance in the utilization of rice and corn heterosis, and can provide important technical support for the accurate breeding of the hybrid seeds.
The aim of the invention is realized by adopting the following technical scheme:
a whole genome prediction method based on genetic algorithm adopts genetic algorithm to select optimal molecular markers, and combines genome optimal linear unbiased estimation model to predict individual breeding value, comprising the following steps:
obtaining a molecular marker of crops to be predicted;
randomly selecting a certain proportion of molecular markers to initialize a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving a molecular marker subset with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subset at a certain ratio to generate a new molecular marker subset;
and calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model.
Further, the method for constructing the genome predictive model comprises the following steps:
y is an n x 1 vector representing a quantitative trait, and the hybrid linear model containing m markers is expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) 2 ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect; solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.
Further, the step of randomly selecting includes: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 represents this flagAnd (3) reserving, randomly repeating the process for 100 times, obtaining 100 different delta vectors, and initializing the GA algorithm.
Further, the calculation method of the suitability of the subset of different molecular markers is any one of the following methods,
red pool information criterion:
AIC=2m-2ln(L)
where m is the number of parameters being estimated and L is the likelihood of the model; AIC represents the fitness calculation result calculated by adopting the red pool information rule;
bayesian information criterion:
BIC=mln(n)-2ln(L)
where m is the number of parameters being estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated by adopting a Bayesian information criterion;
FIT function:
FIT=1-SSE/SST
where SST is the sum of the squares of the total variations of the phenotype values and SSE is the sum of the squares of the residuals; FIT represents a fitness calculation result obtained by FIT function calculation;
HAT function:
HAT=1-PRESS/SST
where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values; HAT represents the fitness calculation result calculated using HAT functions.
Further, mutating, pairing, cross-exchanging the remaining subset of molecular markers at a certain ratio, generating a new subset of molecular markers comprises:
carrying out mutation of 1 to 0 or 0 to 1 on the reserved molecular marker vector according to the probability of 0.1 of each site; every time a pair of delta vectors is randomly selected, two delta vectors after pairing are subjected to cross exchange, so that the information of a plurality of positions or large areas of the two delta vectors is recombined; pairing and cross-swapping simultaneously produces new molecular marker vectors.
Further, the predictive power of the model was evaluated by 10 fold cross validation.
The invention also provides a method for predicting the agronomic characters of the crop hybrid seeds.
Further, the crop is rice or maize.
Further, the agronomic trait is a quantitative trait controlled by a micro-efficient polygene.
Further, the agronomic traits include crop yield and quality traits.
Further, the agronomic traits include yield, tiller number of individual plants, spike weight, thousand seed weight and plant height.
The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits.
In the embodiment of the invention, the plant is specifically rice and corn which are gramineous plants. The algorithm is specifically a genetic algorithm and a genome optimal linear unbiased estimation algorithm.
The method comprises the following steps: firstly, 1% of markers are randomly selected for 100 times repeatedly in all molecular markers input by a user to obtain 100 different chromosomes for the initialization of a GA algorithm, then genome prediction models are respectively constructed by using the markers selected by the chromosomes, thereby calculating fitness functions of the different chromosomes, keeping 5 chromosomes with highest fitness, eliminating the rest of the chromosomes, and then mutating the 5 chromosomes at a certain rate (representing the process that the markers at a certain position are not selected or are not selected to be selected), pairing (pairwise pairing for generating new chromosomes), cross exchanging, finally generating 100 new chromosomes, selecting the markers by taking the chromosomes as standards again for constructing a relationship matrix, and repeating the above processes until the maximum iteration number or convergence is reached. Constructing a genome optimal linear unbiased estimation model by utilizing a molecular marker subset finally selected by an algorithm, predicting the phenotype of a training set, evaluating the predictive power of the model, and predicting the phenotype of all potential hybrids on the basis, and selecting the hybrid with better target characters from the phenotype for field identification.
The algorithm provided by the invention is named as a whole genome prediction method based on a genetic algorithm.
Advantageous effects
The invention can improve the accuracy of the whole genome selection of the hybrid by adopting a whole genome prediction method based on a genetic algorithm. Compared with the traditional GBLUP method, the GA-GBLUP method can effectively improve the prediction capability of agronomic characters such as rice and corn hybrid seed yield, grain weight, plant height and the like, has important significance in rice and corn hybrid seed breeding, and provides an effective tool for improving the variety breeding efficiency of crops.
The genetic algorithm is combined with the traditional genome optimal linear unbiased estimation method to form the GA-GBLUP algorithm, the algorithm can effectively improve the predictive power of the whole genome selection of the agronomic characters of the hybrid rice and corn, improve the accuracy of the whole genome selection of the hybrid rice, and provide accurate and reliable digital reference basis for the breeding of new varieties of crops, thereby improving the research level of breeding and the breeding efficiency.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 shows the expression of the present invention on a rice hybrid dataset.
FIG. 3 is a representation of the invention on a maize hybrid dataset.
Detailed Description
The following describes the technical scheme provided by the invention in detail by combining examples, but the invention is not limited to the following.
The rice IMF2 populations (Hua, j.p., xing, y.z., wu, w.r., xu, c.g., sun, x.l., yu, s.b., & Zhang, q.f. (2003) Single-locus heterotic effects and dominance by dominance interactions can adequately explain the genetic basis of heterosis in an elite rice hybrid processes of the National Academy of Sciences of the United States of America,100 (5), 2574-2579) and maize 305 hybrid populations (Wang, x, zhang, z., xu, y, li, p., zhang, x., & Xu, c. (2020) Using genomic data to improve the estimation of general combining ability based on sparse partial diallel cross designs in mail.the Crop Journal,8 (5), 819-829) genotype and phenotype data in the following examples were all publicly available.
Example 1
Implementation of GA-GBLUP algorithm
The GA-GBLUP algorithm adopts a genome optimal linear unbiased estimation model to predict individual breeding values. y is an n x 1 vector representing a quantitative trait, and a hybrid linear model comprising m markers can be expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) 2 ) Representing the residual error. m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect. Solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.
The GA-GBLUP algorithm mainly comprises the following steps:
1) Chromosome representation
All m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 means that this flag is preserved and the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA-GBLUP algorithm initialization.
2) Fitness calculation
Selecting delta from among all labels for each delta vector described above k The =1 markers remain, and the remaining markers are used to construct a genome predictive model according to fitness functionsThe fitness of the materials is calculated by the number, the fitness calculation method is any one of the following methods, and the fitness function which can be adopted comprises:
red pond information rule (AIC)
AIC=2m-2ln(L)
Wherein: m is the number of parameters estimated, L is the likelihood value of the model; AIC represents the fitness calculation result calculated using the red-pool information criterion.
Bayesian Information Criterion (BIC)
BIC=mln(n)-2ln(L)
Wherein: m is the number of parameters estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated using bayesian information criteria.
FIT function
FIT=1-SSE/SST
Wherein: SST is the sum of squares of total variations of the phenotype values, SSE is the sum of squares of residuals; FIT represents the fitness calculation result calculated using the FIT function.
HAT function
HAT=1-PRESS/SST
Where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values. HAT represents the fitness calculation result calculated using HAT functions.
After the fitness calculation is completed, sorting different delta vectors according to the fitness, reserving the delta vector of the first 5% with the highest fitness, and eliminating the rest delta vectors; from the 10-fold cross-validation results, it can be seen that the FIT and HAT functions are more effective.
3) Genetic manipulation
For the remaining 5 delta vectors, first a 1.fwdarw.0 or 0.fwdarw.1 mutation is performed with a probability of 0.1 per site, e.g
δ (i) =[1 0 1 1 0 0 1 0 1 1]Pre-mutation
δ (j) =[1 0 1 1 0 0 1 0 0 1]Post mutation indicates that the 9 th site on the delta vector is mutated from 1 to 0, whereby the 9 th site is excluded from the model.
Then, we randomly select a pair of delta vectors every time, and the two delta vectors after pairing are exchanged in a crossing way, so that the information of several sites or large areas of the two delta vectors are recombined.
parent (i) =[1 0 1 0 0 0 1 0 1 1]
parent (j) =[1 0 1 1 0 0 1 0 0 1]
child (i) =[1 0 1 0 0 0 1 0 1 1]
child (j) =[1 0 1 1 0 0 1 0 0 1]
The pairing and cross-exchange simultaneously produces a new individual, and the above process is repeated 50 times until 100 different delta vectors are produced.
And at the moment, calculating the fitness of different delta vectors again, selecting 5 individuals with highest fitness from the mixed linear models, repeating the steps until the fitness of the models is not increased or reaches the preset iteration times, taking a molecular marker subset finally selected by an algorithm as a new molecular marker matrix to be brought into the mixed linear model, solving the mixed linear model by adopting a limiting maximum likelihood method, estimating the sizes of a fixed effect and a random effect, taking genotypes of test data into the mixed linear model on the basis, obtaining a phenotype value of the test set, further predicting phenotypes of all potential hybrids on the basis of the predictive power of the 10-fold cross verification evaluation model, and selecting hybrids with better target characters for field identification.
Example 2
Use of GA-GBLUP algorithm on rice hybrid population
1619 bin markers of 278 hybrid seeds of the rice IMF2 population are used as genotype data, and four characters of yield, tiller number of a single plant, weight of a single spike and thousand grain weight are used as phenotype data. 278 hybrid seeds are randomly divided into 10 uniform parts, 9 parts are used as training sets, 1 part is used as a test set, the training sets are combined with GA-GBLUP models with different super parameters, marks are selected from all 1619 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 2 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of the whole genome selection of rice hybrid can be effectively improved. Compared with the traditional GBLUP algorithm, the GA-GBLUP algorithm can improve the predictive power of 24.2%, 12.6%, 3.9% and 2.2% at most for the four characters of yield, tiller number of each plant, spike weight and thousand grain weight, and has great significance for the characters of low genetic transmission such as yield.
Example 3
Use of GA-GBLUP algorithm on corn hybrid population
11255 SNP markers of 305 corn hybrids are used as genotype data, and two traits of spike weight and plant height are used as phenotype data. Randomly dividing 305 hybrid seeds into 10 uniform parts, wherein 9 parts are used as training sets, 1 part is used as a test set, a GA-GBLUP model combined with different super parameters is used on the training sets, marks are selected from all 11255 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 3 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of whole genome selection of corn hybrid can be effectively improved. When used for the prediction of ear weight, GA-GBLUP can increase the predictive power by 11.2% compared to the GBLUP method.

Claims (1)

1. The application of a whole genome prediction method based on a genetic algorithm in predicting agronomic traits of crop hybrid seeds is characterized in that the crop is rice or corn; the agronomic characters are yield, tiller number of a single plant, weight of a single spike, thousand grain weight or plant height;
the whole genome prediction method based on the genetic algorithm comprises the following steps:
selecting an optimal molecular marker by adopting a genetic algorithm, and predicting the breeding value of an individual by combining a genome optimal linear unbiased estimation model on the basis of the optimal molecular marker, wherein the method comprises the following steps of:
obtaining a molecular marker of crops to be predicted;
randomly selecting a certain proportion of molecular markers to initialize a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving a molecular marker subset with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subset at a certain ratio to generate a new molecular marker subset;
calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model; the genotype of the crop to be predicted is put into a genome optimal linear unbiased estimation model to obtain the phenotype value of the crop to be predicted;
the method for constructing the genome genetic relationship matrix prediction model comprises the following steps:
y is an n x 1 vector representing a quantitative trait, and the hybrid linear model containing m markers is expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an n x 1 vector representing the kthThe genotypes of all individuals at the marker, ε is a target of N (0,I σ 2 ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect; solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; obtaining a predicted value of the test set based on the estimated parameters, and further performing cross verification to evaluate the predicted force of the model;
the step of randomly selecting includes: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 means that this flag is preserved, the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA algorithm initialization;
the calculation method of the suitability of the different molecular marker subsets is any one of the following methods:
red pool information criterion:
AIC=2m-2ln(L)
where m is the number of parameters being estimated and L is the likelihood of the model; AIC represents the fitness calculation result calculated by adopting the red pool information rule;
bayesian information criterion:
BIC=mln(n)-2ln(L)
where m is the number of parameters being estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated by adopting a Bayesian information criterion;
FIT function:
FIT=1-SSE/SST
where SST is the sum of the squares of the total variations of the phenotype values and SSE is the sum of the squares of the residuals; FIT represents a fitness calculation result obtained by FIT function calculation;
HAT function:
HAT=1-PRESS/SST
where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values; HAT represents a fitness calculation result calculated using HAT functions;
mutating, pairing, cross-interchanging the remaining subset of molecular markers at a ratio, generating a new subset of molecular markers comprising:
after the fitness calculation is completed, sorting different delta vectors according to the fitness, reserving the delta vector of the first 5% with the highest fitness, and eliminating the rest delta vectors; carrying out mutation of 1 to 0 or 0 to 1 on the reserved molecular marker vector according to the probability of 0.1 of each site; every time a pair of delta vectors is randomly selected, two delta vectors after pairing are subjected to cross exchange, so that the information of a plurality of positions or large areas of the two delta vectors is recombined; generating new molecular marker vectors at the same time of pairing and cross exchange; and at the moment, calculating the fitness of different delta vectors again, selecting individuals with highest fitness from the mixed linear models, repeating the steps until the fitness of the models is not increased or reaches preset iteration times, taking a molecular marker subset finally selected by an algorithm as a new molecular marker matrix to be brought into the mixed linear model, solving the mixed linear model by adopting a limiting maximum likelihood method, estimating the sizes of a fixed effect and a random effect, taking genotypes of test data into the mixed linear model on the basis, obtaining a phenotype value of the test set, further predicting phenotypes of all potential hybrids on the basis of the predictive power of the 10-fold cross-validation evaluation model, and selecting the hybrid with better target characters for field identification.
CN202310741264.1A 2023-06-21 2023-06-21 Genetic algorithm-based whole genome prediction method and application thereof Active CN116863998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310741264.1A CN116863998B (en) 2023-06-21 2023-06-21 Genetic algorithm-based whole genome prediction method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310741264.1A CN116863998B (en) 2023-06-21 2023-06-21 Genetic algorithm-based whole genome prediction method and application thereof

Publications (2)

Publication Number Publication Date
CN116863998A CN116863998A (en) 2023-10-10
CN116863998B true CN116863998B (en) 2024-04-05

Family

ID=88220750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310741264.1A Active CN116863998B (en) 2023-06-21 2023-06-21 Genetic algorithm-based whole genome prediction method and application thereof

Country Status (1)

Country Link
CN (1) CN116863998B (en)

Citations (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005061731A1 (en) * 2003-12-24 2005-07-07 Nanyang Polytechnic Method and system for unbiased genome amplification using genetic algorithms to select primers for genomic dna amplification
WO2005078133A2 (en) * 2004-02-09 2005-08-25 Monsanto Technology Llc Marker assisted best linear unbiased predicted (ma-blup): software adaptions for practical applications for large breeding populations in farm animal species
WO2008025093A1 (en) * 2006-09-01 2008-03-06 Innovative Dairy Products Pty Ltd Whole genome based genetic evaluation and selection process
AU2007214360A1 (en) * 2006-09-01 2008-03-20 Innovative Dairy Products Pty Ltd Whole genome based genetic evaluation and selection process
WO2009035560A1 (en) * 2007-09-12 2009-03-19 Pfizer, Inc. Methods of using genetic markers and related epistatic interactions
WO2010020252A1 (en) * 2008-08-19 2010-02-25 Viking Genetics Fmba Methods for determining a breeding value based on a plurality of genetic markers
CN103026361A (en) * 2010-06-03 2013-04-03 先正达参股股份有限公司 Methods and compositions for predicting unobserved phenotypes (PUP)
WO2013107048A1 (en) * 2012-01-20 2013-07-25 深圳华大基因健康科技有限公司 Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN106028798A (en) * 2013-12-31 2016-10-12 美国陶氏益农公司 Selection based on optimal haploid value to create elite lines
CN106255764A (en) * 2013-12-20 2016-12-21 比勒陀利亚大学 The Disease Resistance labelling of Semen Maydis
CN106779076A (en) * 2016-11-18 2017-05-31 栾图 Breeding variety system and its algorithm based on biological information
WO2017210102A1 (en) * 2016-06-01 2017-12-07 Institute For Systems Biology Methods and system for generating and comparing reduced genome data sets
CN109688805A (en) * 2016-07-11 2019-04-26 先锋国际良种公司 The method for generating gray leaf spot resistance maize
CN109997192A (en) * 2016-06-15 2019-07-09 哈佛学院董事及会员团体 Method for rule-based genome design
CN110273007A (en) * 2019-06-27 2019-09-24 广西扬翔农牧有限责任公司 SNP marker relevant to the effective sperm count of boar and its preparation method and application
CN110476214A (en) * 2017-03-30 2019-11-19 孟山都技术有限公司 System and method for identifying the Aggregate effect of the genome editor of multiple genome editors and prediction identification
CA3105404A1 (en) * 2018-07-03 2020-01-09 New West Genetics Inc. Cannabis variety which produces greater than 50% female plants
CN111640508A (en) * 2020-05-28 2020-09-08 上海生物信息技术研究中心 Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application
CN111863137A (en) * 2020-05-28 2020-10-30 上海朴岱生物科技合伙企业(有限合伙) Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application
CN112204156A (en) * 2018-05-25 2021-01-08 先锋国际良种公司 Systems and methods for improving breeding by modulating recombination rates
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration
CN112802548A (en) * 2021-01-07 2021-05-14 深圳吉因加医学检验实验室 Method for predicting allele-specific copy number variation of single-sample whole genome
CN112980962A (en) * 2019-12-12 2021-06-18 深圳华大生命科学研究院 SNP marker related to birth weight trait of pig and application thereof
CN113223606A (en) * 2021-05-13 2021-08-06 浙江大学 Genome selection method for genetic improvement of complex traits
CN113234848A (en) * 2021-05-26 2021-08-10 北京林业大学 Molecular marker related to poplar stomatal morphology and photosynthetic efficiency and application thereof
WO2021202910A1 (en) * 2020-04-02 2021-10-07 Embark Veterinary, Inc. Methods and systems for determining pigmentation phenotypes
CN114317779A (en) * 2022-01-19 2022-04-12 华中农业大学 SNP molecular marker related to pig carcass traits and application
CN114863991A (en) * 2022-06-21 2022-08-05 沈阳农业大学 Method for improving whole genome prediction precision based on two-step prediction model establishment
CN116210571A (en) * 2023-03-06 2023-06-06 广州市林业和园林科学研究院 Three-dimensional greening remote sensing intelligent irrigation method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874420B2 (en) * 2010-11-30 2014-10-28 Syngenta Participations Ag Methods for increasing genetic gain in a breeding population
US20140283152A1 (en) * 2013-03-14 2014-09-18 University Of Florida Research Foundation, Inc. Method for artificial selection

Patent Citations (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005061731A1 (en) * 2003-12-24 2005-07-07 Nanyang Polytechnic Method and system for unbiased genome amplification using genetic algorithms to select primers for genomic dna amplification
WO2005078133A2 (en) * 2004-02-09 2005-08-25 Monsanto Technology Llc Marker assisted best linear unbiased predicted (ma-blup): software adaptions for practical applications for large breeding populations in farm animal species
WO2008025093A1 (en) * 2006-09-01 2008-03-06 Innovative Dairy Products Pty Ltd Whole genome based genetic evaluation and selection process
AU2007214360A1 (en) * 2006-09-01 2008-03-20 Innovative Dairy Products Pty Ltd Whole genome based genetic evaluation and selection process
WO2009035560A1 (en) * 2007-09-12 2009-03-19 Pfizer, Inc. Methods of using genetic markers and related epistatic interactions
WO2010020252A1 (en) * 2008-08-19 2010-02-25 Viking Genetics Fmba Methods for determining a breeding value based on a plurality of genetic markers
CN103026361A (en) * 2010-06-03 2013-04-03 先正达参股股份有限公司 Methods and compositions for predicting unobserved phenotypes (PUP)
WO2013107048A1 (en) * 2012-01-20 2013-07-25 深圳华大基因健康科技有限公司 Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN106255764A (en) * 2013-12-20 2016-12-21 比勒陀利亚大学 The Disease Resistance labelling of Semen Maydis
CN106028798A (en) * 2013-12-31 2016-10-12 美国陶氏益农公司 Selection based on optimal haploid value to create elite lines
WO2017210102A1 (en) * 2016-06-01 2017-12-07 Institute For Systems Biology Methods and system for generating and comparing reduced genome data sets
CN109997192A (en) * 2016-06-15 2019-07-09 哈佛学院董事及会员团体 Method for rule-based genome design
CN109688805A (en) * 2016-07-11 2019-04-26 先锋国际良种公司 The method for generating gray leaf spot resistance maize
CN106779076A (en) * 2016-11-18 2017-05-31 栾图 Breeding variety system and its algorithm based on biological information
CN110476214A (en) * 2017-03-30 2019-11-19 孟山都技术有限公司 System and method for identifying the Aggregate effect of the genome editor of multiple genome editors and prediction identification
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration
CN112204156A (en) * 2018-05-25 2021-01-08 先锋国际良种公司 Systems and methods for improving breeding by modulating recombination rates
CA3105404A1 (en) * 2018-07-03 2020-01-09 New West Genetics Inc. Cannabis variety which produces greater than 50% female plants
CN110273007A (en) * 2019-06-27 2019-09-24 广西扬翔农牧有限责任公司 SNP marker relevant to the effective sperm count of boar and its preparation method and application
CN112980962A (en) * 2019-12-12 2021-06-18 深圳华大生命科学研究院 SNP marker related to birth weight trait of pig and application thereof
WO2021202910A1 (en) * 2020-04-02 2021-10-07 Embark Veterinary, Inc. Methods and systems for determining pigmentation phenotypes
CN111863137A (en) * 2020-05-28 2020-10-30 上海朴岱生物科技合伙企业(有限合伙) Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application
CN111640508A (en) * 2020-05-28 2020-09-08 上海生物信息技术研究中心 Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application
CN112802548A (en) * 2021-01-07 2021-05-14 深圳吉因加医学检验实验室 Method for predicting allele-specific copy number variation of single-sample whole genome
CN113223606A (en) * 2021-05-13 2021-08-06 浙江大学 Genome selection method for genetic improvement of complex traits
CN113234848A (en) * 2021-05-26 2021-08-10 北京林业大学 Molecular marker related to poplar stomatal morphology and photosynthetic efficiency and application thereof
CN114317779A (en) * 2022-01-19 2022-04-12 华中农业大学 SNP molecular marker related to pig carcass traits and application
CN114863991A (en) * 2022-06-21 2022-08-05 沈阳农业大学 Method for improving whole genome prediction precision based on two-step prediction model establishment
CN116210571A (en) * 2023-03-06 2023-06-06 广州市林业和园林科学研究院 Three-dimensional greening remote sensing intelligent irrigation method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Incorporating genomic annotation into single-step genomic prediction with imputed whole-genome sequence data;TENG Jin-yan等;《Journal of Integrative Agriculture》;第21卷(第4期);1126-1136 *
基于GBLUP 和BayesB 方法对肉鸡屠宰性状基因组预测准确性的比较;朱墨等;《中国农业科学》;第54卷(第23期);5125-5131 *

Also Published As

Publication number Publication date
CN116863998A (en) 2023-10-10

Similar Documents

Publication Publication Date Title
Lind et al. The genomics of local adaptation in trees: are we out of the woods yet?
Hartfield et al. The evolutionary interplay between adaptation and self-fertilization
Yin et al. Genetic dissection on rice grain shape by the two-dimensional image analysis in one japonica× indica population consisting of recombinant inbred lines
CN107278877B (en) A kind of full-length genome selection and use method of corn seed-producing rate
Pace et al. Genomic prediction of seedling root length in maize (Zea mays L.)
Liu et al. The impact of genetic relationship and linkage disequilibrium on genomic selection
Gonzalo et al. Direct mapping of density response in a population of B73× Mo17 recombinant inbred lines of maize (Zea mays L.)
Gosseau et al. Heliaphen, an outdoor high-throughput phenotyping platform for genetic studies and crop modeling
Mir et al. Allelic diversity, structural analysis, and Genome-Wide Association Study (GWAS) for yield and related traits using unexplored common bean (Phaseolus vulgaris L.) germplasm from Western Himalayas
Geuten et al. Conflicting phylogenies of balsaminoid families and the polytomy in Ericales: combining data in a Bayesian framework
CN113223606B (en) Genome selection method for genetic improvement of complex traits
CN114292928B (en) Molecular marker related to sow breeding traits and screening method and application
Pégard et al. Favorable conditions for genomic evaluation to outperform classical pedigree evaluation highlighted by a proof-of-concept study in poplar
McGaugh et al. The utility of genomic prediction models in evolutionary genetics
Hodgins et al. Asymmetrical mating patterns and the evolution of biased style-morph ratios in a tristylous daffodil
CN108197435B (en) Marker locus genotype error-containing multi-character multi-interval positioning method
CN113053459A (en) Hybrid prediction method for integrating parental phenotypes based on Bayesian model
Fu et al. A statistical model for mapping morphological shape
CN116863998B (en) Genetic algorithm-based whole genome prediction method and application thereof
You et al. Genomic cross prediction for linseed improvement
Barjasteh et al. Comparing different marker densities and various reference populations using pedigree-marker best linear unbiased prediction (BLUP) model
Sehgal et al. Genomic selection in wheat: Progress, opportunities and challenges
Harms Genomic Selection for Yield and Seed Composition Stability in an Applied Soybean Breeding Program
Alekya et al. Chapter-7 whole genome strategies for marker assisted selection in plant breeding
Ganesamurthy et al. Analysis of the Efficiency of Genomic Selection Models for Predicting Sheath Blight Resistance in Rice (Oryza sativa L.,)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant