WO2024056056A1 - 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统 - Google Patents

基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统 Download PDF

Info

Publication number
WO2024056056A1
WO2024056056A1 PCT/CN2023/119026 CN2023119026W WO2024056056A1 WO 2024056056 A1 WO2024056056 A1 WO 2024056056A1 CN 2023119026 W CN2023119026 W CN 2023119026W WO 2024056056 A1 WO2024056056 A1 WO 2024056056A1
Authority
WO
WIPO (PCT)
Prior art keywords
rice
genome
cadmium
population
cadmium content
Prior art date
Application number
PCT/CN2023/119026
Other languages
English (en)
French (fr)
Inventor
何振艳
闫慧莉
骆永明
虞轶俊
许文秀
Original Assignee
中国科学院植物研究所
中国科学院南京土壤研究所
浙江省耕地质量与肥料管理总站
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院植物研究所, 中国科学院南京土壤研究所, 浙江省耕地质量与肥料管理总站 filed Critical 中国科学院植物研究所
Publication of WO2024056056A1 publication Critical patent/WO2024056056A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/13Plant traits
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the invention belongs to the field of biotechnology, and specifically relates to a rice grain cadmium accumulation trait prediction device and early warning system based on whole-genome selection research.
  • Rice (Oryza sativa L.) is one of the main staple food crops, and more than 60% of the population depends on rice as their staple food. Compared with other cereal crops, rice easily absorbs cadmium from the soil during its growth.
  • Cadmium (Cd) is a non-essential element for the human body. It is a silver-white metal with a density of 8.65g/cm 3 and is a toxic heavy metal element.
  • IARC International Agency for Research on Cancer
  • Cadmium can enter the human body through the food chain and be enriched. When the accumulation of cadmium in the human body reaches 2.6g, it will have toxic effects.
  • the biological half-life of cadmium in the human body is 15 to 45 years (Nordberg and Gunnar, 2015).
  • the sources of cadmium pollution in farmland soil are mainly divided into two types: natural sources and anthropogenic sources.
  • Natural sources include various geological activities, such as volcanic eruptions.
  • the deposited cadmium element will change the environmental background value of cadmium in the soil.
  • Anthropogenic sources include ore mining, waste discharge, sewage irrigation and other human activities. Among them, mining and metallurgical emissions are the main source.
  • the impacts of electronic waste dismantling, sewage irrigation and road traffic accounted for 58.8%, 44.8% and 57.1% respectively.
  • the impact of these human activities on the accumulation of cadmium in farmland soil cannot be ignored (Cui Xiangfen et al., 2021).
  • Cadmium toxicity in plants Impacts and remediation strategies .Ecotoxicol Environ Saf. 2021 Mar 15;211:111887).
  • Physiological damage includes reduced photosynthetic efficiency, reduced water content, and inhibited absorption of essential elements.
  • Cadmium inhibits carbon fixation and chlorophyll synthesis in plants, thereby affecting plant photosynthesis.
  • the accumulation of cadmium in plants will induce the excessive production of reactive oxygen species, causing physiological damage to plant organelles.
  • the presence of cadmium will interfere with the absorption of essential plant elements such as Ca, P, Mg, Fe, and Zn, leading to chlorosis of plant leaves, damage to root growth, and ultimately plant death.
  • Cadmium can enter the human body through the food chain and be enriched.
  • cadmium in the human body When the accumulation of cadmium in the human body reaches 2.6g, it will have toxic effects.
  • the biological half-life of cadmium in the human body is 15 to 45 years (Nordberg and Gunnar, 2015), long-term accumulation of cadmium can cause toxicity to the respiratory system, circulatory system, urinary system, nervous system, skeletal system, etc., causing symptoms such as osteoporosis, renal failure, kidney stones, and emphysema.
  • breeding rice varieties with low cadmium accumulation is the most economical and feasible method to solve cadmium pollution in rice. Its development has gone through conventional breeding with phenotype as the core and molecular marker-assisted breeding with target trait-associated molecular markers as the core.
  • Conventional breeding is a much-researched breeding method, which mainly selects low-accumulation rice varieties based on their grain cadmium accumulation phenotype by planting different rice varieties in the same soil environment.
  • the conventional breeding process is time-consuming, the cadmium accumulation phenotype of rice grains is easily affected by environmental factors and is unstable, and it is geographically restricted.
  • Molecular marker-assisted breeding can use DNA molecular markers or functional markers that are closely linked to cadmium accumulation traits to indirectly select for cadmium accumulation traits, and then combine them with conventional breeding methods to cultivate new varieties.
  • Molecular marker-assisted breeding has the advantages of high efficiency, accuracy, and stable results, and is currently one of the main methods for breeding low-cadmium accumulation rice varieties. Since the rice grain cadmium accumulation trait is a quantitative trait controlled by multiple genes and is easily affected by environmental factors, the existing conventional breeding and low-density rice grain cadmium accumulation-related molecular markers are far from meeting the actual needs for the breeding of low-cadmium accumulation rice varieties. There is an urgent need to develop new technologies suitable for rapid breeding of stable and low-cadmium accumulation varieties.
  • Genome selection (GS) technology is the most promising breeding method to accelerate the development of new varieties and has broad application prospects.
  • Genome-wide selection takes the form of molecular marker-assisted selection by using high-density molecular markers covering the entire genome to predict the genomic estimated breeding values (GEBV) of individuals.
  • GEBV genomic estimated breeding values
  • whole-genome selection evaluates the effects of all markers simultaneously and is more accurate in predicting complex traits.
  • Genome-wide selection requires the establishment of a training population (TRN), which performs phenotypic analysis on the target traits and performs genotyping using molecular markers covering the entire genome.
  • TRN training population
  • the training set is used to build a statistical model between molecular markers and corresponding phenotypes, which predicts the effect of each marker on the target trait by fitting the effects of all markers.
  • the constructed statistical model is then used to predict the estimated breeding value of individual genomes in the test population (TST) with existing genotypes.
  • TST test population
  • the calculation method of genome-wide selection is mainly an algorithm for estimating breeding values from the genome.
  • BLUP Best Linear Unbiased Prediction
  • MCMC Markov chain Monte Carlo
  • the BLUP method is based on the mixed linear model, which assumes that all SNPs contribute uniformly to phenotypic traits, taking into account both random effects and fixed effects of genetic grouping, and then calculates the individual genome estimated breeding value based on phenotype and pedigree A.
  • BLUP methods include GBLUP based on the genome-wide kinship matrix (G matrix) and RRBLUP based on allele effects. Both have short operation times and are suitable for modeling and predictive analysis when the number of groups is large.
  • Bayesian method is proposed based on the linkage effect of SNP and QTL. It is a nonlinear model and mainly includes Bayes A, Bayes B, Bayes C and Bayesian Lasso. Different Bayesian methods choose different prior distributions, that is, for The effect of each SNP is calculated differently by Bayesian Lasso. Among them, Bayes A believes that each SNP has its own variance, Bayes B only believes that a few markers have an effect, Bayes C believes that effective SNPs have the same variance value, and Bayesian Lasso changes the distribution of effect variance, assuming the effect of markers. It obeys a double exponential distribution. Bayesian models are characterized by generally long computation time, and the prediction accuracy between different Bayesian models is close.
  • the machine learning method refers to the use of computer algorithms to continuously simulate a large amount of data to achieve prediction of target traits, mainly including support vector machine (SVM), random forest (Random Forest, RF), LightGBM (Light Gradient) Boosting Machine) etc.
  • SVM support vector machine
  • Random Forest Random Forest
  • RF Random Forest
  • LightGBM Light Gradient
  • Boosting Machine Light Gradient Boosting Machine
  • BLUP best linear unbiased prediction
  • G matrix genome-wide kinship matrix
  • rrBLUP ridge regression best linear unbiased prediction
  • prediction accuracy refers to the correlation coefficient between the actual breeding value and the estimated breeding value. The closer the coefficient is to 1, the higher the prediction accuracy.
  • Factors that affect the prediction accuracy of genome-wide selection mainly include the heritability of the target trait, the selected algorithm, the density and source of molecular markers, the size of the training population, the genetic relationship between the training population and the test population, etc.
  • Heritability refers to the proportion of genetic variance to phenotypic variance. The greater the heritability, the greater the extent to which the trait is controlled by genes and less affected by environmental factors. The higher the prediction accuracy of genome-wide selection studies. For traits with low heritability, prediction accuracy can be improved by increasing the number of generations in which the phenotype is recorded.
  • Molecular marker density and source refer to the number, distribution and correlation degree of molecular markers covering the genome of the training population in genome-wide selection studies with target traits.
  • the prediction accuracy is proportional to the density of molecular markers, but when the number of molecular markers reaches a certain number, the prediction accuracy will reach a maximum value and then decrease.
  • the size of the training group is one of the important factors affecting prediction accuracy. Usually, as the number of training groups increases, the prediction accuracy also improves.
  • the ratio of the training population to the test population will also affect the prediction accuracy. Studies have shown that increasing the ratio of the two types of groups can help improve the prediction accuracy of genome-wide selection.
  • the genetic relationship between the training population and the test population is also directly proportional to the prediction accuracy. The smaller the genetic distance between the two, the closer the genetic relationship, and the higher the prediction accuracy.
  • the target traits included yield, plant height and flowering time, and the predictive capabilities were 0.31, 0.34 and 0.63 respectively (Spindel J, Begum H, Akdemir D,Virk P,Collard B, E, Atlin G, Jannink JL, McCouch SR. Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet. 2015 Feb 17;11(2):e1004982). J ⁇ nior et al. used 9 prediction models to predict yield, plant height, days to flowering, heading rate, brown spot severity, whole grain yield, aspect ratio, and grain whiteness.
  • the models used include BayesA, GBLUP, RHKS, BayesC, MLR, etc., and the prediction accuracy ranges from 0.15 to 0.725 (Ahmadi N ,Ramanantsoanirina A,Santos JD,Frouin J,Radanielina T.Evolutionary Processes Involved in the Emergence and Expansion of an Atypical O.sativa Group in Madagascar.Rice(N Y).2021 May 20;14(1):44;Frouin J, Labeyrie A,Boisnard A,Sacchi GA,Ahmadi N.Genomic prediction offers the most effective marker assisted breeding approach for ability to prevent arsenic accumulation in rice grains.PLoS One.2019 Jun 13;14(6);Huang Y,Chen H, Reinfelder JR, Liang X, Sun C, Liu C, Li F, Yi JA transcriptomic (RNA-seq) analysis of genes responsive to both cadmium and arsenic stress
  • Hybrid breeding is the main means to increase rice yield by utilizing heterosis. Research shows that hybrid rice yields 20% more than inbred varieties. Whole-genome selection can efficiently select the desired hybrid combination from many potential hybrid combinations, and GS can predict the breeding value of all combinations of obtained genotype parents, thereby reducing the time and cost of field evaluation.
  • commonly used rice populations include NCII, RIL and some populations associated with target traits. Multiple traits of hybrid progeny are predicted, including yield per plant, thousand-grain weight, effective panicle number, and plant height. , the number of primary branches and stems, the number of secondary branches and stems, the number of solid kernels in the main panicle, panicle length, etc. The prediction ability of different types of traits ranges from low to high.
  • the models used include GBLUP, MV-ADV, Lasso, SVM, etc., for prediction The trait with higher ability is thousand-grain weight (0.7-0.8), and the prediction ability of yield per plant and panicle length is below 0.5.
  • the technical problem to be solved by the present invention is how to use whole-genome selection to predict cadmium content in rice grains and/or how to establish a whole-genome selection model for cadmium accumulation traits in rice grains and/or how to predict cadmium content in rice grains and/or how to control cadmium accumulation in rice. Early warning of risks and/or how to cultivate low-cadmium rice.
  • the present invention first provides a device for predicting cadmium content in rice grains.
  • the device may include the following modules:
  • Phenotypic data set acquisition module used to obtain the phenotypic data set of grain cadmium content of rice in the model construction population
  • Genotype data set acquisition module used to obtain SNP molecular markers associated with rice grain cadmium content through genome-wide association analysis to obtain a genotype data set;
  • Whole-genome selection model building module used to construct a whole-genome selection model for predicting cadmium content in rice grains based on the phenotypic data set and the genotype data set through the whole-genome selection algorithm;
  • Genome estimated breeding value calculation module used to calculate and obtain the genome estimated breeding value of the rice to be tested using the genome-wide selection model and the SNP genotyping calculation; predict the genome to be tested based on the genome estimated breeding value Cadmium content in rice grains.
  • the whole genome selection algorithm may be rrBLUP or gBLUP.
  • the model construction group may be composed of a training group and a testing group. Both the training population and the test population are composed of rice materials. The ratio of the number of rice materials in the training population and the test population may be 1:1.
  • the SNP molecular markers are evenly distributed on the 12 chromosomes of rice. The distribution density of the SNP molecular markers can be 60K per rice genome.
  • the number of rice materials in the model construction population may be 500.
  • the present invention also provides a rice cadmium accumulation risk early warning device, which may include the following modules:
  • Phenotypic data set acquisition module used to obtain the phenotypic data set of grain cadmium content of rice in the model construction population
  • Genotype data set acquisition module used to obtain SNP molecular markers associated with rice grain cadmium content through genome-wide association analysis to obtain a genotype data set;
  • Whole-genome selection model building module used to construct a whole-genome selection model for predicting cadmium content in rice grains based on the phenotypic data set and the genotype data set through the whole-genome selection algorithm;
  • SNP genotyping acquisition module of the rice to be tested used to measure the SNP molecular markers of the rice to be tested to obtain the SNP genotyping of the rice to be tested;
  • Genome estimated breeding value calculation module used to calculate and obtain the genome estimated breeding value of the rice to be tested using the whole genome selection model and the SNP genotyping calculation; predict the genome estimated breeding value of the rice to be tested based on the genome estimated breeding value Cadmium content in rice grains;
  • Cadmium content risk early warning module used to output the name of the rice material to be tested whose cadmium content obtained in B5) is higher than the cadmium content risk value.
  • the whole genome selection algorithm may be rrBLUP or gBLUP.
  • the model construction group may be composed of a training group and a testing group.
  • the training population and the testing population are both composed of rice materials; the number ratio of the rice materials in the training population and the testing population can be 1:1.
  • the SNP molecular markers are evenly distributed on the 12 chromosomes of rice.
  • the distribution density of the SNP molecular markers can be 60K per rice genome.
  • the number of rice materials in the model construction population may be 500.
  • the cadmium content risk value may be 0.2 mg/kg.
  • the output may be visual output.
  • the present invention also provides a system for early warning of the risk of cadmium accumulation in rice.
  • the system may be configured as described above.
  • the system may also include instruments, reagents and/or kits for determining rice SNP typing.
  • the system may also include instruments, reagents and/or kits for measuring cadmium content in rice grains.
  • the present invention also provides a computer-readable storage medium, which can enable the computer to run the following steps:
  • C2 Obtain SNP molecular markers associated with cadmium content in rice grains through genome-wide association analysis to obtain a genotype data set;
  • C3 Using a genome-wide selection algorithm, construct a genome-wide selection model for predicting cadmium content in rice grains based on the phenotypic data set and the genotype data set;
  • C5 Use the genome-wide selection model and the SNP genotyping calculation to obtain the genome estimated breeding value of the rice to be tested; predict the cadmium content of the rice grain to be tested based on the genome estimated breeding value;
  • the output may be visual output.
  • the algorithm for whole genome selection may be rrBLUP or gBLUP.
  • the model construction population may be composed of a training population and a test population, and both the training population and the test population are composed of rice materials.
  • the ratio of the number of rice materials in the training population and the test population may be 1:1.
  • the SNP molecular markers are evenly distributed on the 12 chromosomes of evenly distributed rice.
  • the distribution density of the SNP molecular markers may be 60K on each rice genome.
  • the number of rice materials in the model construction population may be 500.
  • Figure 1 shows the phenotypic and genotypic data sets of the genome-wide selection model for cadmium accumulation traits in rice.
  • A The geographical origin (top) and genetic relationship (bottom) of 500 rice germplasms. The ordinate is the number of materials from different geographical origins (top) and different subpopulations (bottom), and the abscissa is the different subpopulations. The letters range in color from light to dark gray and in size from small to large, representing the number of rice varieties.
  • B Frequency distribution of OsGCd values of 500 rice materials. The ordinate is the number of materials and the abscissa is the cadmium content of rice grains.
  • Figure 2 shows the maximum accuracy and time consumed by the 12 modeling algorithms.
  • A Comparison of the time consumption of building a genome-wide selection model using 12 statistical methods;
  • B Comparing the average accuracy of 12 statistical methods using SNPs from Strategy I;
  • C Comparing the average accuracy of 12 statistical methods using SNPs from Strategy II Average accuracy and strategy III;
  • D The average accuracy of 12 statistical methods was compared using SNPs of strategy III; model parameters: the ratio of training population to test population was 1:1; SNP density was 60k, and the population size was 500 (rrBLUP, gBLUP, RF, Light GBM, ANN and SVM) and 219 (Bayes A, Bayes B, Bayes C, Bayes Lasso, Bayes BRR and Bayes RKHS).
  • Figure 3 shows the optimal population size and the ratio of training population to test population for cadmium accumulation traits in rice grains.
  • A The ratio of the training population to the testing population and the SNP density remain unchanged. The average accuracy of 11 groups of population sizes is compared using rrBLUP and gBLUP as statistical methods. The ordinate is the model accuracy and the abscissa is different population sizes;
  • B Using rrBLUP and gBLUP as statistical methods, compare the average accuracy under the ratio of 9 groups of training groups and test groups. The ordinate is the model accuracy, and the abscissa is the ratio of the training group to the test group.
  • Figure 4 shows the optimal SNP marker density for cadmium accumulation traits in rice grains.
  • A The ratio of the training population to the test population and the size of the population remained unchanged. The average accuracy of 9 groups of SNP marker densities was compared using rrBLUP and gBLUP as statistical methods. The ordinate is the model accuracy and the abscissa is the number of different SNP markers;
  • B Under the three strategies, as the SNP marker density increases, the negative logarithm of the P value -log 10 (P) changes. The ordinate is the negative logarithm of the P value -log 10 (P), and the abscissa is the SNP marker density.
  • Figure 5 shows the application of the “intelligent cadmium early warning system” in early warning of cadmium accumulation risk in rice.
  • A Intelligent cadmium pre-treatment The basic process of the police system;
  • B Comparison of risk varieties predicted by the whole genome selection model and field trials in Fuyang;
  • C Comparison of risk varieties predicted by the whole genome selection model and field trials in Wenling.
  • the light gray part represents the measured value (Measured OsGCd) of the field test, and the dark color represents the predicted value (predicted OsGCd); the ordinate of the upper part is the exceedance rate, and the ordinate of the lower part is the measured cadmium content;
  • D The experiment conducted in Fuyang Correlation coefficient between OsGCd values measured in field trials of 44 rice varieties and predicted values;
  • E Correlation coefficient between OsGCd values measured in field trials of 44 rice varieties and predicted values in Wenling; Longitudinal The coordinates are predicted values, and the abscissa is measured values; MAE, mean absolute error.
  • the experimental methods in the following examples are all conventional methods unless otherwise specified.
  • the materials, reagents, instruments, etc. used in the following examples can all be obtained from commercial sources unless otherwise specified.
  • the quantitative experiments in the following examples were repeated three times, and the results were averaged.
  • the first position of each nucleotide sequence in the sequence list is the 5' terminal nucleotide of the corresponding DNA, and the last position is the 3' terminal nucleotide of the corresponding DNA.
  • Example 1 Method for genome-wide selection study of cadmium accumulation traits in rice grains
  • Land preparation and ridge digging Turn over the soil of the seedling cultivation land as a whole to ensure uniform soil throughout the cultivated land. After that, the ridges are dug, each ridge is 70cm wide and the length depends on the length of the seedling land. After digging the ridges, water and spray pesticides and herbicides. It is necessary to dry the field for 1-2 days before sowing.
  • Seed soaking and germination Soak the seeds in warm water for three days, and change the water twice a day to ensure there is no odor. After the seeds turn white, germination begins. Ensure a higher temperature during the germination period. The best time is one and a half to two days. After the seed buds are 5mm long, they can be sown.
  • Sowing and raising seedlings Divide each ridge into two halves and divide the grids for sowing, each grid is 25cm-30cm. The germinated seeds are sown in the middle of the trellis until they grow into seedlings.
  • Transplanting Plant the rice micro-core germplasm materials in the following environment: the average soil cadmium content is 1.12 mg/kg, the average available cadmium content is 0.91 mg/kg, and the pH is 6.04. Resource germplasm materials are planted in two rows with a row spacing of 25cm; 8 plants are planted in each row with a spacing of 20cm. In order to ensure the accuracy of the data, the control material CK was also set up for later data correction. The CK variety was the local conventional japonica rice variety Jiahexiang No. 1. Three plants were planted in each row. The row spacing and plant spacing were consistent with the resource germplasm materials. Set up protective rows around the planting material.
  • CK materials are based on the mixed collection of materials every 20 (10 varieties on the left and right). 1 serving of CK.
  • the collected rice grains together with landmark signs are placed in mesh bags and dried in the sun to avoid mold.
  • the recovered rice grain samples were dried in the sun or placed in an oven at 60°C for 3 days. After the mass was constant, a rice huller was used to shell the grains. The resulting brown rice samples were placed in a 5 mL centrifuge tube. The brown rice samples were then ground using a high-throughput silent tissue grinder for subsequent determination of cadmium content.
  • the method used is the single acid digestion method
  • the instrument used is a far-infrared temperature-controlled digestion furnace
  • the container is a glass digestion tube.
  • Cadmium content in rice grains was measured using inductively coupled plasma mass spectrometry (ICP-MS).
  • Strategy I is to rank all SNPs by P value, regardless of which chromosome they are marked on. In this way, the first 60, 120, 600, 1200, 6k, 12k, 60k, 120k, 600k SNPs were extracted to build 9 SNP data sets; taking into account the distribution of single nucleotide polymorphisms on chromosomes Uniformity, strategy II aims to extract single nucleotide polymorphisms at the first 5, 10, 50, 100, 500, 1000, 5k, 10k, 50k, and 100k positions in 12 chromosomes , and list them together to form 9 data sets.
  • SNPs were randomly selected and 9 SNP data sets were formed, including the same integers as strategies I and II; compared with strategy I, SNPs in strategies II and III were more evenly distributed (D in Figure 1). The SNPs in strategies I and II showed higher P values than strategy III (E in Figure 1).
  • This invention uses a total of 12 algorithms to conduct genome-wide selection research to predict rice grain cadmium content, 8 of which are linear algorithms and 4 are machine learning algorithms.
  • the linear algorithms include: rrBLUP, gBLUP, Bayes A/B/C/Lasso/ BRR/RKHS.
  • Machine learning algorithms include: Support Vector Machine (SNM), Random Forest (RF), LightGBM, and Multi-Layer Perceptron (MLP). Each prediction result is cross-validated 100 times, and the average value is taken as the final prediction result.
  • the rrBLUP algorithm is an indirect method model.
  • the specific analysis is completed through the rrblup package of R software (Lozada et al., 2019).
  • Y is the phenotype vector of each rice variety in the training population
  • is the calculated fixed effect, that is, the phenotypic mean of each variety in the training population
  • X is the correlation matrix obtained by encoding the genotype
  • g is refers to the molecular marker effect vector estimated according to the model
  • e is the residual error (Endelman, 2011).
  • the gBLUP algorithm uses a mixed linear model for prediction (Yao Ji, 2018) and the sommer package of R software for analysis (Perez and de los Campos, 2014).
  • Y is the phenotype vector of each rice variety in the training population
  • Z is the calculated fixed effect matrix
  • is the fixed effect vector
  • X is the random effect matrix
  • g refers to the molecular marker effect vector estimated according to the model
  • is Random errors (VanRaden, 2008).
  • Bayes A/B/C was proposed by Meu Giveaway et al. (Meu Giveaway etal, 2001). According to the assumption of Bayes A, each SNP is valid, this effect follows a normal distribution, and the effect variance follows a proportional inverse chi-square distribution. According to the assumption of Bayes B, which is consistent with the actual situation of the whole genome, a few SNPs have an effect, while other SNPs have no effect, and the effect variance obeys the inverse chi-square distribution. Gibbs and MH (metropolis-Hastings) sampling are jointly applied in Bayes B to obtain the sample labeling effect and variance. Bayes C is an optimization based on Bayes B. Bayes A/B/C can be expressed by the following unified formula.
  • Bayes Lasso assumes that the variance of the labeling effect follows a Laplace distribution, allowing a maximum or minimum value to occur with greater probability.
  • the difference between Bayes A/B/C and Bayes Lasso is the distribution of labeling effects.
  • Bayes A/B/C assumes that the marker effect follows a normal distribution, while Bayes Lasso follows a Laplace distribution.
  • Bayes BRR assumes that all markers have small or medium effects by setting a Gaussian prior distribution of marker effects (Habier et al., 2007). It can be expressed by the following formula:
  • Bayes RKHS is a statistical method that combines the Bayes method and RKHS (de los Campos et al, 2010). In this study, the Bayes model was implemented by the BGLR package in R.
  • Support vector machine is a supervised machine learning method that can be used for ranking and regression analysis (Cortes et al., 1995).
  • a linear decision surface is constructed based on input vectors that are nonlinearly mapped into a high-dimensional feature space. By finding the maximum margin and setting up a classifier, new unknown data can be classified.
  • support vector machines were implemented by the e1071 package of R.
  • the random forest algorithm is a classifier that makes predictions by integrating multiple decision trees (Zhang Libin and Song Kaili, 2019). Its basic principle is to use the Bootstrap sub-sampling method to obtain different sample sets for building models. The degree of difference is different, thus improving the prediction ability (Dong Hongyao et al., 2021), and the analysis is carried out through the random forest software package in R.
  • LightGBM uses histogram-based statistical methods to find the best segmentation points (Related literature: Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, Ma C, Yan J, Wang X. LightGBM: accelerated genetically designed crop breeding through ensemble learning. Genome Biol. 2021 Sep 20; 22(1):271). Based on the decision tree algorithm, LightGBM is a fast, memory-saving and high-performance gradient boosting framework that can be used for sorting, classification, regression and many other machine learning tasks with advantages. In this study, the python package lightgbm v3 3.2 was used to construct the lightgbm statistical method model.
  • MLP is a fully connected neural network with at least one hidden layer.
  • the output of each hidden layer needs to be transformed through an activation function.
  • This method uses neural networks as the basic framework and attempts to imitate the learning model of natural biological neural networks.
  • the python package d2lzh v1.0.0 was used to build the MLP.
  • the Bayes algorithm is inferior to other statistical methods in terms of group size and time consumption. Except for the Bayes algorithm, all other statistical methods can model a genome-wide selection model of 500 population sizes in 4 hours, with rrBLUP and gBLUP being the fastest (less than 1 hour) ( Figure 2, A).
  • the Bayes algorithm can only model genome-wide selection models for up to 219 population sizes, which takes about 7 hours (A in Figure 2). Time consumption and computational efficiency have always been factors that need to be considered in Bayesian analysis, because its model effects need to be sampled in thousands of Markov chain Monte Carlo iterations. As the number of response variables increases, each iteration requires the inversion and decomposition of a larger matrix, making it time-consuming.
  • the prediction accuracy of the rrBLUP and gBLUP methods is higher than that of the Bayesian method (0.67 ⁇ average accuracy ⁇ 0.7).
  • the performance of linear methods is limited by the population size but is not sensitive to the number of SNPs.
  • Machine learning has the superior ability to exploit very large data sets, but requires larger training population sizes to achieve high prediction accuracy. For example, one case shows that at a population size of 100,000, rrBLUP failed to train the model, but LightGBM completed the training in 15 minutes with 40GB of memory. Therefore, the optimal statistical approach for genome-wide selection models depends on population size and SNP density. For the prediction of OsGCd (cadmium content) with a population size of 500 and a SNP density of 60k, the present research shows that rrBLUP and gBLUP are the best statistical methods in terms of prediction accuracy and computational efficiency.
  • Genome-wide selection study prediction accuracy is related to the actual effect of chromosomal segments that can be represented by SNP markers. Markers located in genomic regions that influence traits have been shown to be important factors in the average accuracy of models. Therefore, obtaining a large number of SNPs that are highly correlated with traits is a key factor in establishing accurate genome-wide selection models.
  • GWAS provides a feasible method for detecting SNP markers associated with traits.
  • the source of molecular markers used in this invention is rice grain cadmium accumulation-associated molecular markers screened based on genome-wide association analysis.
  • three strategies are used to screen associated molecular markers as SNP data sets: Strategy I selects associations within the whole genome. The top 60, 120, 600, 1200, 6k, 12k, 60k, 120k, and 600k SNPs with the highest degree are used as molecular marker density; strategy II selects 5, 10, 50, 100, 500, 1000, 5k, 10k, and 50k SNPs are used as molecular marker density; Strategy III randomly selects SNPs as the molecular marker density across the entire genome. Analyze the impact of the density of molecular markers used on the accuracy of genome-wide selection predictions.
  • High marker density is another method to ensure that marker QTL (quantitative trait loci) associations are maintained, thus ensuring high prediction accuracy.
  • each trait has an optimal SNP marker density, beyond which average accuracy begins to decline.
  • 60k SNP parameters there is no significant difference in the prediction accuracy between strategy I and strategy II, indicating that both strategies contain enough SNPs to conduct accurate genome-wide selection model modeling. Therefore, this paper explores the modeling effect of these two strategies on the intersection of 60k SNPs.
  • a total of 45,805 SNPs were identified from the 60k intersection of strategy I and strategy II (C in Figure 4), which were evenly distributed on 12 chromosomes, with P values ranging from 1.794 to 8.043 (D in Figure 4).
  • the average accuracy reached 0.752 ⁇ 0.035 (rrBLUP) and 0.756 ⁇ 0.035 (gBLUP) respectively (E in Figure 4), indicating that 45805 SNPs are sufficient to predict OsGCd.
  • balancing training and test population relationships also affects average accuracy.
  • Studies on the effects of training and test population ratios show that the optimal ratio varies with plant species and traits.
  • corn tar spot composite resistance prediction relatively high prediction accuracy and minimum standard error were observed when 50% of the total genotypes were used as the training population.
  • the ratio of 9:1 is corn earing, plant height and ear weight optimal parameters for prediction.
  • OsGCd content prediction in the present invention it was also observed that 1:1 is the optimal training population and test population ratio. Under this parameter, the average root mean square error can reach 0.77 ⁇ 0.003 (B in Figure 3). Therefore, the population size of 500 (the number of rice materials) and the ratio of training population and test population of 1:1 are the best parameters for predicting OsGCd in the present invention.
  • the present invention combines high-throughput sequencing, whole-genome selection model prediction and other modules with risk assessment to develop a system, namely an intelligent cadmium early warning system, for OsGCd risk early warning of rice grains.
  • the intelligent cadmium early warning system includes four main analysis modules including modeling, genotyping, OsGCd content prediction and risk assessment.
  • the first modeling module is to use the method and parameters in Example 1 to establish a high-precision genome-wide selection model.
  • the second genotyping module obtains SNPs for rice varieties for risk assessment through whole-genome resequencing or custom-made low-cadmium single nucleotide polymorphism arrays.
  • the third OsGCd content prediction module performs genome-wide selection model prediction, using rice variety SNP (single nucleotide polymorphism) as query information, and obtains the predicted grain OsGCd content of each rice variety through query.
  • the fourth module performs risk assessment and basic data visualization: when the OsGCd of a rice variety is higher than the maximum allowable level (exceeding the maximum allowable level of cadmium in rice (0.2 mg/kg) specified by the Ministry of Health of China (MHPRC, 2012)), will be highlighted (the process is shown as A in Figure 5).
  • a genotype data set of 44 rice accessions containing 45,805 SNPs was derived from whole-genome resequencing.
  • Example 1 The results show that using the method and parameters in Example 1 to construct a genome-wide selection model for 500 modeling ensembles, the prediction accuracy of the cadmium content in Wenling and Fuyang rice grains reached 0.756 ⁇ 0.035 and 0.795 ⁇ 0.023 respectively; the cadmium content in rice grains The predicted value is about 2.5 times higher in Fuyang than in Wenling, on average, which may be due to a decrease in soil pH.
  • a total of 32 and 12 rice varieties were identified as risk varieties in Fuyang and Wenling respectively (rice materials exceeding the standard are shown in Table 1).
  • the innovative early warning system of rice OsGCd "intelligent cadmium early warning” developed by this invention is the first OsGCd risk assessment and early warning system established from the following perspectives: from genotype to phenotype. For OsGCd characteristics, the superior performance and extensive environmental significance of "intelligent cadmium early warning” early warning risk rice varieties were demonstrated. It is expected that the "smart early warning” system can be extended to a wider range of hazardous materials and crop species, thereby playing a role in risk assessment and environmental protection.
  • the genome-wide selection study on rice grain cadmium content established by the present invention is different from marker-assisted selection (MAS).
  • MAS marker-assisted selection
  • only a limited number of previously determined markers with the strongest correlation are used to select the best lines, while the method of the present invention Exploit genotype-phenotype relationships at the genome-wide level to produce reliable genome-wide selection models for phenotypic samples.
  • this method requires two steps: (i) constructing a genome-wide selection model by combining molecular (high-density SNP marker) and phenotypic datasets in a training population (TRN), and (ii) using the established The model is used to obtain the estimated phenotype of the genome of individuals who have been genotyped but have no phenotype in the test population (TST); in this way, excellent rice lines with low cadmium content can be screened in advance without having to perform phenotypic analysis in the later stages of breeding.
  • the present invention also developed an innovative early warning system for rice "intelligent cadmium early warning”.
  • This system is the first cadmium (OsGCd) content risk to establish an assessment and early warning system from the following perspectives: from genotype to phenotype.
  • OsGCd the first cadmium
  • For OsGCd characteristics the superior performance and extensive environmental significance of "intelligent cadmium early warning” early warning risk rice varieties were demonstrated. It is expected that the "smart early warning” system can be extended to a wider range of hazardous materials and crop species, thereby playing a role in risk assessment and environmental protection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Botany (AREA)
  • Mycology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统。本发明构建了预测水稻籽粒镉含量全基因组选择模型。模型构建算法为rrBLUP或gBLUP;水稻模型构建群体种群个数为50,其中训练群体和测试群体材料个数比为1:1;构建所用基因型数据集中的镉含量相关SNP分子标记通过GWAS分析获得,均匀分布于水稻的12条染色体上;SNP分子标记的分布密度为每个水稻基因组上60K个。通过该模型可提前筛选出低镉含量优良水稻品系,而不必在育种后期进行表型分析;同时首次建立了水稻"智能镉预警"系统,可应用于更广泛的危险材料和作物品种中,从而在风险评估和环境保护中发挥作用。

Description

基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统 技术领域
本发明属于生物技术领域,具体涉及基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统。
背景技术
水稻(Oryza sativa L.)是主要主粮作物之一,60%以上的人口以稻米为主食。与其他谷类作物相比,水稻在生长过程中易从土壤中吸收镉元素。
镉(cadmium,Cd)是一种人体非必需元素,单质为银白色金属,密度为8.65g/cm3,是一种有毒的重金属元素。2012年,镉及其化合物被国际癌症研究机构(IARC)列为I类致癌物。镉可通过食物链进入人体并富集,人体内镉积累量达到2.6g时会产生毒害作用。镉在人体内的生物学半衰期是15~45年(Nordberg and Gunnar,2015),镉的长期累积会对呼吸系统、循环系统、泌尿系统、神经系统、骨骼系统等造成毒害,造成骨质疏松、肾功能衰竭、肾结石、肺气肿等症状(李沛轩,钟理,郭蕊.重金属镉致心血管疾病的潜在机制及治疗对策[J].中国科学:生命科学,2021,51(9):1 241-1 253;Lin HC,Hao WM,Chu PH.Cadmium and cardiovascular disease:An overview of pathophysiology,epidemiology,therapy,and predictive value.Rev Port Cardiol(Engl Ed).2021 Aug;40(8):611-617;Kim MS,Kim SH,Jeon D,Kim HY,Han JY,Kim B,Lee K.Low-dose cadmium exposure exacerbates polyhexamethylene guanidine-induced lung fibrosis in mice.J Toxicol Environ Health A.2018;81(11):384-396;Chung S,Chung JH,Kim SJ,Koh ES,Yoon HE,Park CW,Chang YS,Shin SJ.Blood lead and cadmium levels and renal function in Korean adults.Clin Exp Nephrol.2014 Oct;18(5):726-34),进而诱发癌症。
农田土壤镉污染的来源主要分为自然源和人为源两种。自然源包括各种地质活动,例如火山喷发等,沉积后的镉元素会改变土壤中的镉环境背景值。人为源包括矿石开采、三废排放、污水灌溉等人为活动。其中矿冶排放是主要来源,电子垃圾拆解、污水灌溉和道路交通的影响占比分别为58.8%、44.8%和57.1%,这些人为活动对农田土壤镉积累的影响也不容忽视(崔祥芬等,2021)。
镉对植物具有毒害作用,具体效应表现为生理损伤与生长抑制(Haider FU,Liqun C,Coulter JA,Cheema SA,Wu J,Zhang R,Wenjun M,Farooq M.Cadmium toxicity in plants:Impacts and remediation strategies.Ecotoxicol Environ Saf.2021 Mar 15;211:111887)。生理损伤包括光合效率降低、水分含量减少与必需元素吸收受到抑制。镉在植物体内会抑制碳固定与叶绿素合成,进而影响植物光合作用。镉在植物体内累积会诱导活性氧过量产生,造成植物细胞器的生理损伤。此外,镉元素的存在会干扰Ca、P、Mg、Fe、Zn等植物必需元素的吸收,导致植物叶片失绿、根系生长受损等,最终造成植物死亡。镉可通过食物链进入人体并富集,人体内镉积累量达到2.6g时会产生毒害作用。镉在人体内的生物学半衰期是15~45年 (Nordberg and Gunnar,2015),镉的长期累积会对呼吸系统、循环系统、泌尿系统、神经系统、骨骼系统等造成毒害,造成骨质疏松、肾功能衰竭、肾结石、肺气肿等症状。
镉低积累水稻品种的选育是解决稻米镉污染最经济、可行的方法,其发展经历了以表型为核心的常规选育和以目标性状关联分子标记为核心的分子标记辅助选育。常规选育是研究较多的一种选育方式,主要通过在同一土壤环境下种植不同水稻品种,根据其籽粒镉积累表型来筛选低积累水稻品种。常规选育过程耗时长,水稻籽粒镉积累表型易受到环境因素影响而不稳定,且具有地域限制,目前商业化品种较少。分子标记辅助选育可利用与镉积累性状紧密连锁的DNA分子标记或功能标记,对镉积累性状进行间接选择,再结合常规育种手段培育新品种。分子标记辅助选育具有高效、准确、结果稳定的优点,是目前镉低积累水稻品种选育的主要方式之一。由于水稻籽粒镉积累性状是多基因控制的数量性状,易受环境型影响,现有的常规选育和低密度水稻籽粒镉积累关联分子标记远不能满足镉低积累水稻品种选育的实际需求,目前亟需开发适用于稳定低镉积累品种快速选育的新技术。
全基因组选择(Genomic Selection,GS)技术是加速新品种开发最有希望的育种方法,具有广阔的应用前景。全基因组选择以分子标记辅助选择的形式,通过利用覆盖全基因组的高密度分子标记对个体的基因组估计育种值(genomic estimated breeding values,GEBV)进行预测。与传统的分子标记辅助育种相比,全基因组选择同时对所有标记的效应进行评估,对于复杂性状的预测更为准确。
全基因组选择中需要建立一个训练群体(TRN),该群体针对目标性状进行表型分析,并使用覆盖全基因组的分子标记进行基因分型。训练集用于构建分子标记和相应表型之间的统计模型,该模型通过拟合所有标记的效应来预测每个标记对目标性状的影响。之后利用构建的统计模型对已有基因型的测试群体(TST)中的个体基因组估计育种值进行预测。全基因组选择的计算方法主要是基因组估计育种值的算法,目前主要分为三类:基于混合线性模型的BLUP(Best Linear Unbiased Prediction)方法、基于MCMC(Markov chain Monte Carlo)和Gibbs抽样的贝叶斯(Bayes)方法和机器学习(Machine Learning)法。
BLUP法以混合线性模型为基础,其假设所有的SNP对表型性状贡献一致,将随机效应和遗传分组的固定效应都加以考虑,然后基于表型和系谱A计算个体的基因组估计育种值。目前常用的BLUP方法包括以全基因组亲缘关系矩阵(G矩阵)为核心的GBLUP和基于等位基因效应的RRBLUP。二者运算时间均较短,适合于群体数量较大时的建模和预测分析。
贝叶斯法是基于SNP与QTL的连锁效应提出的,属于非线性模型,主要包括Bayes A、Bayes B、Bayes C和Bayesian Lasso等,不同的贝叶斯方法选择不同的先验分布,即对于各个SNP的效应计算不同Bayesian Lasso。其中Bayes A认为每个SNP都有其各自方差,Bayes B只认为少数标记具有效应,Bayes C认为有效应的SNP具有相同的方差值,Bayesian Lasso改变了效应方差的分布,假定标记的效应 服从双指数分布。贝叶斯模型的特点为运算时间一般较长,不同贝叶斯模型之间的预测准确度接近。
机器学习法是指利用计算机算法对大量数据进行不断模拟,从而实现对目标性状的预测,主要包括支持向量机(support vector machine,SVM)、随机森林法(Random Forest,RF)、LightGBM(Light Gradient Boosting Machine)等。与传统算法相比,机器学习法具有高效智能的优点,可以对复杂形状进行较为准确的预测,且模型不易过拟合,但仍需对参数进行调整,从而获得最为准确的预测效果。
目前常用的BLUP方法包括基于全基因组亲缘关系矩阵(G矩阵)的最佳线性无偏预测(GBLUP)和基于等位基因效应的岭回归最佳线性无偏预测(rrBLUP)。二者运算时间均较短,适合于群体数量较大时的建模和预测分析。
在全基因组选择研究中,预测准确度是指实际育种值与估计育种值之间的相关系数,该系数越接近1,则表示预测准确度越高。影响全基因组选择预测准确度的因素主要包括目标性状的遗传力、所选算法、分子标记密度与来源、训练群体大小、训练群体与测试群体之间的亲缘关系等。遗传力是指遗传方差占表型方差的比例,遗传力越大,证明该性状受基因控制的程度越大,受到环境因素的影响越小,进行全基因组选择研究的预测准确度越高。对于低遗传力的性状,可通过增加表型记录世代数来提高预测准确度。分子标记密度与来源是指全基因组选择研究中覆盖训练群体基因组分子标记的数量、分布以及与目标性状的关联程度。通常预测准确度与分子标记的密度成正比,但当分子标记数目到达某一数量时,预测准确度会达到最大值,随后下降。训练群体数目大小是影响预测准确度的重要因素之一,通常随着训练群体数目增加,预测准确度也有所提升。训练群体与测试群体比例也会影响预测准确度,研究表明,两类群体比例增加有助于提升全基因组选择预测准确度。训练群体与测试群体间的亲缘关系也与预测准确度成正比,二者间遗传距离越小,亲缘关系越接近,预测准确度越高。
目前全基因组选择已在水稻中开展了许多研究,主要用于纯系选择和杂交育种。水稻的全基因组选择研究主要集中在设计训练群体和评估不同群体内或群体间的预测能力。
以不同的水稻育种群体为研究对象,已经对各种数量性状开展了全基因组选择研究,包括产量、花期、株高、千粒重、株高和抗性等,通过应用不同的统计模型和不同数目的分子标记,预测准确度也不相同(表1.1)。这些研究表明了全基因组选择在水稻纯系育种中的可行性。Xu等通过三种预测模型对21945个杂交品种的产量、分蘖数、穗粒数、千粒重进行了预测,平均预测能力分别为0.1269、0.2259、0.3471和0.6797(Xu et al.,2014)。Spindel等对来自国际水稻研究所(IRRI)的363个优良育种系进行了GS分析,目标性状包括产量、株高和开花时间,预测能力分别为0.31、0.34和0.63(Spindel J,Begum H,Akdemir D,Virk P,Collard B,E,Atlin G,Jannink JL,McCouch SR.Genomic selection and association mapping in rice(Oryza sativa):effect of trait genetic architecture,training population composition, marker number and statistical model on accuracy of rice genomic selection in elite,tropical rice breeding lines.PLoS Genet.2015Feb 17;11(2):e1004982)。Júnior等使用了9个预测模型对产量、株高、开花天数、抽穗率、褐斑严重程度、全粒产量、长宽比、籽粒白度进行了预测,Bayes Cπ模型对所有性状的预测效果均较为稳定。Yabe等对粮食灌浆特性进行了GS分析,预测了与粒重相关性状籽粒灌浆比例和灌浆籽粒平均重量,预测能力分别为0.30和0.28(Yabe S,Hara T,Ueno M,Enoki H,Kimura T,Nishimura S,Yasui Y,Ohsawa R,Iwata H.Potential of Genomic Selection in Mass Selection Breeding of an Allogamous Crop:An Empirical Study to Increase Yield of Common Buckwheat.Front Plant Sci.2018Mar 21;9:276)。在水稻抗性研究方面,已有关于对稻瘟病抗性和对砷抗性性状的GS研究,使用的模型包括BayesA、GBLUP、RHKS、BayesC、MLR等,预测准确度从0.15到0.725(Ahmadi N,Ramanantsoanirina A,Santos JD,Frouin J,Radanielina T.Evolutionary Processes Involved in the Emergence and Expansion of an Atypical O.sativa Group in Madagascar.Rice(N Y).2021 May 20;14(1):44;Frouin J,Labeyrie A,Boisnard A,Sacchi GA,Ahmadi N.Genomic prediction offers the most effective marker assisted breeding approach for ability to prevent arsenic accumulation in rice grains.PLoS One.2019 Jun 13;14(6);Huang Y,Chen H,Reinfelder JR,Liang X,Sun C,Liu C,Li F,Yi J.A transcriptomic(RNA-seq)analysis of genes responsive to both cadmium and arsenic stress in rice root.Sci Total Environ.2019 May 20;666:445-460)。
杂交育种是利用杂种优势提高水稻产量的主要手段,研究表明,杂交水稻比近交系品种产量增加20%。全基因组选择可以高效的地从众多潜在的杂交组合中选择所需的杂交组合,GS可以预测已获得基因型亲本的所有组合的育种值,从减少田间评估的时间和成本。在杂交水稻育种的GS研究中,常用的水稻群体包括NCⅡ、RIL和一些与目标性状关联的群体,对杂交后代的多种性状进行了预测,包括单株产量、千粒重、有效穗数、株高、一次枝梗数、二次枝梗数、主穗实粒数、穗长等,对不同类型性状的预测能力由低到高不等,采用的模型包括GBLUP、MV-ADV、Lasso、SVM等,预测能力较高的性状为千粒重(0.7~0.8),单株产量和穗长预测能力在0.5以下。
快速的全球工业化导致了镉的广泛传播,农业土壤和产品中的污染。相当大比例的大米消费者接触的镉水平高于临时安全摄入限值,引起人们对风险管理的广泛关注。种子工业已经存在了几个世纪,创造了丰富的稻米品种。不同于诸如株高和产量等性状,OsGCd不能直接通过田间观察为种质资源质量评价带来了挑战。
传统上,水稻品种需要先在田间种植,然后在田间种植通过表型测试来评估成熟后的OsGCd风险。这无疑是耗时且成本高昂。因此,如何在种植前预警OsGCd风险一直是环境行业的关键问题。
发明公开
本发明所要解决的技术问题是如何使用全基因组选择预测水稻籽粒镉含量和/或如何建立水稻籽粒镉积累性状的全基因组选择模型和/或如何预测水稻籽粒镉含量和/或如何对水稻镉积累风险进行预警和/或如何培育低镉水稻。
为了解决上述技术问题,本发明首先提供了预测水稻籽粒镉含量的装置,所述装置可包括如下模块:
A1)表型数据集获得模块:用于获得模型构建群体水稻的籽粒镉含量表型数据集;
A2)基因型数据集获得模块:用于通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
A3)全基因组选择模型构建模块:用于通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
A4)待测水稻SNP基因分型获得模块:用于对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
A5)基因组估计育种值计算模块:用于使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计育种值预测所述待测水稻籽粒的镉含量。
上述装置中,所述全基因组选择的算法可为rrBLUP或gBLUP。
上述装置中,所述模型构建群体可由训练群体与测试群体组成。所述训练群体与所述测试群体均由水稻材料组成。所述训练群体和所述测试群体的水稻材料个数比可为1:1。所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上。所述SNP分子标记的分布密度可为每个水稻基因组上60K个。
上述装置中,所述模型构建群体中水稻材料个数可为500。
为了解决上述技术问题,本发明还提供了水稻镉积累风险预警装置,所述装置可包括如下模块:
B1)表型数据集获得模块:用于获得模型构建群体水稻的籽粒镉含量表型数据集;
B2)基因型数据集获得模块:用于通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
B3)全基因组选择模型构建模块:用于通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
B4)待测水稻SNP基因分型获得模块:用于对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
B5)基因组估计育种值计算模块:用于使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计育种值预测所述待测水稻籽粒的镉含量;
B6)镉含量风险预警模块:用于将B5)获得的镉含量高于镉含量风险值的待测水稻材料名称输出。
上述装置中,所述全基因组选择的算法可为rrBLUP或gBLUP。所述模型构建群体可由训练群体与测试群体组成。所述训练群体与所述测试群体均由水稻材料组成;所述训练群体和所述测试群体的水稻材料个数比可为1:1。所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上。所述SNP分子标记的分布密度可为每个水稻基因组上60K个。所述模型构建群体中水稻材料个数可为500。
所述镉含量风险值可为0.2mg/kg。B6)所述输出可为可视化输出。
为了解决上述技术问题,本发明还提供了预警水稻镉积累风险的系统。所述系统可上文所述的装置。所述系统还可包括测定水稻SNP分型的仪器、试剂和/或试剂盒。
所述系统还可包括测定水稻籽粒镉含量的仪器、试剂和/或试剂盒。
为了解决上述技术问题,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质可使计算机运行如下步骤:
C1)获得模型构建群体水稻的籽粒镉含量表型数据集;
C2)通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
C3)通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
C4)对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
C5)使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计育种值预测所述待测水稻籽粒的镉含量;
C6)将C5)获得的镉含量高于镉含量风险值的待测水稻材料名称输出。
所述输出可为可视化输出。
上述计算机可读存储介质中,所述全基因组选择的算法可为rrBLUP或gBLUP。所述模型构建群体可由训练群体与测试群体组成,所述训练群体与所述测试群体均由水稻材料组成。所述训练群体和所述测试群体的水稻材料个数比可为1:1。所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上。所述SNP分子标记的分布密度可为每个水稻基因组上60K个。所述模型构建群体中水稻材料个数可为500。
上文所述的装置和/或上文所述的系统和/或上文所述的计算机可读存储介质的下述任一种应用也属于本发明的保护范围:
P1、在低镉含量水稻育种中的应用;
P2、在筛选或辅助筛选低镉含量水稻中的应用;
P3、在评估或辅助评估镉环境污染风险中的应用。
附图说明
图1为水稻镉积累性状的全基因组选择模型的表型和基因型数据集。(A)500份水稻种质的地理来源(上)和亲缘关系(下),纵坐标为不同地理来源(上)和不同亚群(下)的材料数,横坐标为不同亚群。字母的颜色从浅灰色到深灰色,大小从小到大,代表水稻品种的数量。(B)500份水稻材料的OsGCd值的频率分布,纵坐标为材料数,横坐标为水稻籽粒镉含量。深灰线,水稻中镉的最大允许水平(MHPRC,2012);深灰柱,品种OsGCd超标;浅灰色柱状,品种符合OsGCd标准;OsGCd,稻米镉浓度。(C)从MLM GWAS方法获得的OsGCd的曼哈顿图,纵坐标为P值的负对数-log10(P),横坐标为每条染色体上的SNP。下面的条显示了用于GWA的单核苷酸多态性密度;(D)策略Ⅰ、策略Ⅱ和策略Ⅲ在12条染色体上的SNP密度。从浅灰色到深灰色表示SNP密度从低到高。(E)来自策略I、策略II和策略的SNP的-log10(P)值。
图2为12种建模算法能达到的最大精度和消耗的时间。(A)比较12种统计方法建立全基因组选择模型的时间消耗;(B)使用策略Ⅰ的SNP比较了12种统计方法的平均精度;(C)使用策略Ⅱ的SNP比较了12种统计方法的平均精度和策略Ⅲ;(D)使用策略Ⅲ的SNP比较了12种统计方法的平均精度;模型参数:训练群体与测试群体比为1:1;SNP密度为60k,群体大小为500(rrBLUP,gBLUP,RF,Light GBM,ANN和SVM)和219(Bayes A,Bayes B,Bayes C,Bayes Lasso,Bayes BRR和Bayes RKHS)。
图3为水稻籽粒镉积累性状的最佳群体大小和训练群体与测试群体比例。(A)训练群体与测试群体比率和SNP密度保持不变,以rrBLUP和gBLUP为统计方法比较了11组种群规模下的平均精度,纵坐标为模型精度,横坐标为不同群体大小;(B)使用rrBLUP和gBLUP作为统计方法,在9组训练群体和测试群体比率下比较平均精度,纵坐标为模型精度,横坐标为训练群体与测试群体的比例。
图4为水稻籽粒镉积累性状的最佳SNP标记密度。(A)训练群体与测试群体比率和群体大小保持不变,以rrBLUP和gBLUP为统计方法比较了9组SNP标记密度下的平均精度,纵坐标为模型精度,横坐标为不同SNP标记数;(B)三种策略下,随SNP标记密度的增加,P值的负对数-log10(P)的变化,纵坐标为P值的负对数-log10(P),横坐标SNP标记密度;(C)三种策略中SNP标记的重叠情况;(D)策略Ⅰ和策略Ⅱ的交集SNP的曼哈顿图,纵坐标为P值的负对数-log10(P),横坐标为每条染色体上的SNP。下面的条显示了用于GWA的单核苷酸多态性密度;(E)使用策略Ⅰ和策略Ⅱ交集的SNP标记和最佳种群大小,最佳训练集与测试集比例,rrBLUP和gBLUP作为统计方法构建模型,纵坐标为模型的精度,横坐标为两种统计方法。
图5为“智能镉预警系统”在水稻镉积累风险预警中的应用。(A)智能镉预 警系统的基本流程;(B)在富阳通过全基因组选择模型预测和田间试验确定的风险品种的比较;(C)在温岭通过全基因组选择模型预测和田间试验确定的风险品种的比较。浅灰色部分代表田间试验的实测值(Measured OsGCd),深色代表预测值(predicted OsGCd);上半部分纵坐标为超标率,下半部分纵坐标为实测镉含量;(D)在富阳进行的44个水稻品种田间试验中测得的OsGCd值与预测值之间的相关系数;(E)在温岭进行的44个水稻品种田间试验中测得的OsGCd值与预测值之间的相关系数;纵坐标为预测值,横坐标为实测值;MAE,平均绝对误差。
实施发明的最佳方式
下面结合具体实施方式对本发明进行进一步的详细描述,给出的实施例仅为了阐明本发明,而不是为了限制本发明的范围。以下提供的实施例可作为本技术领域普通技术人员进行进一步改进的指南,并不以任何方式构成对本发明的限制。
下述实施例中的实验方法,如无特殊说明,均为常规方法。下述实施例中所用的材料、试剂、仪器等,如无特殊说明,均可从商业途径得到。以下实施例中的定量试验,均设置三次重复实验,结果取平均值。下述实施例中,如无特殊说明,序列表中各核苷酸序列的第1位均为相应DNA的5′末端核苷酸,末位均为相应DNA的3′末端核苷酸。
实施例1、水稻籽粒镉积累性状的全基因组选择研究的方法
1.水稻籽粒镉含量测定与表型数据分析
1.1水稻材料的种植与收获
在本发明中,在两个不同的镉污染农田中种植了500份具有广泛地理起源和足够亲缘关系的水稻微核心种质材料,以收集水稻籽粒镉含量表型数据。
500份水稻材料如图1所示,来自东亚、美洲、欧洲、非洲和澳大利亚的品种分别占54.7%、18.7%、10.7%、9.3%和6.7%;其中包括五个亚群体(图1中A)。
水稻材料的种植从播种开始,具体操作步骤如下:
(1)整地与打垄:将育秧地整体翻土,确保耕地整体土壤均匀。之后进行打垄,每垄宽度70cm,长度随育秧地长度而定。打垄之后灌水,撒农药与除草剂。播种需前晒田1-2天。
(2)浸种与催芽:温水浸种三天,每天换水两次,确保无异味。待种子露白后,开始催芽。催芽期间确保较高的温度,时间以一天半到两天最佳,种子芽长5mm后,可进行播种。
(3)播种与育秧:将每条垄分为两半,划分用于播种的格子,每格25cm-30cm。将已出芽的种子播种在格子中间,直到生长为秧苗。
(4)拔秧与排序:两周后,将秧苗按编号拔出,并用稻草或尼龙绳将地标牌与秧苗绑在一起,将根轻轻插入泥中,防止秧苗死亡。将排好序的秧苗由育秧地运至插秧田。
(5)插秧:将水稻微核心种质材料种植于如下环境中:土壤镉含量平均值为1.12mg/kg,有效镉含量平均值为0.91mg/kg,pH为6.04。资源种质材料种植两行,行间距25cm;每行种植8株,株间距为20cm。为了保证数据的准确度,同时设置对照材料CK用于后期数据校正,CK品种为当地常规粳稻品种嘉禾香1号,每行种植3株,行间距和株间距与资源种质材料一致。种植材料外围设置保护行。
待水稻成熟后,采集籽粒样品,为了避免边界效应,弃去与过道相邻的左右两株,其余株系籽粒混收;CK材料的收集按照每20个(左右各10个品种)材料混收1份CK。采集的水稻籽粒连同地标牌放置于网袋中,太阳下晾干避免发霉。
1.2水稻籽粒镉含量测定
收回的水稻籽粒样品在太阳下晾干或放置于烘箱中60℃烘干3天,质量恒定后使用砻谷机脱壳,得到的糙米样品放置于5mL离心管中。之后使用高通量静音组织研磨仪粉碎糙米样品,用于后续镉含量测定。
水稻籽粒镉含量测定时,采用方法为单酸消解法,使用仪器为远红外控温式消煮炉,容器为玻璃消煮管,具体步骤简述如下:
(1)称样:准确称取粉碎水稻籽粒样品0.2000g(精确到0.0001g),放入玻璃消煮管,避免粉末粘壁。
(2)加酸:加入1mL优级纯硝酸,冷消化过夜。
(3)消解:加盖弯颈漏斗200℃消解6h,直至消化液呈无色透明或略带黄色。
(4)定容:用蒸馏水将管内消化液洗净,清洗液转移至15mL定容管中,定容至15mL。
(5)过滤:摇匀后使用0.22μM水系滤膜将定容后的液体过滤至10mL离心管中,待测。
质量控制:每批次消解时设置2个空白对照与3个大米粉成分分析标准物质(国家标准物质,GBW100349,钢研纳克检测技术公司),确保水稻籽粒镉含量结果数据准确可靠。所有样品测定均重复3次。
水稻籽粒镉含量采用电感耦合等离子质谱仪(ICP-MS)进行测定。
1.3水稻籽粒镉含量表型数据分析
利用Excel 2019对水稻品种的籽粒镉含量进行描述性统计分析。结果显示,所有基因型的500份水稻材料稻米镉浓度(OsGCd)的平均范围为0.0015mg/kg至0.96mg/kg,超过中国卫生部(MHPRC,2012)规定的水稻中镉的最大允许水平(0.2mg/kg)(图1中B)。
2.水稻籽粒镉积累性状的全基因组关联分析
结合步骤1获得的水稻相对籽粒镉含量表型和500份水稻资源种质的基因 型,利用R软件中的MVP程序包中的MLM模型对水稻籽粒镉积累性状进行全基因组关联分析。利用MLM模型获得结果进行建模群体的籽粒镉积累关联分子标记鉴定,获得与OsGCd相关的SNP。通过R软件包中的qqman程序包和ggplot2程序包绘制曼哈顿图。
结果表明,与水稻籽粒镉积累高度相关的SNP在不同染色体上分布不均匀,在第8染色体上P值最高可达8.04(图1中C)。考虑到基因型与表型的相关性和SNP分布的均匀性是影响全基因组选择模型准确性的两个重要因素,本发明采用了三种策略来建立基因型数据集。
策略Ⅰ是按P值对所有SNP进行排序,而不管它们标记在哪个染色体上。通过这种方式,分别提取前60、120、600、1200、6k、12k、60k、120k、600k个SNP,以建立9个SNP数据集;考虑到单核苷酸多态性在染色体上分布的均匀性,策略Ⅱ旨在提取12条染色体中前5位、10位、50位、100位、500位、1000位、5k位、10k位、50k位、100k位的单核苷酸多态性,并将其列在一起形成9个数据集。作为对照,随机选择SNP并形成9个SNP数据集,包括与策略Ⅰ和Ⅱ相同的整数;与策略Ⅰ相比,策略Ⅱ和策略Ⅲ中的SNP分布更均匀(图1中D)。而策略Ⅰ和Ⅱ中的SNP显示出比策略Ⅲ更高的P值(图1中E)。
3水稻籽粒镉积累性状的全基因组选择模型建立和参数设置
3.1水稻籽粒镉积累性状全基因组选择模型建立
以步骤1中获得的单一建模群体的水稻籽粒镉含量为表型数据,步骤2中获得的水稻籽粒镉积累关联SNP分子标记为基因型数据,通过比对不同算法、分子标记密度和训练集占比三种参数下模型的预测准确度,建立适合单一环境型下的全基因组选择预测模型。建立适用于两种环境型的水稻籽粒镉积累性状全基因组选择模型时,以A、B环境型下水稻籽粒镉积累关联分子标记的交集为基因型,两种环境型下建模群体的籽粒镉含量为表型数据。选取10-fold交叉验证重复100次的结果的均值作为最终预测准确度。
3.2全基因组选择模型参数设置和优化
3.2.1全基因组选择算法
本发明共采用12种算法进行全基因组选择研究来预测水稻籽粒镉含量,其中8种为线性算法,4种为机器学习算法,线性算法包括:rrBLUP、gBLUP、Bayes A/B/C/Lasso/BRR/RKHS。机器学习算法包括:支持向量机(SNM)、随机森林(RandomForest,RF)、LightGBM、多层感知机(MLP),每次预测结果进行100次交叉验证,取平均值作为最终预测结果。
rrBLUP算法是一种间接法模型,具体分析通过R软件的rrblup包完成(Lozada et al.,2019),具体公式如下:
Y=μ+Xg+e
其中Y是训练群体中各水稻品种的表型向量;μ是计算出的固定效应,即训练群体中各品种的表型平均值;X是对基因型进行编码得到的关联矩阵;g是 指根据模型估算出的分子标记效应向量;e为残余误差(Endelman,2011)。
gBLUP算法通过混合线性模型进行预测(姚骥,2018),通过R软件的sommer包来进行分析(Perez and de los Campos,2014),具体公式如下:
Y=Zβ+Xg+ε
其中Y是训练群体中各水稻品种的表型向量;Z是计算出的固定效应矩阵;β是固定效应向量;X是随机效应矩阵;g是指根据模型估算出的分子标记效应向量;ε为随机误差(VanRaden,2008)。
Bayes A/B/C由Meuwissen等人提出(Meuwissen etal,2001),根据Bayes A的假设,每个SNP都是有效的,这种效应遵循正态分布,效应方差遵循比例逆卡方分布。根据Bayes B的假设,符合全基因组的实际情况,少数SNP有效应,而其他SNP没有效应,效应方差服从卡方反分布。在Bayes B中联合应用Gibbs和MH(metropolis-Hastings)抽样来获得样本标记效应和方差。Bayes C是基于Bayes B的优化。Bayes A/B/C可以用以下统一公式表示。
Mallick提出了Bayes Lasso方法(Mallick etal,2014)。Bayes Lasso假设标记效应的方差遵循拉普拉斯分布,从而允许以更大的概率出现最大值或最小值。Bayes A/B/C和Bayes Lasso的区别在于标记效应的分布。Bayes A/B/C假设标记效应服从正态分布,而Bayes Lasso服从拉普拉斯分布。
Bayes BRR方法通过设置标记效应的高斯先验分布,假设所有标记都具有小或中等效应(Habier etal,2007)。可以用下面的公式表示:
Bayes RKHS是Bayes方法与RKHS相结合的一种统计方法(de los Campos etal,2010)。在本研究中,Bayes模型是由R的BGLR软件包实现的。
支持向量机是一种监督机器学习方法,可用于排序和回归分析(Cortes etal,1995)。在支持向量机中,将基于非线性映射到高维特征空间的输入向量构建线性决策面。通过找到最大裕度,设置分类器,可以对新的未知数据进行分类。在本研究中,支持向量机是由R的e1071软件包实现的。
随机森林算法是通过集成多个决策树来进行预测的分类器(张莉彬和宋凯利,2019),其基本原理是采用Bootstrap子自采样的方法获得不同的样本集用于构建模型,各模型之间的差异度不同,因此提高了预测的能力(董红瑶等,2021),通过R中的random forest软件包来进行分析。
LightGBM使用基于直方图的统计方法来寻找最佳分割点(相关文献:Yan J,Xu Y,Cheng Q,Jiang S,Wang Q,Xiao Y,Ma C,Yan J,Wang X.LightGBM:accelerated genomically designed crop breeding through ensemble learning.Genome Biol.2021 Sep 20;22(1):271)。基于决策树算法,LightGBM是一种快速、节省内存和高性能梯度提升框架,可用于排序、分类、回归和许多其他机器学习任务,具有优势。在本研究中,使用python包lightgbm v3 3.2构建lightgbm统计方法模型。
MLP是一个具有至少一个隐藏层的全连接神经网络。每个隐藏层的输出需要通过激活函数进行转换。该方法以神经网络为基本框架,试图模仿自然生物神经网络的学习模式。在本研究中,使用python包d2lzh v1.0.0构建MLP。
结果表明,线性模型比机器学习模型更适合于构建OsGCd的全基因组选择模型。
具体为Bayes算法在群体大小和时间消耗方面不如其他统计方法。除Bayes算法外,所有其他统计方法都可以在4小时内为500个种群规模的全基因组选择模型建模,其中rrBLUP和gBLUP最快(小于1小时)(图2中A)。而Bayes算法只能为多达219个种群规模进行全基因组选择模型建模,耗时约7小时(图2中A)。时间消耗和计算效率一直是Bayes分析中需要考虑的因素,因为其模型效应需要在数千次马尔可夫链蒙特卡洛迭代中采样。随着响应变量数量的增加,每次迭代都需要对较大的矩阵进行求逆和分解,这使得它变得耗时。与Bayes算法相比,其他统计方法的计算效率更高,表明其探索大数据的能力更强。就预测精度而言,12种统计方法获得的最大均方根误差按降序排列为:rrBLUP≈gBLUP>Bayes BRR≈Bayes RKHS>Bayes A≈Bayes B≈Bayes C≈Bayes Lasso≈SVM>RF>LightGBM>MLP(图2中B-D)。总体而言,使用线性统计方法的全基因组选择模型的预测精度(平均精度>0.68)高于机器学习(平均精度<0.59)。在线性统计方法中,rrBLUP和gBLUP方法的预测精度(平均精度>0.75)高于贝叶斯方法(0.67<平均精度<0.7)。一般来说,线性方法(如rrBLUP)的性能受到种群大小的限制,但对SNP数不敏感。另一方面,机器学习具有利用超大数据集的优越能力,但需要更大的训练群体规模才能实现高预测精度。例如,一个案例表明,在100000个种群规模上,rrBLUP未能训练模型,但LightGBM用40GB内存在15分钟内完成训练。因此,全基因组选择模型的最佳统计方法取决于群体规模和SNP密度。对于群体规模为500、SNP密度为60k的OsGCd(镉含量)预测,本发明的研究表明,rrBLUP和gBLUP是预测精度和计算效率方面的最佳统计方法。
3.2.2使用高密度标记和整合GWAS结果提高预测精度
全基因组选择研究预测准确性与可由SNP标记表示的染色体片段的实际效果有关。位于影响性状的基因组区域的标记已被证明是影响模型平均精度的重要因素。因此,获得大量与性状高度相关的SNP是建立准确全基因组选择模型的关键因素。GWAS为检测与性状相关的SNP标记提供了一种可行的方法。
本发明采用的分子标记来源为基于全基因组关联分析筛选出的水稻籽粒镉积累关联分子标记,具体为使用了三种策略来筛选关联分子标记作为SNP数据集:策略Ⅰ在全基因组范围内选取关联程度最高的前60、120、600、1200、6k、12k、60k,120k,600k个SNP作为分子标记密度;策略Ⅱ在每条染色体内选取关联程度最高的5、10、50、100、500、1000、5k、10k,50k个SNP作为分子标记密度;策略Ⅲ在全基因组范围内随机选取SNP作为分子标记密度。分析所用分子标记密度对全基因组选择预测准确度的影响。
通过合并GWAS结果,策略Ⅰ和策略Ⅱ的平均精度分别达到0.73±0.03和0.75±0.03,而策略Ⅲ(随机选择)的平均精度仅为0.43±0.04(图4中A)。与策略Ⅰ和策略Ⅱ相比,策略Ⅲ中的P值平均低约5.5倍(图4中B),表明整合GWAS结果是提高全基因组选择模型预测精度的可行方法。
高标记密度是另一种确保标记QTL(数量性状基因座)关联保持的方法,从而保证高预测精度。但每个性状都有一个最佳SNP标记密度,超过该密度,平均精度开始下降。对于OsGCd预测,当策略Ⅰ(平均精度=0.73±0.003)和策略Ⅱ(平均精度=0.75±0.003)中的SNP数量为60k时,达到了最高的预测精度(图4中A)。在60k SNP参数下,策略Ⅰ和策略Ⅱ的预测精度没有显著差异,表明两种策略都包含足够的SNP,可以进行精确的全基因组选择模型建模。因此,本发明探讨了这两种策略对60k SNPs相交的建模效果。从策略Ⅰ和策略Ⅱ的60k交叉点共鉴定出45805个SNP(图4中C),它们均匀分布在12条染色体上,P值范围为1.794到8.043(图4中D)。使用45805个SNP作为基因型数据集,平均精度分别达到0.752±0.035(rrBLUP)和0.756±0.035(gBLUP)(图4中E),表明45805个SNP足以预测OsGCd。
3.2.3通过增加群体规模和平衡训练群体与测试群体比例关系优化模型
较大的群体规模通常具有更广泛的遗传多样性,可用于预测。平均精度通常随着群体规模的增加而增加,直到达到一个平台。但所需的群体大小始终随植物种类和品种而异。玉米抽穗、株高和穗重预测的案例表明,随着群体大小的减少,rrBLUP和lightGBM的精度从每个性状的约0.75、0.79和0.65开始逐渐下降,6210是最佳群体大小。对于玉米仁油预测,250被确定为最佳种群规模。
本发明在OsGCd预测中观察到类似的趋势。随着种群规模从50增加到500,平均精度在策略Ⅰ中增加到最大值0.75±0.003,在策略Ⅱ中增加到最大值0.77±0.003,在策略Ⅲ中增加到最大值0.43±0.004(图3中A),这表明500是OsGCd G2P建模的最佳种群规模。
除了种群规模外,平衡训练群体和测试群体关系也会影响平均精度。对训练群体和测试群体比例影响的研究表明,优化比例随植物种类和性状而变化。对于玉米焦油斑复合抗性预测,当总基因型的50%用作训练群体时,观察到相对较高的预测精度和最小的标准误差。虽然9:1的比例是玉米抽穗、株高和穗重 预测的最佳参数。对于本发明中的OsGCd含量预测,还观察到1:1是最佳训练群体和测试群体比例。在该参数下,平均均方根误差可以达到0.77±0.003(图3中B)。因此,种群大小500(水稻材料个数)和1:1的训练群体和测试群体比率是本发明预测OsGCd的最佳参数。
实施例2、“智能镉预警系统”在水稻镉积累风险预警中的应用
为了协助OsGCd风险预警中的数据驱动决策,本发明结合高通量测序、全基因组选择模型预测等模块和风险评估,以开发一个系统,即智能镉预警系统,用于水稻籽粒OsGCd风险预警。智能镉预警系统包括四个部分主要分析模块包括建模、基因分型、OsGCd含量预测和风险评估。
第一个建模模块是使用实施例1中的方法和参数建立高精度全基因组选择模型。第二个基因分型模块通过全基因组重新测序或定制的低镉单核苷酸多态性芯片可以获得用于风险评估的水稻品种的SNP。第三个OsGCd含量预测模块执行全基因组选择模型预测,将水稻品种SNP(单核苷酸多态性)作为查询信息,通过查询得到每个水稻品种的预测籽粒OsGCd含量。第四个模块执行风险评估和基本数据可视化:当水稻品种的OsGCd高于最大允许水平(超过中国卫生部(MHPRC,2012)规定的水稻中镉的最大允许水平(0.2mg/kg))时,将突出显示(流程如图5中A所示)。
为了调查智能镉预警系统的有效性,本发明对浙江省富阳市和温岭市两个地点的44份水稻材料(中国农业大学李自超实验室赠送,相关文献:Zhao Y,Zhang H,Xu J,Jiang C,Yin Z,Xiong H,Xie J,Wang X,Zhu X,Li Y,Zhao W,Rashid MAR,Li J,Wang W,Fu B,Ye G,Guo Y,Hu Z,Li Z,Li Z.Loci and natural alleles underlying robust roots and adaptive domestication of upland ecotype rice in aerobic conditions.PLoS Genet.2018 Aug 10;14(8):e1007521)进行了实验,对水稻镉污染风险进行预测。
44份水稻的包含45805个SNP的基因型数据集来源于全基因组重新测序。
结果显示,使用实施例1中的方法和参数对500个建模总体构建的全基因组选择模型,对温岭和富阳水稻籽粒镉含量的预测精度分别达到0.756±0.035和0.795±0.023;水稻籽粒镉含量的预测值为在富阳比温岭高约2.5倍,平均而言,这可能是由于土壤pH值下降导致的。分别共有32和12个水稻品种在富阳和温岭中被鉴定为风险品种(表1中显示超标的水稻材料)。
然后将44份水稻材料在浙江省富阳市和温岭市两个地点的镉污染农田进行种植以测定实际水稻籽粒镉含量,种植方法和镉含量测定方法同实施例1。
野外考察结果表明,测量值与预测值之间存在相关性(图5中D和E),验证了智能镉预警系统的有效性。富阳(图5中B的Fuyang所示)和温岭(图5中C的Wenling所示)的水稻籽粒镉(OsGCd)含量分别达到0.79和0.81,田间试验检测到的风险品种与试验结果一致(表1)。
表1. 44份材料实测值与预测值及风险评估

本发明开发的水稻OsGCd“智能镉预警”的创新预警系统,是第一个OsGCd风险从以下角度建立评估和预警系统:从基因型到表型。对于OsGCd特征,展示了“智能镉预警”预警风险水稻品种的优越性能和广泛的环境意义。预计“智能预警”系统可以扩展到更广泛的危险材料和作物品种中,从而在风险评估和环境保护中发挥作用。
以上对本发明进行了详述。对于本领域技术人员来说,在不脱离本发明的宗旨和范围,以及无需进行不必要的实验情况下,可在等同参数、浓度和条件下,在较宽范围内实施本发明。虽然本发明给出了特殊的实施例,应该理解为,可以对本发明作进一步的改进。总之,按本发明的原理,本申请欲包括任何变更、用途或对本发明的改进,包括脱离了本申请中已公开范围,而用本领域已知的常规技术进行的改变。
工业应用
本发明建立的水稻籽粒镉含量全基因组选择研究与标记辅助选择(MAS)不同,在MAS中,只有有限数量的先前确定的相关性最强的标记用于选择最佳品系,而本发明的方法利用全基因组水平上的基因型-表型关系,以便为无表型的样本制作可靠的全基因组选择模型。简言之,该方法需要两个步骤:(i)通过在训练群体(TRN)中结合分子(高密度SNP标记)和表型数据集来构建全基因组选择模型,以及(ii)使用建立好的模型来获得测试群体(TST)中已进行基因分型但无表现型的个体的基因组估计表型;这样,可以提前筛选出低镉含量优良水稻品系,而不必在育种后期进行表型分析。
在此基础上本发明还开发除了水稻“智能镉预警”的创新预警系统,此系统是第一个镉(OsGCd)含量风险从以下角度建立评估和预警系统:从基因型到表型。对于OsGCd特征,展示了“智能镉预警”预警风险水稻品种的优越性能和广泛的环境意义。预计“智能预警”系统可以扩展到更广泛的危险材料和作物品种中,从而在风险评估和环境保护中发挥作用。

Claims (10)

  1. 预测水稻籽粒镉含量的装置,其特征在于:所述装置包括如下模块:
    A1)表型数据集获得模块:用于获得模型构建群体水稻的籽粒镉含量表型数据集;
    A2)基因型数据集获得模块:用于通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
    A3)全基因组选择模型构建模块:用于通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
    A4)待测水稻SNP基因分型获得模块:用于对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
    A5)基因组估计育种值计算模块:用于使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计育种值预测所述待测水稻籽粒的镉含量。
  2. 根据权利要求1所述的装置,其特征在于:所述全基因组选择的算法为rrBLUP或gBLUP。
  3. 根据权利要求1或2所述的装置,其特征在于:所述模型构建群体由训练群体与测试群体组成,所述训练群体与所述测试群体均由水稻材料组成;所述训练群体和所述测试群体的水稻材料个数比为1:1;所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上;所述SNP分子标记的分布密度为每个水稻基因组上60K个。
  4. 根据权利要求1-3中任一权利要求所述的装置,其特征在于:所述模型构建群体中水稻材料个数为500。
  5. 水稻镉积累风险预警装置,其特征在于:所述装置包括如下模块:
    B1)表型数据集获得模块:用于获得模型构建群体水稻的籽粒镉含量表型数据集;
    B2)基因型数据集获得模块:用于通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
    B3)全基因组选择模型构建模块:用于通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
    B4)待测水稻SNP基因分型获得模块:用于对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
    B5)基因组估计育种值计算模块:用于使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计 育种值预测所述待测水稻籽粒的镉含量;
    B6)镉含量风险预警模块:用于将B5)获得的镉含量高于镉含量风险值的待测水稻材料名称输出。
  6. 根据权利要求5所述的装置,其特征在于:所述全基因组选择的算法为rrBLUP或gBLUP;所述模型构建群体由训练群体与测试群体组成,所述训练群体与所述测试群体均由水稻材料组成;所述训练群体和所述测试群体的水稻材料个数比为1:1;所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上;所述SNP分子标记的分布密度为每个水稻基因组上60K个;所述模型构建群体中水稻材料个数为500。
  7. 预警水稻镉积累风险的系统,其特征在于:所述系统包括权利要求5或6所述的装置;所述系统还包括测定水稻SNP分型的仪器、试剂和/或试剂盒。
  8. 一种存储有计算机程序的计算机可读存储介质,其特征在于:所述计算机程序使计算机运行如下步骤:
    C1)获得模型构建群体水稻的籽粒镉含量表型数据集;
    C2)通过全基因组关联分析获得水稻籽粒镉含量关联的SNP分子标记得到基因型数据集;
    C3)通过全基因组选择的算法,基于所述表型数据集和所述基因型数据集构建预测水稻籽粒镉含量的全基因组选择模型;
    C4)对待测水稻的所述SNP分子标记进行测定获得所述待测水稻的SNP基因分型;
    C5)使用所述全基因组选择模型和所述SNP基因分型计算获得所述待测水稻的基因组估计育种值;根据所述基因组估计育种值预测所述待测水稻籽粒的镉含量;
    C6)将C5)获得的镉含量高于镉含量风险值的待测水稻材料名称输出。
  9. 根据权利要求8所述的计算机可读存储介质,其特征在于:所述全基因组选择的算法为rrBLUP或gBLUP;所述模型构建群体由训练群体与测试群体组成,所述训练群体与所述测试群体均由水稻材料组成;所述训练群体和所述测试群体的水稻材料个数比为1:1;所述SNP分子标记均匀分布于均匀分布水稻的12条染色体上;所述SNP分子标记的分布密度为每个水稻基因组上60K个;所述模型构建群体中水稻材料个数为500。
  10. 权利要求1-6中任一权利要求所述的装置和/或权利要求7所述的系统和/或权利要求8或9所述的计算机可读存储介质的下述任一种应用:
    P1、在低镉含量水稻育种中的应用;
    P2、在筛选或辅助筛选低镉含量水稻中的应用;
    P3、在评估或辅助评估镉环境污染风险中的应用。
PCT/CN2023/119026 2022-09-15 2023-09-15 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统 WO2024056056A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211132783.XA CN115579057A (zh) 2022-09-15 2022-09-15 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统
CN202211132783.X 2022-09-15

Publications (1)

Publication Number Publication Date
WO2024056056A1 true WO2024056056A1 (zh) 2024-03-21

Family

ID=84582091

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/119026 WO2024056056A1 (zh) 2022-09-15 2023-09-15 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统

Country Status (2)

Country Link
CN (1) CN115579057A (zh)
WO (1) WO2024056056A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579057A (zh) * 2022-09-15 2023-01-06 中国科学院植物研究所 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统
CN116467596B (zh) * 2023-04-11 2024-03-26 广州国家现代农业产业科技创新中心 水稻粒长预测模型的训练方法、形态预测方法及装置
CN117238363B (zh) * 2023-10-25 2024-04-16 青岛极智医学检验实验室有限公司 一种表型预测方法、预测系统、设备及介质
CN117831636B (zh) * 2024-03-04 2024-06-11 北京市农林科学院信息技术研究中心 利用融合模型实施基因组选择的方法、装置、设备及介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120004112A1 (en) * 2008-08-19 2012-01-05 Aarhus Universitet Methods for determining a breeding value based on a plurality of genetic markers
CN105868584A (zh) * 2016-05-23 2016-08-17 厦门胜芨科技有限公司 通过选取极端性状个体来进行全基因组选择育种的方法
CN106480228A (zh) * 2016-12-31 2017-03-08 华智水稻生物技术有限公司 水稻镉低积累基因OsHMA3的SNP分子标记及其应用
CN110610744A (zh) * 2019-09-11 2019-12-24 华中农业大学 一种高效可并行运算且高准确性的全基因组选择方法
CN111223520A (zh) * 2019-11-20 2020-06-02 云南省烟草农业科学研究院 一种预测烟草尼古丁含量的全基因组选择模型及其应用
CN112322772A (zh) * 2020-10-27 2021-02-05 中国科学院植物研究所 一种与玉米籽粒镉含量相关基因ZmCd9的单倍型分子标记及其应用
CN113421612A (zh) * 2021-07-14 2021-09-21 江苏沿江地区农业科学研究所 玉米收获期籽粒含水量预测模型、其构建方法和相关snp分子标记组合
CN115579057A (zh) * 2022-09-15 2023-01-06 中国科学院植物研究所 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120004112A1 (en) * 2008-08-19 2012-01-05 Aarhus Universitet Methods for determining a breeding value based on a plurality of genetic markers
CN105868584A (zh) * 2016-05-23 2016-08-17 厦门胜芨科技有限公司 通过选取极端性状个体来进行全基因组选择育种的方法
CN106480228A (zh) * 2016-12-31 2017-03-08 华智水稻生物技术有限公司 水稻镉低积累基因OsHMA3的SNP分子标记及其应用
CN110610744A (zh) * 2019-09-11 2019-12-24 华中农业大学 一种高效可并行运算且高准确性的全基因组选择方法
CN111223520A (zh) * 2019-11-20 2020-06-02 云南省烟草农业科学研究院 一种预测烟草尼古丁含量的全基因组选择模型及其应用
CN112322772A (zh) * 2020-10-27 2021-02-05 中国科学院植物研究所 一种与玉米籽粒镉含量相关基因ZmCd9的单倍型分子标记及其应用
CN113421612A (zh) * 2021-07-14 2021-09-21 江苏沿江地区农业科学研究所 玉米收获期籽粒含水量预测模型、其构建方法和相关snp分子标记组合
CN115579057A (zh) * 2022-09-15 2023-01-06 中国科学院植物研究所 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统

Also Published As

Publication number Publication date
CN115579057A (zh) 2023-01-06

Similar Documents

Publication Publication Date Title
WO2024056056A1 (zh) 基于全基因组选择研究的水稻籽粒镉积累性状预测装置和预警系统
Yadav et al. Genetic gains in pearl millet in India: insights into historic breeding strategies and future perspective
Kumar et al. Marker-assisted selection strategy to pyramid two or more QTLs for quantitative trait-grain yield under drought
Tuberosa Phenotyping for drought tolerance of crops in the genomics era
Deng et al. Discovery of consistent QTLs of wheat spike-related traits under nitrogen treatment at different development stages
Trachsel et al. Identification of QTL for early vigor and stay-green conferring tolerance to drought in two connected advanced backcross populations in tropical maize (Zea mays L.)
Jha et al. Major QTLs and potential candidate genes for heat stress tolerance identified in chickpea (Cicer arietinum L.)
Liu et al. Genome-wide association mapping reveals a rich genetic architecture of stripe rust resistance loci in emmer wheat (Triticum turgidum ssp. dicoccum)
Fischer et al. Field phenotyping strategies and breeding for adaptation of rice to drought
Zhou et al. Identification of QTL associated with nitrogen uptake and nitrogen use efficiency using high throughput genotyped CSSLs in rice (Oryza sativa L.)
Qiao et al. Dissecting root trait variability in maize genotypes using the semi-hydroponic phenotyping platform
Lakew et al. Genetic analysis and phenotypic associations for drought tolerance in Hordeum spontaneum introgression lines using SSR and SNP markers
Safdar et al. Genome-wide association study and QTL meta-analysis identified novel genomic loci controlling potassium use efficiency and agronomic traits in bread wheat
Emebiri QTL dissection of the loss of green colour during post-anthesis grain maturation in two-rowed barley
Heredia et al. Breeding rice for a changing climate by improving adaptations to water saving technologies
Khodadadi et al. Quantitative genetic analysis reveals potential to genetically improve fruit yield and drought resistance simultaneously in coriander
Jiang et al. Population structure and association mapping of traits related to reproductive development in field pea
Wang et al. Population structure and association analysis of yield and grain quality traits in hybrid rice primal parental lines
Seck et al. Realized genetic gain in rice: Achievements from breeding programs
Zhao et al. Mining beneficial genes for aluminum tolerance within a core collection of rice landraces through genome-wide association mapping with high density SNPs from specific-locus amplified fragment sequencing
Zaidi et al. Genomic regions associated with salinity stress tolerance in tropical maize (Zea Mays L.)
Ali et al. Mining of favorable alleles for seed reserve utilization efficiency in Oryza sativa by means of association mapping
Sunilkumar et al. Marker-assisted selection for transfer of QTLs to a promising line for drought tolerance in wheat (Triticum aestivum L.)
Zaïm et al. Genomic regions of durum wheat involved in water productivity
DePauw et al. RL4137 contributes preharvest sprouting resistance to Canadian wheats

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864785

Country of ref document: EP

Kind code of ref document: A1