CN111898807B - Tobacco leaf yield prediction method based on whole genome selection and application - Google Patents

Tobacco leaf yield prediction method based on whole genome selection and application Download PDF

Info

Publication number
CN111898807B
CN111898807B CN202010675520.8A CN202010675520A CN111898807B CN 111898807 B CN111898807 B CN 111898807B CN 202010675520 A CN202010675520 A CN 202010675520A CN 111898807 B CN111898807 B CN 111898807B
Authority
CN
China
Prior art keywords
tobacco
whole genome
model
data
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010675520.8A
Other languages
Chinese (zh)
Other versions
CN111898807A (en
Inventor
童治军
肖炳光
方敦煌
陈学军
曾建敏
焦芳婵
姚恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Academy of Tobacco Agricultural Sciences
Original Assignee
Yunnan Academy of Tobacco Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Academy of Tobacco Agricultural Sciences filed Critical Yunnan Academy of Tobacco Agricultural Sciences
Priority to CN202010675520.8A priority Critical patent/CN111898807B/en
Publication of CN111898807A publication Critical patent/CN111898807A/en
Application granted granted Critical
Publication of CN111898807B publication Critical patent/CN111898807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Mining
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention belongs to the technical field of biology, and particularly relates to a tobacco leaf yield prediction method based on whole genome selection and application thereof. The invention obtains the whole genome data of tobacco leaves in the candidate prediction model; screening and optimizing the whole genome data of tobacco leaves in real time; tobacco yield prediction data is generated. The method can calculate or simulate the phenotype value of the tobacco yield in the maturity stage by utilizing the genotype data of the tobacco seedling stage (or early stage), can predict the phenotype value data of the tobacco yield in the maturity stage in a whole genome range by analyzing the genotype data of a tobacco population or a tobacco variety, and can obtain the phenotype value data of the tobacco yield in the maturity stage when the tobacco seedling stage.

Description

Tobacco leaf yield prediction method based on whole genome selection and application
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a tobacco leaf yield prediction method based on whole genome selection and application thereof.
Background
Tobacco is a special cash crop with the purpose of harvesting leaf organs, plays an important role in national economy of China, and serves as a basis of tobacco industry of China, and cultivation of tobacco varieties with higher yield and high quality becomes an important target of breeding research.
Researches show that the tobacco leaf yield is obviously related to the characteristics of leaf number, leaf length, leaf width and the like, and the characteristics belong to quantitative characteristics influenced by polygene control and environment, and are relatively complex in heredity.
In addition, the determination of the appearance value of the tobacco yield property in the traditional maturity stage is very time-consuming and labor-consuming and has low efficiency. That is, if a more accurate tobacco yield trait phenotype value is obtained, a complete and long tobacco field growth period is required, further measurement in a field is required after maturation, and the time and effort are consumed, and the result is extremely easily influenced by environmental and human factors, so that the uncertainty of the result is caused.
Disclosure of Invention
Aiming at the technical problems and defects that the measurement work of the tobacco yield trait phenotype value in the current maturity stage is very time-consuming and labor-consuming and has low efficiency, and the tobacco yield trait is very easily influenced by environment and human factors, and the measurement result has uncertainty, the invention provides a tobacco yield prediction method based on whole genome selection and application thereof, and the phenotype value of the tobacco yield trait in the maturity stage of a tobacco material is predicted based on the genotype number of the tobacco seedling stage (early stage).
The invention is realized by the following technical scheme:
A tobacco leaf yield prediction method based on whole genome selection comprises the following steps:
acquiring tobacco leaf whole genome data in a candidate prediction model;
screening and optimizing the whole genome data of tobacco leaves in real time;
tobacco yield prediction data is generated.
Further, in the step of obtaining the tobacco leaf whole genome data in the candidate prediction model, the method further comprises the following steps:
setting core parameters of candidate prediction models;
establishing a candidate prediction model;
and primarily screening the whole genome data of the tobacco leaves by combining the core parameters through the candidate prediction model.
Further, in the step of screening and optimizing the whole genome data of the tobacco leaves in real time, the method further comprises the following steps:
establishing a whole genome selection model;
verifying core parameters of a tobacco leaf candidate prediction model;
and (3) secondarily screening the tobacco leaf whole genome data in real time by combining the core parameters through the whole genome selection model.
Further, the core parameters include: the number of molecular markers, the size of training population, the proportion of training population to testing population and the model prediction precision value.
Further, the whole genome selection model includes:
A natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
Further, the tobacco yield prediction data includes: natural leaf number phenotype data, maximum waist leaf length phenotype data, and maximum waist leaf width phenotype data.
Further, the calculation formula of the natural She Shubiao type value data is as follows:
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
the calculation formula of the maximum waist leaf width phenotype value data is as follows:
wherein Bayes C LN Selecting model for natural She Shubiao type value whole genome, LN is natural leaf number, bayes C is candidate prediction model, and Bayes B LL Selecting a model for the whole genome of the maximum waist leaf length surface type value, wherein LL is the maximum waist leaf length, bayes B is a candidate prediction model, and Bayes C WL The model is selected for the whole genome with the maximum waist leaf width surface type value, WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the training population scale, n3 is the ratio of the training population to the test population, and n4 is the model prediction accuracy value.
In order to achieve the above purpose, the invention also provides an application of the tobacco yield prediction method based on whole genome selection, which is applied to analyzing genotype data of tobacco population seedling stage and predicting phenotype value data of tobacco yield traits of each plant maturity stage in the population in whole genome range;
The method is applied to analyzing genotype data of tobacco populations or tobacco varieties, predicting the phenotype value data of the yield traits in the mature period of the tobacco in the whole genome range, and acquiring the phenotype value data of the yield traits in the mature period during the seedling period of the tobacco.
In order to achieve the above object, the present invention further provides a tobacco yield prediction system based on whole genome selection, the system specifically comprising:
the acquisition unit is used for acquiring the whole genome data of the tobacco leaves in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
the generation unit is used for generating tobacco yield prediction data;
the acquisition unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
the first screening module is used for primarily screening the tobacco leaf whole genome data by combining the core parameters through the candidate prediction model;
the screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the tobacco leaf whole genome data in real time by combining the core parameters through the whole genome selection model.
In order to achieve the above object, the present invention also provides a tobacco yield prediction platform based on whole genome selection, comprising:
a processor, a memory, and a tobacco yield prediction platform control program based on the whole genome;
and executing the tobacco yield prediction platform control program based on the whole genome at the processor, wherein the tobacco yield prediction platform control program based on the whole genome is stored in the memory, and the tobacco yield prediction platform control program based on the whole genome is realized.
In order to achieve the above object, the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a tobacco yield prediction platform control program based on whole genome selection, and the tobacco yield prediction platform control program based on whole genome selection implements the tobacco yield prediction method based on whole genome selection.
To achieve the above object, the present invention also provides a chip system comprising at least one processor, wherein program instructions, when executed in the at least one processor, cause the chip system to perform the whole genome-based tobacco yield prediction method steps.
Compared with the prior art, the invention has the following beneficial effects:
the invention can calculate or simulate the appearance value of the tobacco yield property in the mature period by utilizing the genotype data in the tobacco seedling period (early period) through the whole genome-based tobacco yield prediction method, the application, the system, the platform and the storage medium, and has the advantages of convenient operation, rapidness, high efficiency, science and accurate and reliable prediction result.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a tobacco yield prediction method based on whole genome selection;
FIG. 2 shows a full genome selective model Bayes C established for natural Leaf Number (LN) phenotype values in mature tobacco yield traits based on a full genome selective tobacco yield prediction method of the present invention LN A schematic diagram;
FIG. 2, panel (a), shows the number of molecular markers (n 1) versus Bayes C LN Model LN phenotype value prediction accuracy influence diagram;
panel (b) of FIG. 2 shows training population size (n 2) versus Bayes C LN Model LN phenotype value prediction accuracy influence diagram;
panel (C) of FIG. 2 shows training population to test population ratio (n 3) versus Bayes C LN LN phenotype value prediction accuracy influence of the model;
FIG. 2, panel (d), is for Bayes C LN In the model, different candidate prediction models have schematic influences on the prediction precision value (n 4);
FIG. 3 shows a full genome selective model Bayes B for maximum waist length (LL) phenotype values in tobacco yield traits based on a full genome selective tobacco yield prediction method of the present invention LL
FIG. 3, panel (a), shows the number of molecular markers (n 1) versus Bayes B LL Model LL phenotype value prediction accuracy impactA schematic diagram;
panel (B) of FIG. 3 shows training population size (n 2) versus Bayes B LL Model LL phenotype value prediction accuracy influence schematic;
panel (c) of FIG. 3 shows training population to test population ratio (n 3) versus Bayes B LL Model LL phenotype value prediction accuracy influence schematic;
FIG. 3, panel (d) is Bayes B LL In the model, different candidate prediction models have schematic influences on the prediction precision value (n 4);
FIG. 4 shows a full genome selective model Bayes C for maximum waist Width (WL) phenotype values in tobacco yield traits based on a full genome selective tobacco yield prediction method of the present invention WL
FIG. 4, panel (a), shows the number of molecular markers (n 1) versus Bayes C WL Model WL phenotype value prediction accuracy influence schematic;
panel (b) of FIG. 4 shows training population size (n 2) versus Bayes C WL Model WL phenotype value prediction accuracy influence schematic;
panel (C) of FIG. 4 shows training population to test population ratio (n 3) versus Bayes C WL Model WL phenotype value prediction accuracy influence schematic;
FIG. 4, panel (d) is Bayes C WL In the model, different candidate prediction models have schematic influences on the prediction precision value (n 4);
FIG. 5 is a schematic diagram of a system architecture for selecting tobacco yield prediction based on whole genome in accordance with the present invention;
FIG. 6 is a schematic diagram of a whole genome-based tobacco yield prediction platform architecture;
FIG. 7 is a schematic diagram of a computer-readable storage medium architecture according to an embodiment of the invention;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
For a better understanding of the present invention, its objects, technical solutions and advantages, further description of the present invention will be made with reference to the drawings and detailed description, and further advantages and effects will be readily apparent to those skilled in the art from the present disclosure.
The invention may be practiced or carried out in other embodiments and details within the scope and range of equivalents of the various features and advantages of the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.
In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. Secondly, the technical solutions of the embodiments may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can realize the technical solutions, and when the technical solutions are contradictory or cannot be realized, the technical solutions are considered to be absent and are not within the scope of protection claimed in the present invention.
Preferably, the whole genome-based tobacco yield prediction method is applied to one or more terminals or servers. The terminal is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable gate array (Field-Programmable Gate Array, FPGA), a digital processor (Digital Signal Processor, DSP), an embedded device, etc.
The terminal can be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal can perform man-machine interaction with a client through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The invention discloses a tobacco leaf yield prediction method, a tobacco leaf yield prediction system, a tobacco leaf yield prediction platform and a tobacco leaf yield prediction storage medium based on whole genome.
As shown in FIG. 1, a flowchart of a method for predicting tobacco yield based on whole genome selection is provided in an embodiment of the present invention.
In this embodiment, the method for predicting tobacco leaf yield based on whole genome selection may be applied to a terminal or a fixed terminal with a display function, where the terminal is not limited to a personal computer, a smart phone, a tablet computer, a desktop computer or an all-in-one machine with a camera, etc.
The whole genome-based tobacco yield prediction method can also be applied to a hardware environment formed by a terminal and a server connected with the terminal through a network. Networks include, but are not limited to: a wide area network, a metropolitan area network, or a local area network. The tobacco leaf yield prediction method based on whole genome selection in the embodiment of the invention can be executed by a server, a terminal or both.
For example, for a terminal that needs to perform whole genome based tobacco yield prediction, the whole genome based tobacco yield prediction function provided by the method of the present invention may be directly integrated on the terminal, or a client for implementing the method of the present invention may be installed. For another example, the method provided by the invention can also be operated on a server and other devices in the form of a software development kit (Software Development Kit, SDK), an interface for selecting the tobacco yield prediction function based on the whole genome is provided in the form of the SDK, and the terminal or other devices can realize the tobacco yield prediction function based on the whole genome through the provided interface.
The invention is further elucidated below in connection with the accompanying drawings.
As shown in fig. 1, the invention provides a tobacco yield prediction method based on whole genome selection, which specifically comprises the following steps:
a tobacco leaf yield prediction method based on whole genome selection comprises the following steps:
s1, acquiring tobacco leaf whole genome data in a candidate prediction model;
s2, screening and optimizing tobacco leaf whole genome data in real time;
s3, tobacco yield prediction data are generated.
In the scheme of the invention, the tobacco yield prediction data is finally generated by acquiring the tobacco whole genome data in the candidate prediction model, screening and optimizing the tobacco whole genome data in real time, the method can calculate or simulate the tobacco yield character phenotype value in the mature period by utilizing the genotype data in the tobacco seedling period (early stage), and the method has the characteristics of convenience, rapidness, high efficiency, scientificity and accurate and reliable result.
That is, the tobacco yield prediction data is obtained by continuously screening, optimizing and verifying the prediction results of the yield of the mature period of tobacco, namely, the natural Leaf Number (LN), the maximum waist Leaf Length (LL) and the maximum waist leaf Width (WL), through experiments on the basis of the candidate prediction model BayesA, bayesB and bayes c obtained through preliminary screening, and core parameter values such as the number of molecular markers (n 1), the training population scale (n 2), the training population-to-test population ratio (n 3), the model prediction accuracy value (n 4) and the like are also obtained through the method.
Specifically, in the step of obtaining the tobacco leaf whole genome data in the candidate prediction model, the method further comprises the following steps:
s11, setting core parameters of candidate prediction models;
s12, establishing a candidate prediction model;
s13, primarily screening the whole genome data of the tobacco leaves through the candidate prediction model and combining the core parameters.
In the embodiment of the invention, the core parameters of the candidate prediction model are definitely set, the candidate prediction model is established, and the whole genome data of the tobacco leaves are primarily screened by combining the core parameters through the candidate prediction model.
That is, in order to optimize the prediction accuracy of the tobacco yield, core parameter values such as the number of molecular markers (n 1), the training population size (n 2), the training population-to-test population ratio (n 3), and the model prediction accuracy value (n 4) of the candidate prediction models BayesA, bayesB, bayesC and rrBLUP are explicitly specified.
Specifically, in the step of screening and optimizing the whole genome data of the tobacco leaves in real time, the method further comprises the following steps:
s21, establishing a whole genome selection model;
s22, verifying core parameters of the tobacco leaf candidate prediction model;
s23, through the whole genome selection model, combining with the core parameters, secondarily screening tobacco whole genome data in real time.
In the embodiment of the invention, a full genome selection model is established, namely 3 types of full genome selection model BayesCLN, bayesBLL and BayesCWL are respectively established for natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) for predicting tobacco yield traits, then core parameters of a tobacco candidate prediction model are verified, and the full genome selection model is combined with the core parameters to secondarily screen tobacco full genome data in real time, and the generated tobacco yield prediction result in the maturity stage is continuously screened, optimized and verified to finally form tobacco yield prediction data, namely tobacco yield trait form values.
Specifically, the core parameters include: the number of molecular markers, the size of training population, the proportion of training population to testing population and the model prediction precision value.
That is, when the prediction result of the mature period yield is continuously screened, optimized and verified, the method obtains core parameter values such as the number of molecular markers (n 1), the training population scale (n 2), the training population-to-test population ratio (n 3), the model prediction accuracy value (n 4) and the like.
Specifically, the whole genome selection model comprises:
A natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
That is, in the scheme of the invention, the whole genome selection model established for 3 yield trait phenotype values of natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) in the mature period of tobacco is Bayes C LN Full genome selection model Bayes B for maximum waist leaf length LL And maximum waist leaf width whole genome selection model Bayes C WL
Preferably, the tobacco yield prediction data includes: natural leaf number phenotype data, maximum waist leaf length phenotype data, and maximum waist leaf width phenotype data.
The calculation formula of the natural She Shubiao type value data is as follows:
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
the calculation formula of the maximum waist leaf width phenotype value data is as follows:
wherein Bayes C LN Selecting model for natural She Shubiao type value whole genome, LN is natural leaf number, bayes C is candidate prediction model, and Bayes B LL Selecting a model for the whole genome of the maximum waist leaf length surface type value, wherein LL is the maximum waist leaf length, bayes B is a candidate prediction model, and Bayes C WL The model is selected for the whole genome with the maximum waist leaf width surface type value, WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the training population scale, n3 is the ratio of the training population to the test population, and n4 is the model prediction accuracy value.
That is, the genome-wide selection models for predicting tobacco yield were BayesCLN, bayesBLL and bayes cwl, respectively, and the core parameter values of the models were:
a) Whole genome selection model Bayes C for natural Leaf Number (LN) phenotype values LN : the number of molecular markers (n 1) is 2000 markers, i.e. n1=2000 markers; training population size (n 2) was 250 individuals, i.e., n2=250 individuals; the training population to test population ratio (n 3) was 5:1, i.e., n3=5:1 (number of individuals in training population: number of individuals in test population=5:1); the model predictive accuracy value (n 4) is 0.72, that is, n4=0.72.
B) Model Bayes B was selected for full genome for maximum waist Leaf Length (LL) phenotype values LL : the number of molecular markers (n 1) is 16000 markers, i.e., n1=16000 markers; the training population scale (n 2) is more than 250 single plants (comprising 250 single plants), namely n2 is more than or equal to 250 single plants; the training population to test population ratio (n 3) is 6:1, i.e., n3=6:1; the model predictive accuracy value (n 4) is 0.40, that is, n4=0.40.
C) Model Bayes C was selected for full genome for maximum waist-leaf Width (WL) phenotype values WL : the number of molecular markers (n 1) is 1000 markers, i.e. n1=1000 markers; the training population scale (n 2) is more than 250 single plants (comprising 250 single plants), namely n2 is more than or equal to 250 single plants; the training population to test population ratio (n 3) is 6:1, i.e., n3=6:1; the model predictive accuracy value (n 4) is 0.32, that is, n4=0.32.
In other words, the whole genome selection model of the method for predicting the yield of tobacco leaves in the mature period is Bayes C LN 、Bayes B LL And Bayes C WL The model is a specific numerical value of the number (n 1) of molecular markers of each core parameter, the training population scale (n 2), the training population-to-test population ratio (n 3) and the model prediction precision value (n 3) obtained through experimental screening, optimization and verification on the basis of 3 Bayes candidate prediction models (namely BayesA, bayesB and Bayes C) obtained through preliminary screening.
The whole genome selection model Bayes C LN 、Bayes B LL And Bayes C WL The corresponding 4 core parameter values are respectively:
number of molecular markers (n 1): a) For natural Leaf Numbers (LN), n1=2000 markers; b) For maximum waist Leaf Length (LL), n1=16000 markers; c) For maximum waist-leaflet Width (WL), n1=1000 markers.
Training population size (n 2): a) For natural Leaf Number (LN), n2=250 individuals; b) Aiming at the maximum waist Leaf Length (LL), n2 is more than or equal to 250 single plants; c) Aiming at the maximum lumbar leaf Width (WL), n2 is more than or equal to 250 single plants.
Training population to test population ratio (n 3): a) For natural Leaf Number (LN), n3=5:1; b) For maximum waist Leaf Length (LL), n3=6:1; c) For maximum waist leaflet Width (WL), n3=6:1.
Model prediction accuracy value (n 4): a) For natural Leaf Number (LN), n4=0.72; b) For maximum waist Leaf Length (LL), n4=0.40; c) For maximum waist leaflet Width (WL), n4=0.32.
The invention is further illustrated by the following examples:
examples:
whole genome selection model Bayes C for predicting yield of tobacco leaves in mature period LN 、Bayes B LL And Bayes C WL Construction and application of (3)
1. Experimental materials
Flue-cured tobacco varieties Y3 and K326 construct a population of recombinant inbred lines (RILs, F7) comprising 300 lines; in addition, to construct a predictive model with tobacco versatility, a natural population of tobacco distinct from the parental derived population was constructed, consisting of 347 different tobacco varieties (lines).
2. Obtaining phenotypic data of 3 tobacco yield traits of two different types of tobacco populations
The test materials are transplanted to the field after seedling formation, and after the tobacco plants in the field are mature, the measurement and statistics of 3 tobacco yield trait data (phenotype values) are started. From the final obtained phenotypic data results it follows that: the 3 yield trait phenotypes of all populations are in a continuous normal or near normal distribution, and belong to a typical quantitative trait commonly controlled by micro-efficient polygenic and environmental factors (i.e., genetic and non-genetic factors).
3. SNP marker analysis (SNP marker is taken as an example)
Extraction of tobacco genome DNA: the conventional CTAB method or the plant tissue DNA extraction kit can be adopted, and the method can be referred to the existing literature or the instruction in the kit. However, the extracted tobacco DNA needs to be purified to remove RNA, protein and other organic impurities, so that the extracted tobacco DNA meets the requirement of developing SNP chips; if SNP markers are mined by genome re-sequencing of tobacco samples, the corresponding tobacco DNA quality needs to be processed according to the requirements of sequencing companies.
4. Construction and use of the full genome selection model Bayes CLN (taking the natural Leaf Number (LN) as an example)
4.1 preliminary screening of candidate predictive models
The basic parameters (functions) of the 4 original models rrBLUP, bayesA, bayesB and BayesC provided in the R language package are optimized by utilizing SNP genotype data of each tobacco strain and the mature tobacco yield trait-natural Leaf Number (LN) phenotype value of a tobacco integrated population (647 strains) formed by respectively utilizing tobacco recombination inbred lines (RILs, 300 strains), tobacco natural populations (347 different tobacco varieties) and mixing the two. And finally obtaining basic parameters (functions) of each candidate model.
A) Basic functions for rrBLUP raw model:
mixed. Solid: modeling the marker effect as a random effect or using genotype (genotype) values of the line data for a.mat function (computing an additive relationship matrix, predicting breeding values);
kinship. Blup: the genotype value prediction comprises an upper effect;
gwa: performing association mapping;
specific parameter setting:
filling and processing genotype data: a.mat (Additive relationship matrix (additive relationship matrix))
impute<-A.mat(markers,max.missing=0.5;impute.method="EM";n.core=4;return.imputed=T)
markers: genotype data;
missing: allowing maximum miss rate before filling, if greater than 0.5, deleting all SNPs at this site in the sample
Method of inpute: filling method
Core: thread count
return. Inputed: if T is selected, the padded data is returned.
Training set and test set (which can be set according to the actual demand proportion, generally 91 or 82 or 64 proportion)
train=as.matrix(sample(1:271,217))
test<-setdiff(1:271,train)
The analysis of the proportion of the training set and the testing set is uniformly carried out according to the proportion of 8:2 (default value) without the requirement of setting
Model training: mixed. Solid solution (Mixed-model solid solution)
height_answer<-mixed.solve(y=Height,Z=m_train,K=NULL,SE=FALSE,method="REML",return.Hinv=FALSE)
y: training set phenotype data
Z: training set genotype data
K: covariance matrix
SE: standard deviation of
method of: model training method
B) Basis functions for the Bayes original model:
The data input is the same as the rrBLUP raw model.
R language package: BGLR (Bayesian Generalized LinearRegression)
ETA<-list(list(X=markers,model='BayesB',probIn=0.05))
ETA: two-level list for specifying regression functions (or linear predictors)
X: genotype file
model: model selection
Model training:
system.time(fit_BB<-BGLR(y=Height,ETA=ETA,nIter=10000,burnIn=5000,thin=5,saveAt=”,df0=5,S0=NULL,weights=NULL,R2=0.5))
y: phenotype file
ETA: two-level list for specifying regression functions (or linear predictors)
nIter, burn In, thin: (integer) the number of iterations (iteration number), burn-in and thining.
saveAt: protection program
R2: the proportion of expected prior variance before model training.
Preferably, construction and application of natural Leaf Number (LN) model BayesCLN in tobacco yield traits
Screening, optimizing and verifying a whole genome selection model for natural Leaf Number (LN) phenotype value prediction in tobacco yield traits is performed on a recombinant inbred population containing 300 and 347 single plants (lines) and a natural population and a tobacco comprehensive population containing 647 single plants (lines) generated after mixing thereof, respectively, by using an original prediction model (i.e., a candidate prediction model) of the determined basic parameters. The specific method is as follows:
firstly, analyzing a tobacco comprehensive population containing 647 individual plants (lines) by using 50000 high-quality SNP markers uniformly distributed on the whole genome of tobacco and obtaining genotype data; secondly, detecting and counting natural Leaf Number (LN) phenotype values in the tobacco yield traits of each individual plant (line) mature period after baking of the tobacco comprehensive population to obtain LN phenotype data; thirdly, simulating genotype data and LN phenotype data of each individual plant (line) in the tobacco comprehensive group by using 2 types of 4 candidate prediction models (rrBLUP, bayesA, bayesB and BayesC) for determining basic parameters, screening, optimizing and verifying specific numerical values of 4 core parameter molecular markers (n 1), training group scale (n 2), training group-to-test group ratio (n 3) and model prediction precision value (n 4) in the prediction model, and finally constructing a whole genome selection model BayesCLN for predicting natural Leaf Number (LN) phenotype value in tobacco yield traits. That is, 500 SNP markers were classified into 10 gradients (1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000, and 50000) for determining the number of molecular markers (n 1) parameter in the Bayes CLN model, and the results are shown in the graph (a) in FIG. 2; the comprehensive population is used as training population according to 50, 100, 150, 200, 250 and 300 single plants (lines), and the recombinant inbred line population (300 lines) and natural population are also used The body (347 strains) and the comprehensive population (647 strains) were used individually as training populations for determining training population sizes (n 2) in the bayes cln model, respectively, the results of which are detailed in fig. 2, panel (b); 647 strains in the comprehensive group are calculated according to the number of tobacco plants in the training group: the number of tobacco plants in the test population was 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1 and 10:1 for 11 gradients, respectively, to determine the training population to test population ratio (n 3) in the bayes cln model, the results of which are shown in figure 2, panel (c); and carrying out the lN phenotype value prediction calculation on the Bayes CLN model with the specific core parameters (n 1, n2 and n 3) to obtain the lN phenotype predicted value in the tobacco yield trait, comparing the lN phenotype predicted value with the respective lN true value to obtain the prediction precision, and determining the highest prediction precision value (n 4). Finally, after the experimental verification, optimization and practical application, constructing a full genome selection model Bayes C for obtaining natural Leaf Number (LN) phenotype values in the tobacco yield traits in the predicted maturity stage LN The formula is as follows:
wherein,the representation is: the 4 parameters n1, n2, n3 and n4 of the core are sequentially substituted into the Bayes C candidate prediction model.
Similarly, a natural Leaf Number (LN) whole genome selection model was established:
Specifically, FIG. 2 is a full genome selection model Bayes C established for natural Leaf Number (LN) phenotype values in mature tobacco yield traits using genotype data and measured phenotype data of a tobacco integrated population (a mixed population of recombinant inbred and natural populations) in combination with bioinformatics LN
Wherein, in FIG. 2, the graph (a) is a molecular markerQuantity (n 1) vs. Bayes C LN Model LN phenotype prediction accuracy impact: the abscissa is the number of molecular markers; on the ordinate is Bayes C LN The model predicts the LN phenotype value accurately; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers for genotyping the tobacco integrated population as 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
Panel (b) of FIG. 2 shows training population size (n 2) versus Bayes C lN The impact of the model on the prediction accuracy of the lN phenotype values: the abscissa is the training population scale (the number of tobacco plants contained in the training population); on the ordinate is Bayes C LN And the model predicts the LN phenotype value accurately.
Panel (C) of FIG. 2 shows training population to test population ratio (n 3) versus Bayes C LN Model LN phenotype prediction accuracy impact: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in training population: the number of tobacco plants in test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent training population to test population ratios of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; on the ordinate is Bayes C LN And the model predicts the LN phenotype value accurately.
Fig. 2 (d) shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents the 4 original models provided in the R language package: bayesA, bayesB, bayesC and rrBLUP; the ordinate represents the accuracy of the original model in predicting LN phenotype values of the tobacco population to be tested.
Similarly, a whole genome selection model of the appearance values of the tobacco yield traits (LL and WL) in the other 2 maturity stages is established, and the specific models are respectively as follows:
for LL phenotype values, the whole genome selection model is:
in particular, for the maximum in tobacco yield traitsWhole genome selection model Bayes B for waist Leaf Length (LL) phenotype values LL As shown in fig. 3, the corresponding:
FIG. 3, panel (a), shows the number of molecular markers (n 1) versus Bayes B LL LL phenotype value prediction accuracy impact of model: the abscissa is the number of molecular markers; on the ordinate is Bayes B LL The model predicts the precision of the LL form value; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers for genotyping the tobacco integrated population as 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
Panel (B) of FIG. 3 shows training population size (n 2) versus Bayes B LL LL phenotype value prediction accuracy impact of model: the abscissa is the training population scale (the number of tobacco plants contained in the training population); on the ordinate is Bayes B LL Model predictive accuracy for LL phenotype values.
Panel (c) of FIG. 3 shows training population to test population ratio (n 3) versus Bayes B LL LL phenotype value prediction accuracy impact of model: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in training population: the number of tobacco plants in test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent training population to test population ratios of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; on the ordinate is Bayes B LL Model predictive accuracy for LL phenotype values.
Fig. 3 (d) shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents the 4 original models provided in the R language package: bayesA, bayesB, bayesC and rrBLUP; the ordinate represents the accuracy of the original model in predicting the LL phenotype value of the tobacco population to be tested.
For WL phenotype values, its whole genome selection model is:
in particular, for maximum waist in tobacco yield traits Whole genome selection model Bayes C for leaf Width (WL) phenotype values WL As shown in fig. 4, the corresponding:
FIG. 4, panel (a), shows the number of molecular markers (n 1) versus Bayes C WL Model WL phenotype prediction accuracy impact: the abscissa is the number of molecular markers; on the ordinate is Bayes C WL The model predicts the WL surface model value accurately; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers for genotyping the tobacco integrated population as 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
Panel (b) of FIG. 4 shows training population size (n 2) versus Bayes C WL Model WL phenotype prediction accuracy impact: the abscissa is the training population scale (the number of tobacco plants contained in the training population); on the ordinate is Bayes C WL Prediction accuracy of model versus WL type value.
Panel (C) of FIG. 4 shows training population to test population ratio (n 3) versus Bayes C WL Model WL phenotype prediction accuracy impact: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in training population: the number of tobacco plants in test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent training population to test population ratios of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; on the ordinate is Bayes C WL Prediction accuracy of model to WL type value.
Fig. 4 (d) shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents the 4 original models provided in the R language package: bayesA, bayesB, bayesC and rrBLUP; the ordinate represents the accuracy of the original model in predicting the WL phenotype values of the tobacco population to be tested.
Preferably, in formulas (1), (2), (3), bayes C LN Selecting model for natural She Shubiao type value whole genome, LN is natural leaf number, bayes C is candidate prediction model, and Bayes B LL Selecting a model for the whole genome of the maximum waist leaf length surface type value, wherein LL is the maximum waist leaf length, bayes B is a candidate prediction model, and Bayes C WL Is the maximum lumbar leaf width phenotype valueThe whole genome selection model is characterized in that WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the training population scale, n3 is the ratio of the training population to the test population, and n4 is the model prediction accuracy value.
In order to achieve the above purpose, the invention also provides an application of the tobacco yield prediction method based on whole genome selection, which is applied to analyzing genotype data of tobacco population seedling stage and predicting phenotype value data of tobacco yield traits of each plant maturity stage in the population in whole genome range;
The method is applied to analyzing genotype data of tobacco populations or tobacco varieties, predicting the phenotype value data of the yield traits in the mature period of the tobacco in the whole genome range, and acquiring the phenotype value data of the yield traits in the mature period during the seedling period of the tobacco.
That is, the application of the whole genome selection method for predicting tobacco yield uses the 3 whole genome selection models BayesCLN, bayesBLL and BayesCWL to analyze genotype data of tobacco population seedling stage and accurately predict phenotype values of 3 tobacco yield traits of each plant maturity stage in the population in whole genome range.
Preferably, genotype data of a tobacco population or tobacco variety (line) is analyzed by using a whole genome selection model BayesCLN, bayesBLL and Bayes CWL respectively, and 3 yield trait phenotype values of the tobacco maturity are accurately predicted within a whole genome range, so that the accurate phenotype values of 3 tobacco yield traits of the maturity can be obtained in a seedling stage (early stage).
The 3 whole genome selection models provided by the invention are applied to the analysis of the genotype data of each plant in tobacco populations or varieties (lines) in the seedling stage (early stage) so as to predict and obtain the phenotype values of the 3 tobacco yield traits in the mature stage.
The application of the whole genome selection model for predicting the yield of the tobacco leaves in the mature period is to use Bayes C LN 、Bayes B LL And Bayes C WL Model analysis of tobacco colony or variety genotype data to be detected in early seedling stage to obtain 3 tobacco yield characteristics-natural leaf after tobacco maturityNumber (LN), maximum waist-Lobe Length (LL), and maximum waist-lobe Width (WL).
In summary, a first object of the present invention is to provide a full genome selection model Bayes C for predicting maturity tobacco leaf yield LN 、Bayes B LL And Bayes C WL The method comprises the steps of carrying out a first treatment on the surface of the The second objective is to use the whole genome selection model Bayes C LN 、Bayes B LL And Bayes C WL The application of analyzing early (seedling stage) genotype data in tobacco group or variety (line) to accurately predict 3 yield character phenotype values of each tobacco plant in the group.
The first object of the present invention is achieved by a full genome selection model for predicting maturity tobacco leaf yield of Bayes C LN 、Bayes B LL And BayesC WL Their respective core parameter values n1, n2, n3 and n4 are well defined.
The second object of the invention is realized by the application of the whole genome selection model BayesCLN, bayesBLL and the BayesCWL in analyzing early (seedling stage) genotype data of each plant in a tobacco population or variety (line) so as to accurately predict the 3 yield trait phenotype values of each tobacco plant in the population.
In order to scientifically, efficiently and accurately select tobacco varieties with high yield in the mature period and select offspring tobacco materials with higher yield in a targeted and specific way, the invention provides a full genome selection model Bayes C for predicting the yield of tobacco leaves in the mature period LN 、Bayes B LL And Bayes C WL The model is used for respectively collecting and analyzing early genotype data and mature tobacco yield trait values of recombinant inbred line populations (RILs), natural populations and comprehensive populations mixed by the RILs, and 4 core parameters such as the number of molecular markers (n 1), the training population scale (n 2), the training population-to-test population ratio (n 3), the model prediction accuracy value (n 4) and the like in each model are screened, optimized and verified on the basis of BayesA, bayesB and BayesC candidate prediction models obtained by initial selection. The final establishment of the model can be used for auxiliary selection of 3 tobacco yield trait genes/QTL loci in maturity in whole genome range to extractThe efficiency of the auxiliary selection of the high molecular marker and the efficiency of the breeding of the high-yield tobacco variety.
The invention utilizes flue-cured tobacco variety Y3 and K326 (to construct recombinant inbred lines (RILs, F7) group, and also constructs a natural group containing 347 different tobacco varieties (lines), and utilizes the two groups to represent all tobacco groups or varieties (lines), on the one hand, adopts 3 candidate predictive models BayesA, bayesB and BayesC obtained by primary selection, and further screens, optimizes and constructs a complete genome selective model Bayes C for predicting the tobacco yield in mature period LN 、Bayes B LL And Bayes C WL The breeding work of molecular marker selection in tobacco varieties with high yield and high quality in the whole genome range is accelerated.
The invention relates to a full genome selection model Bayes C for predicting the yield of tobacco leaves in the mature period LN 、Bayes B LL And Bayes C WL The method has the characteristics of science, high efficiency, accuracy and low cost, and can be applied to cultivation of new varieties (lines) of high-quality tobacco with ideal yield.
In order to achieve the above object, the present invention further provides a tobacco yield prediction system based on whole genome selection, as shown in fig. 5, the system specifically includes:
the acquisition unit is used for acquiring the whole genome data of the tobacco leaves in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
the generation unit is used for generating tobacco yield prediction data;
in the scheme system, the tobacco yield prediction data is finally generated by acquiring the tobacco whole genome data in the candidate prediction model, screening and optimizing the tobacco whole genome data in real time, the method can calculate or simulate the tobacco yield character phenotype value in the mature period by utilizing the genotype data in the tobacco seedling period (early stage), and the method has the characteristics of convenience, rapidness, high efficiency, science and accurate and reliable result.
That is, the tobacco yield prediction data is obtained by continuously screening, optimizing and verifying the prediction results of the yield of the mature period of tobacco, namely, the natural Leaf Number (LN), the maximum waist Leaf Length (LL) and the maximum waist leaf Width (WL), through experiments on the basis of the candidate prediction model BayesA, bayesB and bayes c obtained through preliminary screening, and core parameter values such as the number of molecular markers (n 1), the training population scale (n 2), the training population-to-test population ratio (n 3), the model prediction accuracy value (n 4) and the like are also obtained through the method.
The acquisition unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
the first screening module is used for primarily screening the tobacco leaf whole genome data by combining the core parameters through the candidate prediction model;
in the embodiment of the system, the core parameters of the candidate prediction model are definitely set, the candidate prediction model is established, and the whole genome data of the tobacco leaves are primarily screened by combining the core parameters through the candidate prediction model.
That is, in order to optimize the prediction accuracy of the tobacco yield, core parameter values such as the number of molecular markers (n 1), the training population size (n 2), the training population-to-test population ratio (n 3), and the model prediction accuracy value (n 4) of the candidate prediction models BayesA, bayesB, bayesC and rrBLUP are explicitly specified.
The screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the tobacco leaf whole genome data in real time by combining the core parameters through the whole genome selection model.
In the embodiment of the invention, the method is used for predicting the tobacco leaf yield by establishing a whole genome selection model, namely 3 types of natural Leaf Numbers (LN), maximum waist Leaf Lengths (LL) and maximum waist leaf Widths (WL) respectivelyTrait establishment whole genome selection model Bayes C LN 、Bayes B LL And Bayes C WL And verifying core parameters of the tobacco leaf candidate prediction model, combining the core parameters through the whole genome selection model, secondarily screening tobacco leaf whole genome data in real time, and continuously screening, optimizing and verifying the generated prediction result of the tobacco leaf mature period yield to finally form tobacco leaf yield prediction data, namely tobacco yield character phenotype values.
Specifically, the core parameters include: the number of molecular markers, the size of training population, the proportion of training population to testing population and the model prediction precision value.
That is, when the prediction result of the mature period yield is continuously screened, optimized and verified, the method obtains core parameter values such as the number of molecular markers (n 1), the training population scale (n 2), the training population-to-test population ratio (n 3), the model prediction accuracy value (n 4) and the like.
Specifically, the whole genome selection model comprises:
a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
That is, in the scheme of the invention, the whole genome selection model established for 3 yield trait phenotype values of natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) in the mature period of tobacco is Bayes C LN Full genome selection model Bayes B for maximum waist leaf length LL And maximum waist leaf width whole genome selection model Bayes C WL
Preferably, the tobacco yield prediction data includes: natural leaf number phenotype data, maximum waist leaf length phenotype data, and maximum waist leaf width phenotype data.
Specifically, the whole genome selection models for predicting tobacco leaf yield are Bayes C respectively LN 、Bayes B LL And Bayes C WL The calculation formula of the phenotype value data corresponding to the whole genome selection model is the same as that of the method, and secondly, the model The core parameter values of (a) are as described above, i.e. the same as the method, and will not be described again here.
To achieve the above object, the present invention also provides a tobacco yield prediction platform based on whole genome selection, as shown in fig. 6, comprising:
a processor, a memory, and a tobacco yield prediction platform control program based on the whole genome;
wherein executing the whole genome-based tobacco yield prediction platform control program at the processor, the whole genome-based tobacco yield prediction platform control program being stored in the memory, the whole genome-based tobacco yield prediction platform control program implementing the whole genome-based tobacco yield prediction method steps, for example:
s1, acquiring tobacco leaf whole genome data in a candidate prediction model;
s2, screening and optimizing tobacco leaf whole genome data in real time;
s3, tobacco yield prediction data are generated.
The details of the steps are set forth above and are not repeated here.
In the embodiment of the invention, the tobacco leaf yield prediction platform built-in processor based on whole genome selection can be composed of integrated circuits, for example, can be composed of single packaged integrated circuits, can also be composed of a plurality of integrated circuits packaged with the same function or different functions, and comprises one or a plurality of central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips and the like. The processor uses various interfaces and line connections to access the various components, by running or executing programs or units stored in memory, and invoking data stored in memory to perform various functions and process data for selecting tobacco yield prediction based on the whole genome;
The memory is used for storing program codes and various data, is installed in the tobacco leaf yield prediction platform based on whole genome selection, and realizes high-speed and automatic program or data access in the running process.
The Memory includes Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disk Memory, tape Memory, or any other medium from which a computer can be used to carry or store data.
To achieve the above object, the present invention further provides a computer readable storage medium, as shown in fig. 7, where the computer readable storage medium stores a tobacco yield prediction platform control program based on whole genome selection, and the tobacco yield prediction platform control program based on whole genome selection implements the steps of the tobacco yield prediction method based on whole genome selection, for example:
S1, acquiring tobacco leaf whole genome data in a candidate prediction model;
s2, screening and optimizing tobacco leaf whole genome data in real time;
s3, tobacco yield prediction data are generated.
The details of the steps are set forth above and are not repeated here.
In the description of embodiments of the invention, it should be noted that any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and that scope of preferred embodiments of the invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, as would be understood by those reasonably skilled in the art of the embodiments of the invention.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, system that includes a processing module, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM).
In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
To achieve the above object, the present invention also provides a chip system comprising at least one processor, wherein program instructions, when executed in the at least one processor, cause the chip system to perform the whole genome-based tobacco yield prediction method steps, for example:
s1, acquiring tobacco leaf whole genome data in a candidate prediction model;
s2, screening and optimizing tobacco leaf whole genome data in real time;
s3, tobacco yield prediction data are generated.
The details of the steps are set forth above and are not repeated here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
The invention can calculate or simulate the appearance value of the tobacco yield property in the mature period by utilizing the genotype data in the tobacco seedling period (early period) through the whole genome-based tobacco yield prediction method, the application, the system, the platform and the storage medium, and has the advantages of convenient operation, rapidness, high efficiency, science and accurate and reliable prediction result.
That is, for the construction of the whole-gene selection model, the invention constructs a amphiphilically derived recombinant inbred line (RILs, F7) population based on flue-cured tobacco varieties Y3 and K326, and also constructs a natural population containing 347 parts of different tobacco varieties (lines), two different types of populations representing the whole tobacco population or tobacco varieties (lines); on the other hand, adopting BayesA, bayesB and BayesC candidate prediction models obtained by preliminary screening, further screening, optimizing and constructing a genome-wide selection model Bayes C for predicting 3 yield character phenotype values of natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) in the mature period of tobacco by combining the actual measurement phenotype values of the tobacco materials LN 、Bayes B LL And Bayes C WL The application of molecular marker selection in the mature tobacco leaf yield prediction in the whole genome range is accelerated, so that the scientific, efficient, accurate and reliable cultivation of high-yield and high-quality tobacco leaves is realized Variety (line).
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (4)

1. The method for predicting the yield of the selected tobacco leaves based on the whole genome is characterized by comprising the following steps of:
A. the method for acquiring the whole genome data of the tobacco leaves in the candidate prediction model comprises the following steps:
a1, setting core parameters of a candidate prediction model;
a2, establishing a candidate prediction model;
a3, preliminarily screening the whole genome data of the tobacco leaves by combining the core parameters through the candidate prediction model;
B. the real-time screening and optimizing of the whole genome data of the tobacco leaves comprises the following steps:
b1, establishing a whole genome selection model, wherein the whole genome selection model comprises a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model;
B2, verifying core parameters of the tobacco leaf candidate prediction model;
b3, secondarily screening tobacco leaf whole genome data in real time by combining the core parameters through the whole genome selection model;
the core parameters in the steps A1, A3, B2 and B3 comprise the number of molecular markers, the scale of training population, the proportion of training population to testing population and the model prediction precision value;
C. generating tobacco yield prediction data, including natural leaf number phenotype value data, maximum waist leaf length phenotype value data and maximum waist leaf width phenotype value data;
the calculation formula of the natural She Shubiao type value data is as follows:
(1)
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
(2)
the calculation formula of the maximum waist leaf width phenotype value data is as follows:
(3)
wherein,Bayes C LN a model was selected for a natural She Shubiao model value whole genome,LNis the number of the natural leaves and is equal to the number of the natural leaves,Bayes Cas a result of the candidate predictive model,Bayes B LL selecting a model for the whole genome of the maximum waist leaf length surface type value,LLfor the maximum waist-lobe length,Bayes Bas a result of the candidate predictive model,Bayes C WL selecting a model for the full genome of the maximum waist leaf width phenotype value,WLfor the maximum waist leaf length, n1 is the number of molecular markers, n2 is the training population scale, n3 is the training population to test population ratio, and n4 is the model prediction accuracy value.
2. An application method of the whole genome-based tobacco yield prediction method, which is characterized in that the whole genome-based tobacco yield prediction method is applied to analyzing genotype data of tobacco population seedling stage and predicting phenotype value data of tobacco yield traits of each plant maturity stage in the population in a whole genome range;
the method is applied to analyzing genotype data of tobacco populations or tobacco varieties, predicting the phenotype value data of the yield traits in the mature period of the tobacco in the whole genome range, and acquiring the phenotype value data of the yield traits in the mature period during the seedling period of the tobacco.
3. A prediction system for implementing the whole genome-based tobacco yield prediction method of claim 1, characterized in that the prediction system specifically comprises:
the acquisition unit is used for acquiring the whole genome data of the tobacco leaves in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
the generation unit is used for generating tobacco yield prediction data;
the acquisition unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
The first screening module is used for primarily screening the tobacco leaf whole genome data by combining the core parameters through the candidate prediction model;
the screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the tobacco leaf whole genome data in real time by combining the core parameters through the whole genome selection model.
4. A prediction platform for implementing the whole genome-based tobacco yield prediction method of claim 1, comprising:
a processor, a memory, and a tobacco yield prediction platform control program based on the whole genome;
wherein the processor executes the whole genome-based tobacco yield prediction platform control program, which is stored in the memory, the whole genome-based tobacco yield prediction platform control program implementing the steps of the whole genome-based tobacco yield prediction method of claim 1.
CN202010675520.8A 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application Active CN111898807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010675520.8A CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010675520.8A CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Publications (2)

Publication Number Publication Date
CN111898807A CN111898807A (en) 2020-11-06
CN111898807B true CN111898807B (en) 2024-02-27

Family

ID=73191755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010675520.8A Active CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Country Status (1)

Country Link
CN (1) CN111898807B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116083622A (en) * 2022-11-22 2023-05-09 云南省烟草农业科学研究院 Genes qLL and qWL related to tobacco leaf types, linked SSR markers and application thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102269570A (en) * 2011-04-28 2011-12-07 中国烟草总公司郑州烟草研究院 Method for measuring maximum length and maximum width of flue-cured tobacco leaf based on canopy multi-spectra
CN105734141A (en) * 2016-03-31 2016-07-06 湖北省烟草科学研究院 Molecular biology method for identifying purity of tobacco varieties
CN106447079A (en) * 2016-08-31 2017-02-22 贵州师范大学 Prediction method for tobacco production of karst mountainous area based on Radarsat-2
WO2017069607A1 (en) * 2015-10-23 2017-04-27 Sime Darby Plantation Sdn. Bhd. Methods for predicting palm oil yield of a test oil palm plant
CN107354203A (en) * 2017-07-10 2017-11-17 中国烟草总公司郑州烟草研究院 Primer for identifying flue-cured tobacco Bi Na 1 combines and kit, application and detection method
CN110610744A (en) * 2019-09-11 2019-12-24 华中农业大学 Efficient whole genome selection method capable of realizing parallel operation and high accuracy
CN111197101A (en) * 2018-11-20 2020-05-26 云南省烟草农业科学研究院 Codominant SSR marker closely linked with tobacco leafy gene mLN and application thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102269570A (en) * 2011-04-28 2011-12-07 中国烟草总公司郑州烟草研究院 Method for measuring maximum length and maximum width of flue-cured tobacco leaf based on canopy multi-spectra
WO2017069607A1 (en) * 2015-10-23 2017-04-27 Sime Darby Plantation Sdn. Bhd. Methods for predicting palm oil yield of a test oil palm plant
CN105734141A (en) * 2016-03-31 2016-07-06 湖北省烟草科学研究院 Molecular biology method for identifying purity of tobacco varieties
CN106447079A (en) * 2016-08-31 2017-02-22 贵州师范大学 Prediction method for tobacco production of karst mountainous area based on Radarsat-2
CN107354203A (en) * 2017-07-10 2017-11-17 中国烟草总公司郑州烟草研究院 Primer for identifying flue-cured tobacco Bi Na 1 combines and kit, application and detection method
CN111197101A (en) * 2018-11-20 2020-05-26 云南省烟草农业科学研究院 Codominant SSR marker closely linked with tobacco leafy gene mLN and application thereof
CN110610744A (en) * 2019-09-11 2019-12-24 华中农业大学 Efficient whole genome selection method capable of realizing parallel operation and high accuracy

Also Published As

Publication number Publication date
CN111898807A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
Tong et al. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data
Roorkiwal et al. Genomic-enabled prediction models using multi-environment trials to estimate the effect of genotype× environment interaction on prediction accuracy in chickpea
Alvarez et al. Ten years of transcriptomics in wild populations: what have we learned about their ecology and evolution?
Liu et al. The impact of genetic relationship and linkage disequilibrium on genomic selection
CN106446597B (en) Several species feature selecting and the method for identifying unknown gene
Fodor et al. Genome-wide prediction methods in highly diverse and heterozygous species: proof-of-concept through simulation in grapevine
Hejase et al. From summary statistics to gene trees: methods for inferring positive selection
CN113519028A (en) Methods and compositions for estimating or predicting genotypes and phenotypes
CN111524545B (en) Method and device for whole genome selective breeding
CN107105624A (en) improved molecular breeding method
CN105868584B (en) The method for carrying out full-length genome selection and use by choosing extreme character individual
Wang et al. A dynamic framework for quantifying the genetic architecture of phenotypic plasticity
Pool Genetic mapping by bulk segregant analysis in Drosophila: experimental design and simulation-based inference
Scutari et al. Improving the efficiency of genomic selection
CN111898807B (en) Tobacco leaf yield prediction method based on whole genome selection and application
Bartholomé et al. Genomic prediction: progress and perspectives for rice improvement
Jing et al. Multiple domestications of Asian rice
CN111798920B (en) Tobacco economic character phenotype value prediction method based on whole genome selection and application
CN111582315A (en) Sample data processing method and device and electronic equipment
CN110853711B (en) Whole genome selection model for predicting fructose content of tobacco and application thereof
CN101517579A (en) Method of searching for protein and apparatus therefor
CN111883205B (en) Prediction method for selecting harmful ingredients of tobacco based on whole genome and application
You et al. Genomic cross prediction for linseed improvement
CN110853710B (en) Whole genome selection model for predicting starch content of tobacco and application thereof
Amini et al. Application of the Two-layer Wrapper-Embedded Feature Selection Method to Improve Genomic Selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant