CN111898807A - Tobacco yield prediction method based on whole genome selection and application - Google Patents

Tobacco yield prediction method based on whole genome selection and application Download PDF

Info

Publication number
CN111898807A
CN111898807A CN202010675520.8A CN202010675520A CN111898807A CN 111898807 A CN111898807 A CN 111898807A CN 202010675520 A CN202010675520 A CN 202010675520A CN 111898807 A CN111898807 A CN 111898807A
Authority
CN
China
Prior art keywords
tobacco
whole genome
yield
model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010675520.8A
Other languages
Chinese (zh)
Other versions
CN111898807B (en
Inventor
童治军
肖炳光
方敦煌
陈学军
曾建敏
焦芳婵
姚恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Academy of Tobacco Agricultural Sciences
Original Assignee
Yunnan Academy of Tobacco Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Academy of Tobacco Agricultural Sciences filed Critical Yunnan Academy of Tobacco Agricultural Sciences
Priority to CN202010675520.8A priority Critical patent/CN111898807B/en
Publication of CN111898807A publication Critical patent/CN111898807A/en
Application granted granted Critical
Publication of CN111898807B publication Critical patent/CN111898807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Mining
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Resources & Organizations (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Strategic Management (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Economics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Analytical Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Marketing (AREA)
  • Mining & Mineral Resources (AREA)
  • Bioethics (AREA)
  • Marine Sciences & Fisheries (AREA)
  • Animal Husbandry (AREA)
  • Agronomy & Crop Science (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)

Abstract

The invention belongs to the technical field of biology, and particularly relates to a tobacco yield prediction method based on whole genome selection and application thereof. The method comprises the steps of obtaining tobacco leaf whole genome data in a candidate prediction model; screening and optimizing the whole genome data of the tobacco leaves in real time; and generating tobacco yield prediction data. The method can calculate or simulate the tobacco yield trait phenotypic value of the tobacco in the mature period by utilizing the genotype data of the tobacco in the seedling period (or the early period), can predict the tobacco yield trait phenotypic value data in the mature period in the whole genome range by analyzing the genotype data of tobacco groups or tobacco varieties, and can obtain the tobacco yield trait phenotypic value data in the mature period.

Description

Tobacco yield prediction method based on whole genome selection and application
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a tobacco yield prediction method based on whole genome selection and application thereof.
Background
Tobacco is a special economic crop aiming at harvesting leaf organs, plays an important role in national economy of China, and is used as the basis of tobacco industry of China, so that the cultivation of high-quality tobacco varieties with high yield becomes an important target of breeding research.
Research shows that the yield of the tobacco leaves is obviously related to the characteristics of the tobacco leaves, such as the leaf number, the leaf length, the leaf width and the like, the characteristics belong to quantitative characteristics influenced by polygene control and environment, and the inheritance is complex.
In addition, the traditional measurement work for the tobacco yield character phenotype value in the mature period is very time-consuming and labor-consuming and low in efficiency. That is, if a more accurate phenotypic value of the yield traits of the tobacco leaves is obtained, a complete and lengthy growing period of the tobacco field is needed, and further measurement in the field is needed after the tobacco leaves are mature, so that the method is time-consuming, labor-consuming, and easily influenced by environmental and human factors, and the result is uncertain.
Disclosure of Invention
Aiming at the technical problems and defects that the determination work of the tobacco yield character phenotype value in the mature period is very time-consuming and labor-consuming and low in efficiency, the tobacco yield character is easily influenced by environment and human factors, and the determination result is uncertain, the invention provides a tobacco yield prediction method based on whole genome selection and application, and the phenotype value of the tobacco yield character in the mature period of a tobacco material is predicted based on the genotype number in the seedling period (early period) of the tobacco.
The invention is realized by the following technical scheme:
a tobacco yield prediction method based on whole genome selection specifically comprises the following steps:
acquiring tobacco leaf whole genome data in the candidate prediction model;
screening and optimizing the whole genome data of the tobacco leaves in real time;
and generating tobacco yield prediction data.
Further, in the step of obtaining the tobacco leaf whole genome data in the candidate prediction model, the method further comprises the following steps:
setting core parameters of the candidate prediction model;
establishing a candidate prediction model;
and primarily screening the whole genome data of the tobacco leaves by the candidate prediction model and the core parameters.
Further, in the step of screening and optimizing the whole genome data of the tobacco leaves in real time, the method also comprises the following steps:
establishing a whole genome selection model;
verifying core parameters of the tobacco leaf candidate prediction model;
and (4) performing real-time secondary screening on the tobacco leaf whole genome data through the whole genome selection model and by combining the core parameters.
Further, the core parameters include: the number of molecular markers, the scale of a training population, the proportion of the training population to a testing population and the prediction accuracy value of the model.
Further, the whole genome selection model comprises:
a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
Further, the tobacco yield prediction data comprises: natural leaf number phenotype value data, maximum waist leaf length phenotype value data, and maximum waist leaf width phenotype value data.
Further, the calculation formula of the data of the natural leaf number phenotype is as follows:
Figure BDA0002583901910000021
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
Figure BDA0002583901910000031
the calculation formula of the maximum waist leaf width phenotypic value data is as follows:
Figure BDA0002583901910000032
among them, Bayes CLNSelecting a model for the whole genome of natural leaf number table values, LN natural leaf number, Bayes C candidate prediction model, Bayes BLLSelecting a model for the maximum waist leaf length table value whole genome, LL being the maximum waist leaf length, Bayes B being a candidate prediction model, Bayes CWLSelecting a model for the whole genome with the maximum waist leaf width phenotype value, wherein WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the scale of a training population, n3 is the proportion of the training population to a test population, and n4 is a model prediction accuracy value.
In order to achieve the purpose, the invention also provides an application of the method for predicting the yield of the tobacco leaves selected based on the whole genome, wherein the method for predicting the yield of the tobacco leaves selected based on the whole genome is applied to analyzing the genotype data of the seedling stage of a tobacco group and predicting the phenotypic value data of the yield characters of the tobacco leaves in the mature stage of each plant in the group in the whole genome range;
the method is applied to analyzing the genotype data of tobacco groups or tobacco varieties, and the yield character phenotype value data of the tobacco in the mature period is predicted in the whole genome range, so that the yield character phenotype value data of the mature period is obtained in the seedling period of the tobacco.
In order to achieve the above object, the present invention further provides a system for predicting tobacco yield based on whole genome selection, wherein the system specifically comprises:
the acquisition unit is used for acquiring the tobacco leaf whole genome data in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
a generating unit for generating tobacco yield prediction data;
the acquiring unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
the first screening module is used for primarily screening the whole genome data of the tobacco leaves by combining the candidate prediction model and the core parameters;
the screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the whole genome data of the tobacco leaves in real time by combining the whole genome selection model and the core parameters.
In order to achieve the above object, the present invention further provides a whole genome-based tobacco yield prediction platform for selecting tobacco leaves, comprising:
the system comprises a processor, a memory and a control program for selecting a tobacco yield prediction platform based on a whole genome;
wherein the processor executes the whole genome-based selection tobacco yield prediction platform control program, the whole genome-based selection tobacco yield prediction platform control program is stored in the memory, and the whole genome-based selection tobacco yield prediction platform control program implements the whole genome-based selection tobacco yield prediction method steps.
In order to achieve the above object, the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a whole genome-based selection tobacco yield prediction platform control program, and the whole genome-based selection tobacco yield prediction platform control program implements the whole genome-based selection tobacco yield prediction method steps.
To achieve the above objects, the present invention also provides a chip system, which includes at least one processor, and when program instructions are executed in the at least one processor, the chip system executes the steps of the method for predicting the yield of tobacco leaves based on whole genome selection.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the application, the system, the platform and the storage medium for predicting the yield of the tobacco leaves based on the whole genome selection, the genotype data of the tobacco seedlings (in the early stage) can be used for calculating or simulating the phenotypic value of the yield character of the tobacco leaves in the mature stage, and the method is convenient, rapid, efficient and scientific to operate, and the prediction result is accurate and reliable.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic view of a flow architecture of a method for predicting the yield of tobacco leaves based on whole genome selection according to the present invention;
FIG. 2 is a whole genome selection model Bayes C established for the natural Leaf Number (LN) phenotype value in the tobacco yield trait in the mature period based on the whole genome selection tobacco yield prediction method of the present inventionLNA schematic diagram;
FIG. 2 is a graph (a) showing the number of molecular markers (n1) vs. Bayes CLNThe LN table type value prediction precision influence diagram of the model is shown;
in FIG. 2, the graph (b) is the training population size (n2) vs. Bayes CLNThe LN table type value prediction precision influence diagram of the model is shown;
panel (C) of FIG. 2 is the ratio of training population to test population (n3) versus Bayes CLNThe LN table type value of the model predicts the precision influence;
in FIG. 2, the diagram (d) is in Bayes CLNIn the models, the influence of different candidate prediction models on the prediction precision value (n4) is shown schematically;
FIG. 3 is a whole genome selection model Bayes B for the maximum waist Leaf Length (LL) phenotype value in the tobacco yield traits based on the whole genome selection tobacco yield prediction method of the present inventionLL
FIG. 3 is a graph (a) showing the number of molecular markers (n1) vs. Bayes BLLThe LL form value of the model is predicted the precision influence schematic diagram;
in FIG. 3, the graph (B) is the training population size (n2) versus Bayes BLLThe LL form value of the model is predicted the precision influence schematic diagram;
panel (c) of FIG. 3 is the ratio of training population to test population (n3) versus Bayes BLLThe LL form value of the model is predicted the precision influence schematic diagram;
in FIG. 3, the diagram (d) is Bayes BLLIn the models, the influence of different candidate prediction models on the prediction precision value (n4) is shown schematically;
FIG. 4 is a whole genome selection model Bayes C for the maximum waist leaf Width (WL) phenotype value in tobacco yield traits based on the whole genome selection tobacco yield prediction method of the present inventionWL
FIG. 4 is a graph (a) showing the number of molecular markers (n1) vs. Bayes CWLThe WL form value prediction precision influence diagram of the model is shown;
panel (b) of FIG. 4 is training population size (n2) versus Bayes CWLThe WL form value prediction precision influence diagram of the model is shown;
panel (C) of FIG. 4 is the ratio of training population to test population (n3) versus Bayes CWLThe WL form value prediction precision influence diagram of the model is shown;
in FIG. 4, the diagram (d) is Bayes CWLIn the models, the influence of different candidate prediction models on the prediction precision value (n4) is shown schematically;
FIG. 5 is a schematic diagram of a system for predicting tobacco yield based on whole genome selection according to the present invention;
FIG. 6 is a schematic diagram of a platform architecture for predicting tobacco yield based on whole genome selection according to the present invention;
FIG. 7 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;
the objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
For better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings, and other advantages and capabilities of the present invention will become apparent to those skilled in the art from the description.
The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. Secondly, the technical solutions in the embodiments can be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not be within the protection scope of the present invention.
Preferably, the tobacco yield prediction method based on whole genome selection is applied to one or more terminals or servers. The terminal is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The terminal can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The terminal can be in man-machine interaction with a client in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control device mode.
The invention relates to a method, a system, a platform and a storage medium for predicting the yield of tobacco leaves selected based on a whole genome.
Fig. 1 is a flowchart of a method for predicting tobacco yield based on whole genome selection according to an embodiment of the present invention.
In this embodiment, the method for predicting the yield of tobacco leaves based on whole genome selection can be applied to a terminal with a display function or a fixed terminal, and the terminal is not limited to a personal computer, a smart phone, a tablet computer, a desktop computer or an all-in-one machine with a camera, and the like.
The method for predicting the yield of the tobacco leaves based on the whole genome selection can also be applied to a hardware environment consisting of a terminal and a server connected with the terminal through a network. Networks include, but are not limited to: a wide area network, a metropolitan area network, or a local area network. The tobacco yield prediction method based on whole genome selection in the embodiment of the invention can be executed by a server, a terminal or both.
For example, for a terminal needing to perform whole genome-based selection tobacco yield prediction, the whole genome-based selection tobacco yield prediction function provided by the method of the present invention can be directly integrated on the terminal, or a client for implementing the method of the present invention can be installed. For another example, the method provided by the present invention may further be operated on a device such as a server in the form of a Software Development Kit (SDK), an interface for selecting the tobacco yield prediction function based on the whole genome is provided in the form of the SDK, and the terminal or other devices may realize the selecting the tobacco yield prediction function based on the whole genome through the provided interface.
The invention is further elucidated with reference to the drawing.
As shown in FIG. 1, the invention provides a method for predicting the yield of tobacco leaves selected based on a whole genome, which specifically comprises the following steps:
a tobacco yield prediction method based on whole genome selection specifically comprises the following steps:
s1, acquiring tobacco leaf whole genome data in the candidate prediction model;
s2, screening and optimizing the whole genome data of the tobacco leaves in real time;
and S3, generating tobacco yield prediction data.
According to the scheme, the tobacco leaf total genome data in the candidate prediction model is obtained, and the tobacco leaf total genome data is screened and optimized in real time to finally generate the tobacco leaf yield prediction data.
That is to say, the tobacco yield prediction data is generated by continuously screening, optimizing and verifying the prediction results of the tobacco yield (namely the natural Leaf Number (LN), the maximum waist Leaf Length (LL) and the maximum waist leaf Width (WL)) in the mature period through experiments on the basis of candidate prediction models bayesian a, bayesian b and bayesian c obtained through preliminary screening, and core parameter values such as the number of molecular markers (n1), the scale of training population (n2), the proportion of training population to testing population (n3) and the model prediction accuracy value (n4) are also obtained through the method.
Specifically, in the step of obtaining the tobacco leaf genome data in the candidate prediction model, the method further comprises the following steps:
s11, setting core parameters of the candidate prediction model;
s12, establishing a candidate prediction model;
s13, primarily screening the tobacco leaf whole genome data through the candidate prediction model and by combining the core parameters.
In the embodiment of the invention, the core parameters of the candidate prediction model are definitely set, the candidate prediction model is established, and the tobacco leaf whole genome data is initially screened by combining the core parameters through the candidate prediction model.
That is, in order to optimize the prediction accuracy for the yield of tobacco leaves, the number of molecular markers (n1), the training population size (n2), the training population-to-test population ratio (n3), the model prediction accuracy value (n4), and other core parameter values of candidate prediction models BayesA, BayesB, BayesC, and rrBLUP are specified.
Specifically, the method also comprises the following steps in the step of screening and optimizing the whole genome data of the tobacco leaves in real time:
s21, establishing a whole genome selection model;
s22, verifying core parameters of the tobacco leaf candidate prediction model;
and S23, performing real-time secondary screening on the tobacco leaf whole genome data through the whole genome selection model and combining the core parameters.
In the embodiment of the invention, a whole genome selection model is established, namely 3 types of natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) are respectively used for predicting the tobacco yield traits, a whole genome selection model Bayes CLN, Bayes BLL and Bayes CWL is established, then the tobacco yield prediction data, namely the tobacco yield trait form value, is finally formed by verifying the core parameters of a tobacco candidate prediction model, combining the whole genome selection model with the core parameters, secondarily screening the whole genome data of the tobacco in real time, and continuously screening, optimizing and verifying the generated prediction result of the tobacco yield in the mature period.
Specifically, the core parameters include: the number of molecular markers, the scale of a training population, the proportion of the training population to a testing population and the prediction accuracy value of the model.
That is, when the prediction results of the yield in the mature period are continuously screened, optimized and verified, the method obtains core parameter values such as the number of molecular markers (n1), the scale of a training population (n2), the proportion of the training population to a test population (n3) and the model prediction accuracy value (n 4).
Specifically, the whole genome selection model comprises:
a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
That is, in the present embodiment, the natural leaf number whole genome selection model is Bayes C as the respective natural leaf number whole genome selection model established for 3 types of phenotypic values of yield traits including the natural Leaf Number (LN), the maximum waist Leaf Length (LL), and the maximum waist leaf Width (WL) in the tobacco maturity stageLNMaximum waist leaf length whole genome selection model Bayes BLLAnd maximum waist leaf width genome wide selection model Bayes CWL
Preferably, the tobacco yield prediction data includes: natural leaf number phenotype value data, maximum waist leaf length phenotype value data, and maximum waist leaf width phenotype value data.
The calculation formula of the data of the natural leaf number phenotype value is as follows:
Figure BDA0002583901910000101
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
Figure BDA0002583901910000102
the calculation formula of the maximum waist leaf width phenotypic value data is as follows:
Figure BDA0002583901910000111
among them, Bayes CLNSelecting a model for the whole genome of natural leaf number table values, LN natural leaf number, Bayes C candidate prediction model, Bayes BLLSelecting a model for the maximum waist leaf length table value whole genome, LL being the maximum waist leaf length, Bayes B being a candidate prediction model, Bayes CWLSelecting a model for the whole genome with the maximum waist leaf width phenotype value, wherein WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the scale of a training population, n3 is the proportion of the training population to a test population, and n4 is a model prediction accuracy value.
That is, the whole genome selection models for predicting the tobacco yield are Bayes CLN, Bayes BLL and Bayes CWL respectively, and the core parameter values of the models are respectively:
A) genome-wide selection model Bayes C against natural Leaf Number (LN) phenotypic valuesLN: the number of molecular markers (n1) is 2000, i.e. n1 is 2000 markers; the training population scale (n2) is 250 individuals, namely n2 is 250 individuals; the ratio of training population to test population (n3) was 5:1, i.e. n3 was 5:1 (number of individuals in training population: number of individuals in test population was 5: 1); the model prediction accuracy value (n4) is 0.72, i.e., n4 is 0.72.
B) Genome-wide selection model Bayes B for maximum waist Leaf Length (LL) phenotypic valuesLL: the number of molecular markers (n1) is 16000, i.e. n1 equals 16000; the training population scale (n2) is more than 250 individuals (including 250 individuals), namely n2 is more than or equal to 250 individuals; the ratio of training population to test population (n3) is 6:1, i.e. n3 ═ 6: 1; the model prediction accuracy value (n4) is 0.40, i.e., n4 is 0.40.
C) Genome-wide selection model Bayes C for maximum waist leaf Width (WL) phenotypic valuesWL: the number of molecular markers (n1) is 1000 markers, namely n1 is 1000 markers; the training population scale (n2) is more than 250 individuals (including 250 individuals), namely n2 is more than or equal to 250 individuals; the ratio of training population to test population (n3) is 6:1, i.e. n3 ═ 6: 1; the model prediction accuracy value (n4) is 0.32, i.e., n4 is 0.32.
In other words, the method of the present invention, that is, the whole genome selection model for predicting the yield of tobacco leaves in the mature period, is Bayes CLN、Bayes BLLAnd Bayes CWLOn the basis of 3 Bayes candidate prediction models (namely Bayes A, Bayes B and Bayes C) obtained by primary screening, the model is subjected to experimental screening, optimization and verification to obtain specific numerical values of the number (n1) of the molecular markers of the respective core parameters, the scale (n2) of the training population, the proportion (n3) of the training population and the testing population and the prediction precision value (n3) of the model.
The whole genome selection model Bayes CLN、Bayes BLLAnd Bayes CWLThe corresponding 4 core parameter values are respectively:
number of molecular markers (n 1): A) for natural Leaf Number (LN), n1 is 2000 labels; B) for the maximum waist Leaf Length (LL), n1 ═ 16000 markers; C) for the maximum waist leaf Width (WL), n1 is 1000 tokens.
Training population size (n 2): A) for natural Leaf Number (LN), n2 is 250 individuals; B) n2 is more than or equal to 250 single plants aiming at the maximum waist Leaf Length (LL); C) aiming at the maximum waist leaf Width (WL), n2 is more than or equal to 250 single plants.
Training population to test population ratio (n 3): A) for natural Leaf Number (LN), n3 is 5: 1; B) for maximum waist Leaf Length (LL), n3 is 6: 1; C) for maximum waist lobe Width (WL), n3 is 6: 1.
Model prediction accuracy value (n 4): A) for natural Leaf Number (LN), n4 ═ 0.72; B) for maximum waist Leaf Length (LL), n4 is 0.40; C) for the maximum waist leaf Width (WL), n4 is 0.32.
The invention is further illustrated by the following specific examples:
example (b):
whole genome selection model Bayes C for predicting yield of tobacco leaves in mature periodLN、Bayes BLLAnd Bayes CWLConstruction and application of
First, experimental material
Constructing a recombinant inbred line (RILs, F7) group by using flue-cured tobacco varieties Y3 and K326, wherein the RILs group comprises 300 strains; in addition, to construct a predictive model with tobacco commonality, a natural population of tobacco was constructed that was distinct from the parental derivative population, which consisted of 347 different tobacco varieties (lines).
Second, obtaining the phenotypic data of the yield characters of 3 kinds of tobacco leaves of two different types of tobacco groups
And transplanting the test material to a field after the test material is grown, and measuring and counting the yield character data (phenotypic value) of the 3 kinds of tobacco leaves after the field tobacco plants are mature. From the results of the finally obtained phenotypic data, it is known that: the phenotypic values of the 3 yield traits of all populations are continuously normal or approximately normal distributed, and belong to the typical quantitative traits which are commonly controlled by the micro-effective polygenes and environmental factors (i.e., genetic and non-genetic factors).
Third, SNP marker analysis (taking SNP marker as an example)
Extracting tobacco genome DNA: the conventional CTAB method or plant tissue DNA extraction kit can be adopted, and the method can refer to the existing literature or the instruction in the kit. But the extracted tobacco DNA needs to be purified to remove RNA, protein and other organic impurities, so that the requirement of developing the SNP chip is met; if the tobacco sample is subjected to genome re-sequencing to mine the SNP marker, the corresponding tobacco DNA quality needs to be processed according to the requirements of a sequencing company.
Construction and application of whole genome selection model Bayes CLN (taking natural Leaf Number (LN) as an example)
4.1 preliminary screening of candidate predictive models
The basic parameters (functions) of 4 original models rrBLUP, BayesA, BayesB and BayesC provided in the R language package are optimized by utilizing SNP genotype data of each tobacco strain of a tobacco recombinant inbred line (RILs, 300 strains), a tobacco natural population (347 different tobacco strains) and a tobacco comprehensive population (647 strains) formed by mixing the two. Finally, the basic parameters (functions) of each candidate model are obtained.
A) Basic function for rrBLUP raw model:
mixed. solve: modeling the marker effect as a random effect or applying the genotypic (genetic) value of the row data to the a.mat function (calculating an additive relationship matrix, predicting a breeding value);
(iii) 2.kinship. blup: including superordinate effects in genotype value prediction;
GWA: associating and mapping;
setting specific parameters:
filling and processing genotype data: mat (Additive relationship matrix)
impute<-A.mat(markers,max.missing=0.5;impute.method="EM";n.core=4;return.imputed=T)
markers: genotype data;
missing: allowing the maximum miss rate before filling, and if the miss rate is more than 0.5, deleting the SNP of the site of all samples
Method: filling method
Core: number of threads
Input: if T is selected, padded data is returned.
Training set and test set (which can be set according to actual demand proportion, generally 91 or 82 or 64 proportion)
train=as.matrix(sample(1:271,217))
test<-setdiff(1:271,train)
The analysis without requiring setting of training set and test set proportions is unified according to a 8:2 ratio (default)
Model training: mixed
height_answer<-mixed.solve(y=Height,Z=m_train,K=NULL,SE=FALSE,method="REML",return.Hinv=FALSE)
y: training set of tabular data
Z: training set genotype data
K: covariance matrix
And SE: standard deviation of
A method: model training method
B) Basis function for Bayes raw model:
the data input is the same as the rrBLUP raw model.
The R language bag: BGLR (Bayesian Generalized Linear regression)
ETA<-list(list(X=markers,model='BayesB',probIn=0.05))
ETA: two-level lists for specifying regression functions (or linear predictors)
X: genotype document
model: model selection
Model training:
system.time(fit_BB<-BGLR(y=Height,ETA=ETA,nIter=10000,burnIn=5000,thin=5,saveAt=”,df0=5,S0=NULL,weights=NULL,R2=0.5))
y: phenotypic document
ETA: two-level lists for specifying regression functions (or linear predictors)
nIter, burnIn, thin: (integer) the number of iterations, burn-in and thining.
savat: protection program
R2: expected prior variance ratio before model training.
Preferably, the natural Leaf Number (LN) model Bayes CLN in the tobacco yield traits is constructed and applied
And (3) screening, optimizing and verifying a genome-wide selection model aiming at natural Leaf Number (LN) phenotype value prediction in the tobacco yield traits by using an original prediction model (namely, a candidate prediction model) with determined basic parameters for a recombinant inbred line population containing 300 and 347 single plants (lines) and a natural population and a comprehensive tobacco population containing 647 single plants (lines) generated after mixing the two populations. The specific method comprises the following steps:
firstly, 50000 high-quality SNP markers uniformly distributed on the whole genome of tobacco are utilized to analyze a tobacco comprehensive group containing 647 single plants (lines) and obtain genotype data; secondly, detecting and counting natural Leaf Number (LN) phenotypic values in the yield traits of the tobacco leaves of each single plant (line) in the mature period after the tobacco comprehensive population is baked, and obtaining LN phenotypic data; third, genotype data and LN phenotype data for each individual (line) within the above-described tobacco complex population was entered using 2 types of 4 candidate predictive models (rrBLUP, BayesA, BayesB, and BayesC) for determining underlying parametersAnd (3) performing line simulation, screening, optimizing and verifying the number (n1) of 4 core parameter molecular markers in the prediction model, the scale (n2) of a training population, the proportion (n3) of the training population to a testing population and the specific numerical value of the model prediction precision value (n4), and finally constructing a whole genome selection model Bayes CLN for predicting the natural Leaf Number (LN) form value in the yield traits of the tobacco leaves. That is, 50000 SNP markers were divided into 10 gradients (1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000, and 50000) for determining the number of molecular markers (n1) parameter in the Bayesian CLN model, and the results are shown in graph (a) in FIG. 2; the comprehensive population is used as a training population according to 50, 100, 150, 200, 250 and 300 individuals (lines), and meanwhile, the recombinant inbred line population (300 lines), the natural population (347 lines) and the comprehensive population (647 lines) are respectively and independently used as the training population for determining the training population scale (n2) in the Bayes CLN model, and the results are shown in a graph (b) in FIG. 2; 647 lines in the synthetic population are treated according to the number of tobacco plants in the training population: the training population to test population ratio in the bayesian cln model was determined by 11 gradients of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1 and 10:1 tobacco plant number in the test population (n3), the results of which are shown in figure 2 (c); and (3) carrying out lN phenotypic value prediction calculation on the Bayes CLN model with the obtained specific core parameters (n1, n2 and n3) to obtain an lN phenotypic prediction value in the yield traits of the tobacco leaves, comparing the lN phenotypic prediction value with respective lN true values to obtain prediction accuracy, and determining the highest prediction accuracy value (n 4). Finally, after the experimental verification, optimization and practical application, a whole genome selection model Bayes C for predicting natural Leaf Number (LN) phenotypic values in the yield traits of the tobacco leaves in the mature period is constructedLNThe formula is as follows:
Figure BDA0002583901910000161
wherein the content of the first and second substances,
Figure BDA0002583901910000162
represents: the 4 parameters n1, n2, n3 and n4 of the core are substituted into the Bayesian candidate prediction model in sequence.
Similarly, a natural Leaf Number (LN) whole genome selection model was established:
Figure BDA0002583901910000163
specifically, fig. 2 is a genome-wide selection model Bayes C established for natural Leaf Number (LN) phenotypic values in the yield traits of tobacco leaves in the mature period using genotype data and actually measured phenotypic data of a tobacco comprehensive population (a mixed population of a recombinant inbred line population and a natural population) in combination with bioinformaticsLN
Wherein, in FIG. 2, the graph (a) shows the number of molecular markers (n1) versus Bayes CLNLN table type value prediction accuracy impact of model: the abscissa is the number of molecular markers; ordinate is Bayes CLNPredicting precision of the model to the LN form value; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers used for genotyping the tobacco composite population of 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
Panel (b) of FIG. 2 is training population size (n2) versus Bayes ClNThe lN form value of the model predicts the precision influence: the abscissa is the training population size (the number of tobacco plants contained in the training population); ordinate is Bayes CLNThe prediction accuracy of the model to the LN prototype value.
Panel (C) of FIG. 2 is the ratio of training population to test population (n3) versus Bayes CLNLN table type value prediction accuracy impact of model: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in the training population: the number of tobacco plants in the test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent the ratios of training population to test population of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; ordinate is Bayes CLNThe prediction accuracy of the model to the LN prototype value.
Fig. 2, graph (d), shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents 4 original models provided in the R language package: bayes A, Bayes B, Bayes C, and rrBLUP; and the ordinate represents the prediction precision of the original model on the LN form value of the tobacco population to be detected.
Similarly, establishing a whole genome selection model of the phenotypic values of the yield traits (LL and WL) of the other 2 mature-period tobacco leaves, wherein the specific models are respectively as follows:
for LL phenotype values, the whole genome selection model was:
Figure BDA0002583901910000171
in particular, the model Bayes B was selected for whole genome selection for maximum waist Leaf Length (LL) phenotype value in tobacco yield traitsLLAs shown in fig. 3, the following is the corresponding:
FIG. 3 is a graph (a) showing the number of molecular markers (n1) vs. Bayes BLLLL-table-type value prediction accuracy impact of model: the abscissa is the number of molecular markers; ordinate is Bayes BLLThe model predicts the LL form value; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers used for genotyping the tobacco composite population of 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
In FIG. 3, the graph (B) is the training population size (n2) versus Bayes BLLLL-table-type value prediction accuracy impact of model: the abscissa is the training population size (the number of tobacco plants contained in the training population); ordinate is Bayes BLLThe prediction accuracy of the model for the LL profile values.
Panel (c) of FIG. 3 is the ratio of training population to test population (n3) versus Bayes BLLLL-table-type value prediction accuracy impact of model: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in the training population: the number of tobacco plants in the test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent the ratios of training population to test population of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; ordinate is Bayes BLLThe prediction accuracy of the model for the LL profile values.
Fig. 3, graph (d), shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents 4 original models provided in the R language package: bayes A, Bayes B, Bayes C, and rrBLUP; and the ordinate represents the prediction precision of the original model on the LL phenotype value of the tobacco group to be detected.
For WL phenotype values, the whole genome selection model was:
Figure BDA0002583901910000181
in particular, the model Bayes C was selected for genome-wide selection of the maximum waist leaf Width (WL) phenotype value in tobacco yield traitsWLAs shown in fig. 4, the following is the corresponding:
FIG. 4 is a graph (a) showing the number of molecular markers (n1) vs. Bayes CWLWL table values of the model predict the accuracy impact: the abscissa is the number of molecular markers; ordinate is Bayes CWLPredicting the WL form value by the model; specifically, 1K, 2K, 4K, 7K, 11K, 16K, 22K, 29K, 37K and All shown on the abscissa in the figure represent the number of SNP markers used for genotyping the tobacco composite population of 1000, 2000, 4000, 7000, 11000, 16000, 22000, 29000, 37000 and 50000, respectively.
Panel (b) of FIG. 4 is training population size (n2) versus Bayes CWLWL table values of the model predict the accuracy impact: the abscissa is the training population size (the number of tobacco plants contained in the training population); ordinate is Bayes CWLPrediction accuracy of type-to-WL table type values.
Panel (C) of FIG. 4 is the ratio of training population to test population (n3) versus Bayes CWLWL table values of the model predict the accuracy impact: the abscissa is the ratio of training population to test population (i.e., the number of tobacco plants in the training population: the number of tobacco plants in the test population), 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, and 10 represent the ratios of training population to test population of 1:5, 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, and 10:1, respectively; ordinate is Bayes CWLAnd (3) predicting the accuracy of the model on the WL form value.
Fig. 4 (d) shows the effect of different candidate prediction models on the prediction accuracy value (n 4): the abscissa represents 4 original models provided in the R language package: bayes A, Bayes B, Bayes C, and rrBLUP; and the ordinate represents the prediction precision of the original model on the WL phenotype value of the tobacco group to be detected.
Preferably, in the formulas (1), (2), (3), Bayes CLNSelecting a model for the whole genome of natural leaf number table values, LN natural leaf number, Bayes C candidate prediction model, Bayes BLLSelecting a model for the maximum waist leaf length table value whole genome, LL being the maximum waist leaf length, Bayes B being a candidate prediction model, Bayes CWLSelecting a model for the whole genome with the maximum waist leaf width phenotype value, wherein WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the scale of a training population, n3 is the proportion of the training population to a test population, and n4 is a model prediction accuracy value.
In order to achieve the purpose, the invention also provides an application of the method for predicting the yield of the tobacco leaves selected based on the whole genome, wherein the method for predicting the yield of the tobacco leaves selected based on the whole genome is applied to analyzing the genotype data of the seedling stage of a tobacco group and predicting the phenotypic value data of the yield characters of the tobacco leaves in the mature stage of each plant in the group in the whole genome range;
the method is applied to analyzing the genotype data of tobacco groups or tobacco varieties, and the yield character phenotype value data of the tobacco in the mature period is predicted in the whole genome range, so that the yield character phenotype value data of the mature period is obtained in the seedling period of the tobacco.
That is to say, the 3 whole genome selection models Bayes CLN, Bayes BLL and Bayes CWL are used for analyzing the genotype data of the tobacco population at the seedling stage and accurately predicting the phenotypic values of the 3 tobacco yield traits of each plant in the population at the mature stage in the whole genome range.
Preferably, genotype data of tobacco populations or tobacco varieties (lines) are analyzed by using a whole genome selection model BayesCLN, BayesBLL and BayesCWL respectively, and 3 yield trait phenotypic values of tobacco maturity stages are accurately predicted in a whole genome range, so that accurate phenotypic values of 3 tobacco yield traits of maturity stages can be obtained in a seedling stage (early stage).
Namely, the 3 whole genome selection models of the invention are applied to the prediction of the phenotypic value of the yield traits of 3 kinds of tobacco leaves in the mature period by analyzing the genotype data of each plant in the tobacco population or variety (line) in the seedling period (early period).
The application of the whole genome selection model for predicting the yield of the tobacco leaves in the mature period is to utilize Bayes CLN、BayesBLLAnd Bayes CWLAnd (3) analyzing genotype data of the tobacco group or variety (line) to be detected in the seedling stage (early stage) by a model, and predicting to obtain the yield characters of 3 types of tobacco leaves after the tobacco to be detected is mature, namely natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) phenotypic values.
In view of the above, the first objective of the present invention is to provide a genome-wide selection model Bayes C for predicting the yield of tobacco leaves in the mature periodLN、Bayes BLLAnd Bayes CWL(ii) a The second purpose is to use the whole genome selection model Bayes CLN、Bayes BLLAnd Bayes CWLThe application of the phenotypic values of 3 yield traits in the mature period of each tobacco plant in a tobacco population or variety (line) by analyzing the genotype data of the tobacco in the early period (seedling period) so as to accurately predict the phenotypic values of the 3 yield traits in the mature period of each tobacco plant in the population.
The first purpose of the invention is realized by that the whole genome selection model for predicting the tobacco yield in the mature period is Bayes CLN、Bayes BLLAnd Bayes CWLTheir respective core parameter values n1, n2, n3 and n4 are well defined.
The second purpose of the invention is realized by analyzing the early (seedling stage) genotype data of each plant in a tobacco population or variety (line) by the whole genome selection models of BayesCLN, BayesBLL and BayesCWL, so as to accurately predict the application of 3 yield trait phenotype values of each tobacco plant in the maturation stage of the population.
In order to scientifically, efficiently and accurately select high-yield tobacco varieties in the mature period and specifically select progeny tobacco materials with higher yield, the invention provides a whole genome selection model Bayes C for predicting the yield of tobacco leaves in the mature periodLN、Bayes BLLAnd Bayes CWLThe method comprises the steps of respectively collecting and analyzing early genotype data and yield property values of tobacco leaves in a mature period for recombinant inbred line populations (RILs), natural populations and comprehensive populations mixed with the RILs by utilizing the models, and screening, optimizing and verifying 4 core parameters such as the number of molecular markers (n1), the scale of training populations (n2), the proportion of the training populations to testing populations (n3) and model prediction precision values (n4) in each model on the basis of Bayesa, BayesB and Bayesc candidate prediction models obtained by primary selection. The final establishment of the model can be used for the auxiliary selection of the yield trait genes/QTL loci of 3 kinds of tobacco leaves in the mature period in the whole genome range so as to improve the efficiency of the molecular marker auxiliary selection and the efficiency of the high-yield tobacco variety breeding.
On one hand, flue-cured tobacco varieties Y3 and K326 (recombinant inbred line (RILs, F7) are constructed, meanwhile, a natural population containing 347 different tobacco varieties (lines) is also constructed, the two populations are used for representing all tobacco populations or varieties (lines), on the other hand, 3 candidate prediction models Bayes A, Bayes B and Bayes C obtained through primary selection are adopted, and a whole genome selection model Bayes C for predicting the yield of tobacco leaves in the mature period is further screened, optimized and constructedLN、Bayes BLLAnd Bayes CWLAnd the breeding work of the molecular marker selection in the tobacco variety with high yield and high quality in the whole genome range is accelerated.
The invention relates to a whole genome selection model Bayes C for predicting the yield of tobacco leaves in the mature periodLN、Bayes BLLAnd Bayes CWLHas the characteristics of science, high efficiency, accuracy and low cost, and can be applied to cultivation of new varieties (lines) of high-quality tobacco with ideal yield.
In order to achieve the above object, the present invention further provides a system for predicting tobacco yield based on whole genome selection, as shown in fig. 5, the system specifically includes:
the acquisition unit is used for acquiring the tobacco leaf whole genome data in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
a generating unit for generating tobacco yield prediction data;
in the scheme system of the invention, the tobacco leaf total genome data in the candidate prediction model is obtained, and the tobacco leaf total genome data is screened and optimized in real time to finally generate the tobacco leaf yield prediction data.
That is to say, the tobacco yield prediction data is generated by continuously screening, optimizing and verifying the prediction results of the tobacco yield (namely the natural Leaf Number (LN), the maximum waist Leaf Length (LL) and the maximum waist leaf Width (WL)) in the mature period through experiments on the basis of candidate prediction models bayesian a, bayesian b and bayesian c obtained through preliminary screening, and core parameter values such as the number of molecular markers (n1), the scale of training population (n2), the proportion of training population to testing population (n3) and the model prediction accuracy value (n4) are also obtained through the method.
The acquiring unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
the first screening module is used for primarily screening the whole genome data of the tobacco leaves by combining the candidate prediction model and the core parameters;
in the embodiment of the system, the core parameters of the candidate prediction model are definitely set, the candidate prediction model is established, and the tobacco leaf whole genome data is initially screened by combining the core parameters through the candidate prediction model.
That is, in order to optimize the prediction accuracy for the yield of tobacco leaves, the number of molecular markers (n1), the training population size (n2), the training population-to-test population ratio (n3), the model prediction accuracy value (n4), and other core parameter values of candidate prediction models BayesA, BayesB, BayesC, and rrBLUP are specified.
The screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the whole genome data of the tobacco leaves in real time by combining the whole genome selection model and the core parameters.
In the embodiment of the invention, a whole genome selection model Bayes C is established for predicting tobacco yield traits by establishing a whole genome selection model, namely respectively aiming at 3 natural Leaf Numbers (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL)LN、Bayes BLLAnd Bayes CWLAnd then, verifying the core parameters of the candidate tobacco leaf prediction model, selecting the model through the whole genome, combining the core parameters, secondarily screening the whole genome data of the tobacco leaves in real time, and continuously screening, optimizing and verifying the generated prediction result of the yield of the tobacco leaves in the mature period to finally form the tobacco leaf yield prediction data, namely the tobacco yield character phenotype value.
Specifically, the core parameters include: the number of molecular markers, the scale of a training population, the proportion of the training population to a testing population and the prediction accuracy value of the model.
That is, when the prediction results of the yield in the mature period are continuously screened, optimized and verified, the method obtains core parameter values such as the number of molecular markers (n1), the scale of a training population (n2), the proportion of the training population to a test population (n3) and the model prediction accuracy value (n 4).
Specifically, the whole genome selection model comprises:
a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
That is, in the present embodiment, the natural leaf number whole genome selection model is Bayes C as the respective natural leaf number whole genome selection model established for 3 types of phenotypic values of yield traits including the natural Leaf Number (LN), the maximum waist Leaf Length (LL), and the maximum waist leaf Width (WL) in the tobacco maturity stageLNAnd the maximum waist leaf length whole genome selectionModel selection Bayes BLLAnd maximum waist leaf width genome wide selection model Bayes CWL
Preferably, the tobacco yield prediction data includes: natural leaf number phenotype value data, maximum waist leaf length phenotype value data, and maximum waist leaf width phenotype value data.
Specifically, the whole genome selection models for predicting the tobacco yield are Bayes C respectivelyLN、Bayes BLLAnd Bayes CWLThe calculation formula of the phenotypic value data corresponding to the whole genome selection model is the same as that of the method, and the core parameter value of the model is the same as that of the method, which is not repeated herein.
In order to achieve the above object, the present invention further provides a platform for predicting tobacco yield based on whole genome selection, as shown in fig. 6, comprising:
the system comprises a processor, a memory and a control program for selecting a tobacco yield prediction platform based on a whole genome;
wherein the genome-wide based selection tobacco yield prediction platform control program is executed on the processor, the genome-wide based selection tobacco yield prediction platform control program is stored in the memory, and the genome-wide based selection tobacco yield prediction platform control program implements the genome-wide based selection tobacco yield prediction method steps, such as:
s1, acquiring tobacco leaf whole genome data in the candidate prediction model;
s2, screening and optimizing the whole genome data of the tobacco leaves in real time;
and S3, generating tobacco yield prediction data.
The details of the steps have been set forth above and will not be described herein.
In an embodiment of the present invention, the built-in processor of the whole genome-based selection tobacco yield prediction platform may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital processing chip, a graphics processor, and a combination of various control chips. The processor accesses each component by using various interfaces and line connections, and performs various functions and processes data for selecting the tobacco yield prediction based on the whole genome by operating or executing programs or units stored in the memory and calling data stored in the memory;
the memory is used for storing program codes and various data, is installed in the tobacco yield prediction platform based on whole genome selection, and realizes high-speed and automatic access of programs or data in the operation process.
The Memory includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), Compact Disc Read-Only Memory (CD-ROM) or other optical Disc Memory, magnetic disk Memory, tape Memory, or any other medium readable by a computer that can be used to carry or store data.
In order to achieve the above object, the present invention further provides a computer readable storage medium, as shown in fig. 7, where the computer readable storage medium stores a whole genome-based selection tobacco yield prediction platform control program, and the whole genome-based selection tobacco yield prediction platform control program implements the whole genome-based selection tobacco yield prediction method, for example:
s1, acquiring tobacco leaf whole genome data in the candidate prediction model;
s2, screening and optimizing the whole genome data of the tobacco leaves in real time;
and S3, generating tobacco yield prediction data.
The details of the steps have been set forth above and will not be described herein.
In describing embodiments of the present invention, it should be noted that any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and that the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processing module-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM).
Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
To achieve the above objects, the present invention also provides a chip system, which includes at least one processor, and when the program instructions are executed in the at least one processor, the chip system executes the steps of the method for predicting tobacco yield based on whole genome selection, such as:
s1, acquiring tobacco leaf whole genome data in the candidate prediction model;
s2, screening and optimizing the whole genome data of the tobacco leaves in real time;
and S3, generating tobacco yield prediction data.
The details of the steps have been set forth above and will not be described herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
According to the method, the application, the system, the platform and the storage medium for predicting the yield of the tobacco leaves based on the whole genome selection, the genotype data of the tobacco seedlings (in the early stage) can be used for calculating or simulating the phenotypic value of the yield character of the tobacco leaves in the mature stage, and the method is convenient, rapid, efficient and scientific to operate, and the prediction result is accurate and reliable.
That is, for the construction of the whole gene selection model, the present invention, on the one hand, constructed a population of parental-derived recombinant inbred lines (RILs, F7) using flue-cured tobacco varieties Y3 and K326, and, in addition, also constructed a natural population containing 347 different tobacco varieties (lines), two different types of populations representing the whole tobacco varietyGrass populations or tobacco varieties (lines); on the other hand, a Bayes C candidate prediction model obtained by primary screening is adopted, and a whole genome selection model Bayes C predicting 3 yield character phenotype values of natural Leaf Number (LN), maximum waist Leaf Length (LL) and maximum waist leaf Width (WL) in the tobacco mature period is further screened, optimized and constructed by combining the actually measured phenotype values of the tobacco materialsLN、Bayes BLLAnd Bayes CWLAnd the application of molecular marker selection in the whole genome range in the yield prediction of the tobacco leaves in the mature period is accelerated, so that the high-yield and high-quality variety (line) can be scientifically, efficiently, accurately and reliably cultured.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. The method for predicting the yield of the tobacco leaves based on whole genome selection is characterized by comprising the following steps:
acquiring tobacco leaf whole genome data in the candidate prediction model;
screening and optimizing the whole genome data of the tobacco leaves in real time;
and generating tobacco yield prediction data.
2. The method for predicting the yield of the tobacco leaves based on the whole genome selection according to claim 1, wherein the method for predicting the yield of the tobacco leaves based on the whole genome selection in the candidate prediction model further comprises the following steps:
setting core parameters of the candidate prediction model;
establishing a candidate prediction model;
and primarily screening the whole genome data of the tobacco leaves by the candidate prediction model and the core parameters.
3. The method for predicting the yield of the tobacco leaves based on the whole genome selection according to claim 1, wherein the method for screening and optimizing the whole genome data of the tobacco leaves in real time further comprises the following steps:
establishing a whole genome selection model;
verifying core parameters of the tobacco leaf candidate prediction model;
and (4) performing real-time secondary screening on the tobacco leaf whole genome data through the whole genome selection model and by combining the core parameters.
4. The method for predicting the yield of the tobacco leaves based on the whole genome selection according to claim 2 or 3, wherein the core parameters comprise: the number of molecular markers, the scale of a training population, the proportion of the training population to a testing population and the prediction accuracy value of the model.
5. The method for predicting the yield of tobacco leaves based on whole genome selection according to claim 3, wherein the whole genome selection model comprises:
a natural leaf number whole genome selection model, a maximum waist leaf length whole genome selection model and a maximum waist leaf width whole genome selection model.
6. The method for predicting the yield of tobacco leaves based on whole genome selection according to claim 1, wherein the tobacco leaf yield prediction data comprises: natural leaf number phenotype value data, maximum waist leaf length phenotype value data, and maximum waist leaf width phenotype value data.
7. The method for predicting the yield of the tobacco leaves based on the whole genome selection according to claim 6, wherein the formula for calculating the data of the numerical values of the natural leaves is as follows:
Figure FDA0002583901900000021
the calculation formula of the maximum waist leaf length phenotype value data is as follows:
Figure FDA0002583901900000022
the calculation formula of the maximum waist leaf width phenotypic value data is as follows:
Figure FDA0002583901900000023
among them, Bayes CLNSelecting a model for the whole genome of natural leaf number table values, LN natural leaf number, Bayes C candidate prediction model, Bayes BLLSelecting a model for the maximum waist leaf length table value whole genome, LL being the maximum waist leaf length, Bayes B being a candidate prediction model, Bayes CWLSelecting a model for the whole genome with the maximum waist leaf width phenotype value, wherein WL is the maximum waist leaf length, n1 is the number of molecular markers, n2 is the scale of a training population, n3 is the proportion of the training population to a test population, and n4 is a model prediction accuracy value.
8. The application of the method for predicting the yield of the tobacco leaves selected based on the whole genome is characterized in that the method for predicting the yield of the tobacco leaves selected based on the whole genome is applied to analyzing the genotype data of the seedling stage of a tobacco group, and the phenotypic value data of the yield characters of the tobacco leaves in the mature stage of each plant in the group are predicted in the whole genome range;
the method is applied to analyzing the genotype data of tobacco groups or tobacco varieties, and the yield character phenotype value data of the tobacco in the mature period is predicted in the whole genome range, so that the yield character phenotype value data of the mature period is obtained in the seedling period of the tobacco.
9. The tobacco yield prediction system based on whole genome selection is characterized by specifically comprising:
the acquisition unit is used for acquiring the tobacco leaf whole genome data in the candidate prediction model;
the screening unit is used for screening and optimizing the whole genome data of the tobacco leaves in real time;
a generating unit for generating tobacco yield prediction data;
the acquiring unit further comprises:
the setting module is used for setting core parameters of the candidate prediction model;
the first modeling module is used for establishing a candidate prediction model;
the first screening module is used for primarily screening the whole genome data of the tobacco leaves by combining the candidate prediction model and the core parameters;
the screening unit further comprises:
the second modeling module is used for establishing a whole genome selection model;
the verification module is used for verifying core parameters of the tobacco leaf candidate prediction model;
and the second screening module is used for secondarily screening the whole genome data of the tobacco leaves in real time by combining the whole genome selection model and the core parameters.
10. A tobacco yield prediction platform based on whole genome selection is characterized by comprising:
the system comprises a processor, a memory and a control program for selecting a tobacco yield prediction platform based on a whole genome;
wherein the genome-wide selection-based tobacco yield prediction platform control program is executed on the processor, the genome-wide selection-based tobacco yield prediction platform control program being stored in the memory, the genome-wide selection-based tobacco yield prediction platform control program implementing the genome-wide selection-based tobacco yield prediction method steps of any one of claims 1 to 7.
CN202010675520.8A 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application Active CN111898807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010675520.8A CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010675520.8A CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Publications (2)

Publication Number Publication Date
CN111898807A true CN111898807A (en) 2020-11-06
CN111898807B CN111898807B (en) 2024-02-27

Family

ID=73191755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010675520.8A Active CN111898807B (en) 2020-07-14 2020-07-14 Tobacco leaf yield prediction method based on whole genome selection and application

Country Status (1)

Country Link
CN (1) CN111898807B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116083622A (en) * 2022-11-22 2023-05-09 云南省烟草农业科学研究院 Genes qLL and qWL related to tobacco leaf types, linked SSR markers and application thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102269570A (en) * 2011-04-28 2011-12-07 中国烟草总公司郑州烟草研究院 Method for measuring maximum length and maximum width of flue-cured tobacco leaf based on canopy multi-spectra
CN105734141A (en) * 2016-03-31 2016-07-06 湖北省烟草科学研究院 Molecular biology method for identifying purity of tobacco varieties
CN106447079A (en) * 2016-08-31 2017-02-22 贵州师范大学 Prediction method for tobacco production of karst mountainous area based on Radarsat-2
WO2017069607A1 (en) * 2015-10-23 2017-04-27 Sime Darby Plantation Sdn. Bhd. Methods for predicting palm oil yield of a test oil palm plant
CN107354203A (en) * 2017-07-10 2017-11-17 中国烟草总公司郑州烟草研究院 Primer for identifying flue-cured tobacco Bi Na 1 combines and kit, application and detection method
CN110610744A (en) * 2019-09-11 2019-12-24 华中农业大学 Efficient whole genome selection method capable of realizing parallel operation and high accuracy
CN111197101A (en) * 2018-11-20 2020-05-26 云南省烟草农业科学研究院 Codominant SSR marker closely linked with tobacco leafy gene mLN and application thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102269570A (en) * 2011-04-28 2011-12-07 中国烟草总公司郑州烟草研究院 Method for measuring maximum length and maximum width of flue-cured tobacco leaf based on canopy multi-spectra
WO2017069607A1 (en) * 2015-10-23 2017-04-27 Sime Darby Plantation Sdn. Bhd. Methods for predicting palm oil yield of a test oil palm plant
CN105734141A (en) * 2016-03-31 2016-07-06 湖北省烟草科学研究院 Molecular biology method for identifying purity of tobacco varieties
CN106447079A (en) * 2016-08-31 2017-02-22 贵州师范大学 Prediction method for tobacco production of karst mountainous area based on Radarsat-2
CN107354203A (en) * 2017-07-10 2017-11-17 中国烟草总公司郑州烟草研究院 Primer for identifying flue-cured tobacco Bi Na 1 combines and kit, application and detection method
CN111197101A (en) * 2018-11-20 2020-05-26 云南省烟草农业科学研究院 Codominant SSR marker closely linked with tobacco leafy gene mLN and application thereof
CN110610744A (en) * 2019-09-11 2019-12-24 华中农业大学 Efficient whole genome selection method capable of realizing parallel operation and high accuracy

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116083622A (en) * 2022-11-22 2023-05-09 云南省烟草农业科学研究院 Genes qLL and qWL related to tobacco leaf types, linked SSR markers and application thereof

Also Published As

Publication number Publication date
CN111898807B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
Tong et al. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data
Roorkiwal et al. Genomic-enabled prediction models using multi-environment trials to estimate the effect of genotype× environment interaction on prediction accuracy in chickpea
Xu et al. Genomic selection: A breakthrough technology in rice breeding
Hoban et al. Computer simulations: tools for population and evolutionary genetics
CN107862173A (en) A kind of lead compound virtual screening method and device
CN106446597B (en) Several species feature selecting and the method for identifying unknown gene
Shao et al. An efficient discrete invasive weed optimization for blocking flow-shop scheduling problem
CN105929690B (en) A kind of Flexible Workshop Robust Scheduling method based on decomposition multi-objective Evolutionary Algorithm
CN107391963A (en) Eucaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method
CN111524545B (en) Method and device for whole genome selective breeding
CN107545038A (en) A kind of file classification method and equipment
Wang et al. A dynamic framework for quantifying the genetic architecture of phenotypic plasticity
Bartholomé et al. Genomic prediction: progress and perspectives for rice improvement
CN109448842B (en) The determination method, apparatus and electronic equipment of human body intestinal canal Dysbiosis
González-Álvarez et al. Comparing multiobjective swarm intelligence metaheuristics for DNA motif discovery
CN111898807B (en) Tobacco leaf yield prediction method based on whole genome selection and application
CN111798920B (en) Tobacco economic character phenotype value prediction method based on whole genome selection and application
Wong et al. A comparison study for DNA motif modeling on protein binding microarray
CN111883205B (en) Prediction method for selecting harmful ingredients of tobacco based on whole genome and application
Bell et al. MIPHENO: data normalization for high throughput metabolite analysis
CN110853710B (en) Whole genome selection model for predicting starch content of tobacco and application thereof
CN109243527A (en) A kind of peptide fragment detectability prediction technique of digestion probability auxiliary
CN114328221A (en) Cross-project software defect prediction method and system based on feature and instance migration
Palande et al. The topological shape of gene expression across the evolution of flowering plants
CN110443374A (en) A kind of resource information processing method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant