CN117169388A

CN117169388A - Method for predicting optimal recovery period of Chinese yam by using marker metabolite model based on machine learning

Info

Publication number: CN117169388A
Application number: CN202311291786.2A
Authority: CN
Inventors: 安莉; 吴绪金; 陈贺; 李萌; 张迪; 周娟; 马欢; 马婧玮; 李通; 许海康; 李委; 王铁良; 段然; 马莹
Original assignee: Institute Of Agricultural Product Quality And Safety Henan Academy Of Agricultural Sciences
Current assignee: Institute Of Agricultural Product Quality And Safety Henan Academy Of Agricultural Sciences
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-05

Abstract

The invention provides a method for predicting the optimal recovery period of Chinese yam by using a marker metabolite model based on machine learning, which comprises the following steps: collecting yam samples in different harvesting periods, and analyzing the yam samples through a metabonomics technology to obtain metabonomics data; preprocessing metabolomic data; selecting characteristics related to the yam growth period by using a machine learning algorithm to obtain potential marker metabolites; screening potential marker metabolites by using a LASSO regression method to construct a marker metabolite prediction model; verifying the constructed marker metabolite prediction model based on the area under the ROC curve; inputting new metabonomics data of the Chinese yam into a marker metabolite prediction model to obtain model scores, and judging whether the Chinese yam is suitable for harvesting or not according to the model scores. The invention can accurately predict the optimal harvesting period of the Chinese yam, eliminate subjectivity and experience dependence, improve scientificity, reduce the influence of external environment, realize the maximization of the yield of the Chinese yam and provide reliable technical support for agricultural production.

Description

Method for predicting optimal recovery period of Chinese yam by using marker metabolite model based on machine learning

Technical Field

The invention relates to the technical field of yam harvesting period prediction, in particular to a method for predicting the optimal yam harvesting period based on a marker metabolite model of machine learning.

Background

The yam is a rhizome plant of Dioscorea of Dioscoreaceae, and the rhizome is an edible part, not only is a food, but also has medical care function, and has been listed as a medicinal and edible plant resource in China. The unique nutrition composition and rich bioactive substances make the compound have wide application value in the fields of traditional Chinese medicine and modern health. The yam is rich in amino acids, organic acids, saccharides and other nutrients, and contains rich secondary metabolites such as saponins, dioscin, flavonoids, alkaloids and other bioactive compounds. The mountain drug is Ping Wei sweet, and has the effects of enhancing immunity, strengthening spleen, relieving diarrhea, tonifying lung, tonifying kidney and the like. The digging of yam is usually continued from the bottom of September, before and after mid-autumn celebration, to the beginning of twelve months for up to three months. Harvesting is an important link in the production process of the Chinese yam, the quality and the yield of the Chinese yam are directly influenced, but research on the correlation between different harvesting periods and the chemical composition change of the Chinese yam is not seen up to the present, report on the identification of the related mature markers of the Chinese yam is not seen, and in actual production, whether the Chinese yam is mature is still judged by traditional experience. As a medicinal and edible plant resource, the optimal harvesting period can ensure the quality and the efficacy of the yam when the yam is used as a medicinal material, and the yam has higher content of nutritional functional components and better taste when the yam is used as food, so that the market competitiveness of the yam industry can be improved, and the sustainable healthy development of the yam industry is promoted.

Existing protocols for yam harvest time are generally based on traditional empirical judgment and external observations. The following are some common existing schemes:

growth cycle observation method: the growth period and the collection period of the Chinese yam are judged by periodically observing Chinese yam plants in a Chinese yam planting area, mainly comprising a growth state, a leaf color, a stem state and the like and combining experience of a producer. The method is simple and visual, but the accuracy is influenced by personal experience and subjective judgment of the producer, and certain errors can exist.

Method for observing underground organs: the growth stage and the collection stage of the Chinese yam are judged by digging part of Chinese yam plants and observing the characteristics of the size, shape, color and the like of underground organs (tubers). Compared with a growth period observation method, the method reflects the growth state of the Chinese yam more directly, but is also influenced by personal experience and subjective judgment of a producer.

Growth model prediction method: based on historical planting data and meteorological data of the Chinese yam, a mathematical model is adopted to establish a Chinese yam growth model, and the growth period and the acquisition period of the Chinese yam are predicted. The method is relatively scientific, but a large amount of data is required for building a model, and the method is greatly influenced by external factors such as weather and the like.

It should be noted that the above-mentioned existing solutions have certain limitations, mainly in terms of accuracy and relying on experience judgment. Therefore, a more scientific and accurate prediction method is established, and the method has important promotion effect on the development of the yam planting industry and the improvement of yield and quality.

The prior art has the following disadvantages:

subjectivity and experience dependence: the traditional Chinese yam harvesting period judging method often depends on the planting and harvesting experience of a grower, or lacks scientificalness according to market price, so that the Chinese yam harvesting period lacks scientific judging standards, price differences of different years are large, meanwhile, individual differences exist among different growers, and judgment made according to personal experience is different.

Data lack: traditional harvest time determination methods typically rely on limited observed data, ignoring other potentially important factors, limiting the reliability and accuracy of the predictive model, especially in the face of complex natural environments and climate changes.

Accuracy is not high, and prediction accuracy is low: the traditional method-based yam harvesting period judgment often has larger errors, so that the harvesting time is unreasonable, and the yield and quality of the yams are affected.

Time and resource waste: the conventional method requires long-term observation and data accumulation, and a large number of trial and error processes, thus resulting in waste of time and resources.

Is influenced by the external environment: the traditional method does not consider external environmental factors (such as climate, air temperature and the like), but the factors have important influence on the growth and development of the Chinese yam, so that the accuracy of judgment is limited. The traditional visual inspection method is difficult to accurately judge the optimal harvesting period of the Chinese yam, so that a part of Chinese yam does not reach the optimal harvesting state when being harvested, and the quality and the yield of the Chinese yam are reduced.

The chemical components contained in the Chinese yam in different harvesting periods are different, so that the quality and the medicinal value of the Chinese yam are affected. The maturity of the Chinese yam can directly influence the accumulation, quality and medicinal value of metabolites contained in the Chinese yam, and the proper harvesting period has important influence on the quality of the Chinese yam and the accumulation of nutrient substances. Determining the optimal harvesting period of rhizoma Dioscoreae is helpful for improving rhizoma Dioscoreae quality, and guaranteeing nutrition and medicinal value of rhizoma Dioscoreae.

Metabonomics and machine learning are two independent but mutually combined technical fields, and have important popularization and application values for predicting the optimal harvesting period of Chinese yam.

Metabolomics reflects the metabolic characteristics of an organism at different stages of growth by analyzing the overall changes in the metabolites of the organism at different states of growth. In predicting the period of yam harvesting, metabolomics can be used to acquire metabolic profiles of yam over different periods of growth, for identifying specific metabolites associated with the period of growth. Collecting rhizoma Dioscoreae samples with different growth periods, extracting and analyzing metabolite to obtain full spectrum metabolite data. By comparing the metabolite compositions of the different growth phases, it is possible to find some marker metabolites closely related to the growth phase of yam. The marker metabolites can be used as a judging index to reflect the growth state of the Chinese yam, and provide a basis for predicting the optimal harvesting period of the Chinese yam.

Machine learning is a branch of artificial intelligence that enables predictions or decisions for specific tasks by letting computers learn and adapt to the data. In predicting the yam harvest time, machine learning can help build a predictive model relating metabonomic data to the growth period of yam. Feature selection and pattern recognition can be performed on the omics data using PCA, OPLS-DA, LASSO, etc. algorithms. By establishing a marker metabolite model, the selected metabolite characteristics are correlated with the growth period of yam, so that the optimal harvest time of yam can be predicted.

The method for comprehensively applying metabonomics and machine learning can be used for accurately predicting the yam harvesting period by collecting metabolite data of yam samples, then constructing a prediction model by using a machine learning algorithm and identifying the characteristic of the marked metabolite related to the yam growing period.

Disclosure of Invention

Aiming at the technical problems of low subjectivity and experience dependency and accuracy and great influence by external environment of the traditional Chinese yam harvesting period prediction method, the invention provides a method for predicting the optimal harvesting period of Chinese yam based on a marker metabolite model of machine learning.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows: a method for predicting the optimal harvest time of Chinese yam by using a marker metabolite model based on machine learning comprises the following steps:

step one, data acquisition: collecting yam samples in different harvesting periods, and analyzing the yam samples by using a metabonomics technology to obtain metabonomics data of the yam in different harvesting periods;

step two, data preprocessing: preprocessing the acquired metabolomic data;

step three, selection of potential marker metabolites: selecting characteristics related to the growth period of the Chinese yam from the preprocessed metabolomic data by using a machine learning algorithm to obtain potential marker metabolites;

fourth, constructing and verifying a prediction model: screening potential marker metabolites by using a LASSO regression method to construct a marker metabolite prediction model; verifying the constructed marker metabolite prediction model based on the area under the ROC curve as an evaluation index;

inputting new yam metabonomics data into the marker metabolite prediction model which passes verification to obtain model scores, and judging whether the yam is suitable for harvesting according to the model scores.

Preferably, the number of yam samples in each harvesting period is not less than 15;

before the metabonomics technology is implemented, freeze-drying is carried out on the yam sample, then an extraction solvent is added, ultrasonic extraction is carried out, supernatant fluid is centrifugally taken, and a sample injection liquid is obtained through a filtering membrane;

the metabonomics technology collects metabolite fingerprint spectra through sample injection liquid of yam samples in different harvesting periods.

Preferably, metabolite fingerprints are acquired using a chromatography-mass spectrometer whose chromatographic conditions are: the flow velocity of the mobile phase is 0.2-0.4 mL.min ^-1 The column temperature is 35-45 ℃, and the sample injection amount is 2-4 mu L; the mass spectrum conditions of the chromatograph-mass spectrometer are: the temperature of the electrospray ion source is set to be 500-600 ℃, the positive ion mode voltage of the ion spray voltage is set to be 4500-6500V, the negative ion mode voltage of the ion spray voltage is set to be-4000-6000V, the ion source gas I, the gas II and the curtain gas are respectively set to be 50, 60 and 25psi, and the collision induction ionization parameter is set to be high.

Preferably, the preprocessing includes data cleansing, outlier removal and data normalization.

Preferably, the selection of the potential marker metabolite is performed under an R software platform comprising the steps of:

(1) performing baseline filtering, peak identification, retention time correction, peak alignment and mass spectrum fragment structure analysis on data acquired by a mass spectrum, and converting the spectrogram data into two-dimensional matrix data;

(2) performing principal component analysis of an unsupervised mode on all the two-dimensional matrix data, finding out differences among groups, and regrouping;

(3) redefining a new group by using a supervised orthogonal partial least square discriminant analysis method to find out potential marker metabolites.

Preferably, the potential marker metabolites are selected by importance ranking VIP value, difference significance P value and difference fold change value; variables that screen VIP >1.0, P < 0.05, and |log2 (fold change) | >1 are potential marker metabolites screened.

Preferably, the marker metabolite prediction model is a model for analyzing potential marker metabolites by LASSO regression, wherein LASSO regression is analyzed by a cross-validation curve and a coefficient of variation path, so that cross-validation mean square errors of the cross-validation curve are minimized, and uncorrelated potential marker metabolites are eliminated; and the optimal punishment and punishment coefficient values are determined by the lowest point of the cross validation curve, the metabolite with the non-zero coefficient is found out to be the marker metabolite through the coefficient path corresponding to the optimal punishment, and the data of the marker metabolite are fitted to construct a marker metabolite prediction model.

Preferably, the area under the ROC curve is larger than 0.95, and the accuracy, sensitivity and specificity of the marker metabolite prediction model are good.

Preferably, the marker metabolite selected is allantoin, 5-oxo-L-proline, 4-hydroxymandelonitrile, L-methionine, 6, 7-dihydroxy-2, 4-dimethoxyphenanthrene, N-feruloylagmatine, glucosyl syringic acid.

Preferably, 7 marker metabolites screened are used as indexes, and a marker metabolite prediction model constructed by the LASSO regression method is as follows:

model score = -117.2510 +allantoin (-2.7340) + allantoin

5-oxo-L-proline (-0.8350) + and

4-hydroxy mandelonitrile 5.5810+

L-methionine 1.7373+

6, 7-dihydroxy-2, 4-dimethoxy phenanthrene 0.4416+

N-feruloylagmatine 1.9960+

Glucosyl syringic acid (-0.7759);

wherein model score is a model score;

threshold judgment is carried out on the model score, and when the model score is greater than zero, the Chinese yam sample is suitable for harvesting; when the model score is less than zero, the yam sample is not suitable for harvesting.

Compared with the prior art, the invention has the beneficial effects that: the method solves the problems of subjectivity, insufficient data, low prediction precision and the like of the traditional Chinese yam harvesting period judging method, provides a more scientific, accurate and reliable judging method for Chinese yam growers, provides an advanced and efficient harvesting period judging solution for the Chinese yam industry, and promotes sustainable development and modernization processes of the Chinese yam industry; the invention can provide scientific and reasonable harvesting period for Chinese yam growers, ensure higher nutrition quality while improving the yield of Chinese yam, increase the economic benefit of Chinese yam growers, improve the market competitiveness of Chinese yam industry and promote the sustainable healthy development of Chinese yam industry. Has the following advantages:

eliminating subjectivity and experience dependence: establishing a prediction model based on metabolite characteristics by adopting a machine learning algorithm; the machine learning algorithm can learn and generalize rules from a large amount of data, eliminates subjectivity and experience dependence of the traditional judging method, and enables a Chinese yam producer to obtain a more scientific and accurate harvesting period through the support of scientific data and algorithms instead of relying on personal experience.

Accuracy is improved: the metabolic analysis technology is adopted, so that the metabolite composition information of the Chinese yam in different growth periods can be comprehensively obtained, and the problem that the traditional method lacks scientific data support is avoided; the machine learning algorithm combines metabolite feature selection and pattern recognition, and can more accurately predict the optimal harvesting period of the Chinese yam, so that the harvesting time is optimized, and the yield and quality of the Chinese yam are improved.

The external environment influence is less: when the prediction model is built, factors related to external environment fluctuation are removed through reasonably screening the characteristics, so that the stability of prediction is improved, and the accuracy of prediction is ensured; the Chinese yam producer can more reliably collect Chinese yam according to the prediction result, and the yield and quality of Chinese yam are improved to the greatest extent.

Maximizing productivity: the advantage of the machine learning algorithm based on the metabolite characteristics is fully utilized, so that the Chinese yam is ensured to be harvested in the optimal harvesting period, the yield of the Chinese yam is increased, the quality of the Chinese yam is improved, and the economic benefit of crops is improved; in addition, accurate prediction of the optimal harvesting period of the Chinese yam is also beneficial to reasonably planning the planting period and making a production plan, so that resources are utilized to the greatest extent, and the overall economic benefit of the Chinese yam planting industry is improved.

In summary, the invention provides a scientific and accurate prediction method, which eliminates subjectivity and experience dependence, improves scientificity, reduces external environment influence, realizes maximization of yam productivity, and provides reliable technical support for agricultural production.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a total ion flow diagram of a yam sample collection of the present invention.

Fig. 2 is a PCA score plot of the present invention.

FIG. 3 is a graph of the OPLS-DA score of the present invention.

FIG. 4 is a diagram of a potential differential metabolite screening process of the present invention, wherein A is VIP >1.0, B is P < 0.05 and |log2 (fold change) | >1, and C is the intersection of the three.

FIG. 5 is a graph of LASSO regression according to the present invention, wherein A is a cross validation curve and B is a LASSO coefficient path graph.

FIG. 6 shows the results of ROC curve analysis according to the present invention.

FIG. 7 is a box plot of scores for the predictive model of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, a method for predicting the optimal recovery period of yam based on a machine learning marker metabolite model comprises the following steps:

step one, data acquisition: firstly, collecting yam samples in different harvest periods, and analyzing the yam samples through a metabonomics technology to obtain metabonomics data of the yam in different harvest periods. The number of the Chinese yam samples in each harvesting period in the Chinese yam samples in different harvesting periods is not less than 10.

The metabonomics data is acquired by collecting metabolic substance spectrum fingerprints of yam samples in different harvest periods, and chromatographic conditions are as follows: the flow velocity of the mobile phase is 0.2-0.4 mL.min ^-1 The column temperature is 35-45 ℃, and the sample injection amount is 2-4 mu L. The mass spectrum conditions were: the temperature of the electrospray ion source is set to be 500-600 ℃, the positive ion mode voltage of the ion spray voltage is set to be 4500-6500V, the negative ion mode voltage of the ion spray voltage is set to be-4000-6000V, the ion source gas I, the gas II and the curtain gas are respectively set to be 50, 60 and 25psi, and the collision induction ionization parameter is set to be high.

Step two, data preprocessing: preprocessing the acquired metabolomic data, including data cleaning, abnormal value removal, normalization and other processing steps, so as to ensure the quality and consistency of the data.

Step three, characteristic selection of metabolite markers: and selecting characteristics related to the growth period of the Chinese yam from the preprocessed metabolomics data by using metabolomics and a machine learning algorithm to obtain potential differential (marked) metabolites. These features will be used as inputs to the predictive model for building the marker metabolite model.

The selection of the characteristics related to the yam growth period is implemented under an R software platform and is carried out by adopting the following steps:

(1) processing the acquired data of the mass spectrum such as baseline filtering, peak identification, retention time correction, peak alignment, mass spectrum fragment structure analysis and the like, and converting the spectrogram data into two-dimensional matrix data to be subjected to analysis of potential difference foreign matters;

(2) performing an unsupervised mode principal component (Principal Component Analysis, PCA) analysis on all data, finding differences between groups, and regrouping;

(3) the redefined new set was analyzed using supervised orthorhombic partial least squares discrimination (Orthogonal Partial Least Squares Discriminant Analysis, OPLS-DA) to find potential marker metabolites.

PCA is a dimension-reduction algorithm for mapping high-dimensional data into a low-dimensional space, thereby reducing the number of features. OPLS-DA is a multivariate statistical analysis method for finding distinguishing metabolite features. And selecting characteristics related to the yam growth period according to the variable weight importance ranking (Variable important in projection, VIP) value, the difference significance P value and the difference multiple fold change value. Variables that screen VIP >1.0, P < 0.05, and |log2 (fold change) | >1 are important features of model screening and final potential marker metabolites.

Screening criteria for significantly different metabolites in addition to fold change and P value, VIP values are also very important indicators for identifying different metabolites. VIP is a variable weight value of an OPLS-DA model variable, and can be used for measuring the influence intensity and interpretation ability of each metabolite accumulation difference on classification and discrimination of each group of samples, and VIP is more than or equal to 1 and is a common differential metabolite screening standard.

Step four: and (3) constructing a prediction model: after feature selection, a marker metabolite prediction model is constructed by adopting a LASSO (Least Absolute Shrinkage and Selection Operator, LASSO) regression algorithm, and the algorithm learns the association rule between the optimal recovery period of the Chinese yam and the metabolite features according to the input metabolite features.

The machine learning algorithm establishes a characteristic marker model of the optimal recovery period of the Chinese yam as a potential marker metabolite screened by LASSO regression analysis. LASSO regression is analyzed through a cross validation curve and a variation coefficient path, irrelevant features are eliminated, target metabolites are screened out, and a marker metabolite prediction model is constructed to predict the optimal recovery period of the Chinese yam.

LASSO is a linear regression method that can be used for feature selection to reduce uncorrelated feature coefficients to zero. The application of the machine learning algorithm provides the metabolite model with efficient and accurate feature selection and pattern recognition capability.

Step five, verifying and optimizing a prediction model: in order to verify and optimize the established marker metabolite predictive model, the marker metabolite predictive model is verified using the area under the curve (Area Under the Curve, AUC) of the subject's working characteristics curve (Receiver Operating Characteristic, ROC) as an evaluation index. ROC AUC is an index commonly used in classification model evaluation to measure the performance and discrimination of classification models.

ROC curves are tools for evaluating the performance of classification models, which show the behavior of the model at different thresholds with the true positive (Sensitivity) on the vertical axis and the false positive (1-Specificity) on the horizontal axis. AUC is the area under the ROC curve and is used to measure the overall performance of the classification model. The closer the AUC value is to 1, the better the performance of the representative model. In the invention, the AUC is used as an evaluation index, so that the classification accuracy and the robustness of the model can be comprehensively evaluated and optimized.

And evaluating a marker metabolite prediction model established by the feature markers screened by the LASSO regression method according to the area AUC of the ROC curve, wherein the corresponding AUC is greater than 0.95, which indicates that the accuracy, sensitivity and specificity of the modeled model are good.

Step six, model application: after establishing a prediction model passing verification, the method can be applied to actual production of the Chinese yam. Inputting new metabonomics data of the Chinese yam into the prediction model which passes verification to obtain model scores, and judging whether the Chinese yam is suitable for harvesting or not according to the model scores.

The new metabolomic data is entered and the model will output a value representing the correlation of the yam sample with the optimal harvest time, i.e. "model score". By thresholding the "model score" when "model score" is greater than zero, it is meant that the yam sample is suitable for harvesting; when "model score" is less than zero, it means that the yam sample is not suitable for harvesting. Farmers can scientifically judge the optimal harvesting period according to the result of model score, and improve the quality of Chinese yam while improving the yield of Chinese yam.

Example 2

The method for predicting the optimal harvesting period of the Chinese yam based on the marker metabolite model of machine learning comprises the steps of collecting six Chinese yam samples in different harvesting periods of 9 months (S1), 10 months (S2), 10 months (S3), 11 months (S4), 11 months (S5) and 12 months (S6), taking 15 samples in each period, uniformly mixing every five samples after freeze drying to form a Chinese yam analysis sample, sieving with a 150-mesh sieve, taking 3 Chinese yam analysis samples in each period, and respectively taking 5g of Chinese yam analysis samples in 6 periods for equal amount mixing by a quality control sample; weighing three quality control samples of rhizoma Dioscoreae at each period, adding 5mL 70% ethanol solution into each 1.0g, ultrasonically extracting for 70 min, centrifuging for 5 min at 12000r/min, collecting supernatant, and filtering with 0.22 μm filter membrane to obtain sample solution.

The method comprises the steps of collecting the yam samples in different harvest periods by using a chromatograph-mass spectrometer, wherein the conditions of liquid chromatography are as follows: c18 column chromatography (Agilent SB-C18,1.8 μm,2.1 mm. Times.100 mm); the aqueous phase in the mobile phase was ultrapure water containing 0.05% formic acid, the organic phase was acetonitrile containing 0.05% formic acid, the elution gradient was 0min water/acetonitrile (95:5, V/V), 9.0min was 5% aqueous phase, 10.0min was 5% aqueous phase, 11.1min was 95% aqueous phase, 14.0min was 95% aqueous phase, and the mobile phase flow rate was 0.2 mL. Min ^-1 The column temperature is 40 ℃, and the sample injection amount is 2 mu L; the mass spectrum conditions are as follows: the electrospray ion source temperature was set to 550 ℃, the ion spray voltage positive ion mode voltage was set to 4500V, the ion spray voltage negative ion mode voltage was set to-5500V, the ion source gas I, gas II and curtain gas were set to 50, 60, 25psi, respectively, the impact induced ionization parameters were set to high, and the total ion flow diagram thereof was as shown in fig. 1.

And after the fingerprint acquired by the mass spectrum is subjected to the treatments of baseline filtering, peak identification, retention time correction, peak alignment, mass spectrum fragment structure analysis and the like, the fingerprint is converted into two-dimensional matrix data for potential difference analysis. The PCA score plot is shown in FIG. 2, and it can be seen that the 6 sets of samples are clearly separated into two broad categories, S1-S3 and S4-S6. Meanwhile, the QC samples are clustered well, which indicates that the instrument is stable and the method repeatability is good.

FIG. 3 is a graph of OPLS-DA scores, which clearly shows that the data of the two new groups are well separated, and further the screening of potential differential metabolites is carried out through VIP >1.0, P value < 0.05 and |log2 (fold change) | >1, and 41 total metabolites meeting the three conditions are shown in FIG. 4, which are important characteristics and final potential differential metabolites screened by the OPLS-DA. Preferably, screening potential differential metabolites, and comprehensively selecting the VIP value, the differential significance P value and the differential fold change value according to the importance of the variable weight; further preferably, the variables of VIP >1.0, P value < 0.05 and |log2 (fold change) | >1 are the important features of model screening and final potential differential metabolites.

Characteristic metabolite screening is carried out on the 41 screened differential metabolites by using LASO regression in a machine learning algorithm, the smaller the cross-validation mean square error of a cross-validation curve in the LASO regression is, the better the LASO fitting effect is, and a penalty value lambda is determined according to the lowest point of the cross-validation curve; the penalty coefficient compresses the variable coefficient which has insignificant influence on the prediction result to 0, and when the penalty value lambda is 0.005, the corresponding mean square error reaches the minimum value, namely 7 metabolites which are characterized by allantoin, 5-oxo-L-proline, 4-hydroxymandelonitrile, L-methionine, 6, 7-dihydroxy-2, 4-dimethoxy phenanthrene, N-feruloyl agmatine, glucosyl syringic acid and the like are obtained; and constructing a LASSO regression model by taking the 7 screened metabolites as indexes. The model formula obtained is as follows:

model score = -117.2510 +allantoin (-2.7340) + allantoin

5-oxo-L-proline (-0.8350) + and

4-hydroxy mandelonitrile 5.5810+

L-methionine 1.7373+

6, 7-dihydroxy-2, 4-dimethoxy phenanthrene 0.4416+

N-feruloylagmatine 1.9960+

Glucosyl syringic acid (-0.7759)

And judging the shelf life according to the calculated model score:

when model score <0, harvest is not appropriate;

when model score >0, harvesting is appropriate.

Evaluation of the model: 10 samples of yam unsuitable for the harvest period (9 months of grown yam) and yam suitable for the harvest period (12 months of grown yam) were collected, mass spectrometry and data normalization were performed according to the above method, and then peaks of 6 metabolites were extracted, and substituted into a Model score formula of the prediction Model, and Model score of each group was calculated, and the Model was evaluated by ROC AUC, and as a result, the corresponding ROC was found to be 1, as shown in fig. 6. The Model score values for yam unsuitable for harvest period were all less than 0, while the Model score values for yam suitable for harvest period were all greater than 0, as shown in fig. 7. The result shows that the accuracy of the model is 100%, the accuracy and the specificity of the model are good, the model has high stability and predictive capability, and the model can well identify whether the Chinese yam meets the harvesting condition.

Example 3

A method for predicting the optimal harvesting period of Chinese yam based on a machine learning marker metabolite model comprises the steps of collecting 10 batches of Chinese yam samples with unknown growth time as research objects, uniformly mixing every five samples after freeze drying to obtain a Chinese yam analysis sample, adding 5mL of 70% ethanol solution into each 1.0g, ultrasonically extracting for 70 minutes, centrifuging for 5 minutes at 12000r/min, taking supernatant, and filtering with a 0.22 mu m filter membrane to obtain sample injection.

Metabolite fingerprint collection is carried out by using a chromatograph-mass spectrometer, wherein the liquid chromatography conditions are as follows: c18 column chromatography (Agilent SB-C18,1.8 μm,2.1 mm. Times.100 mm); the aqueous phase in the mobile phase was ultrapure water containing 0.05% formic acid, the organic phase was acetonitrile containing 0.05% formic acid, the elution gradient was 0min water/acetonitrile (95:5, V/V), 9.0min was 5% aqueous phase, 10.0min was 5% aqueous phase, 11.1min was 95% aqueous phase, 14.0min was 95% aqueous phase, and the mobile phase flow rate was 0.2 mL. Min ^-1 The column temperature is 40 ℃, and the sample injection amount is 2 mu L; the mass spectrum conditions are as follows: the electrospray ion source temperature was set to 550 ℃, the ion spray voltage positive ion mode voltage was set to 4500V, the ion spray voltage negative ion mode voltage was set to-5500V, the ion source gas I, gas II and curtain gas were set to 50, 60, 25psi, respectively, and the impact induced ionization parameter was set to high.

And carrying out baseline filtration, peak identification, retention time correction and peak alignment on the fingerprint acquired by the mass spectrum, and carrying out standardization and normalization on the data. Extracting the data of the metabolism substance spectrum of allantoin, 5-oxo-L-proline, 4-hydroxy mandelonitrile, L-methionine, 6, 7-dihydroxy-2, 4-dimethoxy phenanthrene, N-feruloyl agmatine, glucosyl syringic acid and the like, and calculating by using a Model score formula. The Model score formula is:

model score = -117.2510 +allantoin (-2.7340) + allantoin

5-oxo-L-proline (-0.8350) + and

4-hydroxy mandelonitrile 5.5810+

L-methionine 1.7373+

6, 7-dihydroxy-2, 4-dimethoxy phenanthrene 0.4416+

N-feruloylagmatine 1.9960+

Glucosyl syringic acid (-0.7759)

Judging the shelf life and outputting the result according to the calculated model score:

model score <0 had 6 batches, indicating that yam with 6 plot growth periods was not suitable for harvesting.

model score >0 had 4 batches, indicating that 4 plots of growing-period yam could be harvested.

The invention relates to the establishment of a marker metabolite model for predicting the optimal recovery period of Chinese yam by adopting various machine learning algorithms such as PCA, OPLS-DA, LASSO and the like; through the advantages of the machine learning algorithm, subjectivity and experience dependence are eliminated, and accuracy and stability are improved. The machine learning model provided by the invention can rapidly predict the harvesting period of the Chinese yam by analyzing the metabolite data of the Chinese yam in a short time. Compared with the traditional time-consuming method, the method provided by the invention provides a high-efficiency and rapid judgment means. The invention is suitable for the harvesting of Chinese yam in large-scale production, and provides a practical and reliable judging tool for the harvesting time point of Chinese yam. The yam is an important cash crop, the optimal harvest time of the cash crop is accurately predicted and reasonably managed, the yield and the quality can be improved, and the production cost is reduced, so that the technology of the invention has practical application value for agricultural production.

Metabonomics is a technological means of studying the overall combination and variation of all metabolites in an organism. Through analysis of metabolites in the sample, a large amount of metabolic profile data can be obtained, so that the state and change of a metabolic network in an organism can be known. Machine learning is a type of artificial intelligence algorithm that can automatically build models and make predictions through learning and generalization of large amounts of data. In the invention, the metabonomics technology provides data of metabolite composition in the growth process of the Chinese yam, and the machine learning algorithm constructs a prediction model through the data, so that the accurate prediction of the optimal harvesting period of the Chinese yam is realized.

Metabolite feature selection method: in addition to the LASSO regression method employed in the present invention, other feature selection algorithms, such as random forest, recursive Feature Elimination (RFE), etc., may also be considered. Each method has advantages and disadvantages, and a proper method can be selected according to actual data conditions.

In the invention, a plurality of machine learning algorithms such as PCA, OPLS-DA, LASSO and the like are adopted to construct a metabolite model. In addition to these machine learning algorithms, there are many other machine learning algorithms that can be used for model construction, such as decision trees, support vector machines, neural networks, and the like. Different algorithms may have different effects on the processing of data and pattern recognition, so attempts to try different algorithms may be considered for better predictive performance.

In addition to the AUC employed in the present invention as a model performance evaluation index, other indices may be used for model verification and optimization, such as accuracy, recall, F1-score, and the like. Different indexes are suitable for evaluating different problems, and the proper indexes can be selected according to specific conditions.

Metabolite data used in the present invention may be obtained from a particular experiment or sample, and other sources of data, such as data in public databases or other laboratory collected data, are contemplated. Different data sources may have an impact on the training and predictive results of the model.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method for predicting the optimal harvest time of Chinese yam by using a marker metabolite model based on machine learning is characterized by comprising the following steps:

step two, data preprocessing: preprocessing the acquired metabolomic data;

2. The method for predicting optimal recovery period of yam based on a machine learning marker metabolite model of claim 1, wherein no less than 15 yam samples are selected for each recovery period;

and the metabonomics data are obtained by collecting metabolic substance spectrum fingerprints through sample injection of yam samples in different collection periods.

3. The method for predicting optimal recovery period of yam based on machine learning marker metabolite model of claim 2, wherein metabolite fingerprint is acquired using a chromatograph-mass spectrometer, said colorThe chromatographic conditions of the spectrum-mass spectrometer are as follows: the flow velocity of the mobile phase is 0.2-0.4 mL.min ^-1 The column temperature is 35-45 ℃, and the sample injection amount is 2-4 mu L; the mass spectrum conditions of the chromatograph-mass spectrometer are: the temperature of the electrospray ion source is set to be 500-600 ℃, the positive ion mode voltage of the ion spray voltage is set to be 4500-6500V, the negative ion mode voltage of the ion spray voltage is set to be-4000-6000V, the ion source gas I, the gas II and the curtain gas are respectively set to be 50, 60 and 25psi, and the collision induction ionization parameter is set to be high.

4. A method for predicting optimal recovery period of yam based on a machine learning-based marker metabolite model according to any one of claims 1-3, wherein said preprocessing comprises data washing, outlier removal and data normalization.

5. The method for predicting optimal recovery period of yam based on a machine learning marker metabolite model of claim 4, wherein said selection of potential marker metabolites is performed under an R software platform, comprising the steps of:

6. The method for predicting optimal recovery period of yam based on a machine learning marker metabolite model of claim 5, wherein said potential marker metabolites are selected according to importance ranking VIP value, difference significance P value and difference fold change value; variables that screen VIP >1.0, P < 0.05, and |log2 (fold change) | >1 are potential marker metabolites screened.

7. The method for predicting the optimal recovery period of yam based on a machine learning marker metabolite model of any one of claims 1-3, 4, 5, wherein said marker metabolite prediction model is analyzing potential marker metabolites by LASSO regression, which minimizes the cross-validation mean square error of the cross-validation curve and eliminates irrelevant potential marker metabolites by cross-validation curve and coefficient of variation path analysis; fitting the remaining data of the marker metabolites to construct a marker metabolite predictive model.

8. The method for predicting the optimal recovery period of Chinese yam based on the marker metabolite model of claim 7, wherein the area under the ROC curve is more than 0.95, and the marker metabolite prediction model has good accuracy, sensitivity and specificity.

9. The method for predicting optimal recovery period of yam based on a machine learning marker metabolite model of claim 8, wherein the marker metabolites selected are allantoin, 5-oxo-L-proline, 4-hydroxymandelonitrile, L-methionine, 6, 7-dihydroxy-2, 4-dimethoxyphenanthrene, N-feruloylagmatine, glucosyl syringic acid.

10. The method for predicting the optimal recovery period of yam based on the machine learning marker metabolite model according to claim 9, wherein the marker metabolite prediction model constructed by the LASSO regression method is characterized in that the 7 marker metabolites screened are used as indexes:

model score = -117.2510 +allantoin (-2.7340) + allantoin

5-oxo-L-proline (-0.8350) + and

4-hydroxy mandelonitrile 5.5810+

L-methionine 1.7373+

6, 7-dihydroxy-2, 4-dimethoxy phenanthrene 0.4416+

N-feruloylagmatine 1.9960+

Glucosyl syringic acid (-0.7759);

wherein model score is a model score;