CN117034774A

CN117034774A - Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model

Info

Publication number: CN117034774A
Application number: CN202311049518.XA
Authority: CN
Inventors: 田雨时; 杨旭; 陈年华; 杨武霖; 刘欣玥; 崔昕瞳
Original assignee: Northeast Agricultural University
Current assignee: Northeast Agricultural University
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-10

Abstract

The application discloses a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps: collecting a data set, and preprocessing the data set to obtain a machine data set; constructing an extreme gradient lifting model, and inputting the data set into the extreme gradient lifting model for training to obtain a prediction model; constructing a neural network model, inputting the deep neural network data set into the neural network model for training, and obtaining a deep neural network enzymolysis polysaccharide yield prediction model; and predicting the yield of the straw enzymolysis polysaccharide based on the prediction model. The prediction accuracy of XGB is highest, 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB, 93.0% and 91.1%, respectively. The interpretation analysis of the XGB model quantifies for the first time the contribution of each input variable to polysaccharide production predictions. The method can also be popularized to different resource utilization of various biomasses.

Description

Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model

Technical Field

The application belongs to the field of artificial intelligence technology and corn straw recycling, and particularly relates to a construction method of a high-accuracy straw enzymolysis polysaccharide yield prediction model.

Background

Corn straw is one of main agricultural wastes in China, and the efficient resource utilization of the corn straw is always a difficult problem in the field of agricultural resource utilization. Hemicellulose is one of the main components of corn stalks, a heteromultimer composed of monosaccharides including xylose, arabinose, mannose, galactose and the like, and polysaccharide which converts a macromolecular chain multimer into an oligomer by an enzymolysis method is an important way to utilize the corn stalk hemicellulose as a resource. The polysaccharide has various biological activities such as anti-inflammation, antioxidation, anti-tumor and the like, has low toxicity, and is a potential substitute of related medicaments.

However, complex interactions between enzymatic parameters (such as temperature, pH, enzyme addition, time and substrate concentration) present challenges for accurate prediction of polysaccharide yields and production optimization. Therefore, it is imperative to explore and construct an enzymatic polysaccharide yield prediction model with high prediction accuracy to assist production. Meanwhile, the traditional multivariate mathematical statistics method cannot capture the interdependence relationship among variables in the high-dimensional data, so that the regression prediction effect is poor. Compared with the traditional statistical method, the novel data driving method auxiliary modeling based on various artificial intelligence technologies such as machine learning, deep learning and the like is paid attention to due to high efficiency and accuracy, but the machine learning and deep learning modeling methods in different use scenes are not the same.

Disclosure of Invention

The application aims to provide a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, so as to solve the problems in the prior art.

In order to achieve the above purpose, the application provides a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps:

collecting a data set, preprocessing the data set, and obtaining a machine learning data set and a deep neural network data set;

constructing an extreme gradient lifting model, and inputting the machine learning data set into the extreme gradient lifting model for training to obtain a machine learning enzymolysis polysaccharide yield prediction model;

constructing a neural network model, inputting the deep neural network data set into the neural network model for training, and obtaining a deep neural network enzymolysis polysaccharide yield prediction model;

and predicting the yield of the straw enzymolysis polysaccharide based on the machine learning enzymolysis polysaccharide yield prediction model and the deep neural network enzymolysis polysaccharide yield prediction model.

Preferably, the process of obtaining a machine learning dataset comprises:

collecting a data set based on a straw enzymolysis polysaccharide reaction system;

carrying out correlation analysis on the data set to obtain a correlation relationship among variables;

removing abnormal values from the correlation relationship among the variables to obtain an abnormal value removal data set;

and carrying out hierarchical sampling on the outlier removal data set to obtain the machine learning training data set.

3. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 2, wherein the expression of the linear correlation between variables in the correlation between variables is:

wherein ρ is _xy Representing the pearson correlation value between the two variables.And->The average values for the strain amounts are shown.

Preferably, the acquiring process of the deep neural network data set includes:

and carrying out standardization processing on the data in the data set to obtain the deep neural network data set.

Preferably, the dataset comprises input variables and output variables;

the input variables include enzyme additive, time, temperature, substrate concentration, and pH;

the output variable comprises polysaccharide content.

Preferably, the process of obtaining a machine-learning enzymatic polysaccharide yield prediction model comprises:

dividing the machine learning data set to obtain a training set and a testing set;

constructing an extreme gradient lifting model, and inputting the training set into the extreme gradient lifting model for training to obtain a training model;

and inputting the test set into the training model for performance evaluation to obtain the machine learning enzymolysis polysaccharide yield prediction model.

Preferably, the deep neural network data set is divided into a training set and a verification set based on a cross verification method;

constructing a deep neural network, and inputting the training set into the deep neural network for training to obtain a neural network training model;

inputting the verification set into the trained deep neural network for testing, and obtaining a test result;

and carrying out parameter adjustment on the neural network training model based on the test result to obtain the deep neural network enzymolysis polysaccharide yield prediction model.

The application has the technical effects that:

the application aims to overcome the defect that a prediction model modeling method for accurately predicting the yield of a high-value target product prepared from enzymatic hydrolysis biomass is lacking at present, and provides a high-accuracy product yield prediction model construction method. Taking the polysaccharide production of the xylan enzymolysis corn straw as an example, the application designs the acquisition and pretreatment modes of a machine learning data set, the modeling method of a plurality of machine learning models, and the data pretreatment mode and the modeling method of an effective deep neural network model aiming at the scene, and successfully constructs a plurality of enzymolysis polysaccharide yield prediction models with high prediction accuracy. The modeling method provided by the application comprises 4 machine learning models (LR, tree, RF, XGB) and an autonomously designed deep learning model (DNN), wherein the prediction accuracy of XGB is 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB and is 93.0% and 91.1% respectively. The interpretation analysis of the XGB model was the first to quantify the contribution of each input variable to polysaccharide yield prediction (enzyme addition: 43.7%, enzyme hydrolysis time: 20.7%, substrate concentration: 15%, temperature: 15%, pH: 5.6%). The method can also be popularized to different resource utilization of various biomasses.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model in the embodiment of the application;

FIG. 2 is a schematic diagram of a Pearson Correlation Coefficient (PCC) matrix of all data in an embodiment of the application

FIG. 3 is a descriptive statistical box plot of target variables in a dataset in an embodiment of the application

FIG. 4 is a hierarchical sampling histogram based on enzyme liquid addition (ESV) in an embodiment of the application

FIG. 5 is a schematic diagram of a decision tree model in an embodiment of the present application;

FIG. 6 is a schematic diagram of a random forest model in an embodiment of the present application

FIG. 7 is a schematic structural diagram of an XGB model in an embodiment of the application

FIG. 8 is a schematic diagram of DNN model structure according to an embodiment of the present application

FIG. 9 is a feature importance hierarchical clustering chart of XGB model input variables in an embodiment of the application

FIG. 10 is a graph showing the predictive accuracy of Polysaccharide Yield (PY) by a predictive model in an embodiment of the application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Example 1

As shown in fig. 1, the embodiment provides a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps:

the method for constructing the straw enzymolysis polysaccharide yield model comprises three parts, namely data set collection, machine learning model construction and deep learning model construction. Model complexity is gradually improved from linearity to nonlinearity and from low integration level to high integration level on model type selection strategies of machine learning and deep learning model construction so as to obtain a polysaccharide yield prediction model with high prediction precision. Based on the strategy, the embodiment independently constructs a linear and three nonlinear machine learning models, and designs a new deep learning neural network model.

Data set collection:

the data set, which contains 5 input variables (enzyme addition, time, temperature, substrate concentration, pH) and one output variable (polysaccharide yield), was collected by the straw enzymatic polysaccharide reaction system, and 179 data points were collected as raw data set for modeling. A descriptive statistical analysis based on the data set raw data is given in table 1. The mean and standard deviation of these variables show the distribution pattern of the collected data. The minimum and maximum values of each variable provide a range of the parameter. The distribution pattern of the data can be better understood by the four quartiles and the minimum and maximum values of the variables.

TABLE 1

And (3) building a machine learning model:

1) Data preprocessing (including three steps):

the first step: as shown in fig. 2, the entire process starts with Pearson Correlation Coefficient (PCC) analysis. All input variables in the dataset are continuous. The sign of the pearson correlation coefficient defines the type of correlation between variables, and the magnitude of the coefficient indicates the degree of linear influence of one variable on the other. In observing the linear correlation between the variable data, PCC values for enzyme liquid volume (Enzyme Solution Volume, ESV) and polysaccharide yield (Polysaccharide Yield, PY) were higher than 0.5, with a value of 0.61, indicating a moderate positive correlation. Thus, the ESV is then selected as the stratified sampling category. However, no obvious correlation was observed in the remaining variables (-0.5 < pcc < 0.8).

The process starts with Pearson Correlation Coefficient (PCC) analysis. Correlation analysis is typically used to determine statistical associations between two or more variables and further analyze the strength and direction of the associations. All input variables in this experiment were continuous, and the linear correlation between the variables could be measured by pearson correlation coefficient calculated by equation (1).

And a second step of: and removing the abnormal value. Outliers refer to data that is far from most sample points. Typically such data points exhibit unreasonable characteristics in the dataset. Ignoring these outliers may lead to incorrect conclusions in the machine learning modeling scenario. It is therefore necessary to identify these outliers and remove them. The most common forms of identifying outliers include graphical methods (e.g., box graphs, normal distribution graphs) and modeling methods (e.g., linear regression, clustering algorithms). In the present application, outliers are identified using a box plot method. The box plot technique uses the quartiles of the data to determine outliers therein, which has found wide application in academic research. The data value that is larger than the upper whisker or smaller than the lower whisker of the box plot is determined as an outlier, and the upper and lower whiskers are set to 1.5 times the quartile deviation.

And a third step of: as shown in fig. 3-4, the outlier-removed dataset is named a sub-dataset and then hierarchically sampled in the sub-dataset. The training set and the test set used to build the machine learning model are obtained by sampling from the sub-dataset. Common sampling methods include random sampling, hierarchical sampling, whole sampling, and systematic sampling. Considering the sample size of the dataset, the study employed a hierarchical sampling approach to form training and testing sets. The input variable ESV (enzyme addition) was highly correlated with the target variable PY (polysaccharide yield) as calculated from Pearson Correlation Coefficient (PCC). Thus, the ideal training set and test set should include various types of ESVs. Four ESV types were created using Pandas, 0-100. Mu.L of type 1, 100. Mu.L-200. Mu.L of type 2, 200. Mu.L-300. Mu.L of type 3, 300. Mu.L-400. Mu.L of type 4. 80% of the sub-dataset samples were divided into training sets and 20% of the data points were test sets for final evaluation of the developed model.

Construction of a Linear Regression (LR) model:

multiple linear regression models (LR) were applied to study the relationship between variables. The linear model is a model that implements a prediction function by learning a linear combination of attributes, as shown in formula (1):

f(x)＝w ₁ x ₁ +w ₂ x ₂ +…+w _d x _d +b (1)

linear models have many advantages such as form simplicity and ease of modeling. Based on the assumption of linear correlation between the target variable and the input variable, a multiple linear regression model (LR) predicts the value of the target variable by combining multiple input variables of the subject into one linear process. In the present application, the sub-data set is divided into a training set and a test set at a ratio of 80:20. The linear regression model can only capture the linear relation between variables, and cannot be used for nonlinear relations, such as hierarchical relations. The present application uses multiple linear regression models (LR) as a blank control group for other nonlinear models.

3) Decision Tree (Tree) model construction:

since the decision tree model is the base model of the other two types of tree models (random forests and extreme gradient boosting), the decision tree model is first constructed when constructing the nonlinear model. The decision tree consists of nodes and a directed edge. Each layer corresponds to a sample feature. Nodes include internal nodes representing a feature or attribute and leaf nodes representing a class (fig. 5). The core principle of the decision tree model is that similar inputs will produce similar outputs with low computational complexity and substantial interpretation advantages. The decision tree model is more interpretive than the linear regression model. In the process of constructing the decision tree model in the study, the division mode of the sub-data set is consistent with the linear regression model. In the model construction, the embodiment adopts a grid search mode to optimize the super parameter max_depth (the maximum depth of the tree) of the decision tree model, and ten times of cross validation is carried out on the training set.

4) Building a random forest model:

although the prediction accuracy of the decision tree model is higher than that of the linear regression model, it may be weaker than that of the integrated chemistry model. The integrated learning method can greatly improve the prediction accuracy of the tree by combining a plurality of decision trees. Standard algorithms for integrated chemistry models are Bagging and Boosting.

The random forest model is a typical Bagging algorithm that trains out a weak learner by randomly selecting data from raw data and playing back each time, forming a training set (fig. 6). The study built the RF model in the same procedure as the LR model. The average RMSE in the validation subset is obtained by cross-validation and used to evaluate the model. The final regression result of the RF model is an average value based on the results of each base model.

5) Extreme gradient lifting (XGB) model construction:

after the construction of the RF model is completed, the present embodiment constructs the XGB model (fig. 7). The construction method and the evaluation method are consistent with the RF model. XGB is a typical machine learning model, using Boosting algorithms. The core of the algorithm is to promote the weak learner to be a strong learner. The difference from Bagging algorithm is: 1) All weak learners in the Bagging algorithm have the same influence weight on the final result, and the Boosting algorithm distributes more considerable weight to weak learners with more accurate predicted results after each round of training. 2) After each training period, boosting algorithms change the probability distribution of the training set. The algorithm will increase the weight of the samples that were mispredicted by the weak learner in the previous training period and decrease the weight of the samples that were correctly predicted.

The machine learning model is generally considered a black box model. Although their prediction accuracy is good, it is difficult to determine the contribution of each input variable to the predicted target variable, which results in a trust risk for the model when applied to an actual business scenario. Furthermore, the interpretability analysis helps to enhance insight into the model, helps to iterate the model and features, and helps to develop optimization algorithms at a later stage. In this study, SHAP was used to perform an interpretive analysis on the developed model, exploring the relationship between variables. Feature importance is used to measure the importance of each feature in a dataset. The feature importance conclusions may vary greatly from model to model and cannot be made to account for the effect of each feature on each individual predictor. If consistency is not verified, the model is not necessarily more dependent on features with high assignment attribution. In this case, it is not possible to compare the home importance between two arbitrary models. SHAP is the only consistent personalized feature-attribution method. The model conclusion is globally and individually attributed through SHAP, so that the black box algorithm is understood, the interaction of each input variable is observed, and the feature importance analysis is performed. The results are visually presented and the SHAP dependency graph can be used as a substitute for the traditional partial dependency and cumulative local effect graph. Feature importance hierarchical cluster maps and density scatter maps are used to conduct macro analysis on each feature. The distribution heat map is used for carrying out macroscopic analysis on the integral model.

(3) Deep Neural Network (DNN) model construction:

the whole process starts with data normalization. It is assumed that data having widely different numerical ranges is fed into the neural network. In this case, the network may make learning more challenging by adapting to such data of different numerical ranges. For this reason, one popular solution in practice is to normalize each feature. For each data of the input variables, the average value of the variables is first subtracted and then divided by the standard deviation. Normalization eliminates the effects of amplitude, improves the accuracy of the deep learning model, and allows for faster convergence during gradient descent. The mean and standard deviation for the normalized calculation of the training set and test set data are also calculated from the training set data.

The second step is to build a Deep Neural Network (DNN). The model constructed in this study (fig. 8) contains three fully connected layers (Dense), 48 neurons each, with smaller network structures reducing overfitting. The activation function may impart a nonlinear characteristic to the neural network. The ReLU function alleviates the problem of gradient disappearance, is simple to calculate, and creates sparsity to prevent overfitting, so the hidden layers all use ReLU as the activation function. L1 regularization is also added to the two hidden layers. In L1 regularization, the added cost is proportional to the absolute value of the weighting coefficient. The DNN model is compiled with an optimizer 'Adam', and the loss functions 'MSE' and 'MAE' are monitored throughout the training of the model. Further, the structure of the DNN prediction model is shown in fig. 8.

The k-fold cross validation method is used to partition the validation set in the training set to evaluate the DNN model while adjusting the DNN parameters. And finally, inputting a Callback function before the program is run, and calling the Callback function at the time of running. How many epochs are needed to achieve the lowest validation loss cannot be predicted before training the neural network model. This problem can be solved by using the ModelCheckPoint callback function of the Keras library, which keeps the best model weight coefficients obtained throughout the training process. The ModelCheckPoint parameter is set to: monitor= 'val_mae', save_best_only=true.

The loss function of the deep learning model is the Mean Square Error (MSE), and the Mean Absolute Error (MAE) is also monitored during training as in equations 4 and 5. Correlation coefficient (R) ² ) And Root Mean Square Error (RMSE) are used as statistical indicators for the evaluation model. R is as described in equations 2 and 3 ² The higher the RMSE, the lower the model accuracy. For the results of each training set, the trained model is better if the values of MSE and MAE are smaller.y _i 、/>Respectively, a predicted value, a true value and an average value, and n is the total number of data points.

In the production of straw enzymatic polysaccharide, polysaccharide yield is often affected by many complex nonlinear input variables, such as: temperature, enzyme addition amount, time, pH, etc. These input variables can affect the efficiency of polysaccharide production from enzymatic straw and thus polysaccharide yield. In addition, the complex interactions between the different input variables also make the optimization of enzymatic polysaccharide production very difficult. The traditional multivariable research method is difficult to fully mine out the nonlinear influence of different input variables on polysaccharide yield, so that the problem of poor prediction effect of the constructed yield prediction model is caused.

The application aims to overcome the defect that a prediction model modeling method for accurately predicting the yield of a high-value target product prepared from enzymatic hydrolysis biomass is lacking at present, and provides a high-accuracy product yield prediction model construction method. Taking the polysaccharide production of the xylan enzymolysis corn straw as an example, the application designs the acquisition and pretreatment modes of a machine learning data set, the modeling method of a plurality of machine learning models, and the data pretreatment mode and the modeling method of an effective deep neural network model aiming at the scene, and successfully constructs a plurality of enzymolysis polysaccharide yield prediction models with high prediction accuracy. The modeling method provided by the application comprises 4 machine learning models (LR, tree, RF, XGB) and an autonomously designed deep learning model (DNN), wherein the prediction accuracy of XGB is 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB and is 93.0% and 91.1% respectively. The interpretation analysis of the XGB model (FIG. 9) first quantified the contribution of each input variable to polysaccharide yield prediction (enzyme addition: 43.7%, enzyme hydrolysis time: 20.7%, substrate concentration: 15%, temperature: 15%, pH: 5.6%). The method can also be popularized to different resource utilization of various biomasses.

Evaluating model prediction accuracy of the constructed five input variables and performing test setThe predictive accuracy of the developed model was evaluated above (fig. 10). On the training set, LR, tree, RF, XGB and DNN model R ² The values were 0.514, 0.979, 0.972, 0.999 and 0.987, respectively. The values obtained on the test data were 0.514 (LR), 0.879 (Tree), 0.930 (RF), 0.956 (XGB), and 0.911 (DNN). Thus, the prediction of XGB is significantly better than the other four models, and the lowest RMSE value of XGB (only 0.328) also demonstrates this further (Table 2). The predictive results of these models on the training data indicate that there is a likelihood that overfitting may exist. R of RF, XGB and DNN on training set and test set ² All exceeded 0.9, so no acute overfitting occurred in these models.

Compared with the LR model, the prediction accuracy of the Tree, RF, XGB model and the DNN model is remarkably improved, and the main nonlinear relation among variables is further confirmed. The prediction accuracy of the DNN model is better than that of LR and Tree models, but slightly lower than that of RF and XGB models, which shows that the deep neural network is suitable for the research scene, and the deep neural network model can be compared with the prediction effects of other nonlinear machine learning models, so that a new idea is provided for searching the optimal polysaccharide prediction model. Furthermore, such a deep neural network model can be combined with an optimization algorithm to find input variable combinations that increase the target variable (polysaccharide yield) with less research time and production cost.

TABLE 2

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The method for constructing the high-accuracy straw enzymolysis polysaccharide yield prediction model is characterized by comprising the following steps of:

2. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for obtaining a machine learning data set comprises:

3. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 2, wherein the expression for obtaining the correlation relationship between variables is:

ρ _xy representing the pearson correlation value between the two variables.And->The average values for the strain amounts are shown.

4. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for acquiring the deep neural network data set comprises the following steps:

5. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the data set comprises an input variable and an output variable;

the output variable comprises polysaccharide content.

6. The method for constructing a high-accuracy stalk enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for obtaining the machine learning enzymolysis polysaccharide yield prediction model comprises:

7. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, which is characterized in that,

dividing the deep neural network data set into a training set and a verification set based on a cross verification method;