CN117034774A - Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model - Google Patents
Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model Download PDFInfo
- Publication number
- CN117034774A CN117034774A CN202311049518.XA CN202311049518A CN117034774A CN 117034774 A CN117034774 A CN 117034774A CN 202311049518 A CN202311049518 A CN 202311049518A CN 117034774 A CN117034774 A CN 117034774A
- Authority
- CN
- China
- Prior art keywords
- model
- data set
- neural network
- training
- prediction model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 229920001282 polysaccharide Polymers 0.000 title claims abstract description 67
- 239000005017 polysaccharide Substances 0.000 title claims abstract description 67
- 239000010902 straw Substances 0.000 title claims abstract description 31
- 150000004676 glycans Chemical class 0.000 title claims abstract 23
- 238000010276 construction Methods 0.000 title description 15
- 238000000034 method Methods 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 50
- 238000013528 artificial neural network Methods 0.000 claims abstract description 33
- 238000003062 neural network model Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000010801 machine learning Methods 0.000 claims description 35
- 238000012360 testing method Methods 0.000 claims description 19
- 102000004190 Enzymes Human genes 0.000 claims description 13
- 108090000790 Enzymes Proteins 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 11
- 239000000758 substrate Substances 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 230000002159 abnormal effect Effects 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000010219 correlation analysis Methods 0.000 claims description 3
- 239000000654 additive Substances 0.000 claims description 2
- 230000000996 additive effect Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 abstract description 10
- 238000004519 manufacturing process Methods 0.000 abstract description 9
- 150000004804 polysaccharides Chemical class 0.000 description 44
- 238000012417 linear regression Methods 0.000 description 20
- 238000004422 calculation algorithm Methods 0.000 description 13
- 238000003066 decision tree Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 240000008042 Zea mays Species 0.000 description 7
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 7
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 7
- 235000005822 corn Nutrition 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000002255 enzymatic effect Effects 0.000 description 7
- 241000362773 Espirito Santo virus Species 0.000 description 6
- 238000013136 deep learning model Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 239000002028 Biomass Substances 0.000 description 2
- SRBFZHDQGSBBOR-IOVATXLUSA-N D-xylopyranose Chemical compound O[C@@H]1COC(O)[C@H](O)[C@H]1O SRBFZHDQGSBBOR-IOVATXLUSA-N 0.000 description 2
- 229920002488 Hemicellulose Polymers 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- PYMYPHUHKUWMLA-UHFFFAOYSA-N arabinose Natural products OCC(O)C(O)C(O)C=O PYMYPHUHKUWMLA-UHFFFAOYSA-N 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- SRBFZHDQGSBBOR-UHFFFAOYSA-N beta-D-Pyranose-Lyxose Natural products OC1COC(O)C(O)C1O SRBFZHDQGSBBOR-UHFFFAOYSA-N 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007071 enzymatic hydrolysis Effects 0.000 description 2
- 238000006047 enzymatic hydrolysis reaction Methods 0.000 description 2
- 230000007062 hydrolysis Effects 0.000 description 2
- 238000006460 hydrolysis reaction Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 229920001221 xylan Polymers 0.000 description 2
- 150000004823 xylans Chemical class 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- WQZGKKKJIJFFOK-QTVWNMPRSA-N D-mannopyranose Chemical compound OC[C@H]1OC(O)[C@@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-QTVWNMPRSA-N 0.000 description 1
- 206010061218 Inflammation Diseases 0.000 description 1
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 239000002154 agricultural waste Substances 0.000 description 1
- WQZGKKKJIJFFOK-PHYPRBDBSA-N alpha-D-galactose Chemical compound OC[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@H]1O WQZGKKKJIJFFOK-PHYPRBDBSA-N 0.000 description 1
- 230000003064 anti-oxidating effect Effects 0.000 description 1
- 230000000259 anti-tumor effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- PYMYPHUHKUWMLA-WDCZJNDASA-N arabinose Chemical compound OC[C@@H](O)[C@@H](O)[C@H](O)C=O PYMYPHUHKUWMLA-WDCZJNDASA-N 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 229930182830 galactose Natural products 0.000 description 1
- 231100000053 low toxicity Toxicity 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 150000002772 monosaccharides Chemical class 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps: collecting a data set, and preprocessing the data set to obtain a machine data set; constructing an extreme gradient lifting model, and inputting the data set into the extreme gradient lifting model for training to obtain a prediction model; constructing a neural network model, inputting the deep neural network data set into the neural network model for training, and obtaining a deep neural network enzymolysis polysaccharide yield prediction model; and predicting the yield of the straw enzymolysis polysaccharide based on the prediction model. The prediction accuracy of XGB is highest, 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB, 93.0% and 91.1%, respectively. The interpretation analysis of the XGB model quantifies for the first time the contribution of each input variable to polysaccharide production predictions. The method can also be popularized to different resource utilization of various biomasses.
Description
Technical Field
The application belongs to the field of artificial intelligence technology and corn straw recycling, and particularly relates to a construction method of a high-accuracy straw enzymolysis polysaccharide yield prediction model.
Background
Corn straw is one of main agricultural wastes in China, and the efficient resource utilization of the corn straw is always a difficult problem in the field of agricultural resource utilization. Hemicellulose is one of the main components of corn stalks, a heteromultimer composed of monosaccharides including xylose, arabinose, mannose, galactose and the like, and polysaccharide which converts a macromolecular chain multimer into an oligomer by an enzymolysis method is an important way to utilize the corn stalk hemicellulose as a resource. The polysaccharide has various biological activities such as anti-inflammation, antioxidation, anti-tumor and the like, has low toxicity, and is a potential substitute of related medicaments.
However, complex interactions between enzymatic parameters (such as temperature, pH, enzyme addition, time and substrate concentration) present challenges for accurate prediction of polysaccharide yields and production optimization. Therefore, it is imperative to explore and construct an enzymatic polysaccharide yield prediction model with high prediction accuracy to assist production. Meanwhile, the traditional multivariate mathematical statistics method cannot capture the interdependence relationship among variables in the high-dimensional data, so that the regression prediction effect is poor. Compared with the traditional statistical method, the novel data driving method auxiliary modeling based on various artificial intelligence technologies such as machine learning, deep learning and the like is paid attention to due to high efficiency and accuracy, but the machine learning and deep learning modeling methods in different use scenes are not the same.
Disclosure of Invention
The application aims to provide a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, so as to solve the problems in the prior art.
In order to achieve the above purpose, the application provides a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps:
collecting a data set, preprocessing the data set, and obtaining a machine learning data set and a deep neural network data set;
constructing an extreme gradient lifting model, and inputting the machine learning data set into the extreme gradient lifting model for training to obtain a machine learning enzymolysis polysaccharide yield prediction model;
constructing a neural network model, inputting the deep neural network data set into the neural network model for training, and obtaining a deep neural network enzymolysis polysaccharide yield prediction model;
and predicting the yield of the straw enzymolysis polysaccharide based on the machine learning enzymolysis polysaccharide yield prediction model and the deep neural network enzymolysis polysaccharide yield prediction model.
Preferably, the process of obtaining a machine learning dataset comprises:
collecting a data set based on a straw enzymolysis polysaccharide reaction system;
carrying out correlation analysis on the data set to obtain a correlation relationship among variables;
removing abnormal values from the correlation relationship among the variables to obtain an abnormal value removal data set;
and carrying out hierarchical sampling on the outlier removal data set to obtain the machine learning training data set.
3. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 2, wherein the expression of the linear correlation between variables in the correlation between variables is:
wherein ρ is xy Representing the pearson correlation value between the two variables.And->The average values for the strain amounts are shown.
Preferably, the acquiring process of the deep neural network data set includes:
collecting a data set based on a straw enzymolysis polysaccharide reaction system;
and carrying out standardization processing on the data in the data set to obtain the deep neural network data set.
Preferably, the dataset comprises input variables and output variables;
the input variables include enzyme additive, time, temperature, substrate concentration, and pH;
the output variable comprises polysaccharide content.
Preferably, the process of obtaining a machine-learning enzymatic polysaccharide yield prediction model comprises:
dividing the machine learning data set to obtain a training set and a testing set;
constructing an extreme gradient lifting model, and inputting the training set into the extreme gradient lifting model for training to obtain a training model;
and inputting the test set into the training model for performance evaluation to obtain the machine learning enzymolysis polysaccharide yield prediction model.
Preferably, the deep neural network data set is divided into a training set and a verification set based on a cross verification method;
constructing a deep neural network, and inputting the training set into the deep neural network for training to obtain a neural network training model;
inputting the verification set into the trained deep neural network for testing, and obtaining a test result;
and carrying out parameter adjustment on the neural network training model based on the test result to obtain the deep neural network enzymolysis polysaccharide yield prediction model.
The application has the technical effects that:
the application aims to overcome the defect that a prediction model modeling method for accurately predicting the yield of a high-value target product prepared from enzymatic hydrolysis biomass is lacking at present, and provides a high-accuracy product yield prediction model construction method. Taking the polysaccharide production of the xylan enzymolysis corn straw as an example, the application designs the acquisition and pretreatment modes of a machine learning data set, the modeling method of a plurality of machine learning models, and the data pretreatment mode and the modeling method of an effective deep neural network model aiming at the scene, and successfully constructs a plurality of enzymolysis polysaccharide yield prediction models with high prediction accuracy. The modeling method provided by the application comprises 4 machine learning models (LR, tree, RF, XGB) and an autonomously designed deep learning model (DNN), wherein the prediction accuracy of XGB is 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB and is 93.0% and 91.1% respectively. The interpretation analysis of the XGB model was the first to quantify the contribution of each input variable to polysaccharide yield prediction (enzyme addition: 43.7%, enzyme hydrolysis time: 20.7%, substrate concentration: 15%, temperature: 15%, pH: 5.6%). The method can also be popularized to different resource utilization of various biomasses.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
FIG. 1 is a flow chart of a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model in the embodiment of the application;
FIG. 2 is a schematic diagram of a Pearson Correlation Coefficient (PCC) matrix of all data in an embodiment of the application
FIG. 3 is a descriptive statistical box plot of target variables in a dataset in an embodiment of the application
FIG. 4 is a hierarchical sampling histogram based on enzyme liquid addition (ESV) in an embodiment of the application
FIG. 5 is a schematic diagram of a decision tree model in an embodiment of the present application;
FIG. 6 is a schematic diagram of a random forest model in an embodiment of the present application
FIG. 7 is a schematic structural diagram of an XGB model in an embodiment of the application
FIG. 8 is a schematic diagram of DNN model structure according to an embodiment of the present application
FIG. 9 is a feature importance hierarchical clustering chart of XGB model input variables in an embodiment of the application
FIG. 10 is a graph showing the predictive accuracy of Polysaccharide Yield (PY) by a predictive model in an embodiment of the application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Example 1
As shown in fig. 1, the embodiment provides a method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model, which comprises the following steps:
the method for constructing the straw enzymolysis polysaccharide yield model comprises three parts, namely data set collection, machine learning model construction and deep learning model construction. Model complexity is gradually improved from linearity to nonlinearity and from low integration level to high integration level on model type selection strategies of machine learning and deep learning model construction so as to obtain a polysaccharide yield prediction model with high prediction precision. Based on the strategy, the embodiment independently constructs a linear and three nonlinear machine learning models, and designs a new deep learning neural network model.
Data set collection:
the data set, which contains 5 input variables (enzyme addition, time, temperature, substrate concentration, pH) and one output variable (polysaccharide yield), was collected by the straw enzymatic polysaccharide reaction system, and 179 data points were collected as raw data set for modeling. A descriptive statistical analysis based on the data set raw data is given in table 1. The mean and standard deviation of these variables show the distribution pattern of the collected data. The minimum and maximum values of each variable provide a range of the parameter. The distribution pattern of the data can be better understood by the four quartiles and the minimum and maximum values of the variables.
TABLE 1
And (3) building a machine learning model:
1) Data preprocessing (including three steps):
the first step: as shown in fig. 2, the entire process starts with Pearson Correlation Coefficient (PCC) analysis. All input variables in the dataset are continuous. The sign of the pearson correlation coefficient defines the type of correlation between variables, and the magnitude of the coefficient indicates the degree of linear influence of one variable on the other. In observing the linear correlation between the variable data, PCC values for enzyme liquid volume (Enzyme Solution Volume, ESV) and polysaccharide yield (Polysaccharide Yield, PY) were higher than 0.5, with a value of 0.61, indicating a moderate positive correlation. Thus, the ESV is then selected as the stratified sampling category. However, no obvious correlation was observed in the remaining variables (-0.5 < pcc < 0.8).
The process starts with Pearson Correlation Coefficient (PCC) analysis. Correlation analysis is typically used to determine statistical associations between two or more variables and further analyze the strength and direction of the associations. All input variables in this experiment were continuous, and the linear correlation between the variables could be measured by pearson correlation coefficient calculated by equation (1).
Wherein ρ is xy Representing the pearson correlation value between the two variables.And->The average values for the strain amounts are shown.
And a second step of: and removing the abnormal value. Outliers refer to data that is far from most sample points. Typically such data points exhibit unreasonable characteristics in the dataset. Ignoring these outliers may lead to incorrect conclusions in the machine learning modeling scenario. It is therefore necessary to identify these outliers and remove them. The most common forms of identifying outliers include graphical methods (e.g., box graphs, normal distribution graphs) and modeling methods (e.g., linear regression, clustering algorithms). In the present application, outliers are identified using a box plot method. The box plot technique uses the quartiles of the data to determine outliers therein, which has found wide application in academic research. The data value that is larger than the upper whisker or smaller than the lower whisker of the box plot is determined as an outlier, and the upper and lower whiskers are set to 1.5 times the quartile deviation.
And a third step of: as shown in fig. 3-4, the outlier-removed dataset is named a sub-dataset and then hierarchically sampled in the sub-dataset. The training set and the test set used to build the machine learning model are obtained by sampling from the sub-dataset. Common sampling methods include random sampling, hierarchical sampling, whole sampling, and systematic sampling. Considering the sample size of the dataset, the study employed a hierarchical sampling approach to form training and testing sets. The input variable ESV (enzyme addition) was highly correlated with the target variable PY (polysaccharide yield) as calculated from Pearson Correlation Coefficient (PCC). Thus, the ideal training set and test set should include various types of ESVs. Four ESV types were created using Pandas, 0-100. Mu.L of type 1, 100. Mu.L-200. Mu.L of type 2, 200. Mu.L-300. Mu.L of type 3, 300. Mu.L-400. Mu.L of type 4. 80% of the sub-dataset samples were divided into training sets and 20% of the data points were test sets for final evaluation of the developed model.
Construction of a Linear Regression (LR) model:
multiple linear regression models (LR) were applied to study the relationship between variables. The linear model is a model that implements a prediction function by learning a linear combination of attributes, as shown in formula (1):
f(x)=w 1 x 1 +w 2 x 2 +…+w d x d +b (1)
linear models have many advantages such as form simplicity and ease of modeling. Based on the assumption of linear correlation between the target variable and the input variable, a multiple linear regression model (LR) predicts the value of the target variable by combining multiple input variables of the subject into one linear process. In the present application, the sub-data set is divided into a training set and a test set at a ratio of 80:20. The linear regression model can only capture the linear relation between variables, and cannot be used for nonlinear relations, such as hierarchical relations. The present application uses multiple linear regression models (LR) as a blank control group for other nonlinear models.
3) Decision Tree (Tree) model construction:
since the decision tree model is the base model of the other two types of tree models (random forests and extreme gradient boosting), the decision tree model is first constructed when constructing the nonlinear model. The decision tree consists of nodes and a directed edge. Each layer corresponds to a sample feature. Nodes include internal nodes representing a feature or attribute and leaf nodes representing a class (fig. 5). The core principle of the decision tree model is that similar inputs will produce similar outputs with low computational complexity and substantial interpretation advantages. The decision tree model is more interpretive than the linear regression model. In the process of constructing the decision tree model in the study, the division mode of the sub-data set is consistent with the linear regression model. In the model construction, the embodiment adopts a grid search mode to optimize the super parameter max_depth (the maximum depth of the tree) of the decision tree model, and ten times of cross validation is carried out on the training set.
4) Building a random forest model:
although the prediction accuracy of the decision tree model is higher than that of the linear regression model, it may be weaker than that of the integrated chemistry model. The integrated learning method can greatly improve the prediction accuracy of the tree by combining a plurality of decision trees. Standard algorithms for integrated chemistry models are Bagging and Boosting.
The random forest model is a typical Bagging algorithm that trains out a weak learner by randomly selecting data from raw data and playing back each time, forming a training set (fig. 6). The study built the RF model in the same procedure as the LR model. The average RMSE in the validation subset is obtained by cross-validation and used to evaluate the model. The final regression result of the RF model is an average value based on the results of each base model.
5) Extreme gradient lifting (XGB) model construction:
after the construction of the RF model is completed, the present embodiment constructs the XGB model (fig. 7). The construction method and the evaluation method are consistent with the RF model. XGB is a typical machine learning model, using Boosting algorithms. The core of the algorithm is to promote the weak learner to be a strong learner. The difference from Bagging algorithm is: 1) All weak learners in the Bagging algorithm have the same influence weight on the final result, and the Boosting algorithm distributes more considerable weight to weak learners with more accurate predicted results after each round of training. 2) After each training period, boosting algorithms change the probability distribution of the training set. The algorithm will increase the weight of the samples that were mispredicted by the weak learner in the previous training period and decrease the weight of the samples that were correctly predicted.
The machine learning model is generally considered a black box model. Although their prediction accuracy is good, it is difficult to determine the contribution of each input variable to the predicted target variable, which results in a trust risk for the model when applied to an actual business scenario. Furthermore, the interpretability analysis helps to enhance insight into the model, helps to iterate the model and features, and helps to develop optimization algorithms at a later stage. In this study, SHAP was used to perform an interpretive analysis on the developed model, exploring the relationship between variables. Feature importance is used to measure the importance of each feature in a dataset. The feature importance conclusions may vary greatly from model to model and cannot be made to account for the effect of each feature on each individual predictor. If consistency is not verified, the model is not necessarily more dependent on features with high assignment attribution. In this case, it is not possible to compare the home importance between two arbitrary models. SHAP is the only consistent personalized feature-attribution method. The model conclusion is globally and individually attributed through SHAP, so that the black box algorithm is understood, the interaction of each input variable is observed, and the feature importance analysis is performed. The results are visually presented and the SHAP dependency graph can be used as a substitute for the traditional partial dependency and cumulative local effect graph. Feature importance hierarchical cluster maps and density scatter maps are used to conduct macro analysis on each feature. The distribution heat map is used for carrying out macroscopic analysis on the integral model.
(3) Deep Neural Network (DNN) model construction:
the whole process starts with data normalization. It is assumed that data having widely different numerical ranges is fed into the neural network. In this case, the network may make learning more challenging by adapting to such data of different numerical ranges. For this reason, one popular solution in practice is to normalize each feature. For each data of the input variables, the average value of the variables is first subtracted and then divided by the standard deviation. Normalization eliminates the effects of amplitude, improves the accuracy of the deep learning model, and allows for faster convergence during gradient descent. The mean and standard deviation for the normalized calculation of the training set and test set data are also calculated from the training set data.
The second step is to build a Deep Neural Network (DNN). The model constructed in this study (fig. 8) contains three fully connected layers (Dense), 48 neurons each, with smaller network structures reducing overfitting. The activation function may impart a nonlinear characteristic to the neural network. The ReLU function alleviates the problem of gradient disappearance, is simple to calculate, and creates sparsity to prevent overfitting, so the hidden layers all use ReLU as the activation function. L1 regularization is also added to the two hidden layers. In L1 regularization, the added cost is proportional to the absolute value of the weighting coefficient. The DNN model is compiled with an optimizer 'Adam', and the loss functions 'MSE' and 'MAE' are monitored throughout the training of the model. Further, the structure of the DNN prediction model is shown in fig. 8.
The k-fold cross validation method is used to partition the validation set in the training set to evaluate the DNN model while adjusting the DNN parameters. And finally, inputting a Callback function before the program is run, and calling the Callback function at the time of running. How many epochs are needed to achieve the lowest validation loss cannot be predicted before training the neural network model. This problem can be solved by using the ModelCheckPoint callback function of the Keras library, which keeps the best model weight coefficients obtained throughout the training process. The ModelCheckPoint parameter is set to: monitor= 'val_mae', save_best_only=true.
The loss function of the deep learning model is the Mean Square Error (MSE), and the Mean Absolute Error (MAE) is also monitored during training as in equations 4 and 5. Correlation coefficient (R) 2 ) And Root Mean Square Error (RMSE) are used as statistical indicators for the evaluation model. R is as described in equations 2 and 3 2 The higher the RMSE, the lower the model accuracy. For the results of each training set, the trained model is better if the values of MSE and MAE are smaller.y i 、/>Respectively, a predicted value, a true value and an average value, and n is the total number of data points.
In the production of straw enzymatic polysaccharide, polysaccharide yield is often affected by many complex nonlinear input variables, such as: temperature, enzyme addition amount, time, pH, etc. These input variables can affect the efficiency of polysaccharide production from enzymatic straw and thus polysaccharide yield. In addition, the complex interactions between the different input variables also make the optimization of enzymatic polysaccharide production very difficult. The traditional multivariable research method is difficult to fully mine out the nonlinear influence of different input variables on polysaccharide yield, so that the problem of poor prediction effect of the constructed yield prediction model is caused.
The application aims to overcome the defect that a prediction model modeling method for accurately predicting the yield of a high-value target product prepared from enzymatic hydrolysis biomass is lacking at present, and provides a high-accuracy product yield prediction model construction method. Taking the polysaccharide production of the xylan enzymolysis corn straw as an example, the application designs the acquisition and pretreatment modes of a machine learning data set, the modeling method of a plurality of machine learning models, and the data pretreatment mode and the modeling method of an effective deep neural network model aiming at the scene, and successfully constructs a plurality of enzymolysis polysaccharide yield prediction models with high prediction accuracy. The modeling method provided by the application comprises 4 machine learning models (LR, tree, RF, XGB) and an autonomously designed deep learning model (DNN), wherein the prediction accuracy of XGB is 95.6%, and the prediction accuracy of RF and DNN models is slightly lower than that of XGB and is 93.0% and 91.1% respectively. The interpretation analysis of the XGB model (FIG. 9) first quantified the contribution of each input variable to polysaccharide yield prediction (enzyme addition: 43.7%, enzyme hydrolysis time: 20.7%, substrate concentration: 15%, temperature: 15%, pH: 5.6%). The method can also be popularized to different resource utilization of various biomasses.
Evaluating model prediction accuracy of the constructed five input variables and performing test setThe predictive accuracy of the developed model was evaluated above (fig. 10). On the training set, LR, tree, RF, XGB and DNN model R 2 The values were 0.514, 0.979, 0.972, 0.999 and 0.987, respectively. The values obtained on the test data were 0.514 (LR), 0.879 (Tree), 0.930 (RF), 0.956 (XGB), and 0.911 (DNN). Thus, the prediction of XGB is significantly better than the other four models, and the lowest RMSE value of XGB (only 0.328) also demonstrates this further (Table 2). The predictive results of these models on the training data indicate that there is a likelihood that overfitting may exist. R of RF, XGB and DNN on training set and test set 2 All exceeded 0.9, so no acute overfitting occurred in these models.
Compared with the LR model, the prediction accuracy of the Tree, RF, XGB model and the DNN model is remarkably improved, and the main nonlinear relation among variables is further confirmed. The prediction accuracy of the DNN model is better than that of LR and Tree models, but slightly lower than that of RF and XGB models, which shows that the deep neural network is suitable for the research scene, and the deep neural network model can be compared with the prediction effects of other nonlinear machine learning models, so that a new idea is provided for searching the optimal polysaccharide prediction model. Furthermore, such a deep neural network model can be combined with an optimization algorithm to find input variable combinations that increase the target variable (polysaccharide yield) with less research time and production cost.
TABLE 2
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
Claims (7)
1. The method for constructing the high-accuracy straw enzymolysis polysaccharide yield prediction model is characterized by comprising the following steps of:
collecting a data set, preprocessing the data set, and obtaining a machine learning data set and a deep neural network data set;
constructing an extreme gradient lifting model, and inputting the machine learning data set into the extreme gradient lifting model for training to obtain a machine learning enzymolysis polysaccharide yield prediction model;
constructing a neural network model, inputting the deep neural network data set into the neural network model for training, and obtaining a deep neural network enzymolysis polysaccharide yield prediction model;
and predicting the yield of the straw enzymolysis polysaccharide based on the machine learning enzymolysis polysaccharide yield prediction model and the deep neural network enzymolysis polysaccharide yield prediction model.
2. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for obtaining a machine learning data set comprises:
collecting a data set based on a straw enzymolysis polysaccharide reaction system;
carrying out correlation analysis on the data set to obtain a correlation relationship among variables;
removing abnormal values from the correlation relationship among the variables to obtain an abnormal value removal data set;
and carrying out hierarchical sampling on the outlier removal data set to obtain the machine learning training data set.
3. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 2, wherein the expression for obtaining the correlation relationship between variables is:
ρ xy representing the pearson correlation value between the two variables.And->The average values for the strain amounts are shown.
4. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for acquiring the deep neural network data set comprises the following steps:
collecting a data set based on a straw enzymolysis polysaccharide reaction system;
and carrying out standardization processing on the data in the data set to obtain the deep neural network data set.
5. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, wherein the data set comprises an input variable and an output variable;
the input variables include enzyme additive, time, temperature, substrate concentration, and pH;
the output variable comprises polysaccharide content.
6. The method for constructing a high-accuracy stalk enzymolysis polysaccharide yield prediction model according to claim 1, wherein the process for obtaining the machine learning enzymolysis polysaccharide yield prediction model comprises:
dividing the machine learning data set to obtain a training set and a testing set;
constructing an extreme gradient lifting model, and inputting the training set into the extreme gradient lifting model for training to obtain a training model;
and inputting the test set into the training model for performance evaluation to obtain the machine learning enzymolysis polysaccharide yield prediction model.
7. The method for constructing a high-accuracy straw enzymolysis polysaccharide yield prediction model according to claim 1, which is characterized in that,
dividing the deep neural network data set into a training set and a verification set based on a cross verification method;
constructing a deep neural network, and inputting the training set into the deep neural network for training to obtain a neural network training model;
inputting the verification set into the trained deep neural network for testing, and obtaining a test result;
and carrying out parameter adjustment on the neural network training model based on the test result to obtain the deep neural network enzymolysis polysaccharide yield prediction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311049518.XA CN117034774A (en) | 2023-08-21 | 2023-08-21 | Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311049518.XA CN117034774A (en) | 2023-08-21 | 2023-08-21 | Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117034774A true CN117034774A (en) | 2023-11-10 |
Family
ID=88627869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311049518.XA Pending CN117034774A (en) | 2023-08-21 | 2023-08-21 | Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117034774A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091236A (en) * | 2019-11-27 | 2020-05-01 | 长春吉电能源科技有限公司 | Multi-classification deep learning short-term wind power prediction method classified according to pitch angles |
CN114139596A (en) * | 2021-10-15 | 2022-03-04 | 惠州学院 | Tea variety identification method and system based on deep neural network |
CN114782775A (en) * | 2022-04-19 | 2022-07-22 | 平安科技(深圳)有限公司 | Method and device for constructing classification model, computer equipment and storage medium |
CN114781723A (en) * | 2022-04-22 | 2022-07-22 | 国网河北省电力有限公司 | Short-term photovoltaic output prediction method based on multi-model fusion |
CN115641153A (en) * | 2022-12-23 | 2023-01-24 | 广州市家庭经济核对和养老服务指导中心 | Vehicle price evaluation method based on deep neural network |
-
2023
- 2023-08-21 CN CN202311049518.XA patent/CN117034774A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091236A (en) * | 2019-11-27 | 2020-05-01 | 长春吉电能源科技有限公司 | Multi-classification deep learning short-term wind power prediction method classified according to pitch angles |
CN114139596A (en) * | 2021-10-15 | 2022-03-04 | 惠州学院 | Tea variety identification method and system based on deep neural network |
CN114782775A (en) * | 2022-04-19 | 2022-07-22 | 平安科技(深圳)有限公司 | Method and device for constructing classification model, computer equipment and storage medium |
CN114781723A (en) * | 2022-04-22 | 2022-07-22 | 国网河北省电力有限公司 | Short-term photovoltaic output prediction method based on multi-model fusion |
CN115641153A (en) * | 2022-12-23 | 2023-01-24 | 广州市家庭经济核对和养老服务指导中心 | Vehicle price evaluation method based on deep neural network |
Non-Patent Citations (1)
Title |
---|
集智俱乐部: "如何利用GPT-4 加速合成⽣物学的知识挖掘和机器学习?", pages 2, Retrieved from the Internet <URL:http://www.163.com/dy/article/IBNODTQF0511D05M.html> * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qiu et al. | An integrated framework with feature selection for dropout prediction in massive open online courses | |
Caplat et al. | Symmetric competition causes population oscillations in an individual-based model of forest dynamics | |
CN109934422A (en) | Neural network wind speed prediction method based on time series data analysis | |
CN113192647A (en) | New crown confirmed diagnosis people number prediction method and system based on multi-feature layered space-time characterization | |
Li et al. | Aero-engine exhaust gas temperature prediction based on LightGBM optimized by improved bat algorithm | |
CN113761777B (en) | HP-OVMD-based ultra-short-term photovoltaic power prediction method | |
CN107145934A (en) | A kind of artificial bee colony optimization method based on enhancing local search ability | |
CN113918727B (en) | Construction project knowledge transfer method based on knowledge graph and transfer learning | |
Tabatabaei et al. | EOR screening using optimized artificial neural network by sparrow search algorithm | |
CN111091916A (en) | Data analysis processing method and system based on improved particle swarm optimization in artificial intelligence | |
Zhang et al. | An effective wind speed prediction model combining secondary decomposition and regularised extreme learning machine optimised by cuckoo search algorithm | |
Azmin et al. | Soil classification based on machine learning for crop suggestion | |
Guo et al. | Novel continuous function prediction model using an improved Takagi–Sugeno fuzzy rule and its application based on chaotic time series | |
CN117034774A (en) | Construction method of high-accuracy straw enzymolysis polysaccharide yield prediction model | |
CN116090635A (en) | Meteorological-driven new energy generation power prediction method | |
CN113449912B (en) | Space load situation sensing method based on artificial intelligence technology | |
CN116108963A (en) | Electric power carbon emission prediction method and equipment based on integrated learning module | |
Gugaliya et al. | Multi-model decomposition of nonlinear dynamics using a fuzzy-CART approach | |
CN113486952B (en) | Multi-factor model optimization method of gene regulation network | |
Bai et al. | Clustering single-cell rna sequencing data by deep learning algorithm | |
Tian et al. | Data-driven interpretable analysis for polysaccharide yield prediction | |
Hou et al. | Wind power forecasting method of large-scale wind turbine clusters based on DBSCAN clustering and an enhanced hunter-prey optimization algorithm | |
An et al. | Explainable Artificial Intelligence (XAI) Empowered Digital Twin on Soil Carbon Emission Management Using Proximal Sensing | |
CN117648646B (en) | Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning | |
Tian et al. | Forecast of cerebral infraction incidence rate based on BP Neural Network and ARIMA Combined Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |