CN113380345A - Organic chemical coupling reaction yield prediction and analysis method based on deep forest - Google Patents

Organic chemical coupling reaction yield prediction and analysis method based on deep forest Download PDF

Info

Publication number
CN113380345A
CN113380345A CN202110761921.XA CN202110761921A CN113380345A CN 113380345 A CN113380345 A CN 113380345A CN 202110761921 A CN202110761921 A CN 202110761921A CN 113380345 A CN113380345 A CN 113380345A
Authority
CN
China
Prior art keywords
yield
cascade
prediction
layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110761921.XA
Other languages
Chinese (zh)
Inventor
彭李超
杨晓慧
穆雪纯
董晶
邹雪艳
孙磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Publication of CN113380345A publication Critical patent/CN113380345A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Analytical Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for predicting and analyzing organic chemical coupling reaction yield based on deep forest, which comprises the steps of calculating a feature descriptor, building a model, intelligently regressing the yield and predicting the yield by classification, and specifically comprises the following steps: 1) calculating a characteristic descriptor of each coupling reaction component by using chemical software, and converting the characteristic descriptor into one-dimensional data; 2) a deep forest model is built to train the feature descriptors, the optimal prediction effect is achieved by self-adjusting parameters, and the method combines the idea of deep learning feature learning and integrated learning, so that the efficient prediction of chemical reaction is realized; 3) carrying out intelligent regression and classification prediction on the yield by using the trained model, and analyzing the prediction result; the importance of the one-dimensional feature descriptors is calculated, the influence of the feature descriptors on the yield is analyzed, and more reliable decision information is provided for users in production experiments. The method can assist chemists to quickly predict the yield on the basis of saving cost.

Description

Organic chemical coupling reaction yield prediction and analysis method based on deep forest
Technical Field
The invention belongs to the field of organic synthesis based on pattern recognition and artificial intelligence, and particularly relates to a method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest.
Background
Coupling Reaction (Coupling Reaction) is a process of obtaining an organic molecule by performing a certain chemical Reaction between two organic chemical units (Molecules), including cross-Coupling Reaction and self-Coupling Reaction. Among them, the cross-coupling reaction has advantages of high efficiency, mild reaction conditions, etc., and is often used for organic synthesis. As atoms and molecules are more efficiently manipulated and manipulated, materials that were difficult or even impossible to synthesize have been easily created. The coupling reaction is also mainly used in the fields of natural product synthesis, material science, pesticide chemistry, ligand synthesis and the like. Therefore, increasing the yield of the coupling reaction can drive the production life. In order to realize effective preparation of coupling reaction products under the premise of reducing consumption, the yield of the coupling reaction needs to be predicted more accurately, important factors influencing the yield of the coupling reaction are explored, and more reliable decision information is provided for production experiments.
Over the past few decades, the coupling reaction has progressed rapidly. In 1903, the Ullmann subject group realizes the construction of a C-N bond through a coupling reaction experiment of aryl halide and amine; in 1972, Richard f.heck discovered that palladium catalysts were able to achieve the linkage between carbon atoms under milder conditions; migita et al, 1983, reported the first palladium-catalyzed reaction to form a C (sp2) -N bond; in 1995, two research teams, Stephen l.buchwald and John f.hartwig, almost simultaneously discovered a palladium-catalyzed coupling reaction of aryl bromides with amines without the participation of organotin compounds; in 2010, Richard f.heck, Ei-ichi negishi and Akira Suzuki three scientists developed a "palladium catalyzed cross-coupling method in organic synthesis" awarded the nobel prize for chemistry.
Currently, common palladium catalysts are widely used, and although these catalysts are commercialized, there is still a need to reduce the preparation cost for large-scale production reactions. The traditional chemical experimental method has the defects of high reaction cost, long reaction period, complicated experimental process, incapability of reasonably utilizing experimental data and the like, and when the preparation of the arylamine is realized through the Buchwald-Hartwig coupling reaction, Pd metal has high price and toxicity, and byproducts such as aromatic compounds and the like can be generated in the reaction process, so that the Buchwald-Hartwig coupling reaction yield is low. To solve these problems, chemists hope to find a method for scientifically and intelligently predicting organic chemical synthesis.
In recent years, with the rapid development of machine learning algorithms, more and more experts apply machine learning algorithms to organic synthesis and chemical property prediction in order to improve the yield of coupling reactions on the basis of resource saving in consideration of the multidimensional nature of chemical structures and reactivity. In 2018, Doyle et al realized high-precision prediction of Buchwald-Hartwig coupling reaction yield based on a random forest algorithm. This also demonstrates that the machine learning method can predict the synthesis of multidimensional chemical space reactions using data obtained through high-throughput experiments. There is an urgent need for a method for predicting and analyzing the yield of organic chemical coupling reaction, which can extract feature descriptors, convert the coupling reaction into one-dimensional data, and rapidly mine the correlation between complex reaction conditions in chemical experiments by combining a machine learning method and utilizing a computer, thereby reducing the consumption of human resources and chemical resources, helping chemists make reasonable analysis prediction, and promoting the research and development of organic chemical synthesis.
Disclosure of Invention
In order to solve the defects of the prior art, the invention aims to provide a method for predicting and analyzing the yield of organic chemical coupling reaction based on deep forest, which can quickly achieve the optimal prediction effect of the coupling reaction yield with higher accuracy by self-adjusting parameters, excavate important characteristics influencing the yield and assist chemists in predicting the yield on the basis of saving cost.
In order to achieve the purpose, the invention adopts the technical scheme that:
the organic chemical coupling reaction yield prediction and analysis method based on the deep forest comprises the steps of calculating a feature descriptor, building a model, intelligently regressing yield and predicting classification;
1) calculating the characteristic descriptors, namely calculating the characteristic descriptors of each coupling reaction component according to chemical software, and converting the characteristic descriptors into one-dimensional data so as to train the model in the following process;
2) building a model, namely building a deep forest model to train the feature descriptors, and achieving the best prediction effect by self-adjusting parameters, wherein the method combines the ideas of deep learning feature learning and integrated learning, and realizes efficient prediction of chemical reactions;
3) intelligent regression and classification prediction of yield, namely intelligently predicting the yield by using a trained deep forest model and analyzing the result; including yield prediction analysis and significance analysis of feature descriptors.
The specific implementation steps of the calculation of the feature descriptors include:
(1.1) introducing chemical reactants and reagents into an interface of chemical software, wherein the software automatically calculates a characteristic descriptor of each coupling reaction component and converts the chemical reactants into one-dimensional data;
(1.2) dividing the one-dimensional feature descriptors into a training set and a testing set, and respectively matching the training set and the testing set with corresponding yield or a category to which the yield belongs;
the model building method comprises the following concrete implementation steps:
(2.1) reading and preprocessing a training set, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
and (2.2) carrying out regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm. And importing a training set into cascade layers to carry out feature learning, splicing a prediction result obtained by each random forest in each layer of cascade with original features to be used as input of the next layer of cascade, continuously training in the way, estimating the mean square error of the whole cascade in a verification set when each layer is expanded, and stopping training of the model if no obvious gain exists or the maximum upper limit layer is reached, thereby automatically determining the number of cascade levels. Averaging the predicted values obtained by all random forests in the last layer to obtain a final predicted value, outputting a predicted result at the moment, selecting the best predicted result through adjusting parameters, and storing the model;
and (2.3) carrying out classification prediction on the class to which the yield of the coupling reaction belongs by adopting a deep forest algorithm. And importing a training set into cascade layers to carry out feature learning, splicing class probability vectors and original features obtained by each random forest in each layer of cascade as input of the next layer of cascade, continuously training in the way, estimating the prediction accuracy of the whole cascade in a verification set when each layer is expanded, and stopping training of the model if no obvious gain exists or the set maximum upper limit layer is reached so as to automatically determine the number of cascade levels. Averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; and at the moment, outputting a prediction result, selecting the best prediction result by adjusting parameters and storing the model.
(2.4) performing out-of-sample prediction on the trained model, and if the out-of-sample prediction is effective, verifying the effectiveness of the model, thus proving that the method can effectively predict the yield of the coupling reaction.
And (2.5) the user can adjust the parameters by self according to the prediction effect and by combining self requirements, if the user is not satisfied, the user can adjust the type and the number of the forests in the deep forest, the number of the decision trees contained in each forest and the maximum depth of the deep forest, and the step (2.3) is returned until the user is satisfied.
Wherein, the specific calculation process of the step (2.2) comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1,. ·, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing the real value of the feature x obtained by the k-th level joint training.
The final predicted value is the average of all forest predicted values in the last layer of cascade:
Figure BDA0003150242970000051
the specific calculation process of the step (2.3) comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1,. ·, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
Figure BDA0003150242970000061
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),...,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
Figure BDA0003150242970000062
3) the intelligent regression and classification prediction of the yield specifically comprises the following steps:
(3.1) intelligent regression prediction of yield, namely, introducing a training set and a corresponding yield into a cascade layer for feature learning, and averaging predicted values of each random forest in the last layer of cascade when a model stops training to obtain a final prediction result;
(3.2) intelligent classification prediction of yield, namely, introducing a training set and corresponding yield categories into a cascade layer for feature learning, and averaging class probability vectors output by all random forests in the last layer when a model stops training, wherein the category to which the maximum class probability belongs is the final prediction category;
(3.3) calculating importance ranking of the one-dimensional feature descriptors by a depth forest algorithm; therefore, the descriptor which has a remarkable influence on the reaction yield is found, and reliable decision information is provided for the user to carry out the organic chemical coupling reaction.
The invention has the following beneficial effects:
1. the invention provides an intelligent prediction method based on deep forest, aiming at the problem that the traditional machine learning lacks feature learning when the yield of Buchwald-Hartwig coupling reaction is predicted. The model combines the characteristic learning idea of deep learning, enables a machine to automatically learn useful data and characteristics thereof by means of an algorithm, and increases the complexity of the model by utilizing an integrated learning method to improve the prediction precision of the model, so that a user can self-adjust parameters to achieve the optimal prediction effect; and calculating to obtain the importance sequence of the feature descriptors, and providing reliable decision information for the user to perform organic chemical coupling reaction. The method can assist chemists to make reasonable analysis and prediction, and quickly realize organic synthesis on the basis of saving cost.
2. The deep forest algorithm self-adaptively adjusts the complexity of the model through training, the hyper-parameters have good robustness, and good results can be obtained even by using default parameters.
3. The method for predicting and analyzing the yield of the organic chemical coupling reaction based on the deep forest is simple to operate and easy to implement, and a user can quickly obtain a relatively accurate analysis result.
Drawings
FIG. 1 is a diagram of the reaction equations and reaction components of a chemical reaction in an example of the present invention;
FIG. 2 is a flow chart of an analysis method of the present invention.
Reference numbers in figure 1: equation Buchwald-Hartwig coupling reaction and reaction components, Aryl halide, Base: substrate, Ligand: ligand, Additive: and (3) an additive.
Detailed Description
As shown in FIG. 1, the invention provides a method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest, which comprises 1) calculation of feature descriptors, 2) construction of a model, and 3) intelligent regression and classification prediction of yield.
Wherein, the step 1) of calculating the feature descriptor specifically comprises:
(1.1) introducing all reaction components (comprising 23 additives, 15 halides, 3 substrates and 4 ligands) of the Buchwald-Hartwig coupling reaction shown in the figure 1 into chemical software, automatically calculating and extracting one-dimensional descriptors of each reaction component to finally obtain 120 characteristic descriptors, and converting chemical reactants into one-dimensional data;
(1.2) after removing the partially ineffective reaction and few partially missing values, the remaining 3955 groups of reactions were used as experimental data. Dividing feature descriptors in the one-dimensional data corresponding to the reactions into a training set (70%) and a testing set (30%), and respectively matching the feature descriptors with corresponding yields or categories to which the yields belong;
step 2) model construction of the deep forest, as shown in fig. 2, comprising the steps of:
(2.1) reading the training set preprocessed in the step (1.2), respectively matching the training set with corresponding yield and the category to which the yield belongs, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
(2.2) performing regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm, taking the 3955 groups of reactions in the step (1.2) as experimental data, introducing a deep forest model to train a feature descriptor, expanding each layer of cascade, estimating the mean square error of the whole cascade in verification concentration, stopping the training of the model if no obvious gain exists or a set maximum upper limit layer is reached, and achieving the optimal prediction effect by self-adjusting parameters so as to perform regression prediction on the yield; the specific calculation process comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1,. ·, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing the real value of the feature x obtained by the k-th level joint training.
The final predicted value is the average of all forest predicted values in the last layer of cascade:
Figure BDA0003150242970000081
and (2.3) respectively carrying out classification prediction on the categories of the yield of the coupling reaction by adopting a deep forest algorithm. Taking the 3955-series reaction described in the step (1.2) as experimental data, 1/4 quantile and 3/4 quantile of the specified yields were defined as threshold values, yields of 1/4 quantile or less were low yields, yields of 1/4 quantile or more and 3/4 quantile or less were medium yields, and yields of 3/4 quantile or more were high yields. The reaction data are led into a deep forest model to train the feature descriptors, the prediction accuracy of the whole cascade is estimated in a verification set when each layer of cascade is expanded, and if no obvious gain exists or the maximum upper limit layer is reached, the model stops training, so that the number of cascade levels is automatically determined; averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; and at the moment, outputting a prediction result, selecting the best prediction result by adjusting parameters and storing the model. The specific calculation process comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1,. ·, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
Figure BDA0003150242970000091
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),...,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
Figure BDA0003150242970000101
(2.4) selecting part of additives (18 th, 19 th, 21 th and 22 th in the figure 1) to carry out off-sample prediction on the trained model, and if the off-sample prediction is effective, verifying the effectiveness of the model, thus proving that the method can effectively predict the yield of the coupling reaction.
(2.5) the user can adjust the parameters by himself according to the prediction effect and by combining the self requirements, check the prediction results of the steps 2) and 3), and if the prediction results are not satisfactory, the user can adjust the types and the number of forests in the deep forests, the number of decision trees contained in each forest and the maximum depth of the deep forests, and the step (2.3) is returned until the user is satisfied.
And 3) an intelligent regression and classification prediction module for the yield adopts a deep forest algorithm to train a feature descriptor, automatically learns useful data and features of the useful data by a machine by means of the algorithm, and achieves the optimal prediction effect by self-adjusting parameters so as to perform regression prediction on the value or the category of the yield. The concrete implementation steps comprise:
(3.1) the intelligent regression prediction of the yield is to introduce the training set and the corresponding yield into the cascade layers for feature learning, when the model stops training, the predicted values of each random forest in the last layer of cascade are averaged to obtain the final prediction result, and simultaneously, the coefficient (R-Square, R) of the regression prediction is output2) And Root Mean Square Error (RMSE), the prediction effect of the model is evaluated;
(3.2) intelligent classification prediction of yield, namely, introducing a training set and corresponding yield categories into a cascade layer for feature learning, averaging class probability vectors output by all random forests in the last layer when a model stops training, wherein the category to which the maximum class probability belongs is the final prediction category, and meanwhile, outputting the classification accuracy and kappa statistic of classification prediction to evaluate the prediction effect of the model;
(3.3) calculating the importance of the one-dimensional feature descriptors by a depth forest algorithm, and sequencing the importance according to the importance; and (3) finding descriptors with remarkable influence on reaction yield through the importance sequencing of the descriptors, mining internal rules and analyzing, and providing reliable decision information for users to carry out organic chemical coupling reaction.
The method utilizes the deep forest model to adaptively adjust the complexity of the model through training, the hyper-parameters have good robustness, good results can be obtained even if default parameters are used, a user can adjust the parameters by himself and check the test set, and if the check results are satisfied, parameter adjustment is stopped, and the prediction results are output.
Compared with the traditional machine learning algorithm, the deep forest algorithm has more accurate prediction result, not only combines the characteristic learning thought of deep learning and leads a machine to automatically learn useful data and characteristics thereof by means of the algorithm, but also increases the complexity of the model by utilizing an integrated learning method so as to improve the prediction precision of the model; compared with a general deep learning algorithm, the deep forest has fewer hyper-parameters and good robustness on the parameters, and a good prediction result can be obtained even if default parameters are used; the deep forest is different from a general deep learning algorithm and has better performance on a small sample; cross validation is used during each cascade generation of the deep forest, so that overfitting is effectively avoided; the deep forest can be calculated in parallel, so that the time required by a single machine to run the deep forest is similar to the time required by the GPU to run the deep neural network in an accelerating mode, and the algorithm efficiency can be improved.
Simulation experiment:
the system of the present invention is further shown by simulation experiments, taking Buchwald-Hartwig coupling reaction as an example (chemical reaction formula is shown in FIG. 1), firstly, according to Spartan software (pay for use), descriptors of each reaction component are calculated and extracted, and each group of reaction components is calculated to obtain 120 feature descriptors, wherein each group of feature descriptors comprises 64 atom descriptors, 28 molecule descriptors and 28 vibration descriptors.
Introducing the feature descriptors and the corresponding yields into a model for regression prediction; the simulation results are shown in table 1.
TABLE 1 comparison of regression predictions
Figure BDA0003150242970000121
Evaluation compares deep forest andseveral machine learning algorithms: the prediction accuracy of Linear Regression (LR), k-nearest neighbor (KNN), Support Vector Machine (SVM), Neural Network (NN) and Random Forest (RF) is shown in the experimental result that R is a deep Forest2Greater than the remaining five algorithms, indicating that the regression goodness of fit for the depth forest is optimal among the six different algorithms, and that the RMSE for the depth forest is 6.8, less than the remaining five algorithms, indicating that the regression rms error for the depth forest is less. In conclusion, the regression prediction result of the deep forest is superior to that of a general machine learning algorithm, and the reaction yield can be predicted with high accuracy.
Taking Buchwald-Hartwig coupling reaction as an example (chemical reaction formula is shown in figure 1), introducing the feature descriptors and the corresponding yield categories into a model for classification prediction; the simulation results are shown in table 2.
TABLE 2 comparison of classified predictions
Figure BDA0003150242970000131
As shown in table 2, the evaluation contrasts the deep forest with several machine learning algorithms: the classification accuracy of the Logistic Regression (Logistic Regression), the k neighbor, the support vector machine, the neural network and the random forest can be seen from the experimental result, the classification accuracy of the deep forest is 88.37%, the classification accuracy is higher than that of the rest five algorithms, the classification accuracy of the deep forest is the highest, the value of the kappa statistic of the deep forest is 0.813, the classification accuracy of the deep forest is higher than that of the rest five algorithms and is higher than 0.8, and the classification result predicted by the deep forest is almost completely consistent with the real classification result from the statistical viewpoint. In conclusion, the classification prediction result of the deep forest is superior to that of a general machine learning algorithm, and the class of the coupling reaction yield can be predicted with higher accuracy.

Claims (6)

1. The organic chemical coupling reaction yield prediction and analysis method based on deep forests is characterized by comprising the following steps of: the method comprises the following steps: 1) calculating a characteristic descriptor; 2) building a model; 3) intelligent regression and classification prediction of yield;
calculating the characteristic descriptors of each coupling reaction component according to chemical software, and converting the characteristic descriptors into one-dimensional data so as to train a model subsequently;
building a model, namely building a deep forest model to train the feature descriptors, and achieving the best prediction effect by self-adjusting parameters;
intelligent regression and classification prediction of yield, namely intelligently predicting the yield by using a trained deep forest model and analyzing the result; including yield prediction analysis and significance analysis of feature descriptors.
2. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest according to claim 1, wherein the method comprises the following steps: step 1) the calculation of the feature descriptors specifically comprises:
(1.1) introducing chemical reactants and reagents into chemical software, wherein the software automatically calculates a characteristic descriptor of each coupling reaction component and converts the chemical reactants into one-dimensional data;
(1.2) dividing the feature descriptors in the one-dimensional data into a training set and a testing set, and respectively matching the training set and the testing set with the corresponding yield or the category to which the yield belongs.
3. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest according to claim 1, wherein the method comprises the following steps: step 2) the construction of the model comprises the following steps:
(2.1) reading and preprocessing a training set, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
(2.2) carrying out regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm; importing a training set into cascade layers to carry out feature learning, splicing a prediction result obtained by each random forest in each layer of cascade with original features to be used as input of the next layer of cascade, continuously training in the same way, estimating the mean square error of the whole cascade in a verification set when each layer is expanded, and stopping training of a model if no obvious gain exists or the maximum upper limit layer is reached, thereby automatically determining the number of cascade levels; averaging the predicted values obtained by all random forests in the last layer to obtain a final predicted value, outputting a predicted result at the moment, selecting the best predicted result through adjusting parameters, and storing the model;
(2.3) carrying out classification prediction on the category of the yield of the coupling reaction by adopting a deep forest algorithm; importing a training set into cascade layers to carry out feature learning, splicing class probability vectors and original features obtained by each random forest in each layer of cascade as input of the next layer of cascade, continuously training in the way, estimating the prediction accuracy of the whole cascade in a verification set when each layer is expanded, and stopping training of a model if no obvious gain exists or the set maximum upper limit layer is reached so as to automatically determine the number of cascade levels; averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; at the moment, a prediction result is output, the best prediction result is selected through adjusting parameters, and the model is stored;
(2.4) performing out-of-sample prediction on the trained model, and if the out-of-sample prediction is effective, verifying the effectiveness of the model;
and (2.5) the user can adjust the parameters by self according to the prediction effect and by combining self requirements, if the user is not satisfied, the user can adjust the type and the number of the forests in the deep forest, the number of the decision trees contained in each forest and the maximum depth of the deep forest, and the step (2.3) is returned until the user is satisfied.
4. The method for predicting and analyzing yield of organic chemical coupling reaction according to claim 1, wherein: step 3) intelligent regression and classification prediction of yield, which specifically comprises the following steps:
(3.1) intelligent regression prediction of yield, introducing the training set and the corresponding yield into a cascade layer for feature learning, and averaging predicted values of each random forest in the last layer of cascade when the model stops training to obtain a final prediction result;
(3.2) intelligent classification prediction of yield, namely, introducing the training set and the corresponding class to which the yield belongs into a cascade layer for feature learning, and averaging class probability vectors output by all random forests in the last layer when the model stops training, wherein the class to which the maximum class probability belongs is the final prediction class;
(3.3) calculating importance ranking of the feature descriptors by a deep forest algorithm; therefore, the descriptor which has a remarkable influence on the reaction yield is found, and reliable decision information is provided for the user to carry out the organic chemical coupling reaction.
5. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest as claimed in claim 3, wherein: the specific calculation process of the step (2.2) comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1, …, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing a real numerical value obtained by training the characteristic x through a kth level;
the final predicted value is the average of all forest predicted values in the last layer of cascade:
Figure FDA0003150242960000031
6. the method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest as claimed in claim 3, wherein: the specific calculation process of the step (2.3) comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1, …, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
Figure FDA0003150242960000041
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),…,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
Figure FDA0003150242960000042
CN202110761921.XA 2021-06-08 2021-07-06 Organic chemical coupling reaction yield prediction and analysis method based on deep forest Pending CN113380345A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021106387357 2021-06-08
CN202110638735 2021-06-08

Publications (1)

Publication Number Publication Date
CN113380345A true CN113380345A (en) 2021-09-10

Family

ID=77581058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110761921.XA Pending CN113380345A (en) 2021-06-08 2021-07-06 Organic chemical coupling reaction yield prediction and analysis method based on deep forest

Country Status (1)

Country Link
CN (1) CN113380345A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990405A (en) * 2021-10-19 2022-01-28 上海药明康德新药开发有限公司 Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent
CN115577858A (en) * 2022-11-21 2023-01-06 山东能源数智云科技有限公司 Block chain-based carbon emission prediction method and device and electronic equipment
CN116003260A (en) * 2023-03-27 2023-04-25 广州国家实验室 Method for preparing 1-naphthylamine compound from urea derivative and prediction model thereof
CN118486387A (en) * 2024-07-09 2024-08-13 江苏隆昌化工有限公司 Continuous acylation feedback regulation system based on artificial intelligence for 2, 4-dichloroacetophenone production

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516706A (en) * 2019-07-19 2019-11-29 中国人寿保险股份有限公司 A kind of improved depth forest method
CN110728391A (en) * 2018-07-17 2020-01-24 广西大学 Depth regression forest short-term load prediction method based on expandable information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728391A (en) * 2018-07-17 2020-01-24 广西大学 Depth regression forest short-term load prediction method based on expandable information
CN110516706A (en) * 2019-07-19 2019-11-29 中国人寿保险股份有限公司 A kind of improved depth forest method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEREK T. AHNEMAN等: "Predicting reaction performance in C–N cross-coupling using machine learning", 《SCIENCE》 *
ZHI-HUA ZHOU等: "Deep Forest: Towards an Alternative to Deep Neural Networks", 《ARXIV》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990405A (en) * 2021-10-19 2022-01-28 上海药明康德新药开发有限公司 Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent
CN113990405B (en) * 2021-10-19 2024-05-31 上海药明康德新药开发有限公司 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent
CN115577858A (en) * 2022-11-21 2023-01-06 山东能源数智云科技有限公司 Block chain-based carbon emission prediction method and device and electronic equipment
CN116003260A (en) * 2023-03-27 2023-04-25 广州国家实验室 Method for preparing 1-naphthylamine compound from urea derivative and prediction model thereof
CN118486387A (en) * 2024-07-09 2024-08-13 江苏隆昌化工有限公司 Continuous acylation feedback regulation system based on artificial intelligence for 2, 4-dichloroacetophenone production

Similar Documents

Publication Publication Date Title
CN113380345A (en) Organic chemical coupling reaction yield prediction and analysis method based on deep forest
CN101105841B (en) Method for constructing gene controlled subnetwork by large scale gene chip expression profile data
CN110910951A (en) Method for predicting protein and ligand binding free energy based on progressive neural network
US20050288871A1 (en) Estimating the accuracy of molecular property models and predictions
CN115240772B (en) Method for analyzing single cell pathway activity based on graph neural network
CN112885415B (en) Quick screening method for estrogen activity based on molecular surface point cloud
CN115762654A (en) Smiles-based chemical reaction yield prediction method
CN115458039A (en) Single-sequence protein structure prediction method and system based on machine learning
CN111461286A (en) Spark parameter automatic optimization system and method based on evolutionary neural network
CN110853703A (en) Semi-supervised learning prediction method for protein secondary structure
CN117541095A (en) Agricultural land soil environment quality classification method
CN103605493A (en) Parallel sorting learning method and system based on graphics processing unit
Murayama et al. Characterizing reaction route map of realistic molecular reactions based on weight rank clique filtration of persistent homology
CN110516792A (en) Non-stable time series forecasting method based on wavelet decomposition and shallow-layer neural network
CN113380346A (en) Coupling reaction yield intelligent prediction method based on attention convolution neural network
CN117497038A (en) Method for rapidly optimizing culture medium formula based on nuclear method
CN116108963A (en) Electric power carbon emission prediction method and equipment based on integrated learning module
WO2023178118A1 (en) Directed evolution of molecules by iterative experimentation and machine learning
CN114861759B (en) Distributed training method for linear dynamic system model
CN107607723A (en) A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier
CN113517033B (en) XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment
CN112634993A (en) Prediction model and screening method for activation activity of estrogen receptor of chemicals
CN112925202B (en) Fermentation process stage division method based on dynamic feature extraction
Ramachandran et al. CLAMP-ViT: contrastive data-free learning for adaptive post-training quantization of ViTs
CN118395880B (en) Intermittent process temperature prediction method of multi-stage fusion cyclic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210910

RJ01 Rejection of invention patent application after publication