CN113380345A - Organic chemical coupling reaction yield prediction and analysis method based on deep forest - Google Patents
Organic chemical coupling reaction yield prediction and analysis method based on deep forest Download PDFInfo
- Publication number
- CN113380345A CN113380345A CN202110761921.XA CN202110761921A CN113380345A CN 113380345 A CN113380345 A CN 113380345A CN 202110761921 A CN202110761921 A CN 202110761921A CN 113380345 A CN113380345 A CN 113380345A
- Authority
- CN
- China
- Prior art keywords
- yield
- cascade
- prediction
- layer
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 57
- 238000012412 chemical coupling Methods 0.000 title claims abstract description 19
- 238000004458 analytical method Methods 0.000 title claims description 12
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000005859 coupling reaction Methods 0.000 claims abstract description 28
- 230000000694 effects Effects 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 58
- 239000013598 vector Substances 0.000 claims description 32
- 238000004422 calculation algorithm Methods 0.000 claims description 31
- 238000007637 random forest analysis Methods 0.000 claims description 19
- 239000000126 substance Substances 0.000 claims description 12
- 238000012935 Averaging Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 239000000376 reactant Substances 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000003066 decision tree Methods 0.000 claims description 3
- 239000003153 chemical reaction reagent Substances 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000002474 experimental method Methods 0.000 abstract description 6
- 238000004519 manufacturing process Methods 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 description 10
- 238000003786 synthesis reaction Methods 0.000 description 10
- 238000006443 Buchwald-Hartwig cross coupling reaction Methods 0.000 description 8
- KDLHZDBZIXYQEI-UHFFFAOYSA-N Palladium Chemical compound [Pd] KDLHZDBZIXYQEI-UHFFFAOYSA-N 0.000 description 5
- 239000000654 additive Substances 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 239000003446 ligand Substances 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 239000003054 catalyst Substances 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 2
- 150000001412 amines Chemical class 0.000 description 2
- 150000001502 aryl halides Chemical class 0.000 description 2
- 125000004429 atom Chemical group 0.000 description 2
- 238000006880 cross-coupling reaction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012729 kappa analysis Methods 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 229910052763 palladium Inorganic materials 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 239000000758 substrate Substances 0.000 description 2
- BHELIUBJHYAEDK-OAIUPTLZSA-N Aspoxicillin Chemical compound C1([C@H](C(=O)N[C@@H]2C(N3[C@H](C(C)(C)S[C@@H]32)C(O)=O)=O)NC(=O)[C@H](N)CC(=O)NC)=CC=C(O)C=C1 BHELIUBJHYAEDK-OAIUPTLZSA-N 0.000 description 1
- 102100026816 DNA-dependent metalloprotease SPRTN Human genes 0.000 description 1
- 150000004982 aromatic amines Chemical class 0.000 description 1
- 150000001491 aromatic compounds Chemical class 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 150000001499 aryl bromides Chemical class 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 125000004432 carbon atom Chemical group C* 0.000 description 1
- 238000006555 catalytic reaction Methods 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 150000004820 halides Chemical class 0.000 description 1
- 238000011031 large-scale manufacturing process Methods 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 229930014626 natural product Natural products 0.000 description 1
- 238000010651 palladium-catalyzed cross coupling reaction Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000000575 pesticide Substances 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009257 reactivity Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/10—Analysis or design of chemical reactions, syntheses or processes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Crystallography & Structural Chemistry (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Analytical Chemistry (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for predicting and analyzing organic chemical coupling reaction yield based on deep forest, which comprises the steps of calculating a feature descriptor, building a model, intelligently regressing the yield and predicting the yield by classification, and specifically comprises the following steps: 1) calculating a characteristic descriptor of each coupling reaction component by using chemical software, and converting the characteristic descriptor into one-dimensional data; 2) a deep forest model is built to train the feature descriptors, the optimal prediction effect is achieved by self-adjusting parameters, and the method combines the idea of deep learning feature learning and integrated learning, so that the efficient prediction of chemical reaction is realized; 3) carrying out intelligent regression and classification prediction on the yield by using the trained model, and analyzing the prediction result; the importance of the one-dimensional feature descriptors is calculated, the influence of the feature descriptors on the yield is analyzed, and more reliable decision information is provided for users in production experiments. The method can assist chemists to quickly predict the yield on the basis of saving cost.
Description
Technical Field
The invention belongs to the field of organic synthesis based on pattern recognition and artificial intelligence, and particularly relates to a method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest.
Background
Coupling Reaction (Coupling Reaction) is a process of obtaining an organic molecule by performing a certain chemical Reaction between two organic chemical units (Molecules), including cross-Coupling Reaction and self-Coupling Reaction. Among them, the cross-coupling reaction has advantages of high efficiency, mild reaction conditions, etc., and is often used for organic synthesis. As atoms and molecules are more efficiently manipulated and manipulated, materials that were difficult or even impossible to synthesize have been easily created. The coupling reaction is also mainly used in the fields of natural product synthesis, material science, pesticide chemistry, ligand synthesis and the like. Therefore, increasing the yield of the coupling reaction can drive the production life. In order to realize effective preparation of coupling reaction products under the premise of reducing consumption, the yield of the coupling reaction needs to be predicted more accurately, important factors influencing the yield of the coupling reaction are explored, and more reliable decision information is provided for production experiments.
Over the past few decades, the coupling reaction has progressed rapidly. In 1903, the Ullmann subject group realizes the construction of a C-N bond through a coupling reaction experiment of aryl halide and amine; in 1972, Richard f.heck discovered that palladium catalysts were able to achieve the linkage between carbon atoms under milder conditions; migita et al, 1983, reported the first palladium-catalyzed reaction to form a C (sp2) -N bond; in 1995, two research teams, Stephen l.buchwald and John f.hartwig, almost simultaneously discovered a palladium-catalyzed coupling reaction of aryl bromides with amines without the participation of organotin compounds; in 2010, Richard f.heck, Ei-ichi negishi and Akira Suzuki three scientists developed a "palladium catalyzed cross-coupling method in organic synthesis" awarded the nobel prize for chemistry.
Currently, common palladium catalysts are widely used, and although these catalysts are commercialized, there is still a need to reduce the preparation cost for large-scale production reactions. The traditional chemical experimental method has the defects of high reaction cost, long reaction period, complicated experimental process, incapability of reasonably utilizing experimental data and the like, and when the preparation of the arylamine is realized through the Buchwald-Hartwig coupling reaction, Pd metal has high price and toxicity, and byproducts such as aromatic compounds and the like can be generated in the reaction process, so that the Buchwald-Hartwig coupling reaction yield is low. To solve these problems, chemists hope to find a method for scientifically and intelligently predicting organic chemical synthesis.
In recent years, with the rapid development of machine learning algorithms, more and more experts apply machine learning algorithms to organic synthesis and chemical property prediction in order to improve the yield of coupling reactions on the basis of resource saving in consideration of the multidimensional nature of chemical structures and reactivity. In 2018, Doyle et al realized high-precision prediction of Buchwald-Hartwig coupling reaction yield based on a random forest algorithm. This also demonstrates that the machine learning method can predict the synthesis of multidimensional chemical space reactions using data obtained through high-throughput experiments. There is an urgent need for a method for predicting and analyzing the yield of organic chemical coupling reaction, which can extract feature descriptors, convert the coupling reaction into one-dimensional data, and rapidly mine the correlation between complex reaction conditions in chemical experiments by combining a machine learning method and utilizing a computer, thereby reducing the consumption of human resources and chemical resources, helping chemists make reasonable analysis prediction, and promoting the research and development of organic chemical synthesis.
Disclosure of Invention
In order to solve the defects of the prior art, the invention aims to provide a method for predicting and analyzing the yield of organic chemical coupling reaction based on deep forest, which can quickly achieve the optimal prediction effect of the coupling reaction yield with higher accuracy by self-adjusting parameters, excavate important characteristics influencing the yield and assist chemists in predicting the yield on the basis of saving cost.
In order to achieve the purpose, the invention adopts the technical scheme that:
the organic chemical coupling reaction yield prediction and analysis method based on the deep forest comprises the steps of calculating a feature descriptor, building a model, intelligently regressing yield and predicting classification;
1) calculating the characteristic descriptors, namely calculating the characteristic descriptors of each coupling reaction component according to chemical software, and converting the characteristic descriptors into one-dimensional data so as to train the model in the following process;
2) building a model, namely building a deep forest model to train the feature descriptors, and achieving the best prediction effect by self-adjusting parameters, wherein the method combines the ideas of deep learning feature learning and integrated learning, and realizes efficient prediction of chemical reactions;
3) intelligent regression and classification prediction of yield, namely intelligently predicting the yield by using a trained deep forest model and analyzing the result; including yield prediction analysis and significance analysis of feature descriptors.
The specific implementation steps of the calculation of the feature descriptors include:
(1.1) introducing chemical reactants and reagents into an interface of chemical software, wherein the software automatically calculates a characteristic descriptor of each coupling reaction component and converts the chemical reactants into one-dimensional data;
(1.2) dividing the one-dimensional feature descriptors into a training set and a testing set, and respectively matching the training set and the testing set with corresponding yield or a category to which the yield belongs;
the model building method comprises the following concrete implementation steps:
(2.1) reading and preprocessing a training set, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
and (2.2) carrying out regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm. And importing a training set into cascade layers to carry out feature learning, splicing a prediction result obtained by each random forest in each layer of cascade with original features to be used as input of the next layer of cascade, continuously training in the way, estimating the mean square error of the whole cascade in a verification set when each layer is expanded, and stopping training of the model if no obvious gain exists or the maximum upper limit layer is reached, thereby automatically determining the number of cascade levels. Averaging the predicted values obtained by all random forests in the last layer to obtain a final predicted value, outputting a predicted result at the moment, selecting the best predicted result through adjusting parameters, and storing the model;
and (2.3) carrying out classification prediction on the class to which the yield of the coupling reaction belongs by adopting a deep forest algorithm. And importing a training set into cascade layers to carry out feature learning, splicing class probability vectors and original features obtained by each random forest in each layer of cascade as input of the next layer of cascade, continuously training in the way, estimating the prediction accuracy of the whole cascade in a verification set when each layer is expanded, and stopping training of the model if no obvious gain exists or the set maximum upper limit layer is reached so as to automatically determine the number of cascade levels. Averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; and at the moment, outputting a prediction result, selecting the best prediction result by adjusting parameters and storing the model.
(2.4) performing out-of-sample prediction on the trained model, and if the out-of-sample prediction is effective, verifying the effectiveness of the model, thus proving that the method can effectively predict the yield of the coupling reaction.
And (2.5) the user can adjust the parameters by self according to the prediction effect and by combining self requirements, if the user is not satisfied, the user can adjust the type and the number of the forests in the deep forest, the number of the decision trees contained in each forest and the maximum depth of the deep forest, and the step (2.3) is returned until the user is satisfied.
Wherein, the specific calculation process of the step (2.2) comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1,. ·, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing the real value of the feature x obtained by the k-th level joint training.
The final predicted value is the average of all forest predicted values in the last layer of cascade:
the specific calculation process of the step (2.3) comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1,. ·, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),...,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
3) the intelligent regression and classification prediction of the yield specifically comprises the following steps:
(3.1) intelligent regression prediction of yield, namely, introducing a training set and a corresponding yield into a cascade layer for feature learning, and averaging predicted values of each random forest in the last layer of cascade when a model stops training to obtain a final prediction result;
(3.2) intelligent classification prediction of yield, namely, introducing a training set and corresponding yield categories into a cascade layer for feature learning, and averaging class probability vectors output by all random forests in the last layer when a model stops training, wherein the category to which the maximum class probability belongs is the final prediction category;
(3.3) calculating importance ranking of the one-dimensional feature descriptors by a depth forest algorithm; therefore, the descriptor which has a remarkable influence on the reaction yield is found, and reliable decision information is provided for the user to carry out the organic chemical coupling reaction.
The invention has the following beneficial effects:
1. the invention provides an intelligent prediction method based on deep forest, aiming at the problem that the traditional machine learning lacks feature learning when the yield of Buchwald-Hartwig coupling reaction is predicted. The model combines the characteristic learning idea of deep learning, enables a machine to automatically learn useful data and characteristics thereof by means of an algorithm, and increases the complexity of the model by utilizing an integrated learning method to improve the prediction precision of the model, so that a user can self-adjust parameters to achieve the optimal prediction effect; and calculating to obtain the importance sequence of the feature descriptors, and providing reliable decision information for the user to perform organic chemical coupling reaction. The method can assist chemists to make reasonable analysis and prediction, and quickly realize organic synthesis on the basis of saving cost.
2. The deep forest algorithm self-adaptively adjusts the complexity of the model through training, the hyper-parameters have good robustness, and good results can be obtained even by using default parameters.
3. The method for predicting and analyzing the yield of the organic chemical coupling reaction based on the deep forest is simple to operate and easy to implement, and a user can quickly obtain a relatively accurate analysis result.
Drawings
FIG. 1 is a diagram of the reaction equations and reaction components of a chemical reaction in an example of the present invention;
FIG. 2 is a flow chart of an analysis method of the present invention.
Reference numbers in figure 1: equation Buchwald-Hartwig coupling reaction and reaction components, Aryl halide, Base: substrate, Ligand: ligand, Additive: and (3) an additive.
Detailed Description
As shown in FIG. 1, the invention provides a method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest, which comprises 1) calculation of feature descriptors, 2) construction of a model, and 3) intelligent regression and classification prediction of yield.
Wherein, the step 1) of calculating the feature descriptor specifically comprises:
(1.1) introducing all reaction components (comprising 23 additives, 15 halides, 3 substrates and 4 ligands) of the Buchwald-Hartwig coupling reaction shown in the figure 1 into chemical software, automatically calculating and extracting one-dimensional descriptors of each reaction component to finally obtain 120 characteristic descriptors, and converting chemical reactants into one-dimensional data;
(1.2) after removing the partially ineffective reaction and few partially missing values, the remaining 3955 groups of reactions were used as experimental data. Dividing feature descriptors in the one-dimensional data corresponding to the reactions into a training set (70%) and a testing set (30%), and respectively matching the feature descriptors with corresponding yields or categories to which the yields belong;
step 2) model construction of the deep forest, as shown in fig. 2, comprising the steps of:
(2.1) reading the training set preprocessed in the step (1.2), respectively matching the training set with corresponding yield and the category to which the yield belongs, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
(2.2) performing regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm, taking the 3955 groups of reactions in the step (1.2) as experimental data, introducing a deep forest model to train a feature descriptor, expanding each layer of cascade, estimating the mean square error of the whole cascade in verification concentration, stopping the training of the model if no obvious gain exists or a set maximum upper limit layer is reached, and achieving the optimal prediction effect by self-adjusting parameters so as to perform regression prediction on the yield; the specific calculation process comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1,. ·, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing the real value of the feature x obtained by the k-th level joint training.
The final predicted value is the average of all forest predicted values in the last layer of cascade:
and (2.3) respectively carrying out classification prediction on the categories of the yield of the coupling reaction by adopting a deep forest algorithm. Taking the 3955-series reaction described in the step (1.2) as experimental data, 1/4 quantile and 3/4 quantile of the specified yields were defined as threshold values, yields of 1/4 quantile or less were low yields, yields of 1/4 quantile or more and 3/4 quantile or less were medium yields, and yields of 3/4 quantile or more were high yields. The reaction data are led into a deep forest model to train the feature descriptors, the prediction accuracy of the whole cascade is estimated in a verification set when each layer of cascade is expanded, and if no obvious gain exists or the maximum upper limit layer is reached, the model stops training, so that the number of cascade levels is automatically determined; averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; and at the moment, outputting a prediction result, selecting the best prediction result by adjusting parameters and storing the model. The specific calculation process comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1,. ·, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),...,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
(2.4) selecting part of additives (18 th, 19 th, 21 th and 22 th in the figure 1) to carry out off-sample prediction on the trained model, and if the off-sample prediction is effective, verifying the effectiveness of the model, thus proving that the method can effectively predict the yield of the coupling reaction.
(2.5) the user can adjust the parameters by himself according to the prediction effect and by combining the self requirements, check the prediction results of the steps 2) and 3), and if the prediction results are not satisfactory, the user can adjust the types and the number of forests in the deep forests, the number of decision trees contained in each forest and the maximum depth of the deep forests, and the step (2.3) is returned until the user is satisfied.
And 3) an intelligent regression and classification prediction module for the yield adopts a deep forest algorithm to train a feature descriptor, automatically learns useful data and features of the useful data by a machine by means of the algorithm, and achieves the optimal prediction effect by self-adjusting parameters so as to perform regression prediction on the value or the category of the yield. The concrete implementation steps comprise:
(3.1) the intelligent regression prediction of the yield is to introduce the training set and the corresponding yield into the cascade layers for feature learning, when the model stops training, the predicted values of each random forest in the last layer of cascade are averaged to obtain the final prediction result, and simultaneously, the coefficient (R-Square, R) of the regression prediction is output2) And Root Mean Square Error (RMSE), the prediction effect of the model is evaluated;
(3.2) intelligent classification prediction of yield, namely, introducing a training set and corresponding yield categories into a cascade layer for feature learning, averaging class probability vectors output by all random forests in the last layer when a model stops training, wherein the category to which the maximum class probability belongs is the final prediction category, and meanwhile, outputting the classification accuracy and kappa statistic of classification prediction to evaluate the prediction effect of the model;
(3.3) calculating the importance of the one-dimensional feature descriptors by a depth forest algorithm, and sequencing the importance according to the importance; and (3) finding descriptors with remarkable influence on reaction yield through the importance sequencing of the descriptors, mining internal rules and analyzing, and providing reliable decision information for users to carry out organic chemical coupling reaction.
The method utilizes the deep forest model to adaptively adjust the complexity of the model through training, the hyper-parameters have good robustness, good results can be obtained even if default parameters are used, a user can adjust the parameters by himself and check the test set, and if the check results are satisfied, parameter adjustment is stopped, and the prediction results are output.
Compared with the traditional machine learning algorithm, the deep forest algorithm has more accurate prediction result, not only combines the characteristic learning thought of deep learning and leads a machine to automatically learn useful data and characteristics thereof by means of the algorithm, but also increases the complexity of the model by utilizing an integrated learning method so as to improve the prediction precision of the model; compared with a general deep learning algorithm, the deep forest has fewer hyper-parameters and good robustness on the parameters, and a good prediction result can be obtained even if default parameters are used; the deep forest is different from a general deep learning algorithm and has better performance on a small sample; cross validation is used during each cascade generation of the deep forest, so that overfitting is effectively avoided; the deep forest can be calculated in parallel, so that the time required by a single machine to run the deep forest is similar to the time required by the GPU to run the deep neural network in an accelerating mode, and the algorithm efficiency can be improved.
Simulation experiment:
the system of the present invention is further shown by simulation experiments, taking Buchwald-Hartwig coupling reaction as an example (chemical reaction formula is shown in FIG. 1), firstly, according to Spartan software (pay for use), descriptors of each reaction component are calculated and extracted, and each group of reaction components is calculated to obtain 120 feature descriptors, wherein each group of feature descriptors comprises 64 atom descriptors, 28 molecule descriptors and 28 vibration descriptors.
Introducing the feature descriptors and the corresponding yields into a model for regression prediction; the simulation results are shown in table 1.
TABLE 1 comparison of regression predictions
Evaluation compares deep forest andseveral machine learning algorithms: the prediction accuracy of Linear Regression (LR), k-nearest neighbor (KNN), Support Vector Machine (SVM), Neural Network (NN) and Random Forest (RF) is shown in the experimental result that R is a deep Forest2Greater than the remaining five algorithms, indicating that the regression goodness of fit for the depth forest is optimal among the six different algorithms, and that the RMSE for the depth forest is 6.8, less than the remaining five algorithms, indicating that the regression rms error for the depth forest is less. In conclusion, the regression prediction result of the deep forest is superior to that of a general machine learning algorithm, and the reaction yield can be predicted with high accuracy.
Taking Buchwald-Hartwig coupling reaction as an example (chemical reaction formula is shown in figure 1), introducing the feature descriptors and the corresponding yield categories into a model for classification prediction; the simulation results are shown in table 2.
TABLE 2 comparison of classified predictions
As shown in table 2, the evaluation contrasts the deep forest with several machine learning algorithms: the classification accuracy of the Logistic Regression (Logistic Regression), the k neighbor, the support vector machine, the neural network and the random forest can be seen from the experimental result, the classification accuracy of the deep forest is 88.37%, the classification accuracy is higher than that of the rest five algorithms, the classification accuracy of the deep forest is the highest, the value of the kappa statistic of the deep forest is 0.813, the classification accuracy of the deep forest is higher than that of the rest five algorithms and is higher than 0.8, and the classification result predicted by the deep forest is almost completely consistent with the real classification result from the statistical viewpoint. In conclusion, the classification prediction result of the deep forest is superior to that of a general machine learning algorithm, and the class of the coupling reaction yield can be predicted with higher accuracy.
Claims (6)
1. The organic chemical coupling reaction yield prediction and analysis method based on deep forests is characterized by comprising the following steps of: the method comprises the following steps: 1) calculating a characteristic descriptor; 2) building a model; 3) intelligent regression and classification prediction of yield;
calculating the characteristic descriptors of each coupling reaction component according to chemical software, and converting the characteristic descriptors into one-dimensional data so as to train a model subsequently;
building a model, namely building a deep forest model to train the feature descriptors, and achieving the best prediction effect by self-adjusting parameters;
intelligent regression and classification prediction of yield, namely intelligently predicting the yield by using a trained deep forest model and analyzing the result; including yield prediction analysis and significance analysis of feature descriptors.
2. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest according to claim 1, wherein the method comprises the following steps: step 1) the calculation of the feature descriptors specifically comprises:
(1.1) introducing chemical reactants and reagents into chemical software, wherein the software automatically calculates a characteristic descriptor of each coupling reaction component and converts the chemical reactants into one-dimensional data;
(1.2) dividing the feature descriptors in the one-dimensional data into a training set and a testing set, and respectively matching the training set and the testing set with the corresponding yield or the category to which the yield belongs.
3. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest according to claim 1, wherein the method comprises the following steps: step 2) the construction of the model comprises the following steps:
(2.1) reading and preprocessing a training set, and selecting a deep forest model to perform regression prediction or classification prediction according to requirements;
(2.2) carrying out regression prediction on the yield of the coupling reaction by adopting a deep forest algorithm; importing a training set into cascade layers to carry out feature learning, splicing a prediction result obtained by each random forest in each layer of cascade with original features to be used as input of the next layer of cascade, continuously training in the same way, estimating the mean square error of the whole cascade in a verification set when each layer is expanded, and stopping training of a model if no obvious gain exists or the maximum upper limit layer is reached, thereby automatically determining the number of cascade levels; averaging the predicted values obtained by all random forests in the last layer to obtain a final predicted value, outputting a predicted result at the moment, selecting the best predicted result through adjusting parameters, and storing the model;
(2.3) carrying out classification prediction on the category of the yield of the coupling reaction by adopting a deep forest algorithm; importing a training set into cascade layers to carry out feature learning, splicing class probability vectors and original features obtained by each random forest in each layer of cascade as input of the next layer of cascade, continuously training in the way, estimating the prediction accuracy of the whole cascade in a verification set when each layer is expanded, and stopping training of a model if no obvious gain exists or the set maximum upper limit layer is reached so as to automatically determine the number of cascade levels; averaging class probability vectors output by all random forests in the last layer, wherein the class to which the maximum class probability belongs is the final prediction class; at the moment, a prediction result is output, the best prediction result is selected through adjusting parameters, and the model is stored;
(2.4) performing out-of-sample prediction on the trained model, and if the out-of-sample prediction is effective, verifying the effectiveness of the model;
and (2.5) the user can adjust the parameters by self according to the prediction effect and by combining self requirements, if the user is not satisfied, the user can adjust the type and the number of the forests in the deep forest, the number of the decision trees contained in each forest and the maximum depth of the deep forest, and the step (2.3) is returned until the user is satisfied.
4. The method for predicting and analyzing yield of organic chemical coupling reaction according to claim 1, wherein: step 3) intelligent regression and classification prediction of yield, which specifically comprises the following steps:
(3.1) intelligent regression prediction of yield, introducing the training set and the corresponding yield into a cascade layer for feature learning, and averaging predicted values of each random forest in the last layer of cascade when the model stops training to obtain a final prediction result;
(3.2) intelligent classification prediction of yield, namely, introducing the training set and the corresponding class to which the yield belongs into a cascade layer for feature learning, and averaging class probability vectors output by all random forests in the last layer when the model stops training, wherein the class to which the maximum class probability belongs is the final prediction class;
(3.3) calculating importance ranking of the feature descriptors by a deep forest algorithm; therefore, the descriptor which has a remarkable influence on the reaction yield is found, and reliable decision information is provided for the user to carry out the organic chemical coupling reaction.
5. The method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest as claimed in claim 3, wherein: the specific calculation process of the step (2.2) comprises the following steps:
the deep forest model has K layers of cascade, each layer of cascade is composed of L forests, and training samples input by the K level of cascade are (x)kY), K ═ 0,1, …, K; wherein x iskRepresenting the feature vectors of training samples input into the k-th layer cascade, y representing the true value of the yield corresponding to each feature vector, and x representing the input features received by the k-th layer cascadekIs the original feature x0Concatenation with the output of the layer k-1 cascade, so the combined features are expressed as:
xk=(fk(xk-1),x0),
wherein f isk(x) Representing a real numerical value obtained by training the characteristic x through a kth level;
the final predicted value is the average of all forest predicted values in the last layer of cascade:
6. the method for predicting and analyzing yield of organic chemical coupling reaction based on deep forest as claimed in claim 3, wherein: the specific calculation process of the step (2.3) comprises the following steps:
the deep forest model has M layers of cascade, each layer of cascade is composed of N forests, and training samples input by the mth layer of cascade are (x)mC), M ═ 0,1, …, M; wherein x ismRepresenting the feature vectors of the training samples input into the mth layer cascade, C representing the corresponding category of each feature vector, and the input feature x received by the mth layer cascademIs the original feature x0And (3) splicing the class probability vector cascaded with the (m-1) th layer, so that the combined features are represented as:
xm=(pm(xm-1),x0),
wherein p ism(x) Representing a class probability vector obtained by training the feature x through the mth level;
the final class probability vector is the average of all forest prediction probabilities in the last layer of cascade:
if the training samples have a common class c, then p (x) ═ p1(x),p2(x),…,pc(x) Category corresponding to the maximum probability in the category probability vectors is the category to which the prediction belongs:
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2021106387357 | 2021-06-08 | ||
CN202110638735 | 2021-06-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113380345A true CN113380345A (en) | 2021-09-10 |
Family
ID=77581058
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110761921.XA Pending CN113380345A (en) | 2021-06-08 | 2021-07-06 | Organic chemical coupling reaction yield prediction and analysis method based on deep forest |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113380345A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113990405A (en) * | 2021-10-19 | 2022-01-28 | 上海药明康德新药开发有限公司 | Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent |
CN115577858A (en) * | 2022-11-21 | 2023-01-06 | 山东能源数智云科技有限公司 | Block chain-based carbon emission prediction method and device and electronic equipment |
CN116003260A (en) * | 2023-03-27 | 2023-04-25 | 广州国家实验室 | Method for preparing 1-naphthylamine compound from urea derivative and prediction model thereof |
CN118486387A (en) * | 2024-07-09 | 2024-08-13 | 江苏隆昌化工有限公司 | Continuous acylation feedback regulation system based on artificial intelligence for 2, 4-dichloroacetophenone production |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516706A (en) * | 2019-07-19 | 2019-11-29 | 中国人寿保险股份有限公司 | A kind of improved depth forest method |
CN110728391A (en) * | 2018-07-17 | 2020-01-24 | 广西大学 | Depth regression forest short-term load prediction method based on expandable information |
-
2021
- 2021-07-06 CN CN202110761921.XA patent/CN113380345A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110728391A (en) * | 2018-07-17 | 2020-01-24 | 广西大学 | Depth regression forest short-term load prediction method based on expandable information |
CN110516706A (en) * | 2019-07-19 | 2019-11-29 | 中国人寿保险股份有限公司 | A kind of improved depth forest method |
Non-Patent Citations (2)
Title |
---|
DEREK T. AHNEMAN等: "Predicting reaction performance in C–N cross-coupling using machine learning", 《SCIENCE》 * |
ZHI-HUA ZHOU等: "Deep Forest: Towards an Alternative to Deep Neural Networks", 《ARXIV》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113990405A (en) * | 2021-10-19 | 2022-01-28 | 上海药明康德新药开发有限公司 | Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent |
CN113990405B (en) * | 2021-10-19 | 2024-05-31 | 上海药明康德新药开发有限公司 | Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent |
CN115577858A (en) * | 2022-11-21 | 2023-01-06 | 山东能源数智云科技有限公司 | Block chain-based carbon emission prediction method and device and electronic equipment |
CN116003260A (en) * | 2023-03-27 | 2023-04-25 | 广州国家实验室 | Method for preparing 1-naphthylamine compound from urea derivative and prediction model thereof |
CN118486387A (en) * | 2024-07-09 | 2024-08-13 | 江苏隆昌化工有限公司 | Continuous acylation feedback regulation system based on artificial intelligence for 2, 4-dichloroacetophenone production |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113380345A (en) | Organic chemical coupling reaction yield prediction and analysis method based on deep forest | |
CN101105841B (en) | Method for constructing gene controlled subnetwork by large scale gene chip expression profile data | |
CN110910951A (en) | Method for predicting protein and ligand binding free energy based on progressive neural network | |
US20050288871A1 (en) | Estimating the accuracy of molecular property models and predictions | |
CN115240772B (en) | Method for analyzing single cell pathway activity based on graph neural network | |
CN112885415B (en) | Quick screening method for estrogen activity based on molecular surface point cloud | |
CN115762654A (en) | Smiles-based chemical reaction yield prediction method | |
CN115458039A (en) | Single-sequence protein structure prediction method and system based on machine learning | |
CN111461286A (en) | Spark parameter automatic optimization system and method based on evolutionary neural network | |
CN110853703A (en) | Semi-supervised learning prediction method for protein secondary structure | |
CN117541095A (en) | Agricultural land soil environment quality classification method | |
CN103605493A (en) | Parallel sorting learning method and system based on graphics processing unit | |
Murayama et al. | Characterizing reaction route map of realistic molecular reactions based on weight rank clique filtration of persistent homology | |
CN110516792A (en) | Non-stable time series forecasting method based on wavelet decomposition and shallow-layer neural network | |
CN113380346A (en) | Coupling reaction yield intelligent prediction method based on attention convolution neural network | |
CN117497038A (en) | Method for rapidly optimizing culture medium formula based on nuclear method | |
CN116108963A (en) | Electric power carbon emission prediction method and equipment based on integrated learning module | |
WO2023178118A1 (en) | Directed evolution of molecules by iterative experimentation and machine learning | |
CN114861759B (en) | Distributed training method for linear dynamic system model | |
CN107607723A (en) | A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier | |
CN113517033B (en) | XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment | |
CN112634993A (en) | Prediction model and screening method for activation activity of estrogen receptor of chemicals | |
CN112925202B (en) | Fermentation process stage division method based on dynamic feature extraction | |
Ramachandran et al. | CLAMP-ViT: contrastive data-free learning for adaptive post-training quantization of ViTs | |
CN118395880B (en) | Intermittent process temperature prediction method of multi-stage fusion cyclic neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210910 |
|
RJ01 | Rejection of invention patent application after publication |