CN113517033A - XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment - Google Patents
XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment Download PDFInfo
- Publication number
- CN113517033A CN113517033A CN202110535993.2A CN202110535993A CN113517033A CN 113517033 A CN113517033 A CN 113517033A CN 202110535993 A CN202110535993 A CN 202110535993A CN 113517033 A CN113517033 A CN 113517033A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- prediction
- xgboost
- descriptor
- reaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 82
- 238000004458 analytical method Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 36
- 238000010586 diagram Methods 0.000 claims abstract description 23
- 239000000126 substance Substances 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims abstract description 9
- 230000008569 process Effects 0.000 claims abstract description 9
- 239000000758 substrate Substances 0.000 claims description 19
- 239000000654 additive Substances 0.000 claims description 16
- 239000003446 ligand Substances 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 16
- 150000004820 halides Chemical class 0.000 claims description 14
- 238000006443 Buchwald-Hartwig cross coupling reaction Methods 0.000 claims description 13
- 230000000996 additive effect Effects 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 10
- 239000000376 reactant Substances 0.000 claims description 6
- 102100026816 DNA-dependent metalloprotease SPRTN Human genes 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 4
- 150000002894 organic compounds Chemical class 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000012827 research and development Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 28
- KDLHZDBZIXYQEI-UHFFFAOYSA-N Palladium Chemical compound [Pd] KDLHZDBZIXYQEI-UHFFFAOYSA-N 0.000 description 14
- 125000004432 carbon atom Chemical group C* 0.000 description 9
- 238000005859 coupling reaction Methods 0.000 description 9
- 150000001875 compounds Chemical class 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 229910052763 palladium Inorganic materials 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 239000003054 catalyst Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000006555 catalytic reaction Methods 0.000 description 3
- 238000006880 cross-coupling reaction Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 229910000510 noble metal Inorganic materials 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000005576 amination reaction Methods 0.000 description 2
- 150000004982 aromatic amines Chemical class 0.000 description 2
- 125000004429 atom Chemical group 0.000 description 2
- 239000006227 byproduct Substances 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000005755 formation reaction Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 125000000623 heterocyclic group Chemical group 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000008204 material by function Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 238000010651 palladium-catalyzed cross coupling reaction Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 150000001412 amines Chemical class 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 125000003118 aryl group Chemical group 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006757 chemical reactions by type Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 229940000406 drug candidate Drugs 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 239000002815 homogeneous catalyst Substances 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000000543 intermediate Substances 0.000 description 1
- CTAPFRYPJLPFDF-UHFFFAOYSA-N isoxazole Chemical compound C=1C=NOC=1 CTAPFRYPJLPFDF-UHFFFAOYSA-N 0.000 description 1
- 150000002545 isoxazoles Chemical class 0.000 description 1
- 229930014626 natural product Natural products 0.000 description 1
- 239000011368 organic material Substances 0.000 description 1
- 150000002941 palladium compounds Chemical class 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000000575 pesticide Substances 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012887 quadratic function Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000002907 substructure search Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/10—Analysis or design of chemical reactions, syntheses or processes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Analytical Chemistry (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor. The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software; the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.
Description
Technical Field
The invention belongs to the field of organic synthesis based on pattern recognition and artificial intelligence, and particularly relates to an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment.
Background
The coupling reaction (coupling reaction) is very important in organic synthesis, and the product is widely applied to medicines, pesticides, natural products and even advanced functional materials. The coupling reaction can be further divided into a cross-coupling reaction (two different fragments are combined into one molecule) and a self-coupling reaction (two identical fragments are combined into one molecule) according to the reaction types. Palladium-catalyzed cross-coupling reactions are a broad class of coupling reactions, which refer to reactions in which a palladium compound is used as a catalyst (mostly a homogeneous catalyst).
To produce complex but very important organic materials, it is necessary to bond carbon atoms together by chemical reaction. However, the chemical bonds between carbon atoms and adjacent atoms in organic molecules are often very stable and are not susceptible to chemical reactions with other molecules. Although the prior method can make carbon atoms more active, the carbon atoms which are too active can generate a large amount of byproducts, and the problem can be solved by using palladium as a catalyst. The palladium atoms attract different carbon atoms to the side of the palladium atoms like a mordant, so that the carbon atoms are very close to each other and are easy to combine, namely coupling. Such a reaction does not require activation of the carbon atoms to a very active level, with fewer by-products, and is therefore more accurate and efficient. Richard F. heck discovered earlier in 1972 that palladium as a catalyst can achieve the connection between carbon atoms under milder conditions, and Ei-ichi Negishi and Akira Suzuki further developed methods for cross-coupling C-C atoms using palladium catalysis in 1977 and 1979, respectively, to further expand the substrate and product types of this type of chemical reaction. These methods allow easy and efficient coupling of stable carbon atoms together to synthesize more structurally complex molecules. The advent of these cross-coupling synthetic methods has led to unprecedented improvements in the ability and levels of chemists to manipulate atoms and molecules.
Heck, american scientists Richard f.heck, japan scientists Ei-ichi negishi and Akira Suzuki also acquired the annual nobel prize by developing a "palladium catalyzed cross-coupling method in organic synthesis". By these methods, many substances that were difficult or even impossible to synthesize in the past have been easily created. In fact, the methods invented by them have been widely applied to scientific research and industrial production in the fields of pharmacy, electronic industry, advanced materials and the like.
Palladium catalysis can not only realize the coupling reaction of C-C combination, but also realize the coupling reaction of carbon-heteroatom combination. The C-N formation reaction is an important area in modern organic synthesis. Through the formation of C-N bond, amines and their derivatives, nitrogen-containing heterocycles, etc. can be prepared, many of which are biologically and pharmaceutically active compounds and some important intermediates. Arylamine plays an important role in the fields of medicines, functional materials and the like, and Buchwald-Hartwig (Buhward-Hartwig) amination reaction is an efficient and universal method for synthesizing substituted arylamine and is one of research hotspots in the field of organic synthesis for constructing C-N bonds by utilizing palladium catalysis. However, applying this reaction to complex drug-like molecules remains challenging, one limitation being the poor performance of substrates with five-membered heterocycles (such as isoxazoles) that contain heteroatom-heteroatom bonds. These heterocycles have drug-like characteristics, but are not representative enough in successful drug candidates; the Pd metal and the noble metal belong to noble metals, so that the price is high and the Pd metal and the noble metal have certain toxicity; thirdly, aiming at different reaction substrates, what additives, substrates and ligands are used, the method does not have a mature reaction rule, so that the yield is higher, the reaction is more efficient, the traditional manual experiment needs to be continuously tried and error, the efficiency is low, the cost is high, and the realization difficulty is higher. Therefore, how to obtain higher predicted yield with less cost and high efficiency of selecting reaction conditions is a major concern for researchers. For this reason researchers considered using a machine learning approach to predict the performance of the Buchwald-Hartwig reaction in the presence of isoxazole.
In recent years, the development of machine learning algorithm provides a 'shortcut' for Buchwald-Hartwig (Buhward-Hartwig) amination reaction to search for proper reaction substrates and reaction conditions. The method becomes a scientific research component in a plurality of disciplines, brings new opportunities for the development of organic chemistry, and realizes the prediction of the activation performance, the chemical reaction performance, the compound property and the like of the catalyst. The information in the chemical system is screened or coded to form a certain expression mode of chemical information, namely a descriptor, so that the research on the chemical field can be converted into a data processing process, and the dependence on personnel is reduced on a certain program. The machine learning method can be used for mining the correlation of mass experimental data generated in a chemical experiment, helping chemists to make reasonable analysis and prediction and greatly accelerating the chemical research and development process.
Disclosure of Invention
In order to solve the problems of the prior art and insufficient available data, the invention aims to provide an intelligent chemical reaction yield prediction and analysis method based on XGboost in a small sample environment, and the intelligent chemical reaction yield prediction and analysis method can automatically and efficiently perform intelligent prediction and analysis on the chemical reaction yield by combining three-dimensional chemical structure information, and is convenient for subsequent researches of related researchers; the whole model is short in training time, high in identification accuracy and good in robustness.
In order to achieve the purpose, the invention adopts the technical scheme that:
the XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor;
the data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software;
the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results;
analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis.
The data acquisition of the three-dimensional descriptor specifically comprises the following steps:
(1.1) arranging and combining all variables in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination;
(1.2) converting each combination of the drawn two-dimensional structure diagram into a three-dimensional structure diagram combination under the conditions that a certain reactant or reaction condition is taken as a variable and the rest is a quantitative small sample, and storing a file;
(1.3) calculating and outputting the three-dimensional structure descriptor of the file saved in the step (1.2) by using related software, so as to keep the structure information and the plane information of the organic compound;
and (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set and a testing set, and corresponding the three-dimensional structure descriptors to corresponding reaction yields.
The intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:
(2.1) importing the training set and test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing a model;
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
The result analysis of the three-dimensional descriptor specifically includes:
after sample internal and external prediction is carried out, calculating through an XGboost algorithm to obtain importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
Wherein, when the two-dimensional structure chart is drawn in the step (1.1), the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are drawn in a permutation and combination mode, and the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.
In the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein the target function is as follows:
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes;γthe number of leaf nodes can be controlled; lambda can control the fraction of leaf nodes not to be too large, so as to prevent overfitting;representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;is the ith sample xiThe predicted value of (c):yiis the true value;
the XGboost objective function is composed of two parts, wherein the first part is used for measuring the goodness of fit of the currently generated model to the training data; in another part, the XGboost explicitly takes the complexity of the model as part of the objective function, i.e., the regularization term;
because the target function in the XGboost algorithm can be freely selected, as long as the second-order conductibility is met, the target function of the XGboost algorithm in the invention selects a square loss function:
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, judging the prediction effect of the model through the evaluation index, and meanwhile, calculating the importance score of the three-dimensional descriptor by the XGboost algorithm in the process of searching the optimal segmentation point so as to obtain the importance ranking of the descriptor, thereby providing certain decision information for the user.
The invention has the following beneficial effects:
1. the invention provides an intelligent prediction method based on a gradient lifting tree model XGboost in a small sample environment, aiming at the problem that the prediction of Buchwald-Hartwig amination reaction yield still depends on a two-dimensional chemical descriptor and a large amount of reaction data in the current machine learning. The model converts the chemical structure into a three-dimensional characteristic descriptor, and plane information and structure information of the chemical structure are reserved; aiming at small sample data, the prediction precision of the model is greatly improved; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.
2. The grid searching method can simultaneously search possible values of a certain parameter or a plurality of parameters, thereby conveniently obtaining the optimal parameter of the model.
And 3, the XGboost and the grid search method are combined to predict the yield of the chemical reaction more accurately and efficiently.
4. The XGboost-based chemical reaction yield intelligent prediction and analysis method under the small sample environment is simple to operate, easy to implement, accurate in analysis result, greatly convenient for relevant users to use and capable of meeting user requirements.
Drawings
FIG. 1 is a schematic diagram of the reaction scheme and the associated variable structures of the chemical reactions in the examples of the present invention;
FIG. 2 is a two-dimensional block diagram of a combination of reaction conditions;
FIG. 3 is a three-dimensional block diagram corresponding to FIG. 2;
FIG. 4 is a flow chart of an analysis method of the present invention.
Reference numbers in figure 1: yield variable selection in Buchwald-Hartwig amination and Buchwald-Hartwig amination ranges from Aryl, halide, Additive: additive, Base: substrate, l (ligand): a ligand.
Detailed Description
As shown in FIGS. 1-4, the invention provides an XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor.
The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software. The plane information of the chemical structure is kept, and the structural information is kept; the concrete implementation steps comprise:
(1.1) arranging and combining all variables comprising halide, ligand, substrate and additive in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination by utilizing Spartan software to prepare for converting into a three-dimensional structure.
In the present example, which is shown in FIG. 1, when a two-dimensional structure diagram is plotted, the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are plotted in permutation and combination, and all the reaction combinations are halide, ligand, substrate and additive in the order; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence. There are 15 halides, 4 ligands, 3 substrates, 23 additives, 4140 in the corresponding permutation; the ineffective reactions were removed and 3960 effective reactions were obtained, one-to-one corresponding to the reaction yield.
(1.2) under the condition that a certain reactant or reaction condition is taken as a variable, and the rest is a quantitative small sample, each combination of the drawn two-dimensional structure chart (shown in figure 2) is converted into a three-dimensional structure chart combination (shown in figure 3) and is stored as an sdf format file.
This example converts a two-dimensional structure diagram of a mapped 15 reaction set into a three-dimensional structure in Spartan software under the condition of taking halide as a variable and the rest of additives, substrates and ligands as quantitative small samples.
(1.3) calculating each sdf format file in Python by using an RDkit tool package and outputting a three-dimensional structure descriptor thereof, thereby retaining the structure information and the plane information of the organic compound.
The invention calculates and extracts the descriptor of the three-dimensional structure of the organic compound, which mainly depends on a tool kit of chemical informatics and machine learning: the RDkit is an important tool for processing molecular data in chemistry, biology, pharmacy and material science as an inputtable machine learning and deep learning model, and the content of the RDkit covers various processing methods such as molecular reading and writing of Python based on the RDkit, molecular fingerprint and molecular descriptor calculation of a compound, comparison of the compound, similarity search of the compound, compound skeleton analysis and substructure search, chemical reaction processing and the like.
And (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set (70%) and a testing set (30%), and corresponding to corresponding reaction yields so as to facilitate the XGboost algorithm to carry out sample internal and external prediction and analysis.
The intelligent prediction is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; the method specifically comprises the following steps:
and (2.1) importing the training set and the test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing the model.
In the embodiment, the optimal parameters are automatically obtained by using the grid search method, mainly because the XGBoost algorithm contains more parameters, and the grid search method can specify a certain parameter value to perform exhaustive search and also can arrange and combine possible values of a plurality of parameters, so as to find the optimal parameters, which is more efficient than manual parameter adjustment.
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
Wherein the result analysis is to analyze the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis. The method specifically comprises the following steps: after the out-of-sample prediction is carried out, calculating by using an XGboost algorithm to obtain the importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
In the intelligent prediction of the three-dimensional descriptor, the specific calculation process in the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein an objective function is as follows:
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes; gamma can control the number of leaf nodes, and lambda can control the fraction of the leaf nodes not to be too large, so as to prevent overfitting;representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;is the ith sample xiThe predicted value of (c):yiis the true value.
As can be seen from the objective function, the objective function of the XGBoost is composed of two parts, the first part is the same as the objective function of the conventional GBDT and is used for measuring the goodness of fit of the currently generated model to the training data, and the difference is that in the second part: XGboost explicitly takes the complexity of the model as part of the objective function, namely the regularization term, which also contains two parts.
It can be seen that the objective function has many parameters, and manual parameter adjustment is laborious and time-consuming, so that a convenient grid search method is introduced to select the optimal parameters:
the XGBoost is a tree integration model, and sums the results of K (number of trees) trees as the final predicted value. Namely:
assuming a given sample set has n samples with m features, D { (x)i,yi)}(|D|=n,xi∈Rm,yiE.g. R) in which xiDenotes the ith sample, yiDenotes the ith class label, R is a real number. The space F of the regression tree (CART tree) is:
F={f(x)=wq(x)}(q:Rm→T,w∈RT)
where q represents the structure of each tree that maps samples to corresponding leaf nodes; t is the number of leaf nodes of the corresponding tree; f (x) the structure q of the corresponding tree and the leaf node weights w. m represents the feature dimension. The predicted value of XGBoost is the sum of the values of the corresponding leaf nodes of each tree.
In GBDT, the first derivative of the loss function L to f (x) is used to calculate the pseudo residual error for learning to generate fm(x) XGBoost uses not only the first derivative but also the second derivative:
performing a second-order Taylor expansion on the above equation: g is the first derivative, h is the second derivative:
and then all training samples are grouped according to leaf nodes to obtain:
defining:wherein, IjStores data x mapped to the jth leaf nodeiIndex set of GjThe accumulated sum of the first partial derivatives of the samples contained in the leaf node j is a constant; hjThe cumulative sum of the second partial derivatives representing the samples contained by leaf node j is a constant. Substituting it into the objective function yields:
the optimal value is obtained by constructing a quadratic equation form of a unit, and the known objective function is as follows:
it is a related to wjA unary quadratic function of, and (H)j+ λ) > 0, then f (w)j) In thatTaking the minimum value, wherein the minimum value is:
the tree structure is best when the target value Obj is minimum, i.e. in this caseIs the optimal solution of the objective function. Therefore, the simplified objective function is:
the metric for finding the best segmentation point in the CART regression tree is the minimum mean square error, the metric for finding the segmentation point by XGboost is the maximum, and lambda, gamma are related to the regularization term:
wherein, ILRepresents the left leaf node, LRIs the right leaf node, I is the index set of the corresponding samples of the leaf node. Because the target function in the XGboost algorithm can be freely selected, only second-order conductibility is required, through experiments, the target function of the XGboost algorithm in the invention selects a square loss function:
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, judging the prediction effect of the model through the evaluation index, and meanwhile, calculating the importance score of the three-dimensional descriptor by the XGboost algorithm in the process of searching the optimal segmentation point so as to obtain the importance ranking of the descriptor, thereby providing certain decision information for the user.
Compared with other boosting algorithms, the XGboost algorithm is more accurate in prediction result and efficient because a regular term is added in the objective function to prevent overfitting, and two additional technologies are used to prevent overfitting: puncturing technique, column (feature) sampling; the XGboost considers the condition that the training data is sparse, and can specify the default direction of the branch for the missing value or the specified value, so that the efficiency and the accuracy of the algorithm can be greatly improved; when the optimal feature segmentation point is searched, considering that the efficiency of a traditional greedy method for enumerating all possible segmentation points of each feature is too low, the XGboost realizes an approximate algorithm. The general idea is to enumerate several candidates which can become the division points according to the percentile method, and then calculate and find the best division point from the candidates according to the formula for solving the division points; after the characteristic columns are sorted, the characteristic columns are stored in a memory in a block form and can be repeatedly used in iteration; although boosting algorithm iterations must be serial, parallel processing can be done as each feature column is processed; the XGboost also considers how to effectively use the disk when the data volume is large and the memory is insufficient, and mainly combines a multithreading method, a data compression method and a fragmentation method to improve the efficiency of the algorithm as much as possible.
Simulation experiment:
the system of the present invention is further demonstrated by simulation experiments, using Buchwald-Hartwig amination as an example (chemical reaction formula is shown in FIG. 1), as a selection of user data and introducing it into the chemical reaction.
Based on the small sample data: data for 15 samples were assembled by selecting the halide as the variable and the remaining additives, substrate, ligand as the first. The prediction analysis is carried out by drawing a two-dimensional structure chart of the reaction combination, converting the two-dimensional structure chart into a three-dimensional structure, storing the three-dimensional structure chart into an sdf format, summarizing data and dividing the data into a test set and a training set. As shown in the table above, better prediction results can be obtained than with the two-dimensional descriptors.
Claims (6)
1. The XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment is characterized by comprising the following steps of: the method comprises the steps of data acquisition, intelligent prediction and result analysis of the three-dimensional descriptor;
the data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software;
the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results;
analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis.
2. The intelligent prediction and analysis method for chemical reaction yield according to claim 1, characterized in that: the data acquisition of the three-dimensional descriptor specifically comprises the following steps:
(1.1) arranging and combining all variables in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination;
(1.2) converting each combination of the drawn two-dimensional structure diagram into a three-dimensional structure diagram combination under the conditions that a certain reactant or reaction condition is taken as a variable and the rest is a quantitative small sample, and storing a file;
(1.3) calculating and outputting the three-dimensional structure descriptor of the file saved in the step (1.2) by using related software, so as to keep the structure information and the plane information of the organic compound;
and (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set and a testing set, and corresponding the three-dimensional structure descriptors to corresponding reaction yields.
3. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: the intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:
(2.1) importing the training set and test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing a model;
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
4. The XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment according to claim 3, wherein: the result analysis of the three-dimensional descriptor specifically includes:
after sample internal and external prediction is carried out, calculating through an XGboost algorithm to obtain importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
5. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: when the two-dimensional structure chart is drawn in the step (1.1), arranging and combining reaction variables (including halide, ligand, substrate and additive) in the Buchwald-Hartwig amination reaction, wherein the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.
6. The intelligent prediction and analysis method for chemical reaction yield according to claim 3, characterized in that: in the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein the target function is as follows:
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes;γthe number of leaf nodes can be controlled; lambda can control the fraction of leaf nodes not to be too large, so as to prevent overfitting;representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;is the ith sample xiThe predicted value of (c):yiis the true value;
the XGboost objective function is composed of two parts, wherein the first part is used for measuring the goodness of fit of the currently generated model to the training data; in another part, the XGboost explicitly takes the complexity of the model as part of the objective function, i.e., the regularization term;
the target function in the XGboost algorithm can be freely selected, and the target function of the XGboost algorithm can be selected as long as second-order conductibility is met:
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, and judging the prediction effect of the model through the evaluation index.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2021103081299 | 2021-03-23 | ||
CN202110308129 | 2021-03-23 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113517033A true CN113517033A (en) | 2021-10-19 |
CN113517033B CN113517033B (en) | 2022-08-12 |
Family
ID=78064505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110535993.2A Active CN113517033B (en) | 2021-03-23 | 2021-05-17 | XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113517033B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019055499A1 (en) * | 2017-09-12 | 2019-03-21 | Massachusetts Institute Of Technology | Systems and methods for predicting chemical reactions |
US20190340316A1 (en) * | 2018-05-03 | 2019-11-07 | Lam Research Corporation | Predicting etch characteristics in thermal etching and atomic layer etching |
CN110491453A (en) * | 2018-04-27 | 2019-11-22 | 上海交通大学 | A kind of yield prediction method of chemical reaction |
EP3712897A1 (en) * | 2019-03-22 | 2020-09-23 | Tata Consultancy Services Limited | Automated prediction of biological response of chemical compounds based on chemical information |
CN111863143A (en) * | 2020-07-31 | 2020-10-30 | 中国石油化工股份有限公司 | Parameter estimation method and device for catalytic cracking kinetic model |
CN112272764A (en) * | 2018-01-30 | 2021-01-26 | 斯坦福国际研究院 | Computational generation of chemical synthetic routes and methods |
-
2021
- 2021-05-17 CN CN202110535993.2A patent/CN113517033B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019055499A1 (en) * | 2017-09-12 | 2019-03-21 | Massachusetts Institute Of Technology | Systems and methods for predicting chemical reactions |
CN112272764A (en) * | 2018-01-30 | 2021-01-26 | 斯坦福国际研究院 | Computational generation of chemical synthetic routes and methods |
CN110491453A (en) * | 2018-04-27 | 2019-11-22 | 上海交通大学 | A kind of yield prediction method of chemical reaction |
US20190340316A1 (en) * | 2018-05-03 | 2019-11-07 | Lam Research Corporation | Predicting etch characteristics in thermal etching and atomic layer etching |
EP3712897A1 (en) * | 2019-03-22 | 2020-09-23 | Tata Consultancy Services Limited | Automated prediction of biological response of chemical compounds based on chemical information |
CN111863143A (en) * | 2020-07-31 | 2020-10-30 | 中国石油化工股份有限公司 | Parameter estimation method and device for catalytic cracking kinetic model |
Non-Patent Citations (6)
Title |
---|
AKIRA YADA ET AL.: "Machine Learning Approach for Prediction of Reaction Yield with Simulated Catalyst Parameters", 《CHEMISTRY LETTERS》 * |
DAN ZHANG ET AL.: "iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins", 《COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE》 * |
DEREK T. AHNEMAN ET AL.: "Predicting reaction performance in C – N cross-coupling using machine learning", 《SCIENCE》 * |
TIANQI CHEN ET AL.: "XGBoost: A Scalable Tree Boosting System", 《ARXIV[CS.LG]》 * |
付尊蕴: "基于深度学习的小分子虚拟筛选和反应产率预测", 《中国优秀博硕士学位论文全文数据库(硕士)医药卫生科技辑》 * |
刘伊迪 等: "机器学习在有机化学中的应用", 《有机化学》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113517033B (en) | 2022-08-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Agrafiotis et al. | Combinatorial informatics in the post-genomics era | |
Unsleber et al. | Chemoton 2.0: autonomous exploration of chemical reaction networks | |
JP2002513979A (en) | System, method, and computer program product for representing proximity data in multidimensional space | |
Hu et al. | LEAP into the Pfizer Global Virtual Library (PGVL) space: creation of readily synthesizable design ideas automatically | |
AU2003223983A1 (en) | Methods and systems for discovery of chemical compounds and their syntheses | |
Gensch et al. | Design and application of a screening set for monophosphine ligands in cross-coupling | |
CN113380345A (en) | Organic chemical coupling reaction yield prediction and analysis method based on deep forest | |
Saldívar-González et al. | Chemoinformatics approaches to assess chemical diversity and complexity of small molecules | |
Lin et al. | PanGu Drug Model: learn a molecule like a human | |
US20020072887A1 (en) | Interaction fingerprint annotations from protein structure models | |
Orlando et al. | Manipulating large-scale Arabidopsis microarray expression data: identifying dominant expression patterns and biological process enrichment | |
Ertl et al. | The scaffold tree: an efficient navigation in the scaffold universe | |
CN113362905B (en) | Asymmetric catalytic reaction enantioselectivity prediction method based on deep learning | |
Murayama et al. | Characterizing reaction route map of realistic molecular reactions based on weight rank clique filtration of persistent homology | |
Shi et al. | Machine learning for chemistry: basics and applications | |
Tan et al. | A multitask approach to learn molecular properties | |
CN113517033B (en) | XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment | |
CN112837740B (en) | DNA binding residue prediction method based on structural characteristics | |
Li et al. | Synthesis-driven design of 3D molecules for structure-based drug discovery using geometric transformers | |
CN111052252B (en) | Heterogeneous method for modeling biochemical environment | |
CN110176279B (en) | Lead compound virtual screening method and device based on small sample | |
US20140171332A1 (en) | System for the efficient discovery of new therapeutic drugs | |
Lin et al. | Empowering Research in Chemistry and Materials Science through Intelligent Algorithms | |
Hou et al. | Regression Prediction of Coupling Reaction Yield Based on Attention–Driven Convolutional Neural Network | |
Coley | Data-driven Prediction of Organic Reaction Outcomes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |