CN113517033A - XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment - Google Patents

XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment Download PDF

Info

Publication number
CN113517033A
CN113517033A CN202110535993.2A CN202110535993A CN113517033A CN 113517033 A CN113517033 A CN 113517033A CN 202110535993 A CN202110535993 A CN 202110535993A CN 113517033 A CN113517033 A CN 113517033A
Authority
CN
China
Prior art keywords
dimensional
prediction
xgboost
descriptor
reaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110535993.2A
Other languages
Chinese (zh)
Other versions
CN113517033B (en
Inventor
杨晓慧
彭李超
董晶
张普玉
张泽霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Publication of CN113517033A publication Critical patent/CN113517033A/en
Application granted granted Critical
Publication of CN113517033B publication Critical patent/CN113517033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor. The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software; the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.

Description

XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment
Technical Field
The invention belongs to the field of organic synthesis based on pattern recognition and artificial intelligence, and particularly relates to an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment.
Background
The coupling reaction (coupling reaction) is very important in organic synthesis, and the product is widely applied to medicines, pesticides, natural products and even advanced functional materials. The coupling reaction can be further divided into a cross-coupling reaction (two different fragments are combined into one molecule) and a self-coupling reaction (two identical fragments are combined into one molecule) according to the reaction types. Palladium-catalyzed cross-coupling reactions are a broad class of coupling reactions, which refer to reactions in which a palladium compound is used as a catalyst (mostly a homogeneous catalyst).
To produce complex but very important organic materials, it is necessary to bond carbon atoms together by chemical reaction. However, the chemical bonds between carbon atoms and adjacent atoms in organic molecules are often very stable and are not susceptible to chemical reactions with other molecules. Although the prior method can make carbon atoms more active, the carbon atoms which are too active can generate a large amount of byproducts, and the problem can be solved by using palladium as a catalyst. The palladium atoms attract different carbon atoms to the side of the palladium atoms like a mordant, so that the carbon atoms are very close to each other and are easy to combine, namely coupling. Such a reaction does not require activation of the carbon atoms to a very active level, with fewer by-products, and is therefore more accurate and efficient. Richard F. heck discovered earlier in 1972 that palladium as a catalyst can achieve the connection between carbon atoms under milder conditions, and Ei-ichi Negishi and Akira Suzuki further developed methods for cross-coupling C-C atoms using palladium catalysis in 1977 and 1979, respectively, to further expand the substrate and product types of this type of chemical reaction. These methods allow easy and efficient coupling of stable carbon atoms together to synthesize more structurally complex molecules. The advent of these cross-coupling synthetic methods has led to unprecedented improvements in the ability and levels of chemists to manipulate atoms and molecules.
Heck, american scientists Richard f.heck, japan scientists Ei-ichi negishi and Akira Suzuki also acquired the annual nobel prize by developing a "palladium catalyzed cross-coupling method in organic synthesis". By these methods, many substances that were difficult or even impossible to synthesize in the past have been easily created. In fact, the methods invented by them have been widely applied to scientific research and industrial production in the fields of pharmacy, electronic industry, advanced materials and the like.
Palladium catalysis can not only realize the coupling reaction of C-C combination, but also realize the coupling reaction of carbon-heteroatom combination. The C-N formation reaction is an important area in modern organic synthesis. Through the formation of C-N bond, amines and their derivatives, nitrogen-containing heterocycles, etc. can be prepared, many of which are biologically and pharmaceutically active compounds and some important intermediates. Arylamine plays an important role in the fields of medicines, functional materials and the like, and Buchwald-Hartwig (Buhward-Hartwig) amination reaction is an efficient and universal method for synthesizing substituted arylamine and is one of research hotspots in the field of organic synthesis for constructing C-N bonds by utilizing palladium catalysis. However, applying this reaction to complex drug-like molecules remains challenging, one limitation being the poor performance of substrates with five-membered heterocycles (such as isoxazoles) that contain heteroatom-heteroatom bonds. These heterocycles have drug-like characteristics, but are not representative enough in successful drug candidates; the Pd metal and the noble metal belong to noble metals, so that the price is high and the Pd metal and the noble metal have certain toxicity; thirdly, aiming at different reaction substrates, what additives, substrates and ligands are used, the method does not have a mature reaction rule, so that the yield is higher, the reaction is more efficient, the traditional manual experiment needs to be continuously tried and error, the efficiency is low, the cost is high, and the realization difficulty is higher. Therefore, how to obtain higher predicted yield with less cost and high efficiency of selecting reaction conditions is a major concern for researchers. For this reason researchers considered using a machine learning approach to predict the performance of the Buchwald-Hartwig reaction in the presence of isoxazole.
In recent years, the development of machine learning algorithm provides a 'shortcut' for Buchwald-Hartwig (Buhward-Hartwig) amination reaction to search for proper reaction substrates and reaction conditions. The method becomes a scientific research component in a plurality of disciplines, brings new opportunities for the development of organic chemistry, and realizes the prediction of the activation performance, the chemical reaction performance, the compound property and the like of the catalyst. The information in the chemical system is screened or coded to form a certain expression mode of chemical information, namely a descriptor, so that the research on the chemical field can be converted into a data processing process, and the dependence on personnel is reduced on a certain program. The machine learning method can be used for mining the correlation of mass experimental data generated in a chemical experiment, helping chemists to make reasonable analysis and prediction and greatly accelerating the chemical research and development process.
Disclosure of Invention
In order to solve the problems of the prior art and insufficient available data, the invention aims to provide an intelligent chemical reaction yield prediction and analysis method based on XGboost in a small sample environment, and the intelligent chemical reaction yield prediction and analysis method can automatically and efficiently perform intelligent prediction and analysis on the chemical reaction yield by combining three-dimensional chemical structure information, and is convenient for subsequent researches of related researchers; the whole model is short in training time, high in identification accuracy and good in robustness.
In order to achieve the purpose, the invention adopts the technical scheme that:
the XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor;
the data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software;
the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results;
analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis.
The data acquisition of the three-dimensional descriptor specifically comprises the following steps:
(1.1) arranging and combining all variables in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination;
(1.2) converting each combination of the drawn two-dimensional structure diagram into a three-dimensional structure diagram combination under the conditions that a certain reactant or reaction condition is taken as a variable and the rest is a quantitative small sample, and storing a file;
(1.3) calculating and outputting the three-dimensional structure descriptor of the file saved in the step (1.2) by using related software, so as to keep the structure information and the plane information of the organic compound;
and (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set and a testing set, and corresponding the three-dimensional structure descriptors to corresponding reaction yields.
The intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:
(2.1) importing the training set and test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing a model;
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
The result analysis of the three-dimensional descriptor specifically includes:
after sample internal and external prediction is carried out, calculating through an XGboost algorithm to obtain importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
Wherein, when the two-dimensional structure chart is drawn in the step (1.1), the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are drawn in a permutation and combination mode, and the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.
In the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein the target function is as follows:
Figure BDA0003069852620000061
Figure BDA0003069852620000062
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes;γthe number of leaf nodes can be controlled; lambda can control the fraction of leaf nodes not to be too large, so as to prevent overfitting;
Figure BDA0003069852620000063
representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;
Figure BDA0003069852620000064
is the ith sample xiThe predicted value of (c):
Figure BDA0003069852620000065
yiis the true value;
the XGboost objective function is composed of two parts, wherein the first part is used for measuring the goodness of fit of the currently generated model to the training data; in another part, the XGboost explicitly takes the complexity of the model as part of the objective function, i.e., the regularization term;
because the target function in the XGboost algorithm can be freely selected, as long as the second-order conductibility is met, the target function of the XGboost algorithm in the invention selects a square loss function:
Figure BDA0003069852620000066
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, judging the prediction effect of the model through the evaluation index, and meanwhile, calculating the importance score of the three-dimensional descriptor by the XGboost algorithm in the process of searching the optimal segmentation point so as to obtain the importance ranking of the descriptor, thereby providing certain decision information for the user.
The invention has the following beneficial effects:
1. the invention provides an intelligent prediction method based on a gradient lifting tree model XGboost in a small sample environment, aiming at the problem that the prediction of Buchwald-Hartwig amination reaction yield still depends on a two-dimensional chemical descriptor and a large amount of reaction data in the current machine learning. The model converts the chemical structure into a three-dimensional characteristic descriptor, and plane information and structure information of the chemical structure are reserved; aiming at small sample data, the prediction precision of the model is greatly improved; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.
2. The grid searching method can simultaneously search possible values of a certain parameter or a plurality of parameters, thereby conveniently obtaining the optimal parameter of the model.
And 3, the XGboost and the grid search method are combined to predict the yield of the chemical reaction more accurately and efficiently.
4. The XGboost-based chemical reaction yield intelligent prediction and analysis method under the small sample environment is simple to operate, easy to implement, accurate in analysis result, greatly convenient for relevant users to use and capable of meeting user requirements.
Drawings
FIG. 1 is a schematic diagram of the reaction scheme and the associated variable structures of the chemical reactions in the examples of the present invention;
FIG. 2 is a two-dimensional block diagram of a combination of reaction conditions;
FIG. 3 is a three-dimensional block diagram corresponding to FIG. 2;
FIG. 4 is a flow chart of an analysis method of the present invention.
Reference numbers in figure 1: yield variable selection in Buchwald-Hartwig amination and Buchwald-Hartwig amination ranges from Aryl, halide, Additive: additive, Base: substrate, l (ligand): a ligand.
Detailed Description
As shown in FIGS. 1-4, the invention provides an XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor.
The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software. The plane information of the chemical structure is kept, and the structural information is kept; the concrete implementation steps comprise:
(1.1) arranging and combining all variables comprising halide, ligand, substrate and additive in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination by utilizing Spartan software to prepare for converting into a three-dimensional structure.
In the present example, which is shown in FIG. 1, when a two-dimensional structure diagram is plotted, the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are plotted in permutation and combination, and all the reaction combinations are halide, ligand, substrate and additive in the order; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence. There are 15 halides, 4 ligands, 3 substrates, 23 additives, 4140 in the corresponding permutation; the ineffective reactions were removed and 3960 effective reactions were obtained, one-to-one corresponding to the reaction yield.
(1.2) under the condition that a certain reactant or reaction condition is taken as a variable, and the rest is a quantitative small sample, each combination of the drawn two-dimensional structure chart (shown in figure 2) is converted into a three-dimensional structure chart combination (shown in figure 3) and is stored as an sdf format file.
This example converts a two-dimensional structure diagram of a mapped 15 reaction set into a three-dimensional structure in Spartan software under the condition of taking halide as a variable and the rest of additives, substrates and ligands as quantitative small samples.
(1.3) calculating each sdf format file in Python by using an RDkit tool package and outputting a three-dimensional structure descriptor thereof, thereby retaining the structure information and the plane information of the organic compound.
The invention calculates and extracts the descriptor of the three-dimensional structure of the organic compound, which mainly depends on a tool kit of chemical informatics and machine learning: the RDkit is an important tool for processing molecular data in chemistry, biology, pharmacy and material science as an inputtable machine learning and deep learning model, and the content of the RDkit covers various processing methods such as molecular reading and writing of Python based on the RDkit, molecular fingerprint and molecular descriptor calculation of a compound, comparison of the compound, similarity search of the compound, compound skeleton analysis and substructure search, chemical reaction processing and the like.
And (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set (70%) and a testing set (30%), and corresponding to corresponding reaction yields so as to facilitate the XGboost algorithm to carry out sample internal and external prediction and analysis.
The intelligent prediction is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; the method specifically comprises the following steps:
and (2.1) importing the training set and the test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing the model.
In the embodiment, the optimal parameters are automatically obtained by using the grid search method, mainly because the XGBoost algorithm contains more parameters, and the grid search method can specify a certain parameter value to perform exhaustive search and also can arrange and combine possible values of a plurality of parameters, so as to find the optimal parameters, which is more efficient than manual parameter adjustment.
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
Wherein the result analysis is to analyze the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis. The method specifically comprises the following steps: after the out-of-sample prediction is carried out, calculating by using an XGboost algorithm to obtain the importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
In the intelligent prediction of the three-dimensional descriptor, the specific calculation process in the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein an objective function is as follows:
Figure BDA0003069852620000101
Figure BDA0003069852620000102
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes; gamma can control the number of leaf nodes, and lambda can control the fraction of the leaf nodes not to be too large, so as to prevent overfitting;
Figure BDA0003069852620000111
representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;
Figure BDA0003069852620000112
is the ith sample xiThe predicted value of (c):
Figure BDA0003069852620000113
yiis the true value.
As can be seen from the objective function, the objective function of the XGBoost is composed of two parts, the first part is the same as the objective function of the conventional GBDT and is used for measuring the goodness of fit of the currently generated model to the training data, and the difference is that in the second part: XGboost explicitly takes the complexity of the model as part of the objective function, namely the regularization term, which also contains two parts.
It can be seen that the objective function has many parameters, and manual parameter adjustment is laborious and time-consuming, so that a convenient grid search method is introduced to select the optimal parameters:
the XGBoost is a tree integration model, and sums the results of K (number of trees) trees as the final predicted value. Namely:
Figure BDA0003069852620000114
assuming a given sample set has n samples with m features, D { (x)i,yi)}(|D|=n,xi∈Rm,yiE.g. R) in which xiDenotes the ith sample, yiDenotes the ith class label, R is a real number. The space F of the regression tree (CART tree) is:
F={f(x)=wq(x)}(q:Rm→T,w∈RT)
where q represents the structure of each tree that maps samples to corresponding leaf nodes; t is the number of leaf nodes of the corresponding tree; f (x) the structure q of the corresponding tree and the leaf node weights w. m represents the feature dimension. The predicted value of XGBoost is the sum of the values of the corresponding leaf nodes of each tree.
In GBDT, the first derivative of the loss function L to f (x) is used to calculate the pseudo residual error for learning to generate fm(x) XGBoost uses not only the first derivative but also the second derivative:
Figure BDA0003069852620000121
performing a second-order Taylor expansion on the above equation: g is the first derivative, h is the second derivative:
Figure BDA0003069852620000122
Figure BDA0003069852620000123
and then all training samples are grouped according to leaf nodes to obtain:
Figure BDA0003069852620000124
defining:
Figure BDA0003069852620000125
wherein, IjStores data x mapped to the jth leaf nodeiIndex set of GjThe accumulated sum of the first partial derivatives of the samples contained in the leaf node j is a constant; hjThe cumulative sum of the second partial derivatives representing the samples contained by leaf node j is a constant. Substituting it into the objective function yields:
Figure BDA0003069852620000126
the optimal value is obtained by constructing a quadratic equation form of a unit, and the known objective function is as follows:
Figure BDA0003069852620000127
then, the objective function for each leaf node j is:
Figure BDA0003069852620000128
it is a related to wjA unary quadratic function of, and (H)j+ λ) > 0, then f (w)j) In that
Figure BDA0003069852620000129
Taking the minimum value, wherein the minimum value is:
Figure BDA00030698526200001210
the tree structure is best when the target value Obj is minimum, i.e. in this caseIs the optimal solution of the objective function. Therefore, the simplified objective function is:
Figure BDA00030698526200001211
the metric for finding the best segmentation point in the CART regression tree is the minimum mean square error, the metric for finding the segmentation point by XGboost is the maximum, and lambda, gamma are related to the regularization term:
Figure BDA0003069852620000131
wherein, ILRepresents the left leaf node, LRIs the right leaf node, I is the index set of the corresponding samples of the leaf node. Because the target function in the XGboost algorithm can be freely selected, only second-order conductibility is required, through experiments, the target function of the XGboost algorithm in the invention selects a square loss function:
Figure BDA0003069852620000132
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, judging the prediction effect of the model through the evaluation index, and meanwhile, calculating the importance score of the three-dimensional descriptor by the XGboost algorithm in the process of searching the optimal segmentation point so as to obtain the importance ranking of the descriptor, thereby providing certain decision information for the user.
Compared with other boosting algorithms, the XGboost algorithm is more accurate in prediction result and efficient because a regular term is added in the objective function to prevent overfitting, and two additional technologies are used to prevent overfitting: puncturing technique, column (feature) sampling; the XGboost considers the condition that the training data is sparse, and can specify the default direction of the branch for the missing value or the specified value, so that the efficiency and the accuracy of the algorithm can be greatly improved; when the optimal feature segmentation point is searched, considering that the efficiency of a traditional greedy method for enumerating all possible segmentation points of each feature is too low, the XGboost realizes an approximate algorithm. The general idea is to enumerate several candidates which can become the division points according to the percentile method, and then calculate and find the best division point from the candidates according to the formula for solving the division points; after the characteristic columns are sorted, the characteristic columns are stored in a memory in a block form and can be repeatedly used in iteration; although boosting algorithm iterations must be serial, parallel processing can be done as each feature column is processed; the XGboost also considers how to effectively use the disk when the data volume is large and the memory is insufficient, and mainly combines a multithreading method, a data compression method and a fragmentation method to improve the efficiency of the algorithm as much as possible.
Simulation experiment:
the system of the present invention is further demonstrated by simulation experiments, using Buchwald-Hartwig amination as an example (chemical reaction formula is shown in FIG. 1), as a selection of user data and introducing it into the chemical reaction.
Figure BDA0003069852620000141
Based on the small sample data: data for 15 samples were assembled by selecting the halide as the variable and the remaining additives, substrate, ligand as the first. The prediction analysis is carried out by drawing a two-dimensional structure chart of the reaction combination, converting the two-dimensional structure chart into a three-dimensional structure, storing the three-dimensional structure chart into an sdf format, summarizing data and dividing the data into a test set and a training set. As shown in the table above, better prediction results can be obtained than with the two-dimensional descriptors.

Claims (6)

1. The XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment is characterized by comprising the following steps of: the method comprises the steps of data acquisition, intelligent prediction and result analysis of the three-dimensional descriptor;
the data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software;
the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results;
analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis.
2. The intelligent prediction and analysis method for chemical reaction yield according to claim 1, characterized in that: the data acquisition of the three-dimensional descriptor specifically comprises the following steps:
(1.1) arranging and combining all variables in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination;
(1.2) converting each combination of the drawn two-dimensional structure diagram into a three-dimensional structure diagram combination under the conditions that a certain reactant or reaction condition is taken as a variable and the rest is a quantitative small sample, and storing a file;
(1.3) calculating and outputting the three-dimensional structure descriptor of the file saved in the step (1.2) by using related software, so as to keep the structure information and the plane information of the organic compound;
and (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set and a testing set, and corresponding the three-dimensional structure descriptors to corresponding reaction yields.
3. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: the intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:
(2.1) importing the training set and test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing a model;
(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.
4. The XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment according to claim 3, wherein: the result analysis of the three-dimensional descriptor specifically includes:
after sample internal and external prediction is carried out, calculating through an XGboost algorithm to obtain importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.
5. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: when the two-dimensional structure chart is drawn in the step (1.1), arranging and combining reaction variables (including halide, ligand, substrate and additive) in the Buchwald-Hartwig amination reaction, wherein the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.
6. The intelligent prediction and analysis method for chemical reaction yield according to claim 3, characterized in that: in the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:
and importing the obtained training set and test set data into an XGboost algorithm, wherein the target function is as follows:
Figure FDA0003069852610000031
Figure FDA0003069852610000032
wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes;γthe number of leaf nodes can be controlled; lambda can control the fraction of leaf nodes not to be too large, so as to prevent overfitting;
Figure FDA0003069852610000033
representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;
Figure FDA0003069852610000034
is the ith sample xiThe predicted value of (c):
Figure FDA0003069852610000035
yiis the true value;
the XGboost objective function is composed of two parts, wherein the first part is used for measuring the goodness of fit of the currently generated model to the training data; in another part, the XGboost explicitly takes the complexity of the model as part of the objective function, i.e., the regularization term;
the target function in the XGboost algorithm can be freely selected, and the target function of the XGboost algorithm can be selected as long as second-order conductibility is met:
Figure FDA0003069852610000041
and finally, returning a predicted value when the model reaches the minimum value according to the loss function, and judging the prediction effect of the model through the evaluation index.
CN202110535993.2A 2021-03-23 2021-05-17 XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment Active CN113517033B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021103081299 2021-03-23
CN202110308129 2021-03-23

Publications (2)

Publication Number Publication Date
CN113517033A true CN113517033A (en) 2021-10-19
CN113517033B CN113517033B (en) 2022-08-12

Family

ID=78064505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110535993.2A Active CN113517033B (en) 2021-03-23 2021-05-17 XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment

Country Status (1)

Country Link
CN (1) CN113517033B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019055499A1 (en) * 2017-09-12 2019-03-21 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
US20190340316A1 (en) * 2018-05-03 2019-11-07 Lam Research Corporation Predicting etch characteristics in thermal etching and atomic layer etching
CN110491453A (en) * 2018-04-27 2019-11-22 上海交通大学 A kind of yield prediction method of chemical reaction
EP3712897A1 (en) * 2019-03-22 2020-09-23 Tata Consultancy Services Limited Automated prediction of biological response of chemical compounds based on chemical information
CN111863143A (en) * 2020-07-31 2020-10-30 中国石油化工股份有限公司 Parameter estimation method and device for catalytic cracking kinetic model
CN112272764A (en) * 2018-01-30 2021-01-26 斯坦福国际研究院 Computational generation of chemical synthetic routes and methods

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019055499A1 (en) * 2017-09-12 2019-03-21 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
CN112272764A (en) * 2018-01-30 2021-01-26 斯坦福国际研究院 Computational generation of chemical synthetic routes and methods
CN110491453A (en) * 2018-04-27 2019-11-22 上海交通大学 A kind of yield prediction method of chemical reaction
US20190340316A1 (en) * 2018-05-03 2019-11-07 Lam Research Corporation Predicting etch characteristics in thermal etching and atomic layer etching
EP3712897A1 (en) * 2019-03-22 2020-09-23 Tata Consultancy Services Limited Automated prediction of biological response of chemical compounds based on chemical information
CN111863143A (en) * 2020-07-31 2020-10-30 中国石油化工股份有限公司 Parameter estimation method and device for catalytic cracking kinetic model

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
AKIRA YADA ET AL.: "Machine Learning Approach for Prediction of Reaction Yield with Simulated Catalyst Parameters", 《CHEMISTRY LETTERS》 *
DAN ZHANG ET AL.: "iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins", 《COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE》 *
DEREK T. AHNEMAN ET AL.: "Predicting reaction performance in C – N cross-coupling using machine learning", 《SCIENCE》 *
TIANQI CHEN ET AL.: "XGBoost: A Scalable Tree Boosting System", 《ARXIV[CS.LG]》 *
付尊蕴: "基于深度学习的小分子虚拟筛选和反应产率预测", 《中国优秀博硕士学位论文全文数据库(硕士)医药卫生科技辑》 *
刘伊迪 等: "机器学习在有机化学中的应用", 《有机化学》 *

Also Published As

Publication number Publication date
CN113517033B (en) 2022-08-12

Similar Documents

Publication Publication Date Title
Agrafiotis et al. Combinatorial informatics in the post-genomics era
Unsleber et al. Chemoton 2.0: autonomous exploration of chemical reaction networks
JP2002513979A (en) System, method, and computer program product for representing proximity data in multidimensional space
Hu et al. LEAP into the Pfizer Global Virtual Library (PGVL) space: creation of readily synthesizable design ideas automatically
AU2003223983A1 (en) Methods and systems for discovery of chemical compounds and their syntheses
Gensch et al. Design and application of a screening set for monophosphine ligands in cross-coupling
CN113380345A (en) Organic chemical coupling reaction yield prediction and analysis method based on deep forest
Saldívar-González et al. Chemoinformatics approaches to assess chemical diversity and complexity of small molecules
Lin et al. PanGu Drug Model: learn a molecule like a human
US20020072887A1 (en) Interaction fingerprint annotations from protein structure models
Orlando et al. Manipulating large-scale Arabidopsis microarray expression data: identifying dominant expression patterns and biological process enrichment
Ertl et al. The scaffold tree: an efficient navigation in the scaffold universe
CN113362905B (en) Asymmetric catalytic reaction enantioselectivity prediction method based on deep learning
Murayama et al. Characterizing reaction route map of realistic molecular reactions based on weight rank clique filtration of persistent homology
Shi et al. Machine learning for chemistry: basics and applications
Tan et al. A multitask approach to learn molecular properties
CN113517033B (en) XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment
CN112837740B (en) DNA binding residue prediction method based on structural characteristics
Li et al. Synthesis-driven design of 3D molecules for structure-based drug discovery using geometric transformers
CN111052252B (en) Heterogeneous method for modeling biochemical environment
CN110176279B (en) Lead compound virtual screening method and device based on small sample
US20140171332A1 (en) System for the efficient discovery of new therapeutic drugs
Lin et al. Empowering Research in Chemistry and Materials Science through Intelligent Algorithms
Hou et al. Regression Prediction of Coupling Reaction Yield Based on Attention–Driven Convolutional Neural Network
Coley Data-driven Prediction of Organic Reaction Outcomes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant