CN113517033A

CN113517033A - XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment

Info

Publication number: CN113517033A
Application number: CN202110535993.2A
Authority: CN
Inventors: 杨晓慧; 彭李超; 董晶; 张普玉; 张泽霖
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-03-23
Filing date: 2021-05-17
Publication date: 2021-10-19
Anticipated expiration: 2041-05-17
Also published as: CN113517033B

Abstract

The invention discloses an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor. The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software; the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.

Description

XGboost-based chemical reaction yield intelligent prediction and analysis method in small sample environment

Technical Field

The invention belongs to the field of organic synthesis based on pattern recognition and artificial intelligence, and particularly relates to an XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment.

Background

The coupling reaction (coupling reaction) is very important in organic synthesis, and the product is widely applied to medicines, pesticides, natural products and even advanced functional materials. The coupling reaction can be further divided into a cross-coupling reaction (two different fragments are combined into one molecule) and a self-coupling reaction (two identical fragments are combined into one molecule) according to the reaction types. Palladium-catalyzed cross-coupling reactions are a broad class of coupling reactions, which refer to reactions in which a palladium compound is used as a catalyst (mostly a homogeneous catalyst).

To produce complex but very important organic materials, it is necessary to bond carbon atoms together by chemical reaction. However, the chemical bonds between carbon atoms and adjacent atoms in organic molecules are often very stable and are not susceptible to chemical reactions with other molecules. Although the prior method can make carbon atoms more active, the carbon atoms which are too active can generate a large amount of byproducts, and the problem can be solved by using palladium as a catalyst. The palladium atoms attract different carbon atoms to the side of the palladium atoms like a mordant, so that the carbon atoms are very close to each other and are easy to combine, namely coupling. Such a reaction does not require activation of the carbon atoms to a very active level, with fewer by-products, and is therefore more accurate and efficient. Richard F. heck discovered earlier in 1972 that palladium as a catalyst can achieve the connection between carbon atoms under milder conditions, and Ei-ichi Negishi and Akira Suzuki further developed methods for cross-coupling C-C atoms using palladium catalysis in 1977 and 1979, respectively, to further expand the substrate and product types of this type of chemical reaction. These methods allow easy and efficient coupling of stable carbon atoms together to synthesize more structurally complex molecules. The advent of these cross-coupling synthetic methods has led to unprecedented improvements in the ability and levels of chemists to manipulate atoms and molecules.

Heck, american scientists Richard f.heck, japan scientists Ei-ichi negishi and Akira Suzuki also acquired the annual nobel prize by developing a "palladium catalyzed cross-coupling method in organic synthesis". By these methods, many substances that were difficult or even impossible to synthesize in the past have been easily created. In fact, the methods invented by them have been widely applied to scientific research and industrial production in the fields of pharmacy, electronic industry, advanced materials and the like.

Palladium catalysis can not only realize the coupling reaction of C-C combination, but also realize the coupling reaction of carbon-heteroatom combination. The C-N formation reaction is an important area in modern organic synthesis. Through the formation of C-N bond, amines and their derivatives, nitrogen-containing heterocycles, etc. can be prepared, many of which are biologically and pharmaceutically active compounds and some important intermediates. Arylamine plays an important role in the fields of medicines, functional materials and the like, and Buchwald-Hartwig (Buhward-Hartwig) amination reaction is an efficient and universal method for synthesizing substituted arylamine and is one of research hotspots in the field of organic synthesis for constructing C-N bonds by utilizing palladium catalysis. However, applying this reaction to complex drug-like molecules remains challenging, one limitation being the poor performance of substrates with five-membered heterocycles (such as isoxazoles) that contain heteroatom-heteroatom bonds. These heterocycles have drug-like characteristics, but are not representative enough in successful drug candidates; the Pd metal and the noble metal belong to noble metals, so that the price is high and the Pd metal and the noble metal have certain toxicity; thirdly, aiming at different reaction substrates, what additives, substrates and ligands are used, the method does not have a mature reaction rule, so that the yield is higher, the reaction is more efficient, the traditional manual experiment needs to be continuously tried and error, the efficiency is low, the cost is high, and the realization difficulty is higher. Therefore, how to obtain higher predicted yield with less cost and high efficiency of selecting reaction conditions is a major concern for researchers. For this reason researchers considered using a machine learning approach to predict the performance of the Buchwald-Hartwig reaction in the presence of isoxazole.

In recent years, the development of machine learning algorithm provides a 'shortcut' for Buchwald-Hartwig (Buhward-Hartwig) amination reaction to search for proper reaction substrates and reaction conditions. The method becomes a scientific research component in a plurality of disciplines, brings new opportunities for the development of organic chemistry, and realizes the prediction of the activation performance, the chemical reaction performance, the compound property and the like of the catalyst. The information in the chemical system is screened or coded to form a certain expression mode of chemical information, namely a descriptor, so that the research on the chemical field can be converted into a data processing process, and the dependence on personnel is reduced on a certain program. The machine learning method can be used for mining the correlation of mass experimental data generated in a chemical experiment, helping chemists to make reasonable analysis and prediction and greatly accelerating the chemical research and development process.

Disclosure of Invention

In order to solve the problems of the prior art and insufficient available data, the invention aims to provide an intelligent chemical reaction yield prediction and analysis method based on XGboost in a small sample environment, and the intelligent chemical reaction yield prediction and analysis method can automatically and efficiently perform intelligent prediction and analysis on the chemical reaction yield by combining three-dimensional chemical structure information, and is convenient for subsequent researches of related researchers; the whole model is short in training time, high in identification accuracy and good in robustness.

In order to achieve the purpose, the invention adopts the technical scheme that:

the XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor;

the data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software;

the intelligent prediction of the three-dimensional descriptor is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results;

analyzing the result of the three-dimensional descriptor, namely analyzing the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis.

The data acquisition of the three-dimensional descriptor specifically comprises the following steps:

(1.1) arranging and combining all variables in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination;

(1.2) converting each combination of the drawn two-dimensional structure diagram into a three-dimensional structure diagram combination under the conditions that a certain reactant or reaction condition is taken as a variable and the rest is a quantitative small sample, and storing a file;

(1.3) calculating and outputting the three-dimensional structure descriptor of the file saved in the step (1.2) by using related software, so as to keep the structure information and the plane information of the organic compound;

and (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set and a testing set, and corresponding the three-dimensional structure descriptors to corresponding reaction yields.

The intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:

(2.1) importing the training set and test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing a model;

(2.2) performing out-of-sample prediction to prove the effectiveness of the model; and the model selected by the invention can predict the chemical reaction yield in a small sample environment, and determine the combination of reactants and reaction conditions to provide the reaction combination corresponding to the highest yield.

The result analysis of the three-dimensional descriptor specifically includes:

after sample internal and external prediction is carried out, calculating through an XGboost algorithm to obtain importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.

Wherein, when the two-dimensional structure chart is drawn in the step (1.1), the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are drawn in a permutation and combination mode, and the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.

In the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:

and importing the obtained training set and test set data into an XGboost algorithm, wherein the target function is as follows:

wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes;_γthe number of leaf nodes can be controlled; lambda can control the fraction of leaf nodes not to be too large, so as to prevent overfitting;

representing the complexity of K trees; l (φ) is an expression in linear space; i is the ith sample, k is the kth tree;

is the ith sample x_iThe predicted value of (c):

y_iis the true value;

the XGboost objective function is composed of two parts, wherein the first part is used for measuring the goodness of fit of the currently generated model to the training data; in another part, the XGboost explicitly takes the complexity of the model as part of the objective function, i.e., the regularization term;

because the target function in the XGboost algorithm can be freely selected, as long as the second-order conductibility is met, the target function of the XGboost algorithm in the invention selects a square loss function:

and finally, returning a predicted value when the model reaches the minimum value according to the loss function, judging the prediction effect of the model through the evaluation index, and meanwhile, calculating the importance score of the three-dimensional descriptor by the XGboost algorithm in the process of searching the optimal segmentation point so as to obtain the importance ranking of the descriptor, thereby providing certain decision information for the user.

The invention has the following beneficial effects:

1. the invention provides an intelligent prediction method based on a gradient lifting tree model XGboost in a small sample environment, aiming at the problem that the prediction of Buchwald-Hartwig amination reaction yield still depends on a two-dimensional chemical descriptor and a large amount of reaction data in the current machine learning. The model converts the chemical structure into a three-dimensional characteristic descriptor, and plane information and structure information of the chemical structure are reserved; aiming at small sample data, the prediction precision of the model is greatly improved; the importance ranking of the descriptors can also be obtained through calculation, so that more reliable decision information is provided for the user. The invention can assist chemists to make reasonable analysis and prediction and greatly accelerate the chemical research and development process.

2. The grid searching method can simultaneously search possible values of a certain parameter or a plurality of parameters, thereby conveniently obtaining the optimal parameter of the model.

And 3, the XGboost and the grid search method are combined to predict the yield of the chemical reaction more accurately and efficiently.

4. The XGboost-based chemical reaction yield intelligent prediction and analysis method under the small sample environment is simple to operate, easy to implement, accurate in analysis result, greatly convenient for relevant users to use and capable of meeting user requirements.

Drawings

FIG. 1 is a schematic diagram of the reaction scheme and the associated variable structures of the chemical reactions in the examples of the present invention;

FIG. 2 is a two-dimensional block diagram of a combination of reaction conditions;

FIG. 3 is a three-dimensional block diagram corresponding to FIG. 2;

FIG. 4 is a flow chart of an analysis method of the present invention.

Reference numbers in figure 1: yield variable selection in Buchwald-Hartwig amination and Buchwald-Hartwig amination ranges from Aryl, halide, Additive: additive, Base: substrate, l (ligand): a ligand.

Detailed Description

As shown in FIGS. 1-4, the invention provides an XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment, which comprises data acquisition, intelligent prediction and result analysis of a three-dimensional descriptor.

The data acquisition of the three-dimensional descriptor is realized by converting a drawn two-dimensional structure diagram of a related chemical structure into a three-dimensional structure diagram and then calculating the descriptor of the three-dimensional structure by using related software. The plane information of the chemical structure is kept, and the structural information is kept; the concrete implementation steps comprise:

(1.1) arranging and combining all variables comprising halide, ligand, substrate and additive in Buchwald-Hartwig amination reaction according to a certain sequence, and drawing a two-dimensional structure diagram of each combination by utilizing Spartan software to prepare for converting into a three-dimensional structure.

In the present example, which is shown in FIG. 1, when a two-dimensional structure diagram is plotted, the reaction variables (including halide, ligand, substrate and additive) in Buchwald-Hartwig amination reaction are plotted in permutation and combination, and all the reaction combinations are halide, ligand, substrate and additive in the order; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence. There are 15 halides, 4 ligands, 3 substrates, 23 additives, 4140 in the corresponding permutation; the ineffective reactions were removed and 3960 effective reactions were obtained, one-to-one corresponding to the reaction yield.

(1.2) under the condition that a certain reactant or reaction condition is taken as a variable, and the rest is a quantitative small sample, each combination of the drawn two-dimensional structure chart (shown in figure 2) is converted into a three-dimensional structure chart combination (shown in figure 3) and is stored as an sdf format file.

This example converts a two-dimensional structure diagram of a mapped 15 reaction set into a three-dimensional structure in Spartan software under the condition of taking halide as a variable and the rest of additives, substrates and ligands as quantitative small samples.

(1.3) calculating each sdf format file in Python by using an RDkit tool package and outputting a three-dimensional structure descriptor thereof, thereby retaining the structure information and the plane information of the organic compound.

The invention calculates and extracts the descriptor of the three-dimensional structure of the organic compound, which mainly depends on a tool kit of chemical informatics and machine learning: the RDkit is an important tool for processing molecular data in chemistry, biology, pharmacy and material science as an inputtable machine learning and deep learning model, and the content of the RDkit covers various processing methods such as molecular reading and writing of Python based on the RDkit, molecular fingerprint and molecular descriptor calculation of a compound, comparison of the compound, similarity search of the compound, compound skeleton analysis and substructure search, chemical reaction processing and the like.

And (1.4) summarizing the three-dimensional structure descriptors of all reaction combinations obtained through calculation, dividing the three-dimensional structure descriptors into a training set (70%) and a testing set (30%), and corresponding to corresponding reaction yields so as to facilitate the XGboost algorithm to carry out sample internal and external prediction and analysis.

The intelligent prediction is to train and predict the obtained three-dimensional descriptor through a gradient lifting tree model XGboost; a grid search method is embedded in the algorithm, possible values of a plurality of parameters are arranged and combined, and the optimal parameters are selected according to model prediction results; the method specifically comprises the following steps:

and (2.1) importing the training set and the test set data obtained in the step (1.4) into an XGboost algorithm, arranging and combining possible values of a plurality of parameters in the XGboost algorithm by using a grid search method, outputting a prediction result and a corresponding parameter value by calculating a loss function value of each iteration in the XGboost algorithm until the loss function value converges to the minimum or a certain number of times, and finally selecting a parameter corresponding to the best prediction result and storing the model.

In the embodiment, the optimal parameters are automatically obtained by using the grid search method, mainly because the XGBoost algorithm contains more parameters, and the grid search method can specify a certain parameter value to perform exhaustive search and also can arrange and combine possible values of a plurality of parameters, so as to find the optimal parameters, which is more efficient than manual parameter adjustment.

Wherein the result analysis is to analyze the output of the intelligent prediction module; the method comprises the steps of yield prediction result analysis, reaction condition analysis corresponding to yield and three-dimensional descriptor feature importance analysis. The method specifically comprises the following steps: after the out-of-sample prediction is carried out, calculating by using an XGboost algorithm to obtain the importance ranking of the three-dimensional descriptors; and finding main descriptors influencing the reaction yield through the importance ranking of the descriptors, mining internal rules and analyzing the internal rules.

In the intelligent prediction of the three-dimensional descriptor, the specific calculation process in the step (2.1) comprises the following steps:

and importing the obtained training set and test set data into an XGboost algorithm, wherein an objective function is as follows:

wherein T represents the number of leaf nodes and w represents the fraction of the leaf nodes; gamma can control the number of leaf nodes, and lambda can control the fraction of the leaf nodes not to be too large, so as to prevent overfitting;

is the ith sample x_iThe predicted value of (c):

y_iis the true value.

As can be seen from the objective function, the objective function of the XGBoost is composed of two parts, the first part is the same as the objective function of the conventional GBDT and is used for measuring the goodness of fit of the currently generated model to the training data, and the difference is that in the second part: XGboost explicitly takes the complexity of the model as part of the objective function, namely the regularization term, which also contains two parts.

It can be seen that the objective function has many parameters, and manual parameter adjustment is laborious and time-consuming, so that a convenient grid search method is introduced to select the optimal parameters:

the XGBoost is a tree integration model, and sums the results of K (number of trees) trees as the final predicted value. Namely:

assuming a given sample set has n samples with m features, D { (x)_i,y_i)}(|D|＝n,x_i∈R^m,y_iE.g. R) in which x_iDenotes the ith sample, y_iDenotes the ith class label, R is a real number. The space F of the regression tree (CART tree) is:

F＝{f(x)＝w_q(x)}(q:R^m→T,w∈R^T)

where q represents the structure of each tree that maps samples to corresponding leaf nodes; t is the number of leaf nodes of the corresponding tree; f (x) the structure q of the corresponding tree and the leaf node weights w. m represents the feature dimension. The predicted value of XGBoost is the sum of the values of the corresponding leaf nodes of each tree.

In GBDT, the first derivative of the loss function L to f (x) is used to calculate the pseudo residual error for learning to generate f_m(x) XGBoost uses not only the first derivative but also the second derivative:

performing a second-order Taylor expansion on the above equation: g is the first derivative, h is the second derivative:

and then all training samples are grouped according to leaf nodes to obtain:

defining:

wherein, I_jStores data x mapped to the jth leaf node_iIndex set of G_jThe accumulated sum of the first partial derivatives of the samples contained in the leaf node j is a constant; h_jThe cumulative sum of the second partial derivatives representing the samples contained by leaf node j is a constant. Substituting it into the objective function yields:

the optimal value is obtained by constructing a quadratic equation form of a unit, and the known objective function is as follows:

then, the objective function for each leaf node j is:

it is a related to w_jA unary quadratic function of, and (H)_j+ λ) > 0, then f (w)_j) In that

Taking the minimum value, wherein the minimum value is:

the tree structure is best when the target value Obj is minimum, i.e. in this caseIs the optimal solution of the objective function. Therefore, the simplified objective function is:

the metric for finding the best segmentation point in the CART regression tree is the minimum mean square error, the metric for finding the segmentation point by XGboost is the maximum, and lambda, gamma are related to the regularization term:

wherein, I_LRepresents the left leaf node, L_RIs the right leaf node, I is the index set of the corresponding samples of the leaf node. Because the target function in the XGboost algorithm can be freely selected, only second-order conductibility is required, through experiments, the target function of the XGboost algorithm in the invention selects a square loss function:

Compared with other boosting algorithms, the XGboost algorithm is more accurate in prediction result and efficient because a regular term is added in the objective function to prevent overfitting, and two additional technologies are used to prevent overfitting: puncturing technique, column (feature) sampling; the XGboost considers the condition that the training data is sparse, and can specify the default direction of the branch for the missing value or the specified value, so that the efficiency and the accuracy of the algorithm can be greatly improved; when the optimal feature segmentation point is searched, considering that the efficiency of a traditional greedy method for enumerating all possible segmentation points of each feature is too low, the XGboost realizes an approximate algorithm. The general idea is to enumerate several candidates which can become the division points according to the percentile method, and then calculate and find the best division point from the candidates according to the formula for solving the division points; after the characteristic columns are sorted, the characteristic columns are stored in a memory in a block form and can be repeatedly used in iteration; although boosting algorithm iterations must be serial, parallel processing can be done as each feature column is processed; the XGboost also considers how to effectively use the disk when the data volume is large and the memory is insufficient, and mainly combines a multithreading method, a data compression method and a fragmentation method to improve the efficiency of the algorithm as much as possible.

Simulation experiment:

the system of the present invention is further demonstrated by simulation experiments, using Buchwald-Hartwig amination as an example (chemical reaction formula is shown in FIG. 1), as a selection of user data and introducing it into the chemical reaction.

Based on the small sample data: data for 15 samples were assembled by selecting the halide as the variable and the remaining additives, substrate, ligand as the first. The prediction analysis is carried out by drawing a two-dimensional structure chart of the reaction combination, converting the two-dimensional structure chart into a three-dimensional structure, storing the three-dimensional structure chart into an sdf format, summarizing data and dividing the data into a test set and a training set. As shown in the table above, better prediction results can be obtained than with the two-dimensional descriptors.

Claims

1. The XGboost-based chemical reaction yield intelligent prediction and analysis method in a small sample environment is characterized by comprising the following steps of: the method comprises the steps of data acquisition, intelligent prediction and result analysis of the three-dimensional descriptor;

2. The intelligent prediction and analysis method for chemical reaction yield according to claim 1, characterized in that: the data acquisition of the three-dimensional descriptor specifically comprises the following steps:

3. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: the intelligent prediction of the three-dimensional descriptor specifically comprises the following steps:

4. The XGboost-based intelligent prediction and analysis method for chemical reaction yield in a small sample environment according to claim 3, wherein: the result analysis of the three-dimensional descriptor specifically includes:

5. The intelligent predicting and analyzing method for chemical reaction yield according to claim 2, wherein: when the two-dimensional structure chart is drawn in the step (1.1), arranging and combining reaction variables (including halide, ligand, substrate and additive) in the Buchwald-Hartwig amination reaction, wherein the sequence of all reaction combinations is halide, ligand, substrate and additive; and using Spartan software to select the first one with halide as a variable and additive, substrate and ligand as quantitative quantities, and combining, wherein each group of reaction combination draws a two-dimensional structure diagram according to a certain sequence.

6. The intelligent prediction and analysis method for chemical reaction yield according to claim 3, characterized in that: in the intelligent prediction of the three-dimensional descriptor, the specific calculation process of the step (2.1) comprises the following steps:

is the ith sample x_iThe predicted value of (c):

y_iis the true value;

the target function in the XGboost algorithm can be freely selected, and the target function of the XGboost algorithm can be selected as long as second-order conductibility is met:

and finally, returning a predicted value when the model reaches the minimum value according to the loss function, and judging the prediction effect of the model through the evaluation index.