CN113782109A - Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree - Google Patents
Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree Download PDFInfo
- Publication number
- CN113782109A CN113782109A CN202111066691.1A CN202111066691A CN113782109A CN 113782109 A CN113782109 A CN 113782109A CN 202111066691 A CN202111066691 A CN 202111066691A CN 113782109 A CN113782109 A CN 113782109A
- Authority
- CN
- China
- Prior art keywords
- template
- reverse reaction
- prediction model
- node
- reverse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/10—Analysis or design of chemical reactions, syntheses or processes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Analytical Chemistry (AREA)
- Biomedical Technology (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a reactant derivation method and a reverse synthesis derivation method based on a Monte Carlo tree, belongs to the technical field of reverse synthesis, and aims to solve the technical problem of improving the prediction accuracy of reverse synthesis analysis. The reverse synthesis derivation method comprises the following steps: performing reverse reaction template extraction on the initial data set by an RdChiral method; after data cleaning is carried out on the initial template library, a training set, a verification set and a test set are obtained, and a reverse reaction template library is constructed on the basis of a nonrepeating reverse reaction template; constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model; training a template prediction model; the method comprises the steps of constructing a Monte Carlo tree by taking a target compound SMILES expression as a root node, predicting a reverse reaction template corresponding to a molecule in each node of the Monte Carlo tree by adopting a template prediction model based on a Monte Carlo tree searching method, and obtaining a preceding stage reactant corresponding to the reverse reaction template.
Description
Technical Field
The invention relates to the technical field of reverse synthesis, in particular to a reactant derivation method and a reverse synthesis derivation method based on a Monte Carlo tree.
Background
Retrosynthetic analysis is a method of synthesizing a given compound, usually by a chemist or computer, by breaking down the target into intermediates or simpler reactants in a step-by-step process until a commercially available building block is found. The reverse synthesis analysis is traditionally realized by an expert system based on a manual coding rule, and the application range is standard and the accuracy is low.
Based on the analysis, how to improve the prediction accuracy of the reverse synthesis analysis is a technical problem to be solved.
Disclosure of Invention
The technical task of the invention is to provide a reactant derivation method and a reverse synthesis derivation method based on Monte Carlo trees aiming at the defects so as to solve the problem of how to improve the prediction accuracy of reverse synthesis analysis.
According to the reactant derivation method based on the Monte Carlo tree, a target compound SMILES expression is used as a root node to construct the Monte Carlo tree, a template prediction model is adopted to predict an inverse reaction template corresponding to molecules in each node of the Monte Carlo tree based on a Monte Carlo tree search method, and a preceding stage reaction corresponding to the inverse reaction template is obtained at the same time, wherein the template prediction model is a neural network model which takes a product in a chemical equation as input and the inverse reaction template as output; the method comprises the following steps:
in the selection stage, for the current node, each iteration starts from the root node of the tree, the UCB score of each node is calculated, and the leaf node with the highest UCB score is selected as the leaf node to be expanded;
in the expansion stage, for each molecule of the leaf node to be expanded, predicting a corresponding reverse reaction template through a template prediction model, obtaining a preposed molecule corresponding to each reverse reaction template based on RDKit and creating a leaf node;
in the simulation phase, the selection and the expansion are continuously carried out starting from the leaf nodes which are not visited until a stop condition is met and a termination node is reached, wherein the stop condition comprises the following steps: the generated preposed molecules are all present in a commercially available compound library, reach the maximum depth of a tree and are ineffective in a reverse reaction template;
in the backtracking stage, updating the Q value and the N value of each node on the backtracking path from bottom to top from the leaf node to be expanded until the root node is reached;
wherein, the calculation formula of the UCB score is as follows:
wherein N is-1Representing the number of times that the parent node of the current node is traversed, and C representing a hyper-parameter for balancing exploration and development;
in the process of calculating the UCB score of each node, Q represents the sum of the previous step values, and N represents the number of times that the current child node is traversed.
Preferably, the same template prediction model is used for predicting the corresponding reverse reaction template in the simulation stage and the expansion stage;
the template prediction model is obtained through the following steps:
acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
extracting a reverse reaction template from the initial data set by an RdChiral method, performing Hash coding on a reaction equation and the reverse reaction template respectively, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
after data cleaning is carried out on the initial template library, the reverse reaction template hash code is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out on the basis of the finger print vector and the label vector, a training set, a verification set and a test set are obtained, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and a reverse reaction template library is constructed on the basis of a nonrepeating reverse reaction template;
constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
and training the template prediction model by taking the training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking the verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on the test set to obtain a final template prediction model.
Preferably, the data cleaning is performed on the initial template library, and the method comprises the following steps:
removing sample data of which the product quantity is more than 1 in the product SMILES expression in the initial template library;
acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, removing corresponding sample data;
removing sample data corresponding to reverse reaction templates with the occurrence frequency less than a threshold value in the initial template library;
and removing repeated sample data in the initial template library according to the Hash code of the reaction equation.
Preferably, the method for obtaining the corresponding preposed molecules of each reverse reaction template and creating the leaf nodes based on the RDkit comprises the following steps:
reserving a predetermined number of reverse reaction templates for all the reverse reaction templates obtained from each molecule;
all molecular related reverse reaction templates are stored in a father node, and related Q values and N values are respectively initialized.
Preferably, the same template prediction model is used to predict the corresponding inverse response template in the simulation phase and the expansion phase.
Preferably, the updating of the Q value and the N value of each node on the backtracking path from the leaf node to be expanded to the root node from bottom to top includes the following steps:
obtaining a value evaluation value of the termination node according to the value updating function;
accumulating the value evaluation value once for the Q value of each node on the backtracking path, and adding 1 to the N value;
the calculation formula of the value updating function is as follows:
wherein Reward is a value evaluation value, Nin_stockFor the number of compounds available, N is the number of compounds in the termination node and transformations are the number of changes from the target compound for each compound.
In a second aspect, the present invention provides a method for deriving a reverse synthesis based on a monte carlo tree, comprising the following steps:
acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
extracting a reverse reaction template from the initial data set by an RdChiral method, performing Hash coding on a reaction equation and the reverse reaction template respectively, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
after data cleaning is carried out on the initial template library, the reverse reaction template hash code is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out on the basis of the finger print vector and the label vector, a training set, a verification set and a test set are obtained, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and a reverse reaction template library is constructed on the basis of a nonrepeating reverse reaction template;
constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
training the template prediction model by taking a training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking a verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on a test set to obtain a final template prediction model;
according to the reactant derivation method based on the Monte Carlo tree, the Monte Carlo tree is constructed by taking the SMILES expression of the target compound as a root node, the reverse reaction template corresponding to each molecule in each node of the Monte Carlo tree is predicted by adopting a template prediction model based on the Monte Carlo tree search method, and the previous stage reaction corresponding to the reverse reaction template is obtained at the same time.
Preferably, before reverse reaction template extraction is performed on the initial dataset by the RdChiral method, the reaction equation is subjected to atomic mapping by rxn mapper, and the reactants and products are labeled with atomic numbers.
Preferably, the data cleaning is performed on the initial template library, and the method comprises the following steps:
removing sample data of which the product quantity is more than 1 in the product SMILES expression in the initial template library;
acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, removing corresponding sample data;
removing sample data corresponding to reverse reaction templates with the occurrence frequency less than a threshold value in the initial template library;
and removing repeated sample data in the initial template library according to the Hash code of the reaction equation.
Preferably, the reverse reaction template is converted into a label vector through a LabelBinarizer label binarization method of a scimit-lern library; the product SMILES expression is converted into a finger print vector by the Morgan algorithm of RDkit.
Preferably, the template prediction model includes:
the number of neurons of the input layer is consistent with the length of the finger print vector;
a hidden layer configured with an activation function ELU;
and the number of the neurons of the output layer is consistent with the number of the non-repeated reverse reaction templates, and the output layer is configured with an activation function Softmax.
The reactant derivation method and the reverse synthesis derivation method based on the Monte Carlo tree have the following advantages:
1. the method comprises the steps of constructing a Monte Carlo tree based on molecules in a target compound SMILES expression as root nodes, predicting an inverse reaction template corresponding to the molecules in each node of the Monte Carlo tree through a template prediction model, and obtaining a preceding-stage reactant corresponding to the inverse reaction template based on a Monte Carlo tree searching method, wherein the inverse reaction template is obtained by the template prediction model based on product prediction;
2. obtaining a reverse reaction template by an RdChiral method, constructing an initial template library by reaction equation hash codes, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the reverse reaction template hash codes, after the initial template library is cleaned, the initial template library is divided into a training set, a verification set and a test set, a reverse reaction template library is constructed based on non-repetitive reverse reaction templates, constructing a single hidden layer fully-connected neural network model as a template prediction model based on a Keras deep learning framework, training the template prediction model based on the training set, the verification set and the test set to obtain a final template prediction model, the reverse reaction template can be obtained by the template prediction model by taking the product as input prediction, thereby realizing the rapid and high-efficiency obtaining of the reverse reaction template, further, the reverse synthesis derivation can be quickly, efficiently and accurately realized by combining the Monte Carlo tree search;
3. before the template prediction model is trained, data cleaning is carried out on the initial template base, invalid and repeated sample data are removed, the effectiveness of the sample is improved, the accuracy of the template prediction model is improved, and the operation efficiency is improved;
4. converting the product SMILES expression into a high-dimensional finger print vector, facilitating the learning of the neural network model to the chemical structure characteristics in the SMILES expression, and constructing a mapping relation between the reaction center characteristics in the product SMILES expression and the corresponding reverse reaction template;
5. the simple single-hidden-layer fully-connected neural network structure is adopted as the template prediction model structure, so that the overfitting phenomenon of a reverse reaction template with less occurrence times in the training process is reduced, and the generalization capability of the template prediction model is improved;
6. the inverse synthesis derivation method has the capability of continuous learning. The reverse synthetic path may be updated by continuously learning new chemical reaction knowledge by retraining the template prediction model after updating the reaction equation dataset.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block flow diagram of the reactant derivation method based on Monte Carlo tree of example 1;
fig. 2 is a flow chart of a reverse synthesis derivation method based on a monte carlo tree in embodiment 2.
Detailed Description
The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.
The embodiment of the invention provides a reactant derivation method and a reverse synthesis derivation method based on a Monte Carlo tree, which are used for solving the technical problem of how to improve the prediction accuracy of reverse synthesis analysis.
Example 1:
the invention relates to a reactant derivation method based on a Monte Carlo tree, which is characterized in that a target compound SMILES expression is used as a root node to construct the Monte Carlo tree, a template prediction model is adopted to predict an inverse reaction template corresponding to molecules in each node of the Monte Carlo tree based on a Monte Carlo tree search method, a preceding stage reactant corresponding to the inverse reaction template is obtained at the same time, and the template prediction model is a neural network model which takes a product in a chemical equation as input and the inverse reaction template as output.
In this embodiment, the method includes the steps of:
s100, in a selection stage, for a current node, calculating the UCB score of each node from a root node of a tree in each iteration, and selecting a leaf node with the highest UCB score as a leaf node to be expanded;
s200, in the expansion stage, predicting a corresponding reverse reaction template through a template prediction model for each molecule of the leaf node to be expanded, obtaining a preposed molecule corresponding to each reverse reaction template based on the RDkit, and creating the leaf node;
s300, in a simulation stage, selecting and expanding are continuously carried out from leaf nodes which are not visited until a stop condition is met and a stop node is reached, wherein the stop condition comprises: the generated preposed molecules are all present in a commercially available compound library, reach the maximum depth of a tree and are ineffective in a reverse reaction template;
s400, in a backtracking stage, starting from the leaf node to be expanded to update the Q value and the N value of each node on a backtracking path from bottom to top until the root node is reached, wherein Q represents the sum of the previous step values, and N represents the number of times that the current child node is traversed.
The template prediction model in the embodiment is obtained through the following steps:
(1) acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
(2) extracting a reverse reaction template from the initial data set by an RdChiral method, performing Hash coding on a reaction equation and the reverse reaction template respectively, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
(3) after data cleaning is carried out on an initial template library, a reverse reaction template hash code is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out based on the finger print vector and the label vector to obtain a training set, a verification set and a test set, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and a reverse reaction template library is constructed based on a nonrepeating reverse reaction template;
(4) constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
(5) and training the template prediction model by taking the training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking the verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on the test set to obtain a final template prediction model.
Before the initial data set is subjected to reverse reaction template extraction through an RdChiral method in the step (2), atomic mapping is carried out on a reaction equation through RXMnaper, and atomic numbers are marked on reactants and products so as to further extract a reverse reaction template.
And after the reverse reaction template of the reaction equation is extracted based on the RdChiral, performing hash coding on the reaction equation and the reverse reaction template respectively to obtain two new fields of the hash coding of the reaction equation and the hash coding of the reverse reaction template, wherein the hash coding of the reaction equation is used for eliminating repeated sample data in the initial template library, and the hash coding of the reverse reaction template is used for eliminating sample data corresponding to the reverse reaction template with too few times in the initial template library on one hand and is further converted into a label vector to be used as a training target on the other hand.
Step (3) firstly, data cleaning is carried out on the initial template library to obtain the initial template library without abnormal sample data, and the method specifically comprises the following steps:
(3-1) removing sample data with the product quantity larger than 1 in the SMILES field of the product of the initial template library;
(3-2) acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, rejecting the corresponding sample data;
and (3-3) obtaining an initial template base without abnormal sample data.
Then, further carrying out data preprocessing on the initial template library without abnormal sample data, wherein the data preprocessing comprises the following steps:
(3-4) removing sample data corresponding to reverse reaction templates with too few occurrences in the initial template library (in order to ensure the generalization capability of the training model, a threshold value is generally set to be 3);
and (3-5) removing repeated sample data in the initial template library according to the Hash coding of the reaction equation.
After further data preprocessing is carried out on the initial template library without abnormal sample data, the reverse reaction template hash code is converted into label vectors through a LabelBinarizer label binarization method of a scinit-lern library, and the label vectors form a label vector sample set; the method comprises the steps of converting a product SMILES into an ECFP with the radius of 2 and the length of 2048 by using a Morgan algorithm of an RDkit, forming a finger print vector sample set by finger print vectors, combining the label vectors and the finger print vector sample set into a sample data set, wherein the label vectors and the finger print vectors in the sample data set are in one-to-one correspondence due to the fact that random seeds are fixed, then carrying out data set division on the sample data set to obtain a training set, a verification set and a test set, and the proportion of division of the training set, the verification set and the test set is 90%, 5% and 5%. And constructing a reverse reaction template library according to the nonrepeating reverse reaction template.
In the embodiment, the template prediction model is a single hidden layer full-connection neural network model constructed based on a Keras deep learning framework, the number of neurons in an input layer of the neural network is set to be the finger print vector length, the number of neurons in an output layer of the neural network is set to be the nonrepeating reverse reaction template number in a training set, activation functions of a hidden layer and an output layer are respectively set to be ELU and Softmax, a loss function is set to be a cross entropy loss function, and then the template prediction model is trained to obtain a final template prediction model. In the template training process, Dropout and l2 regularization are used for preventing overfitting, the number of neuron nodes of the hidden layer is set to be 512, an Adam optimizer is used, and the initial learning rate is set to be 0.001. And training the neural network model on a training set by adopting the hyper-parameter setting. The verification set is used for monitoring the training effect of the model and preventing over-training fitting. The test set is used to verify the generalization ability of the model after training is completed.
In step S100, selecting a leaf node of which the current node is most expected to be further developed as a leaf node to be expanded in the selection stage, starting from a root node of the tree in each iteration, calculating a UCB score of each node, selecting a leaf node with the highest UCB score to further expand, wherein a calculation formula of the UCB score is as follows:
wherein Q represents the sum of the previous step values, N represents the number of times the current child node is traversed, and N represents the total number of times the current child node is traversed-1Representing the number of times the parent of the current node is traversed, C represents a hyper-parameter for balancing exploration and development, with a default value of 1.4.
In step S200, in the expansion phase, for each molecule of the selected leaf node to be expanded, a pre-molecule is generated through a reaction template given by a template prediction model, and a leaf node is created, specifically, the following operations are performed:
(1) obtaining a series of reverse reaction templates for each molecule of the leaf node to be expanded according to the template prediction model;
(2) and obtaining a preposed molecule corresponding to each reverse reaction template based on the RDkit and creating a leaf node.
Specifically, all templates obtained for each molecule retained only the top 50 highest scoring templates, or the cumulative probability of retaining templates reached 0.995. The reaction templates for all molecules are stored in the parent node and the associated Q and N values are initialized to 0.5 and 1, respectively.
In step S300, the selection and expansion are continuously performed from the leaf node that has not been visited until the stop condition is satisfied and the end node is reached, and in the above operation steps, the same template prediction model is used to predict the corresponding inverse response template in the simulation stage and the expansion stage.
In step S400, in the backtracking stage, the Q value and the N value of each node on the backtracking path are updated from the leaf node to be expanded from bottom to top until the root node is reached, including the following steps:
(1) obtaining a value evaluation value of the termination node according to the value updating function;
(2) and accumulating the value evaluation value once for the Q value of each node on the backtracking path, and adding 1 to the N value.
The calculation formula of the value updating function is as follows:
wherein Reward is a value evaluation value, Nin_stockFor the number of compounds available, N is the number of compounds in the termination node and transformations are the number of changes from the target compound for each compound.
In the specific implementation process, the template prediction model can be constructed and trained according to the prior art, products are used as input, and the reverse reaction template is predicted and output through the template prediction model. The construction and training of the template prediction model in this embodiment is an option.
Example 2:
the invention relates to a reverse synthesis derivation method based on a Monte Carlo tree, which comprises the following steps:
s100, acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
s200, extracting a reverse reaction template from the initial data set by an RdChiral method, respectively carrying out Hash coding on a reaction equation and the reverse reaction template, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
s300, after data cleaning is carried out on the initial template base, Hash coding of a reverse reaction template is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out based on the finger print vector and the label vector to obtain a training set, a verification set and a test set, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and the reverse reaction template base is constructed based on a nonrepeating reverse reaction template;
s400, constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
s500, training the template prediction model by taking the training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking the verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on the test set to obtain a final template prediction model;
s600, constructing a monte carlo tree by using the SMILES expression of the target compound as a root node through the reactant derivation method based on the monte carlo tree disclosed in embodiment 1, predicting an inverse reaction template corresponding to a molecule in each node of the monte carlo tree by using a template prediction model based on the monte carlo tree search method, and obtaining a previous-stage reactant corresponding to the inverse reaction template.
In step S200, before performing reverse reaction template extraction on the initial data set by using an RdChiral method, atom mapping is performed on a reaction equation by using rxn mapper, and atomic numbers are labeled on reactants and products, so as to further extract a reverse reaction template.
And after the reverse reaction template of the reaction equation is extracted based on the RdChiral, performing hash coding on the reaction equation and the reverse reaction template respectively to obtain two new fields of the hash coding of the reaction equation and the hash coding of the reverse reaction template, wherein the hash coding of the reaction equation is used for eliminating repeated sample data in the initial template library, and the hash coding of the reverse reaction template is used for eliminating sample data corresponding to the reverse reaction template with too few times in the initial template library on one hand and is further converted into a label vector to be used as a training target on the other hand.
And finally, obtaining an initial template library consisting of five fields of reaction equation Hash codes, reactant SMILES, product SMILES, reverse reaction templates and reverse reaction template Hash codes.
In step S300, firstly, data cleaning is performed on the initial template library to obtain an initial template library without abnormal sample data, which specifically includes:
(1) removing sample data with the product quantity larger than 1 in the SMILES field of the initial template library product;
(2) acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, removing corresponding sample data;
(3) and obtaining an initial template library without abnormal sample data.
Then, further carrying out data preprocessing on the initial template library without abnormal sample data, wherein the data preprocessing comprises the following steps:
(1) removing sample data corresponding to reverse reaction templates with too few occurrences in the initial template library (in order to ensure the generalization capability of the training model, a threshold value is generally set to be 3);
(2) removing repeated sample data in the initial template library according to the Hash code of the reaction equation,
after further data preprocessing is carried out on the initial template library without abnormal sample data, the reverse reaction template is converted into a label vector through a LabelBinarizer label binarization method of a scinit-lern library; the method comprises the steps of converting a product SMILES into an ECFP with the radius of 2 and the length of 2048 by using a Morgan algorithm of an RDkit, forming a finger print vector sample set by finger print vectors, combining the label vectors and the finger print vector sample set into a sample data set, wherein the label vectors and the finger print vectors in the sample data set are in one-to-one correspondence due to the fact that random seeds are fixed, then carrying out data set division on the sample data set to obtain a training set, a verification set and a test set, and the proportion of division of the training set, the verification set and the test set is 90%, 5% and 5%. And constructing a reverse reaction template library according to the nonrepeating reverse reaction template.
In step S400, a single hidden layer fully-connected neural network model is constructed based on a Keras deep learning framework, in this embodiment, the number of neurons in an input layer of the neural network is set to be the finger print vector length, the number of neurons in an output layer of the neural network is set to be the number of nonrepeating reverse reaction templates in a training set, activation functions of a hidden layer and an output layer are respectively set to be ELU and Softmax, a loss function is set to be a cross entropy loss function, and then the template prediction model is trained through step S500 to obtain a final template prediction model.
In step S500 template training, Dropout and l2 regularization are used to prevent overfitting, the hidden layer neuron node number is set to 512, and the Adam optimizer is used, the initial learning rate is set to 0.001. And training the neural network model on a training set by adopting the hyper-parameter setting. The verification set is used for monitoring the training effect of the model and preventing over-training fitting. The test set is used to verify the generalization ability of the model after training is completed.
After the final template prediction model is obtained, step S600 is executed, the target product is taken as input, a corresponding reverse reaction template is predicted and output through the final template prediction model, a monte carlo tree is constructed based on the product SMILES expression, and a preceding-stage reactant corresponding to the reverse reaction template is obtained through monte carlo tree search.
The method can effectively improve the prediction accuracy and expand the application field by adopting the deep learning algorithm and the Monte Carlo tree search algorithm to carry out inverse synthesis analysis.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.
Claims (10)
1. The reactant derivation method based on the Monte Carlo tree is characterized in that a target compound SMILES expression is used as a root node to construct the Monte Carlo tree, a template prediction model is used for predicting an inverse reaction template corresponding to molecules in each node of the Monte Carlo tree based on a Monte Carlo tree search method, and a preceding stage reactant corresponding to the inverse reaction template is obtained at the same time, wherein the template prediction model is a neural network model which takes a product in a chemical equation as input and the inverse reaction template as output;
the method comprises the following steps:
in the selection stage, for the current node, each iteration starts from the root node of the tree, the UCB score of each node is calculated, and the leaf node with the highest UCB score is selected as the leaf node to be expanded;
in the expansion stage, for each molecule of the leaf node to be expanded, predicting a corresponding reverse reaction template through a template prediction model, obtaining a preposed molecule corresponding to each reverse reaction template based on RDKit and creating a leaf node;
in the simulation phase, the selection and the expansion are continuously carried out starting from the leaf nodes which are not visited until a stop condition is met and a termination node is reached, wherein the stop condition comprises the following steps: the generated preposed molecules are all present in a commercially available compound library, reach the maximum depth of a tree and are ineffective in a reverse reaction template;
in the backtracking stage, updating the Q value and the N value of each node on the backtracking path from bottom to top from the leaf node to be expanded until the root node is reached;
wherein, the calculation formula of the UCB score is as follows:
wherein N is-1Representing the number of times that the parent node of the current node is traversed, and C representing a hyper-parameter for balancing exploration and development;
q represents the sum of the values of the previous steps, and N represents the number of times the current child node is traversed.
2. The method of Monte Carlo tree-based reactant derivation according to claim 1, wherein the same template prediction model is used to predict the corresponding inverse reaction template during the simulation phase and the expansion phase;
the template prediction model is obtained through the following steps:
acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
extracting a reverse reaction template from the initial data set by an RdChiral method, performing Hash coding on a reaction equation and the reverse reaction template respectively, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
after data cleaning is carried out on the initial template library, the reverse reaction template hash code is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out on the basis of the finger print vector and the label vector, a training set, a verification set and a test set are obtained, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and a reverse reaction template library is constructed on the basis of a nonrepeating reverse reaction template;
constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
and training the template prediction model by taking the training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking the verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on the test set to obtain a final template prediction model.
3. The method of Monte Carlo tree-based reactant derivation according to claim 2, wherein the initial template library is data-cleaned comprising the steps of:
removing sample data of which the product quantity is more than 1 in the product SMILES expression in the initial template library;
acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, removing corresponding sample data;
removing sample data corresponding to reverse reaction templates with the occurrence frequency less than a threshold value in the initial template library;
and removing repeated sample data in the initial template library according to the Hash code of the reaction equation.
4. The method for reactant derivation according to any of claims 1-3, wherein the obtaining of the corresponding pre-molecule for each reverse reaction template and the creation of leaf nodes based on RDKit comprises the steps of:
reserving a predetermined number of reverse reaction templates for all the reverse reaction templates obtained from each molecule;
all molecular related reverse reaction templates are stored in a father node, and related Q values and N values are respectively initialized.
5. The method of any one of claims 1-3, wherein the Q and N values of each node in the backtracking path are updated from the leaf node to be expanded from bottom to top until the root node is reached, comprising the steps of:
obtaining a value evaluation value of the termination node according to the value updating function;
accumulating the value evaluation value once for the Q value of each node on the backtracking path, and adding 1 to the N value;
the calculation formula of the value updating function is as follows:
wherein Reward is a value evaluation value, Nin_stockFor the number of compounds available, N is the number of compounds in the termination node and transformations are the number of changes from the target compound for each compound.
6. The reverse synthesis derivation method based on the Monte Carlo tree is characterized by comprising the following steps:
acquiring a reaction equation to construct an initial data set, wherein the reaction equation comprises a reactant SMILES expression and a product SMILES expression;
extracting a reverse reaction template from the initial data set by an RdChiral method, performing Hash coding on a reaction equation and the reverse reaction template respectively, and constructing an initial template library based on the Hash coding of the reaction equation, a reactant SMILES expression, a product SMILES expression, the reverse reaction template and the Hash coding of the reverse reaction template;
after data cleaning is carried out on the initial template library, the reverse reaction template hash code is converted into a label vector, a product SMILES expression is converted into a finger print vector, data set division is carried out on the basis of the finger print vector and the label vector, a training set, a verification set and a test set are obtained, the training set, the verification set and the test set respectively comprise the label vector and the finger print vector, the label vector corresponds to the finger print vector one by one, and a reverse reaction template library is constructed on the basis of a nonrepeating reverse reaction template;
constructing a single hidden layer full-connection neural network model based on a Keras deep learning framework to serve as a template prediction model, wherein the template prediction model is used for inputting, predicting and outputting a reverse reaction template by taking a product as an input;
training the template prediction model by taking a training set as input to optimize parameters of the template prediction model, monitoring the training process of the template prediction model by taking a verification set as input to prevent overfitting to obtain a trained template prediction model, and testing the trained template prediction model based on a test set to obtain a final template prediction model;
by the Monte Carlo tree-based reactant derivation method as claimed in any one of claims 1-5, constructing the Monte Carlo tree with the target compound SMILES expression as the root node, and predicting the reverse reaction template corresponding to the molecule in each node of the Monte Carlo tree by using the template prediction model based on the Monte Carlo tree search method, and obtaining the previous stage reactant corresponding to the reverse reaction template.
7. The method of claim 6, wherein the reactants and products are labeled with atomic numbers by RXMLAPPer atomic mapping of the reaction equations before reverse reaction template extraction of the initial dataset by the RdChiral method.
8. The method of claim 6, wherein the initial template library is data-washed, comprising the steps of:
removing sample data of which the product quantity is more than 1 in the product SMILES expression in the initial template library;
acting the reverse reaction template on the corresponding product based on the RdChiral, and if the action is invalid, removing corresponding sample data;
removing sample data corresponding to reverse reaction templates with the occurrence frequency less than a threshold value in the initial template library;
and removing repeated sample data in the initial template library according to the Hash code of the reaction equation.
9. The method of claim 6, wherein the inverse reaction template hash code is converted into a label vector by LabelBinarizer label binarization method of scinit-lern library; the product SMILES expression is converted into a finger print vector by the Morgan algorithm of RDkit.
10. The method of Monte Carlo tree based inverse synthetic derivation according to claims 6, 7, 8 or 9, wherein the template prediction model comprises:
the number of neurons of the input layer is consistent with the length of the finger print vector;
a hidden layer configured with an activation function ELU;
and the number of the neurons of the output layer is consistent with the number of the non-repeated reverse reaction templates, and the output layer is configured with an activation function Softmax.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111066691.1A CN113782109A (en) | 2021-09-13 | 2021-09-13 | Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111066691.1A CN113782109A (en) | 2021-09-13 | 2021-09-13 | Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113782109A true CN113782109A (en) | 2021-12-10 |
Family
ID=78842709
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111066691.1A Pending CN113782109A (en) | 2021-09-13 | 2021-09-13 | Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113782109A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114530208A (en) * | 2022-02-18 | 2022-05-24 | 中山大学 | Planning method and system for chemical reverse synthesis path |
CN114974450A (en) * | 2022-06-28 | 2022-08-30 | 苏州沃时数字科技有限公司 | Method for generating operation steps based on machine learning and automatic testing device |
CN115083549A (en) * | 2022-07-18 | 2022-09-20 | 烟台国工智能科技有限公司 | Product raw material ratio reverse derivation method based on data mining |
CN115240788A (en) * | 2022-09-19 | 2022-10-25 | 杭州生奥信息技术有限公司 | Small molecule de novo design method based on molecular building block synthesis planning technology |
CN115588471A (en) * | 2022-11-23 | 2023-01-10 | 药融云数字科技(成都)有限公司 | Self-correcting single-step inverse synthesis method under continuous learning, terminal, server and system |
CN116189804A (en) * | 2023-04-17 | 2023-05-30 | 烟台国工智能科技有限公司 | Method and system for predicting reaction conditions based on graph convolution neural network |
CN116578934A (en) * | 2023-07-13 | 2023-08-11 | 烟台国工智能科技有限公司 | Inverse synthetic analysis method and device based on Monte Carlo tree search |
CN117457093A (en) * | 2023-12-20 | 2024-01-26 | 烟台国工智能科技有限公司 | Reverse synthesis method and device for organic reaction product based on data amplification |
WO2024032096A1 (en) * | 2022-08-09 | 2024-02-15 | 腾讯科技(深圳)有限公司 | Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device |
CN117972531A (en) * | 2024-03-29 | 2024-05-03 | 烟台国工智能科技有限公司 | Diversified inverse synthetic analysis model evaluation method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109872780A (en) * | 2019-03-14 | 2019-06-11 | 北京深度制耀科技有限公司 | A kind of determination method and device of chemical synthesis route |
CN110659420A (en) * | 2019-09-25 | 2020-01-07 | 广州西思数字科技有限公司 | Personalized catering method based on deep neural network Monte Carlo search tree |
CN112397155A (en) * | 2020-12-01 | 2021-02-23 | 中山大学 | Single-step reverse synthesis method and system |
-
2021
- 2021-09-13 CN CN202111066691.1A patent/CN113782109A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109872780A (en) * | 2019-03-14 | 2019-06-11 | 北京深度制耀科技有限公司 | A kind of determination method and device of chemical synthesis route |
CN110659420A (en) * | 2019-09-25 | 2020-01-07 | 广州西思数字科技有限公司 | Personalized catering method based on deep neural network Monte Carlo search tree |
CN112397155A (en) * | 2020-12-01 | 2021-02-23 | 中山大学 | Single-step reverse synthesis method and system |
Non-Patent Citations (1)
Title |
---|
郭世豪: "基于深度学习的化合物逆合成系统设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)工程科技Ⅰ辑》, no. 8, 15 August 2020 (2020-08-15), pages 1 - 57 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114530208A (en) * | 2022-02-18 | 2022-05-24 | 中山大学 | Planning method and system for chemical reverse synthesis path |
CN114974450A (en) * | 2022-06-28 | 2022-08-30 | 苏州沃时数字科技有限公司 | Method for generating operation steps based on machine learning and automatic testing device |
CN115083549A (en) * | 2022-07-18 | 2022-09-20 | 烟台国工智能科技有限公司 | Product raw material ratio reverse derivation method based on data mining |
WO2024032096A1 (en) * | 2022-08-09 | 2024-02-15 | 腾讯科技(深圳)有限公司 | Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device |
CN115240788A (en) * | 2022-09-19 | 2022-10-25 | 杭州生奥信息技术有限公司 | Small molecule de novo design method based on molecular building block synthesis planning technology |
CN115588471A (en) * | 2022-11-23 | 2023-01-10 | 药融云数字科技(成都)有限公司 | Self-correcting single-step inverse synthesis method under continuous learning, terminal, server and system |
CN116189804B (en) * | 2023-04-17 | 2023-07-14 | 烟台国工智能科技有限公司 | Method and system for predicting reaction conditions based on graph convolution neural network |
CN116189804A (en) * | 2023-04-17 | 2023-05-30 | 烟台国工智能科技有限公司 | Method and system for predicting reaction conditions based on graph convolution neural network |
CN116578934A (en) * | 2023-07-13 | 2023-08-11 | 烟台国工智能科技有限公司 | Inverse synthetic analysis method and device based on Monte Carlo tree search |
CN116578934B (en) * | 2023-07-13 | 2023-09-19 | 烟台国工智能科技有限公司 | Inverse synthetic analysis method and device based on Monte Carlo tree search |
CN117457093A (en) * | 2023-12-20 | 2024-01-26 | 烟台国工智能科技有限公司 | Reverse synthesis method and device for organic reaction product based on data amplification |
CN117457093B (en) * | 2023-12-20 | 2024-03-08 | 烟台国工智能科技有限公司 | Reverse synthesis method and device for organic reaction product based on data amplification |
CN117972531A (en) * | 2024-03-29 | 2024-05-03 | 烟台国工智能科技有限公司 | Diversified inverse synthetic analysis model evaluation method and device |
CN117972531B (en) * | 2024-03-29 | 2024-06-11 | 烟台国工智能科技有限公司 | Diversified inverse synthetic analysis model evaluation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113782109A (en) | Reactant derivation method and reverse synthesis derivation method based on Monte Carlo tree | |
Jin et al. | Bayesian symbolic regression | |
CN109977098A (en) | Non-stationary time-series data predication method, system, storage medium and computer equipment | |
CN113971992B (en) | Self-supervision pre-training method and system for molecular attribute predictive graph network | |
CN110969304A (en) | Method, system and device for predicting production capacity of digital factory | |
CN114360659A (en) | Biological reverse synthesis method and system combining and-or tree and single-step reaction rule prediction | |
CN111160459A (en) | Device and method for optimizing hyper-parameters | |
CN117877608B (en) | Monte Carlo tree search inverse synthesis planning method and device based on experience network | |
CN113436729A (en) | Synthetic lethal interaction prediction method based on heterogeneous graph convolution neural network | |
CN113515540A (en) | Query rewriting method for database | |
Parri et al. | A hybrid VMD based contextual feature representation approach for wind speed forecasting | |
CN117095762A (en) | Compound generation method based on genetic algorithm and self-encoder | |
WO2009107416A1 (en) | Graph structure variation detection apparatus, graph structure variation detection method, and program | |
CN113362920B (en) | Feature selection method and device based on clinical data | |
JP2004078812A (en) | System and method for analyzing process data | |
CN115240787A (en) | Brand-new molecule generation method based on deep conditional recurrent neural network | |
CN115512781A (en) | Method for improving inverse synthesis credibility through multi-model ensemble learning | |
CN109918659B (en) | Method for optimizing word vector based on unreserved optimal individual genetic algorithm | |
Yakushin et al. | Neural network model for forecasting statistics of communities of social networks | |
Zuhairoh et al. | Continuous-time Hybrid Markov/semi-Markov Model with Sojourn Time Approach in the Spread of Infectious Diseases | |
Ahmed et al. | Application of an Efficient Genetic Algorithm for Solving n× 𝒎𝒎 Flow Shop Scheduling Problem Comparing it with Branch and Bound Algorithm and Tabu Search Algorithm | |
CN117041073B (en) | Network behavior prediction method, system, equipment and storage medium | |
Poornima et al. | An efficient feature selection and classification for the crop field identification: A hybridized wrapper based approach | |
Li et al. | Interpretable Subgraph Feature Extraction for Hyperlink Prediction | |
CN116881854B (en) | XGBoost-fused time sequence prediction method for calculating feature weights |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211210 |
|
RJ01 | Rejection of invention patent application after publication |