CN113990405A - Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent - Google Patents

Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent Download PDF

Info

Publication number
CN113990405A
CN113990405A CN202111214079.4A CN202111214079A CN113990405A CN 113990405 A CN113990405 A CN 113990405A CN 202111214079 A CN202111214079 A CN 202111214079A CN 113990405 A CN113990405 A CN 113990405A
Authority
CN
China
Prior art keywords
reagent
data item
reagent compound
smiles
reaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111214079.4A
Other languages
Chinese (zh)
Other versions
CN113990405B (en
Inventor
陈德铭
马汝建
陈志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Apptec Co Ltd
Original Assignee
Wuxi Apptec Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Apptec Co Ltd filed Critical Wuxi Apptec Co Ltd
Priority to CN202111214079.4A priority Critical patent/CN113990405B/en
Publication of CN113990405A publication Critical patent/CN113990405A/en
Application granted granted Critical
Publication of CN113990405B publication Critical patent/CN113990405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)

Abstract

The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The model construction method comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limitation data table; representing the chemical reaction formula by adopting a character sequence in a SMILES form; deleting SMILES data of a reagent compound existing in the reagent compound restriction data table from the chemical reaction formula SMILES data as an input data item, and using the deleted SMILES data of the reagent compound as a target data item; and generating an output data item closest to the target data item through artificial intelligence deep learning, supplementing the output data item into the chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, outputting the output data item as the predicted reagent compound. And (4) automatically predicting and completing the chemical reaction reagent through the model.

Description

Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent
Technical Field
The invention relates to the field of pharmaceutical chemistry application, in particular to a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of chemical reaction reagents.
Background
In the field of medicinal chemistry application, organic synthesis of new chemical molecules requires relevant prediction and judgment on chemical reactions (assumed by organic chemists or virtualized by computer algorithms), so that loss and waste caused by experiment failure are avoided; in an automated synthesis apparatus, information on reaction conditions (particularly, a compound as a reagent) under which a chemical reaction can proceed is required, and the chemical reaction can proceed. Therefore, automatic prediction of chemical reaction reagents or completion of missing reagents is an important link for realizing automatic organic synthesis design.
An effective modeling approach of forward reaction prediction and reverse reaction generation is provided for organic synthesis automation through Artificial Intelligence (AI) trained by a large amount of organic chemical reaction data, particularly a deep learning model. In modeling retroactive, AI models generally only consider the transformed reactants, not the relevant reagent information (C.W.Coley, L.Rogers, W.H.Green, and K.F.Jensen.computer-Assisted Retrosynthesis Based on Molecular simulation. ACS. Central Science,3,2017.Connor WColey, William H Green, and Klavs F J.S.Rdchiral: An rdkit writer for manipulating the electrochemical chemistry in retrosynthetic template and application. journal of chemical information and modeling,59(6): 2522537, 2019.); few AI models that directly predict reactants and reagents have very low accuracy (top 10 prediction accuracy < 25%) (Alessandra Toniato, Philippe Schwaller, Antonio Cardinal, Joppe Gelumkens & Teodoro Lanino, unknown noise reduction of chemical reaction data, Nature Machine identification volume 3, pages 485-494,2021), and the accuracy of their reagent information also needs to be greatly improved. Therefore, reagent automatic prediction completion is performed on chemical reactions with incomplete information, and improvement of the reaction prediction accuracy of the AI model to the experiment executable level is also a key link.
Even though a chemist artificially designs an organic synthesis route, each synthesis reaction step omits conditions including reagent information due to the fact that reaction conversion is concerned in the supposed process, and the chemist needs to realize related reaction steps in a laboratory, generally searches similar existing experimental reactions to collect similar reaction conditions and adjusts and judges according to experience, so that automatic and accurate reagent automatic prediction completion can be realized, and the efficiency of artificially designing and implementing the organic synthesis route by the chemist can be effectively improved.
The current AI models applied to organic synthesis route design mainly focus on both problems of inverse synthesis and reaction judgment, and have generated numerous model studies and methods. As shown in FIG. 1 (Ryan-Rhys Griffiths, Philippie Schwaller, Alpha A. Lee, Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design, Neurops works hop on Machine Learning for Molecules and Materials,2018), reverse Synthesis or Synthesis Design (Synthesis Design), i.e., given a desired compound to be synthesized, a reactive compound predicted to produce the desired product and related information; reaction prediction (reaction prediction)/determining, i.e., the reactants and associated information (e.g., reaction condition information) for a given chemical reaction, predicting/generating its reaction product, or determining whether a complete reaction comprising a given reaction product is correct. The technical problem concerned by the present invention is reagent prediction (or reaction planning), i.e. the prediction of the reaction conditions (mainly chemical reagents) required for a reaction given the reactants and the desired products of the reaction. On one hand, the reagent prediction is difficult to uniquely predict due to the fuzzy definition information of the compound reagent as the condition in the reaction, and on the other hand, the reagent prediction is very challenging and is rarely researched by the field of AI and deep learning. However, the lack of an effective automatic completion method for chemical reagents is a key gap that leads to the failure of reverse synthesis and reaction prediction AI of organic route design to meet industrial expectations and applications.
In the design of an automatic organic synthesis route, unknown target molecules in each step are virtually decomposed into corresponding candidate reactants, whether the candidate reactants can actually react in an experiment and can generate the target molecules is verified, reasonable reaction conditions need to be considered, and chemical reagents (which widely comprise solvents and catalysts) are important determinants. Compared with other more conventional conditions such as temperature and pressure (usually normal temperature and pressure) which can be adjusted in the same compound combination, the chemical reagent as a compound needs to be prepared (purchased and applied) before a reaction experiment as a reactant, and the accurate judgment and selection of the chemical reagent plays a second key role similar to that of the reactant and can determine whether the reaction is successfully realized.
In reverse synthesis, the AI model typically considers only the transformed reactants, not the relevant reactant information; the few AI models that directly predict reactants and reagents have very low accuracy, and the ratio of the true answer to any of the first ten predictions (Top-10) of the model is < 25% (Alessandra Toniato, Philippe Schwaller, Antonio Cardinal, Joppe Gelumkens & Teodoro Lanino, Unassisted noise reduction of chemical reaction data, Nature Machine understanding volume 3, pages 485-494,2021), so the accuracy of the reagent information needs to be greatly improved as well.
In the current condition prediction research, expert rules are generally input aiming at specific types of reactions, the effectiveness and the scale are constrained by expert knowledge and experience, and the scale cannot be effectively and automatically enlarged and applied to general reactions. Currently, the main work for conditional prediction of unspecific Reactions (Hanyu Gao, Thomas J.Struble, Connor W.Coley, Yuran Wang, William H.Green, and Klavs F.Jensen, Using Machine Learning To preliminary Suitable conditionables for Organic Reactions, ACS Cent.Sci.,4, 1465-. In the real reaction, the role classification is not available before the reaction product is determined, and the number of the reagents is often multiple under the real condition, so that the hard limit of 2 at most cannot be met. The method specifically relies on the unique Identification (ID) of the reaction role division condition class compound in reaction non-public data (Reaxyz https:// www.reaxys.com /) marked by manpower for decades: catalysts, solvents, reaction reagents and the like are classified and modeled, but even though the training model and the tested data of the method are limited to predict only 1 catalyst, 2 solvents and 2 reagents at most, the accuracy of the method is still far away from the accuracy required by real organic synthesis. The accuracy of only 57.3 percent of the first three predictions (Top-3) and 66.0 percent of the first ten predictions (Top-10) of chemical reagents (including catalysts, solvents and reaction reagents) with wide significance is achieved, and the accuracy distance from the actual industrial application is large. The method using the reagent compound ID as the prediction result does not pay attention to the effect of the prediction/completion of the reagent on the reaction judgment, i.e., the prediction ID does not contain chemical information of a relevant prediction chemical reagent, and cannot be effectively combined with the chemical information of a reactant which is supposed to/simulates the reaction to perform the reaction judgment of chemical significance.
Therefore, in order to increase the accuracy of the forward and reverse reaction prediction of the AI model in the automation of organic synthesis to the level of experiment executability required for industrial application, a method and an apparatus for automatically performing the prediction of a reaction reagent or the completion of a missing reagent need to be established, and the method and the apparatus need to have a direct improvement effect on the accuracy of the assumed/virtual reaction judgment.
Disclosure of Invention
According to the industry demand gap of the current technology and organic synthesis design, the invention solves the problem of automatically predicting reaction reagents or completing missing reagents, namely automatically predicting reaction reagents or planning reaction. For this purpose, the present invention provides, in a first aspect, a method for constructing a reagent compound prediction model, comprising the steps of:
step one, expressing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound restriction data table;
step two, expressing the chemical reaction formula by adopting a character sequence in a SMILES form, and dividing the chemical reaction formula into a training set and a testing set; the chemical reaction formulas in the training set and the test set both comprise a full information reactant, a full information reagent compound, and a full information product;
step three, deleting SMILES data of the reagent compounds in the reagent compound limitation data table from the chemical reaction formula SMILES data in the training set to serve as input data items, and using the deleted SMILES data of the reagent compounds as target data items;
fourthly, aiming at the corresponding relation between the input data items in the third step and the target data items in the third step, the AI translation model generates n output data items which are closest to the target data items in the third step through artificial intelligent deep learning according to the input data items in the third step; the closest includes a completely consistent situation; n is an integer not less than 0;
and step five, supplementing the output data item obtained in the step four into a chemical reaction formula, inputting the output data item into a reaction prediction model, and obtaining a reagent compound prediction model according to result feedback optimization: if the predicted product is consistent with the original reaction product, the output data item obtained in the fourth step is the predicted reagent compound; if the predicted product and the original reaction product do not match, the output data item from step four is not taken as the predicted reagent compound, and the output data item excluding that case is recorded as the predicted reagent compound of the chemical equation.
In some embodiments, the AI translation model is a deep learning-based machine translation model, Transformer (translation model), of Ashish Vaswani, Noam Shazer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. The Transformer in Vaswani 2017 is a general Transformer technology, and can be applied to text translation, or chemical reaction products, and reagent prediction of the invention. The AI translation model used in the present invention is not limited to the Transformer in Vaswani 2017, but may be another Transformer.
In some embodiments, further comprising: step six, testing the reagent compound prediction model by using the test set; the method specifically comprises the following steps:
s1, deleting SMILES data of the reagent compounds in the test set as input data items after the SMILES data of the reagent compounds exist in the reagent compound restriction data table, and using the deleted SMILES data of the reagent compounds as target data items;
s2, aiming at the corresponding relation between the input data item in the step S1 and the target data item in the step S1, through artificial intelligence deep learning, an AI translation model generates an output data item which is closest to the target data item in the step S2 according to the input data item in the step S1;
s3, supplementing the output data item obtained in the step S2 into a chemical reaction formula, inputting the output data item into the reaction prediction model, and if the predicted product is consistent with the original reaction product, determining the output data item obtained in the step S2 as a predicted reagent compound;
s4, comparing the consistency probability P of the product predicted by the reaction prediction model and the original reaction product under the two conditions that the output data item obtained in the step S2 is supplemented and the output data item obtained in the step S2 is not supplemented; if the probability P after supplementing the output data item obtained in the step S2 is significantly higher than the probability P without supplementing the output data item obtained in the step S2, it indicates that the reagent compound prediction model is successfully constructed.
In some embodiments, the deep learning with artificial intelligence in step four generates n output data items closest to the target data item in step three by:
after the input data items in the third step are subjected to weight calculation of each layer of the AI translation model, the original weights of K different character symbols are obtained by an output layer, and the probability is normalized through a Softmax function; in the training process, an optimization function is adopted, and the weights of all layers of the AI translation model are updated through reverse transfer so as to achieve the optimization target of minimizing the difference between an output symbol and a real target symbol; the optimization function calculates the difference between the normalized probability distribution of K different character symbols and the probability distribution of a real target symbol by using cross entropy, and minimizes the difference; taking the SMILES sequence of the character symbol with the maximum normalized probability of the K different character symbols after the difference between the minimized output symbol and the real target symbol is taken as the output data item in the fourth step;
wherein K is an integer of more than or equal to 0; when the character symbol with the highest normalized probability is the ". multidot." symbol in SMILES, the output data item in step four is a blank compound.
The value of K is the number of all possible character symbols of the chemical structure. The character symbols of one implementation include a set of elements that occur (such as Cl, Br, I, N, C) and non-element symbols (such as ".) in SMILES.
In some embodiments, the first step is specifically: standard names for reagent compounds were collected, and the accumulated SMILES for reagent compounds were pooled using the open chemical noun conversion tool https:// opsin.ch.cam.ac.uk/conversion to SMILES, forming the reagent compound restriction data table for SMILES for >2000 reagent compounds.
In some embodiments, the second step is specifically: chemical reaction formulas are collected and processed as SMILES through published chemical patents, and a total of 150 to 230 million reactions are collected, of which 90 to 97% are the training set and the remainder (i.e., 10 to 3%) are the test set.
Further, a total of 180 to 210 million responses were collected, with 93 to 96% as the training set and the remaining responses (i.e., 7 to 4%) as the test set. Preferably, a total of 200 ten thousand ± 100 responses are collected, with 95% ± 0.5% as the training set and the remaining responses as the test set.
In some embodiments, n in step four is ≦ 10. The invention does not need to limit the range of n after n is more than or equal to 0, can obtain reasonable reagent number by generating the pretest reagent, and can allow n to be 0 for prediction, namely the reaction does not need to supplement any reagent. However, in generally reasonable reactions, n.ltoreq.10 is typical.
In some embodiments, the reaction prediction model is Philippe Schwaller et al molecular transform A model for uncategorized-calified chemical reaction prediction,2019Sep 25; 1572-1583 (hereinafter referred to as Schwaller 2019) as a Molecular Transducer (MT).
On the other hand, the purpose of the model constructed by the method is to provide a method for automatically predicting and completing a chemical reaction reagent, namely, a reagent compound for completing is obtained by using the reagent compound prediction model constructed by the method for constructing the reagent compound prediction model.
In some embodiments, the method for automated predictive replenishment of a chemical reaction reagent comprises the steps of:
a1, inputting chemical reaction formula SMILES data to be complemented as an input data item into the reagent compound prediction model, and obtaining an output data item of the reagent compound prediction model;
a2, supplementing the output data item in the step A1 into the chemical reaction formula to be completed, inputting the reaction prediction model, and if the predicted product and the original reaction product are consistent, the output data item in the step A2 is the reagent compound for completing.
In a third aspect, the present invention further provides an apparatus for automatic predictive replenishment of a chemical reaction reagent, including an analysis module, configured to generate a reagent compound for replenishment according to input chemical reaction SMILES data to be replenished; the analysis module is loaded with:
the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model described above, and the reaction prediction model.
The device also belongs to the specific application of the model constructed by the construction method.
The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The model construction method comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limitation data table; representing the chemical reaction formula by adopting a character sequence in a SMILES form; deleting SMILES data of a reagent compound existing in the reagent compound restriction data table from the chemical reaction formula SMILES data as an input data item, and using the deleted SMILES data of the reagent compound as a target data item; and generating an output data item closest to the target data item by artificial intelligence deep learning (such as an optimization function, a reverse transfer method, a Softmax method and the like), supplementing the output data item into the chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, determining the output data item as the predicted reagent compound. And (4) automatically predicting and completing the chemical reaction reagent through the model.
The invention constructs a chemical reaction formula without a reaction reagent (namely, the reagent compound) as an input data item by utilizing reactant, reagent compound and product data in the chemical reaction formula, trains an AI machine learning model to predict the chemical structure of the reaction reagent, and combines the predicted reaction reagent into the chemical reaction to improve the accuracy of chemical reaction prediction/judgment (such as a reaction prediction model), wherein the accuracy can be improved by 33%. Compared with the prior art, the method does not need to perform reaction role division of expert division on the chemical reaction, automatically distinguishes the target to be predicted through reagent data, and is not limited by the class, the number and the like of the reagent.
The construction and evaluation method can utilize the existing chemical reaction data to train the relevant AI reagent compound prediction model (namely the reagent compound prediction model) and automatically evaluate and verify the promotion effect of the AI reagent compound prediction model on the reaction judgment under the condition of no need of manual intervention evaluation.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 shows the relationship among Synthesis (Synthesis design), Reaction prediction/judgment (Reaction prediction), and Reaction reagent prediction (generalized Reaction planning). Where P is the target compound/target product, i.e.the product of the reaction (to the right of the scheme-). Prediction of reaction (broad sense) reagent, knowing the reactant (a) and the target product (P) predicts the compound (reagent) required for the reaction, a technical problem to which the present invention is directed; predicting/judging the reaction, namely, giving a reactant and a reagent (A + B + C; - > left side), predicting a product, or judging whether a certain product P (- > right side) is the product of the reaction, wherein more technical problems concerned by method models exist; the synthesis design, i.e. inverse synthesis, gives a target compound (P) and predicts the reactants producing the target, and has more technical problems focused on method models.
FIG. 2 is a flow chart of reaction reagent prediction data construction, reagent compound prediction model training, and reaction prediction promotion auto-validation.
Figure 3 is a reagent compound predictive model training implementation: the AI translation model was trained to perform reagent compound predictions in chemical reactions (the ".
Detailed Description
In order to make the technical means, the characteristics, the purposes and the functions of the invention easy to understand, the invention is further described with reference to the specific drawings. However, the present invention is not limited to the following embodiments.
It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no technical significance, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
The technical concept of the invention is as shown in fig. 2, and comprises reaction reagent prediction data construction, reagent compound prediction model training and reaction prediction promotion automatic verification according to an execution sequence. The reaction is carried out by partitioning the left reactant, including the potential reagents, A.B.C, - > on the right is the reaction product (D). Through the matching of the reagent compound restriction data table, it is found that C in the reaction is a reagent compound, and an input data item (or called source language) a.b- > D (missing reagent reaction) and a corresponding output/target data item C (reagent group in case of multiple reagents) for training a machine learning model (such as a machine translation model, i.e. an AI translation model) are constructed by deleting the reagent compound in the chemical reaction. Using a reaction reagent prediction dataset constructed in this manner (i.e., data for a reagent compound prediction model), the trained AI translation model will attempt to predict the reagent compound(s) C' when it encounters a reaction of the form a.b- > D. Although C 'may not be completely consistent with the original answer C, the present invention will still judge that the reagent compound C' is reasonable by using the existing reaction prediction model (such as the molecular converter in Schwaller 2019) if the a.b.c '- > D product is still predicted to be the expected D, and there may be many reasonable reagent compound conditions (C or C' are reasonable) for the same chemical reaction conversion verified by chemists.
Reaction reagent prediction data construction: in the present invention, the reagent compounds to be predicted to be filled are from a "reagent compound limit data table" that limits the range of reagent compounds that the model can predict since not all reactants are reagent compounds. The table can be constructed by obtaining chemical agent names from chemical information websites and chemical literatures, and representing corresponding structures by character string representation of SMILES, or collecting generalized reagent (compounds such as solvents, reagents, catalysts, etc.) compound structures by expert experience. The reagent compound limitation data table is only expressed by using a character string of a chemical reagent SMILES structure, does not need the name of the chemical reagent, and does not need experts to strictly define related roles of a solvent, a reagent and a catalyst. Such as ethanol, of formula C2H6O or CH3CH2OH, which is represented by the standard SMILES formula as "CCO", and the SMILES formulae of further reagent compounds are briefly illustrated in Table 1, it being understood that the reagent compounds are not limited to those shown in Table 1.
Table 1 reagent compounds examples
Name of reagent Compound Reagent structure
Ethanol CCO
Propylene (PA) CC=C
Acetonitrile CC#N
Carbonic acid methyl ethyl ester COC(=O)CC
The reaction reagent prediction data is constructed by deleting the reagent compounds and their combinations existing in the "reagent compound restriction data table" from the existing chemical reaction data entirely containing the reagent compounds to obtain the chemical reaction lacking the reagent compounds as an input data item and the deleted combination of the reagent compounds as a target data item. The alignment can be by SMILES alignment, since the complete chemical reaction data can also be passed. The method for constructing the missing reaction reagent prediction data set provides the data utilized by the AI machine learning model (or the machine learning model in short) automatically without manual intervention, namely paired input data items (incomplete reaction SMILES) and target data items (the reagent combination SMILES to be predicted), and can be used for training and evaluating the effect of the AI translation model. When the reagent compound chemical structure combination prediction method is applied, the chemical structure combination of the reagent compound to be supplemented can be directly predicted according to the chemical information characteristics extracted by the input reaction, the reagent compound chemical structure combination prediction method is not limited by any reagent category, number and combination, and the method is more suitable for real scenes.
Training of reagent compound prediction model: an AI translation model (e.g., the deep learning-based machine translation model in Vaswani 2017-a Transformer; which can be used for generalized translation tasks) is applied to train on the constructed deletion-response training dataset. The input data is a reaction that simulates the deletion of a reagent compound but requires the presence of reactants and products, and the target output data is a set of reagent compounds from which the reaction has been deleted. The input data-target output data pairs for each reaction, namely the pair of input data items (incomplete reaction SMILES) and target data items (to-be-predicted reagent combination SMILES), are used as a missing reaction training data set, an AI translation model is adopted, and after the model obtains the input data, the model can maximally generate and output the corresponding target data as an optimization target, and finally a reagent compound prediction model for the chemical reaction is obtained.
And (3) reaction prediction and automatic verification promotion, namely using a trained reagent compound prediction model to predict/supplement a reagent for a reaction containing at least one reactant in the constructed verification data test set, and reconstructing the reaction into a reaction for supplementing and predicting reagent information, wherein if the reagent is predicted to be a blank compound, the original reaction does not need to supplement the reagent information and is copied into a reconstruction reaction. For one or more existing trained reaction prediction machine learning models, the models are respectively used for predicting products of a baseline reaction of an unsupplemented reagent compound without a primary product and a reaction of a supplemented pre-predicted reagent compound, and verifying the probability of correct prediction of the products and the effect of correct proportion improvement compared with the baseline reaction. Under the condition of a plurality of existing reaction prediction machine learning models, the model with the largest promotion index is selected as a reaction prediction model which is used together with a reagent compound prediction model subsequently. And in the application stage, the reagent compound prediction model and the selected reaction prediction model are used as a combined component, so that the reaction judgment accuracy in the synthesis design is improved.
Example 1
The machine translation of reagent compound prediction model training is realized by performing modeling training of reagent prediction on a constructed missing reagent reaction training data set based on a deep learning machine translation model.
Step one, collecting related chemical reagent standard names, using open chemical noun conversion tool (https:// opsin.ch.cam.ac.uk/) Converting to SMILES, and combining with SMILES accumulated by other experts to form>"reagent Compound Limit data Table" of 2000 reagents SMILES "
Step two, collecting chemical reaction formulas through the open chemical patent and processing into SMILES, and collecting 200 ten thousand reactions in total, wherein 95% (180 ten thousand) are used as a training set, and the other 5% (95000) are used as a test set.
And step three, deleting the reagents and the combinations thereof in the reagent compound limitation data table on the existing chemical reaction SMILES data completely containing the reagents to obtain the reaction SMILES without the reagents as an input data item, using the deleted reagent combination as a target data item, and using a blank compound (which can be represented by a symbol of the SMILES) as the target data item under the condition that no reagent needs to be supplemented for the reaction.
Step four, the chemical structure of the compound including the reactant and the generalized reagent can be represented by the character sequence converted into the SMILES form, and the character sequence generating the missing reagent combination thereof can be realized as a machine translation problem from the source language to the target language through the character sequence of the reaction. The "source language" data is a set of simulated reactions with reagent compounds deleted, but with reactants and products, and the "target language" data is a set of reagent compounds from which the reactions were deleted. For each "source language" - "target language" pair of reactions, an AI translation model (such as the Transformer in Vaswani 2017) is used to generate the translation closest to the "target language" from the "source language" as the target for optimization, as shown in fig. 3. In fig. 3, the left string is denoted reactant (lacking reagent) by > > and the right string is the reaction product. After passing through the Transformer in Vaswani 2017, character symbols (token) of the predicted reagent combination SMILES are generated one by one. After the input is subjected to weight calculation of each layer of the transform in the Vaswani 2017, in the output layer of the model, original weights zi (>0) of K (in this example, K ═ 5) different tokens are obtained, i ═ 1,2, …, K, and normalized probability output is performed by Softmax, where Br with the highest probability (═ 0.90) is taken as the final output.
The output is a combination of reagents, since one reaction can contain multiple reagents simultaneously. The structures shown in the examples correspond to O ═ C1CCC (═ O) N1Br, the ellipses are omitted to omit other reagents in the model prediction for this combination.
Specifically, the Transformer in Vaswani 2017 includes a reactant, a reactant,>>And generating character symbols (tokens) of the combined SMILES to be pre-predicted one by one. After the input is subjected to transform layer weight calculation, an output layer is an output layer of the model, and original weights z of K (in this example, K is 5) different tokens are obtainedi(>0) And passes Softmax
Figure BDA0003309959590000091
Normalized probability output is performed (e 2.71828 … is natural logarithm). In the training process, an optimization function is adopted, and weights of all layers of the AI translation model are updated through reverse transfer so as to achieve the optimization target of the minimized output symbol and the real symbol; the optimization function may use cross entropy to calculate the difference between the output probability distribution [0.02,0.90,0.05,0.01,0.02] of model K ═ 5 symbols and the true target symbol probability distribution and minimize the difference; suppose Br is a true symbol in the training data, whose distribution is [0.0,1.0,0.0,0.0,0.0 ]. In the test process of the reagent compound prediction model, the distribution of real symbols is not known, the character symbol sequence with the highest probability is directly output as an output data item, and in the legend, the symbol Br (0.90) with the highest probability is output as a final output.
The reagent compound prediction model outputs the highest probability reagent combination since a reaction may contain multiple reagents simultaneously. The structures shown in the example of fig. 3 correspond to O ═ C1CCC (═ O) N1Br, and the ellipses indicate that the omitted reagent compound prediction model predicts the other reagents in the combination.
And step five, according to a trained Reagent compound prediction model (RT), a reaction prediction model (such as a molecular converter in Schwaller 2019, MT for short) based on a machine translation type can be matched for verification and application. Specifically, for a test reaction, the reagent is first predicted by RT, and if so, no additional reagent is needed. The reaction with the reagent replenished is imported into MT, if the product predicted by Top-1 (first) is identical to the original reaction product, then the reagent predicted by RT is adopted.
Step six, testing the result and the testing accuracy of the set: the results were generated from experiments run on their own using test data and associated models, and were not a direct reference to the data set and result numbers in Schwaller 2019.
Test effect 1: in the database extracted from the published patent reactions, training of reagent compound predictive models (RT) was performed using 180 ten thousand reactions, and the results of reaction prediction were tested on approximately 95000 reactions independent of the training reactions: in comparison to incomplete reactions in the absence of reagents, validation experiments performed reaction predictions on the above reaction data in the absence of reagents using a reaction prediction Model (MT) line trained on the corresponding data. Without the RT of the invention, the MT Top-1 accuracy is 69% (baseline reaction, the result of this test data set test using Schwaller 2019 MT), and after using model RT pretest for the reaction of the missing reagent, the RT + MT Top-1 accuracy is 92%, and the RT can increase the MT reaction prediction accuracy by 23 points (69% -92%, relative increase is 33%).
Test effect 2: in the organic synthesis real scene chemical reaction test set (127 reactions) needing to supplement reaction conditions, the correct proportion of the forward reaction prediction Model (MT) can be effectively improved by 90.7% through the reaction supplemented by the reagent compound prediction model (RT) (the absolute increase of RT + MT is 39 points).
Test effect 3: in the reaction tests (20 reactions) extracted from real patent reactions and checked by chemists as being infeasible for errors, the reagent compound prediction model (RT) can effectively control the error rate (5%; 1 was misjudged by MT) of misjudging the error reaction as the correct reaction.
Test results are exemplified by:
reaction of deletion reagent: SMILES form:
CC(C)(C)CCn1ccc(-c2ccn3nccc3n2)cc1=O>>CC(C)(C)CCn1ccc(-c2ccn3ncc(Br)c3n2)cc1=O
the structure form is as follows: the left is > > left reactant, the right is > > right product
Figure BDA0003309959590000111
Without reagent compound prediction model RT replenishment, the product predicted by reaction prediction Model (MT) was wrong (predicted product would change Br in the right correct product to I, shown omitted).
The reagent set for RT prediction is clccl.o ═ C1CCC (═ O) N1Br, with the structural form:
Figure BDA0003309959590000112
MT predicted the correct product as described above by the reagent set predicted by RT supplementation, as shown below.
Figure BDA0003309959590000113
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A method for constructing a reagent compound prediction model is characterized by comprising the following steps:
step one, expressing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound restriction data table;
step two, expressing the chemical reaction formula by adopting a character sequence in a SMILES form, and dividing the chemical reaction formula into a training set and a testing set; the chemical reaction formulas in the training set and the test set both comprise a full information reactant, a full information reagent compound, and a full information product;
step three, deleting SMILES data of the reagent compounds in the reagent compound limitation data table from the chemical reaction formula SMILES data in the training set to serve as input data items, and using the deleted SMILES data of the reagent compounds as target data items;
fourthly, aiming at the corresponding relation between the input data items in the third step and the target data items in the third step, the AI translation model generates n output data items which are closest to the target data items in the third step through artificial intelligent deep learning according to the input data items in the third step; the closest includes a completely consistent situation; n is an integer not less than 0;
and step five, supplementing the output data item obtained in the step four into a chemical reaction formula, inputting the output data item into a reaction prediction model, and obtaining a reagent compound prediction model according to result feedback optimization: if the predicted product is consistent with the original reaction product, the output data item obtained in the fourth step is the predicted reagent compound; if the predicted product and the original reaction product do not match, the output data item from step four is not taken as the predicted reagent compound, and the output data item excluding that case is recorded as the predicted reagent compound of the chemical equation.
2. The method of constructing a reagent compound predictive model of claim 1, further comprising: step six, testing the reagent compound prediction model by using the test set; the method specifically comprises the following steps:
s1, deleting SMILES data of the reagent compounds in the test set as input data items after the SMILES data of the reagent compounds exist in the reagent compound restriction data table, and using the deleted SMILES data of the reagent compounds as target data items;
s2, aiming at the corresponding relation between the input data item in the step S1 and the target data item in the step S1, through artificial intelligence deep learning, an AI translation model generates an output data item which is closest to the target data item in the step S2 according to the input data item in the step S1;
s3, supplementing the output data item obtained in the step S2 into a chemical reaction formula, inputting the output data item into the reaction prediction model, and if the predicted product is consistent with the original reaction product, determining the output data item obtained in the step S2 as a predicted reagent compound;
s4, comparing the consistency probability P of the product predicted by the reaction prediction model and the original reaction product under the two conditions that the output data item obtained in the step S2 is supplemented and the output data item obtained in the step S2 is not supplemented; if the probability P after supplementing the output data item obtained in the step S2 is significantly higher than the probability P without supplementing the output data item obtained in the step S2, it indicates that the reagent compound prediction model is successfully constructed.
3. The method for constructing reagent compound prediction model according to claim 1, wherein the deep learning of artificial intelligence in step four generates n output data items closest to the target data item in step three by the following method:
after the input data items in the third step are subjected to weight calculation of each layer of the AI translation model, the original weights of K different character symbols are obtained by an output layer, and the probability is normalized through a Softmax function; in the training process, an optimization function is adopted, and the weights of all layers of the AI translation model are updated through reverse transfer so as to achieve the optimization target of minimizing the difference between an output symbol and a real target symbol; the optimization function calculates the difference between the normalized probability distribution of K different character symbols and the probability distribution of a real target symbol by using cross entropy, and minimizes the difference; taking the SMILES sequence of the character symbol with the maximum normalized probability of the K different character symbols after the difference between the minimized output symbol and the real target symbol is taken as the output data item in the fourth step;
wherein K is an integer of more than or equal to 0; when the character symbol with the highest normalized probability is the ". multidot." symbol in SMILES, the output data item in step four is a blank compound.
4. The method for constructing a reagent compound predictive model according to claim 1, wherein the first step is specifically: standard names for reagent compounds were collected, and the accumulated SMILES for reagent compounds were pooled using the open chemical noun conversion tool https:// opsin.ch.cam.ac.uk/conversion to SMILES, forming the reagent compound restriction data table for SMILES for >2000 reagent compounds.
5. The method for constructing a reagent compound prediction model according to claim 1, wherein the second step is specifically: chemical reaction formulas are collected and processed into SMILES through published chemical patents, and 150-230 ten thousand reactions are collected in total, wherein 90-97% of the reactions are used as the training set, and the rest of the reactions are used as the test set.
6. The method of claim 1, wherein n is 10 or less.
7. The method for constructing a reagent compound prediction model according to claim 1, wherein the reaction prediction model is Philippie Schwaller et al. molecular transform A model for uncancentaining-calized chemical reaction prediction,2019Sep 25; 1572-1583.
8. A method for automatic predictive completion of a chemical reaction reagent, characterized in that a reagent compound for completion is obtained by using the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model according to any one of claims 1 to 7.
9. The method for automated predictive replenishment of a chemical reaction reagent of claim 8, comprising the steps of:
a1, inputting chemical reaction formula SMILES data to be complemented as an input data item into the reagent compound prediction model, and obtaining an output data item of the reagent compound prediction model;
a2, supplementing the output data item in the step A1 into the chemical reaction formula to be completed, inputting the reaction prediction model, and if the predicted product and the original reaction product are consistent, the output data item in the step A2 is the reagent compound for completing.
10. The device for automatically predicting and completing the chemical reaction reagent is characterized by comprising an analysis module, a prediction module and a completion module, wherein the analysis module is used for generating a reagent compound for completion according to input chemical reaction formula SMILES data to be completed; the analysis module is loaded with:
the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model according to any one of claims 1 to 7, and the reaction prediction model.
CN202111214079.4A 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent Active CN113990405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111214079.4A CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111214079.4A CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Publications (2)

Publication Number Publication Date
CN113990405A true CN113990405A (en) 2022-01-28
CN113990405B CN113990405B (en) 2024-05-31

Family

ID=79739298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111214079.4A Active CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Country Status (1)

Country Link
CN (1) CN113990405B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114530208A (en) * 2022-02-18 2022-05-24 中山大学 Planning method and system for chemical reverse synthesis path
CN115201431A (en) * 2022-07-20 2022-10-18 广州蓝勃生物科技有限公司 Reagent development experimental method and device
CN115312135A (en) * 2022-07-21 2022-11-08 苏州沃时数字科技有限公司 Method, system, device and storage medium for predicting chemical reaction conditions
WO2024032096A1 (en) * 2022-08-09 2024-02-15 腾讯科技(深圳)有限公司 Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device
CN118136140A (en) * 2024-05-06 2024-06-04 烟台国工智能科技有限公司 Organic synthesis multistage reaction condition prediction method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200027528A1 (en) * 2017-09-12 2020-01-23 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
KR20200072585A (en) * 2018-11-30 2020-06-23 이율희 Method for predicting the HAZARD and RISK of target chemicals BASED ON AI
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
US20210125060A1 (en) * 2019-10-29 2021-04-29 Samsung Electronics Co., Ltd Apparatus and method of optimizing experimental conditions using neural network
CN113380346A (en) * 2021-06-08 2021-09-10 河南大学 Coupling reaction yield intelligent prediction method based on attention convolution neural network
CN113380345A (en) * 2021-06-08 2021-09-10 河南大学 Organic chemical coupling reaction yield prediction and analysis method based on deep forest

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200027528A1 (en) * 2017-09-12 2020-01-23 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
KR20200072585A (en) * 2018-11-30 2020-06-23 이율희 Method for predicting the HAZARD and RISK of target chemicals BASED ON AI
US20210125060A1 (en) * 2019-10-29 2021-04-29 Samsung Electronics Co., Ltd Apparatus and method of optimizing experimental conditions using neural network
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
CN113380346A (en) * 2021-06-08 2021-09-10 河南大学 Coupling reaction yield intelligent prediction method based on attention convolution neural network
CN113380345A (en) * 2021-06-08 2021-09-10 河南大学 Organic chemical coupling reaction yield prediction and analysis method based on deep forest

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PHILIPPE SCHWALLER ET AL.: "Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction", ACS CENTRAL SCIENCE *
张良顺: "有机合成中化学反应的机器学习", 功能高分子学报 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114530208A (en) * 2022-02-18 2022-05-24 中山大学 Planning method and system for chemical reverse synthesis path
CN115201431A (en) * 2022-07-20 2022-10-18 广州蓝勃生物科技有限公司 Reagent development experimental method and device
CN115312135A (en) * 2022-07-21 2022-11-08 苏州沃时数字科技有限公司 Method, system, device and storage medium for predicting chemical reaction conditions
CN115312135B (en) * 2022-07-21 2023-10-20 苏州沃时数字科技有限公司 Chemical reaction condition prediction method, system, device and storage medium
WO2024032096A1 (en) * 2022-08-09 2024-02-15 腾讯科技(深圳)有限公司 Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device
CN118136140A (en) * 2024-05-06 2024-06-04 烟台国工智能科技有限公司 Organic synthesis multistage reaction condition prediction method and device
CN118136140B (en) * 2024-05-06 2024-07-23 烟台国工智能科技有限公司 Organic synthesis multistage reaction condition prediction method and device

Also Published As

Publication number Publication date
CN113990405B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN113990405B (en) Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent
US10622098B2 (en) Systems and methods for predicting chemical reactions
Dai et al. Retrosynthesis prediction with conditional graph logic network
US20220172802A1 (en) Retrosynthesis systems and methods
WO2018131259A1 (en) Text evaluation device and text evaluation method
Galagali et al. Bayesian inference of chemical kinetic models from proposed reactions
Beykal et al. A data‐driven optimization algorithm for differential algebraic equations with numerical infeasibilities
CN113838536B (en) Translation model construction method, product prediction model construction method and prediction method
Wang et al. A joint FrameNet and element focusing Sentence-BERT method of sentence similarity computation
Chi et al. Establish a patent risk prediction model for emerging technologies using deep learning and data augmentation
CN106599610A (en) Method and system for predicting association between long non-coding RNA and protein
Green et al. Inference about complex relationships using peak height data from DNA mixtures
Lepers et al. Inference with selection, varying population size, and evolving population structure: application of ABC to a forward–backward coalescent process with interactions
Chen et al. Global sensitivity analysis for multivariate outputs using generalized RBF-PCE metamodel enhanced by variance-based sequential sampling
Du et al. Species tree and reconciliation estimation under a duplication-loss-coalescence model
Ahmed et al. Context-aware knowledge selection and reliable model recommendation with ACCORDION
CN116417138A (en) Intelligent artificial intelligence medical technology value review method and system
CN115862757A (en) Prediction method, model and model construction method of stem reactant
Kelly et al. Low-dimensional high-fidelity kinetic models for NOX formation by a compute intensification method
CN116564428A (en) Evaluation method and system of single-step reaction prediction model
Beykal et al. Data-driven Stochastic Optimization of Numerically Infeasible Differential Algebraic Equations: An Application to the Steam Cracking Process
CN115512781A (en) Method for improving inverse synthesis credibility through multi-model ensemble learning
Koshkarov et al. GPTree: Generator of Phylogenetic Trees with Overlapping and Biological Events for Supertree Inference.
Sanchez et al. dnadna: DEEP NEURAL ARCHITECTURES FOR DNA-A DEEP LEARNING FRAMEWORK FOR POPULATION GENETIC INFERENCE
Fujarewicz et al. Spatiotemporal sensitivity of systems modeled by cellular automata

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant