CN113990405B - Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent - Google Patents

Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent Download PDF

Info

Publication number
CN113990405B
CN113990405B CN202111214079.4A CN202111214079A CN113990405B CN 113990405 B CN113990405 B CN 113990405B CN 202111214079 A CN202111214079 A CN 202111214079A CN 113990405 B CN113990405 B CN 113990405B
Authority
CN
China
Prior art keywords
reagent
data item
smiles
reagent compound
reaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111214079.4A
Other languages
Chinese (zh)
Other versions
CN113990405A (en
Inventor
陈德铭
马汝建
陈志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Apptec Co Ltd
Original Assignee
Wuxi Apptec Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Apptec Co Ltd filed Critical Wuxi Apptec Co Ltd
Priority to CN202111214079.4A priority Critical patent/CN113990405B/en
Publication of CN113990405A publication Critical patent/CN113990405A/en
Application granted granted Critical
Publication of CN113990405B publication Critical patent/CN113990405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)

Abstract

The invention discloses a method for constructing a reagent compound prediction model, and a method and a device for automatically predicting and completing a chemical reaction reagent. The method for constructing the model comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form, and generating a reagent compound limit data table; representing the chemical reaction formula by adopting a character sequence in the form of SMILES; deleting the chemical reaction formula SMILES data of the reagent compounds existing in the reagent compound limit data table as an input data item, and taking the deleted SMILES data of the reagent compounds as a target data item; and generating an output data item closest to the target data item through deep learning of artificial intelligence, supplementing the output data item into a chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, outputting the output data item as a predicted reagent compound. The chemical reagent is automatically predicted and completed by the model.

Description

Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent
Technical Field
The invention relates to the field of pharmaceutical chemistry application, in particular to a method for constructing a reagent compound prediction model, and a method and a device for automatically predicting and completing a chemical reaction reagent.
Background
In the field of pharmaceutical chemistry application, the organic synthesis of new chemical molecules requires the relevant prediction and judgment of chemical reactions (envisaged by organic chemists or virtual by computer algorithms), avoiding the loss and waste caused by experimental failure; in an automated synthesis apparatus, information on the reaction conditions under which a chemical reaction can proceed (in particular, a compound as a reagent) is required, and the chemical reaction can proceed. Therefore, the automatic prediction of chemical reaction reagents or the completion of missing reagents is an important link for realizing automatic organic synthesis design.
Artificial Intelligence (AI), particularly a deep learning model, trained by a large amount of organic chemical reaction data provides an effective modeling approach for forward reaction prediction and reverse reaction generation for organic synthesis automation. In the modeling process of reverse reaction generation, the AI model generally only considers the converted reactant, and the accuracy of few AI models for directly predicting the reactant and the reagent is very low without considering the related reagent information (C.W.Coley,L.Rogers,W.H.Green,and K.F.Jensen.Computer-Assisted Retrosynthesis Based on Molecular Similarity.ACS Central Science,3,2017.Connor WColey,William H Green,and Klavs F Jensen.Rdchiral:An rdkit wrapper for handling stereochemistry in retrosynthetic template extraction and application.Journal of chemical information and modeling,59(6):2529–2537,2019.); (the accuracy of the reagent information of the front 10 prediction accuracy <25%)(Alessandra Toniato,Philippe Schwaller,Antonio Cardinale,Joppe Geluykens&Teodoro Laino,Unassisted noise reduction of chemical reaction datasets,Nature Machine Intelligence volume 3,pages485–494,2021), is also greatly improved, so that the reagent automatic prediction completion is carried out on the chemical reaction with incomplete information, and the improvement of the reaction prediction accuracy of the AI model to the experimental executable level is also a key link.
Even though the organic synthesis route is designed by the chemist, each synthesis reaction step omits the condition including the reagent information due to the concern of the reaction conversion itself in the process of the design, and the chemist usually searches similar existing experimental reactions to collect similar reaction conditions and carries out adjustment and judgment according to experience to automatically and accurately predict and complement the reagent, thereby effectively improving the efficiency of the artificial design and implementation of the organic synthesis route by the chemist.
The current AI model applied to the design of the organic synthesis route mainly focuses on the problems of inverse synthesis and reaction judgment, and numerous model researches and methods are generated. FIG. 1 shows (Ryan-Rhys Griffiths,Philippe Schwaller,Alpha A.Lee,Dataset Bias in the Natural Sciences:A Case Study in Chemical Reaction Prediction and Synthesis Design,NeurIPS Workshop on Machine Learning for Molecules and Materials,2018), an inverse synthesis or synthetic design (SYNTHESIS DESIGN), i.e., a reaction compound and related information that is predicted to produce a desired compound product given the desired compound product; reaction prediction (reaction prediction)/determination of the reactants and related information (e.g., reaction condition information) of a given chemical reaction, prediction/generation of its reaction products, or determination of whether the complete reaction containing the given reaction product is correct. The technical problem of the present invention is that of predicting the reagents (or reaction scheme reaction planning), i.e. the reaction conditions (mainly chemical reagents) required for a given reactant and the desired product of the reaction. On one hand, the reagent prediction is very challenging due to the vague definition information of the compound reagent as a condition in the reaction, and on the other hand, the reagent prediction is difficult to predict uniquely due to the diversity of the reaction condition, and is rarely studied in the fields of AI and deep learning. However, the lack of an efficient automatic chemical reagent replacement method is a key gap that results in the failure of the reverse synthesis and reaction prediction AI of the organic route design to meet industry expectations and applications.
In the automatic organic synthesis route design, the unknown target molecule of each step is virtually decomposed into corresponding candidate reactants, whether the candidate reactants can actually react in the experiment and verify that the target molecule can be generated, reasonable reaction conditions need to be considered, and chemical reagents (solvents and catalysts are also included in the broad sense) have important decisive importance. Compared with other conditions such as temperature and pressure (usually normal temperature and pressure) which are more conventional and adjustable in the same compound combination, the chemical reagent, as a compound, needs to be prepared (purchased and applied) before the reaction experiment as a reactant, accurately judges and selects the chemical reagent, plays a second key role similar to the reactant, and can determine whether the reaction is successfully realized.
In reverse synthesis, AI models generally only consider the reactants of the conversion, and do not consider relevant reactant information; the few AI models for directly predicting reactants and reagents have very low accuracy, and the real answer can be correctly matched with any one of the ten predictions (Top-10) before the models <25%(Alessandra Toniato,Philippe Schwaller,Antonio Cardinale,Joppe Geluykens&Teodoro Laino,Unassisted noise reduction of chemical reaction datasets,Nature Machine Intelligence volume 3,pages485–494,2021),, so that the accuracy of the reactant information is also greatly improved.
In the current condition prediction research, expert rules are generally input aiming at specific types of reactions, the effectiveness and the scale are both limited by expert knowledge and experience, and the method can not be effectively and automatically applied to the general reactions in a scale-up mode. The current condition prediction for unspecific reactions is mainly performed (Hanyu Gao,Thomas J.Struble,Connor W.Coley,Yuran Wang,William H.Green,and Klavs F.Jensen,Using Machine Learning To Predict Suitable Conditions for Organic Reactions,ACS Cent.Sci.,4,1465-1476,2018), by using a priori knowledge of the classification of the reaction roles, which requires a large number of manual labeling, and the number of reagents to be predicted is not more than 2. In a real reaction, this role classification is not available until the reaction product is determined, and in a real case, the number of reagents is often various, and the hardness limit of at most 2 cannot be satisfied. The method is specifically based on unique Identification (ID) of the reaction role-dividing conditional class compound in reaction non-public data (Reaxys https:// www.reaxys.com /) marked for decades manually: catalysts, solvents, reagents, etc. are classified for modeling, but even though their training models and the data tested have been defined to predict only up to 1 catalyst, 2 solvents, and 2 reagents, their accuracy is still a distance from the accuracy required for true organic synthesis. The accuracy of 57.3% for the first three predictions (Top-3) and 66.0% for the first ten predictions (Top-10) can only be achieved for widely-meaning chemical reagents (including catalysts, solvents and reaction reagents), and the accuracy distance required by practical industrial application is large. The method of using the reagent compound ID as the prediction result does not pay attention to the effect of prediction/complementation of the reagent on the reaction judgment, i.e., the prediction ID does not contain chemical information of the relevant prediction chemical reagent, and cannot be effectively combined with chemical information of a reactant which is supposed to simulate the reaction to perform the reaction judgment in a chemical sense.
Therefore, to improve the accuracy of the prediction of the forward and reverse reactions of the AI model in the organic synthesis automation to the level of experimental executable required for industrial application, a method and a device for automatically predicting the reaction reagents or complementing the missing reagents need to be established, and the method and the device need to directly improve the accuracy of the envisaged/virtual reaction judgment.
Disclosure of Invention
According to the industrial demand gap between the current technology and the organic synthesis design, the invention solves the problem of automatic reagent prediction or reagent deficiency completion, namely automatic reagent prediction or reaction planning. To this end, the present invention firstly provides a method for constructing a reagent compound predictive model, comprising the steps of:
Step one, representing the reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limit data table;
Step two, representing the chemical reaction formula by adopting a character sequence in an SMILES form, and dividing the chemical reaction formula into a training set and a testing set; the chemical reaction formulas in the training set and the testing set both comprise reactants with complete information, reagent compounds with complete information and products with complete information;
Step three, deleting the chemical reaction formula SMILES data in the training set, namely deleting the SMILES data of the reagent compounds in the reagent compound limit data table, and taking the deleted SMILES data of the reagent compounds as target data items;
aiming at the corresponding relation between the input data item in the third step and the target data item in the third step, generating n output data items closest to the target data item in the third step by the AI translation model through artificial intelligence deep learning according to the input data item in the third step; the closest includes a completely consistent situation; n is an integer not less than 0;
Supplementing the output data item obtained in the fourth step into a chemical reaction formula, inputting the output data item into a reaction prediction model, and obtaining a reagent compound prediction model according to result feedback optimization: if the predicted product is consistent with the original reaction product, the output data item obtained in the fourth step is the predicted reagent compound; if the predicted product and the original reaction product are not identical, the output data item obtained in the fourth step is not used as the predicted reagent compound, and the predicted reagent compound excluding the output data item in this case as the chemical formula is recorded.
In some embodiments, the AI translation model is a depth learning based machine translation model-a transducer (translation model) in Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N Gomez,Lukasz Kaiser,and Illia Polosukhin.Attention is all you need.In Advances in neural information processing systems,pp.5998–6008,2017( below, vaswani 2017). The transducer in Vaswani 2017 is a general transducer technology that can be used in text translation, or chemical reaction products, as well as reagent predictions of the present invention. The AI translation model used in the present invention is not limited to the transducer in Vaswani 2017, but may be another transducer.
In some embodiments, further comprising: step six, testing the reagent compound prediction model by using the test set; the method comprises the following steps:
S1, deleting SMILES data of chemical reaction formulas in the test set, namely deleting SMILES data of reagent compounds in the reagent compound limit data table to serve as input data items, and taking the deleted SMILES data of the reagent compounds as target data items;
S2, aiming at the corresponding relation between the input data item in the step S1 and the target data item in the step S1, through artificial intelligence deep learning, the AI translation model generates an output data item closest to the target data item in the step S2 according to the input data item in the step S1;
S3, supplementing the output data item obtained in the step S2 into a chemical reaction formula, inputting the chemical reaction formula into the reaction prediction model, and if the predicted product is consistent with the original reaction product, obtaining the output data item obtained in the step S2 as a predicted reagent compound;
S4, comparing the probability P of consistency between the product predicted by the reaction prediction model and the original reaction product under the two conditions that the output data item obtained in the step S2 is supplemented and the output data item obtained in the step S2 is not supplemented; and if the probability P after the output data item obtained in the step S2 is supplemented is obviously higher than the probability P after the output data item obtained in the step S2 is not supplemented, the reagent compound prediction model is successfully constructed.
In some embodiments, the generating n output data items closest to the target data item in the third step through artificial intelligence deep learning in the fourth step is achieved by the following method:
after the weight of each layer of the AI translation model is calculated by the input data item in the step three, the output layer obtains the original weights of K different character symbols, and the normalization probability is carried out through a Softmax function; in the training process, an optimization function is adopted, and weights of all layers of the AI translation model are updated through reverse transfer, so that an optimization target for minimizing the difference between an output symbol and a real target symbol is achieved; the optimization function uses cross entropy to calculate the difference between the normalized probability distribution of K different character symbols and the probability distribution of a real target symbol, and minimize the difference; taking the SMILES sequence of the character symbol with the maximum normalized probability of K different character symbols after minimizing the difference between the output symbol and the real target symbol as an output data item in the step four;
Wherein K is an integer greater than or equal to 0; when the character symbol with the highest normalized probability is the "." symbol in the SMILES, the output data item in the fourth step is a blank compound.
The value of K is the number of all possible character symbols of the chemical structure. A character symbol for one implementation includes a set of elements that appear (e.g., cl, br, I, N, C) and non-element symbols in SMILES (e.g., ").
In some embodiments, the step one is specifically: standard names of reagent compounds were collected and converted to SMILES using the open chemical noun conversion tool https:// opsin. Ch. Cam. Ac. Uk, combining the accumulated SMILES of reagent compounds to form the reagent compound limit data table for SMILES of >2000 reagent compounds.
In some embodiments, the step two is specifically: the chemical reaction formulas are collected through the published chemical patents and processed into SMILES, and 150-230 ten thousand reactions are collected in total, wherein 90-97% of the reactions are used as the training set, and the rest (namely 10-3%) are used as the test set.
Further, 180-210 ten thousand responses were collected in total, with 93% -96% as the training set and the remaining responses (i.e., 7% -4%) as the test set. Preferably, a total of 200 tens of thousands of + -100 responses are collected, with 95% + -0.5% as the training set and the remaining responses as the test set.
In some embodiments, n.ltoreq.10 in step four. The invention does not need to limit the range of n after n is more than or equal to 0, can obtain reasonable reagent number by generating the prediction reagent, and can allow n=0 predictions, namely the reaction does not need to be supplemented with any reagent. However, in a generally reasonable reaction, n.ltoreq.10 is typical.
In some embodiments, the reaction prediction model is Philippe Schwaller et al.Molecular transformer:A model for uncertainty-calibrated chemical reaction prediction,2019Sep 25;5(9):1572-1583( hereinafter referred to as SCHWALLER 2019) of the molecular converter (Molecular Transformer, MT).
On the other hand, the purpose of the model constructed by the method is to provide a method for automatic prediction and complementation of a chemical reaction reagent, namely, the reagent compound for complementation is obtained by adopting the reagent compound prediction model constructed by the method for constructing the reagent compound prediction model.
In some embodiments, the method of automated predictive replenishment of a chemical reagent comprises the steps of:
A1, inputting chemical reaction formula SMILES data to be complemented into the reagent compound prediction model as an input data item, and obtaining an output data item of the reagent compound prediction model;
A2, supplementing the output data item in the step A1 into the chemical reaction formula to be complemented, inputting the reaction prediction model, and if the predicted product is consistent with the original reaction product, obtaining the output data item in the step A2 as the reagent compound for complementing.
In a third aspect, the present invention also provides an apparatus for automatic predictive completion of a chemical reaction reagent, including an analysis module for generating a reagent compound for completion according to inputted chemical reaction formula SMILES data to be completed; the analysis module is loaded with:
The reagent compound predictive model constructed by the method for constructing a reagent compound predictive model as described above, and the reaction predictive model.
The device also belongs to a specific application of the model constructed by the construction method.
The invention discloses a method for constructing a reagent compound prediction model, and a method and a device for automatically predicting and completing a chemical reaction reagent. The invention discloses a method for constructing a reagent compound prediction model, and a method and a device for automatically predicting and completing a chemical reaction reagent. The method for constructing the model comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form, and generating a reagent compound limit data table; representing the chemical reaction formula by adopting a character sequence in the form of SMILES; deleting the chemical reaction formula SMILES data of the reagent compounds existing in the reagent compound limit data table as an input data item, and taking the deleted SMILES data of the reagent compounds as a target data item; and (3) generating an output data item closest to the target data item through deep learning (such as optimization functions, reverse transfer, softmax and the like) of artificial intelligence, supplementing the output data item into a chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, obtaining the output data item as a predicted reagent compound. The chemical reagent is automatically predicted and completed by the model.
According to the invention, the chemical reaction formula of the missing reaction reagent (namely the reagent compound) is constructed by utilizing the data of the reactant, the reagent compound and the product in the chemical reaction formula as input data items, the AI machine learning model is trained to predict the chemical structure of the reaction reagent, and the predicted reaction reagent is combined into the chemical reaction to improve the accuracy of chemical reaction prediction/judgment (such as a reaction prediction model), so that the accuracy can be improved by 33%. Compared with the prior art, the method does not need to carry out expert division on the chemical reaction, but automatically distinguishes the target to be predicted through the reagent data, and is not limited by the types, the number and the like of the reagents.
The construction and evaluation method of the invention can utilize the existing chemical reaction data, train the related AI reagent compound prediction model (namely the reagent compound prediction model of the invention) and automatically evaluate and verify the improvement effect of the AI reagent compound prediction model on the reaction judgment under the condition of no need of manual intervention evaluation.
The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.
Drawings
FIG. 1 shows the relationship among reverse synthesis (SYNTHESIS DESIGN), reaction prediction/determination (Reaction prediction), and reaction (generalized) reagent prediction (Reaction planning). Wherein P is the target compound/target product, i.e. the product of the reaction (located to the right in the figure-). Reaction (generalized) reagent prediction, namely, predicting a compound (reagent) required by a reaction of a known reactant (A) and a target product (P), namely, the technical problem of the invention; reaction prediction/judgment, given reactants and reagents (a+b+c; -left), predicting a product, or judging whether a product P (- > right) is the product of this reaction, there are many technical problems of interest to the process model; the synthesis design, i.e. the reverse synthesis, gives the target compound (P), predicts the reactant that produces this target, and there are many technical problems of interest for the method model.
FIG. 2 is a flow chart of automated verification of reagent predictive data construction, reagent compound predictive model training, and reaction predictive boosting.
FIG. 3 is a reagent compound predictive model training implementation: training AI translation models to make predictions of reagent compounds in chemical reactions (model complement ".
Detailed Description
The invention is further described with reference to the following detailed description in order to make the technical means, the inventive features, the achieved objects and the effects of the invention easy to understand. The present invention is not limited to the following examples.
It should be understood that the structures, proportions, sizes, etc. shown in the drawings are for illustration purposes only and should not be construed as limiting the invention to the extent that it can be practiced, since modifications, changes in the proportions, or otherwise, used in the practice of the invention, are not intended to be critical to the essential characteristics of the invention, but are intended to fall within the spirit and scope of the invention.
The technical concept of the invention is as shown in fig. 2, and the method comprises the steps of reaction reagent prediction data construction, reagent compound prediction model training and automatic verification of reaction prediction promotion according to the execution sequence. The reaction is passed by separating the left reactant, including potential reagents A.B.C, - > to the right is reaction product (D). Through reagent compound limit data table comparison and matching, the C in the reaction is found to be a reagent compound, and an input data item (or called source language) A.B- > D (missing reagent reaction) and a corresponding output/target data item C (a reagent group when a plurality of reagents are used) for training a machine learning model (such as a machine translation model, namely an AI translation model) are constructed by deleting the reagent compound in the chemical reaction. Using the reagent predictive data set constructed in this manner (i.e., data for the reagent compound predictive model), the trained AI translational model will attempt to predict reagent compound(s) C' when it encounters a reaction in the form of reaction a.b- > D. Although C 'may not be exactly identical to the original answer C, the present invention would still judge that reagent compound C' is also reasonable if the a.b.c '- > D product is predicted to be still the desired D by using the existing reaction prediction model (e.g., the molecular converter in SCHWALLER 2019), and that multiple reasonable reagent compound conditions (either C or C' are reasonable) are possible for the same chemical reaction conversion as verified by the chemist.
Construction of reagent prediction data: the reagent compounds to be predicted to fill in the present invention are derived from a "reagent compound limitation data table", and because not all reactants are reagent compounds, it is necessary to limit the range of reagent compounds that can be predicted by the model. The table can be constructed to obtain chemical reagent names from chemical information websites and chemical literature, and represent corresponding structures through character string representation forms of SMILES, or collect generalized reagent (compounds such as solvents, reagents, catalysts and the like) compound structures through expert experience. The reagent compound limit data table of the invention is only expressed by using character strings of a chemical reagent SMILES structure, does not need the name of the chemical reagent, and does not need an expert to strictly define related roles of solvents, reagents and catalysts. For example, ethanol has a chemical formula of C 2H6 O or CH 3CH2 OH, and a structural formula of "CCO" represented by a standard SMILES, and more reagent compounds have a structure shown in Table 1, it should be noted that the reagent compounds are not limited to the structure shown in Table 1.
Table 1 examples of reagent compounds
Reagent compound name Reagent structure
Ethanol CCO
Propylene CC=C
Acetonitrile CC#N
Methyl ethyl carbonate COC(=O)CC
The structure of the reagent prediction data is that the reagent compounds existing in the "reagent compound limitation data table" and combinations thereof are deleted from the existing chemical reaction data entirely containing the reagent compounds, the chemical reaction in which the reagent compounds are deleted is obtained as an input data item, and the deleted reagent compounds are combined as a target data item. The alignment can be performed by SMILES alignment, as can the complete chemical reaction data, by a separate SMILES representation, and using > > separate chemical reaction input compounds (i.e., reactants) and reaction products. The method for constructing the missing reagent prediction data set provides that data utilized by an AI machine learning model (or abbreviated as machine learning model) is automatically provided without manual intervention, namely paired input data items (incomplete reaction SMILES) and target data items (reagent combination SMILES to be predicted) can be used for training and evaluating the effect of the AI translation model. When the method is applied, the chemical structural combination of the reagent compounds to be supplemented can be directly predicted according to the chemical information characteristics extracted by the input reaction, and the method is not limited by any reagent types, number and combination, and is more suitable for a real scene.
Reagent compound predictive model training: an AI translation model (e.g., depth-learning machine translation model-Transformer in Vaswani 2017; this Transformer can be used for generalized translation tasks) is applied to train on the constructed deletion reaction training dataset. The input data is the simulated reaction of the reagent compound deleted but with the reactants and products, and the target output data is the set of reagent compounds from which the reaction was deleted. As a missing reaction training data set, an AI translation model is used for each reaction input data-target output data pair, namely the paired input data item (incomplete reaction SMILES) and target data item (reagent combination to be predicted SMILES), and after input data is obtained, model training is performed to generate and output corresponding target data as an optimization target to the maximum extent, and finally a reagent compound prediction model for chemical reaction is obtained.
And (3) automatically verifying the reaction prediction promotion, namely performing reagent prediction/supplementation on the reaction containing at least one reactant in the constructed verification data test set by using a trained reagent compound prediction model, and reconstructing the reaction into a reaction for supplementing predicted reagent information, wherein if the reagent is predicted to be a blank compound, the original reaction does not need to supplement reagent information, and the reconstructed reaction is copied. And (3) for one or more existing trained reaction prediction machine learning models, respectively predicting the products of the models by using the baseline reaction of the non-supplemented reagent compound and the reaction of the supplemented prediction reagent compound for removing the original products, and verifying the effects of the comparison baseline reaction, the probability of the correct prediction of the products and the improvement of the correct proportion. Under the condition of a plurality of existing reaction prediction machine learning models, a model with the maximum lifting index is selected as a reaction prediction model which is used in combination with a reagent compound prediction model. In the application stage, the reagent compound prediction model and the selected reaction prediction model are used as a combined component, so that the reaction judgment accuracy in the synthetic design is improved.
Example 1
Machine translation for reagent compound predictive model training is achieved by modeling training of reagent predictions on a constructed missing reagent response training dataset based on a deep learning machine translation model.
Step one, collecting related chemical reagent standard names, converting the chemical reagent standard names into SMILES by using an open chemical noun conversion tool (https:// opsin. Ch. Cam. Ac. Uk /), combining reagent SMILES accumulated by other experts, and forming a reagent compound limit data table of >2000 reagent SMILES "
Step two, collecting chemical reaction formulas and processing the chemical reaction formulas into SMILES through published chemical patents, and collecting 200 ten thousand reactions, wherein 95% (180 ten thousand) of the reactions are used as training sets, and the other 5% (95000) of the reactions are used as test sets.
And step three, deleting the reagent and the combination thereof existing in the reagent-containing complete chemical reaction SMILES data to obtain the reagent-deleted reaction SMILES as an input data item, and taking the deleted reagent combination as a target data item, wherein a blank compound (which can be represented by the symbol of the SMILES) is used as the target data item under the condition that the reaction does not need to be supplemented with any reagent.
The chemical structure of the compound comprising the reactant and the generalized reagent in step four can be represented by a character sequence converted into a SMILES form, and the character sequence of the missing reagent combination generated by the character sequence of the reaction can be realized as a machine translation problem for translating from a 'source language' to a 'target language'. The "source language" data is a set of reagent compounds that simulate a reaction with a reagent compound deleted but with reactants and products, and the "target language" data is a set of reagent compounds with the reaction deleted. For each reaction pair of "source language" - "target language", the translation closest to the "target language" is generated from the "source language" using an AI translation model (such as the transducer in Vaswani 2017) to optimize for the target, as shown in fig. 3. In fig. 3, the left character string separated by > > represents the reactant (lack of reagent), and the right character string is the reaction product. After passing through the convertors in Vaswani 2017, the character symbols (token) of the predicted reagent combination SMILES are generated one by one. After the input is calculated by the weights of the layers of the convertors in Vaswani 2017, at the output layer of the model, the original weights zi (> 0) of K (k=5 in this example) different token are obtained, i=1, 2, …, K, and normalized probability output is performed by Softmax, where Br (=0.90) with the largest probability is taken as the final output.
The output is a combination of reagents, as a reaction may contain multiple reagents at the same time. The structure shown in the examples corresponds to o=c1ccc (=o) N1Br, with the ellipses omitting other reagents in the model prediction of this combination.
Specifically, the Transformer in Vaswani 2017 generates character symbols (token) of the reagent combination to be predicted SMILES one by one through a multi-layer encoder and attention mechanism, and the missing reaction input in the form of the character symbols (token) includes reactants, > > and products. After the weight of each layer of the Transformer is input and calculated, the output layer is taken as the output layer of the model, so as to obtain the original weights z i (> 0) of K (K=5 in this example) different token, and the original weights z i are passed through Softmax
Normalized probability output (e=2.71828 … is natural logarithm) was performed. In the training process, an optimization function is adopted, and weights of all layers of an AI translation model are updated through reverse transfer, so that an optimization target of minimizing an output symbol and a real symbol is achieved; the optimization function may use cross entropy to calculate the difference between the output probability distribution [0.02,0.90,0.05,0.01,0.02] of the model k=5 symbols and the probability distribution of the true target symbol and minimize the difference; let Br be the real symbol in the training data, its distribution is [0.0,1.0,0.0,0.0,0.0]. In the test process of the reagent compound prediction model, the distribution of real symbols is not known, a character symbol sequence with the highest probability is directly output to be used as an output data item, and in the legend, a symbol Br (=0.90) with the highest probability is used as a final output.
The reagent compound predictive model outputs the highest probability reagent combination, as one reaction may contain multiple reagents at the same time. The structure shown in the example of fig. 3 corresponds to o=c1ccc (=o) N1Br, and the ellipses indicate that the omitted reagent compound prediction model predicts other reagents in this combination.
Step five, according to the trained reagent compound prediction model (Reagent Transformer, RT), a reaction prediction model (such as a molecular converter in SCHWALLER 2019, abbreviated as MT) based on the machine translation type can be paired for verification application. Specifically, for one test reaction, reagent prediction is performed by RT first, if so, no additional reagent is needed. The reaction completed with the reagent is input to MT, and if Top-1 (first) predicted product and original reaction product are consistent, RT predicted reagent is adopted.
Step six, testing the result of the set and testing accuracy rate: the result is an experimental generation run using test data and related models by itself, not a direct reference to the dataset and result numbers in SCHWALLER 2019.
Test effect 1: training of reagent compound predictive model (RT) was performed using 180 ten thousand reactions in a database extracted from published patent reactions, and the results of the reaction predictions were tested on approximately 95000 reactions independent of the training reactions: in contrast to incomplete reactions lacking reagents, validation experiments use reaction prediction Model (MT) lines trained on the corresponding data to predict reactions from the above reaction data lacking reagents. Without the RT of the invention, MT Top-1 accuracy was 69% (baseline response, results of MT in the present test dataset using SCHWALLER 2019), RT+MT Top-1 accuracy was 92% after model RT predictive reagent was used for the missing reagent response, RT could be 23 points (69% - >92%, 33% relative improvement) for MT reaction predictive accuracy overall.
Test effect 2: in the organic synthesis real scene chemical reaction test set (127 reactions) requiring supplementing reaction conditions, the reaction completed by the reagent compound prediction model (RT) can effectively improve the correct proportion of the forward reaction prediction Model (MT) by 90.7% (the absolute lifting of RT+MT is 39 points).
Test effect 3: on reaction tests (20 reactions) which were not feasible to extract from the actual patent reactions and which were checked as erroneous by the chemist, the reagent compound predictive model (RT) was able to effectively control the error rate (5%; 1 was misjudged to be passed by MT) of erroneous reactions as correct reactions.
Test results examples:
reaction of deletion reagent: SMILES form:
CC(C)(C)CCn1ccc(-c2ccn3nccc3n2)cc1=O>>CC(C)(C)CCn1ccc(-c2ccn3ncc(Br)c3n2)cc1=O
The structural form is as follows: left is the > > left reactant, right is the > > right product
If the reagent compound prediction model RT is not supplemented with reagent, the predicted product of the reaction prediction Model (MT) is incorrect (the predicted product will change Br in the right correct product to I, display omitted).
The reagent group for RT prediction is ClCCl.O=C1CCC (=O) N1Br, and the structural form is as follows:
with the reagent set supplemented with RT prediction, MT was able to predict the correct product described above, as indicated below.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (9)

1. The method for constructing the reagent compound prediction model is characterized by comprising the following steps of:
Step one, representing the reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limit data table;
Step two, representing the chemical reaction formula by adopting a character sequence in an SMILES form, and dividing the chemical reaction formula into a training set and a testing set; the chemical reaction formulas in the training set and the testing set both comprise reactants with complete information, reagent compounds with complete information and products with complete information;
Step three, deleting the chemical reaction formula SMILES data in the training set, namely deleting the SMILES data of the reagent compounds in the reagent compound limit data table, and taking the deleted SMILES data of the reagent compounds as target data items;
aiming at the corresponding relation between the input data item in the third step and the target data item in the third step, generating n output data items closest to the target data item in the third step by the AI translation model through artificial intelligence deep learning according to the input data item in the third step; the closest includes a completely consistent situation; n is an integer not less than 0;
Supplementing the output data item obtained in the fourth step into a chemical reaction formula, inputting the output data item into a reaction prediction model, and obtaining a reagent compound prediction model according to result feedback optimization: if the predicted product is consistent with the original reaction product, the output data item obtained in the fourth step is the predicted reagent compound; if the predicted product and the original reaction product are not identical, the output data item obtained in the fourth step is not used as the predicted reagent compound, and the predicted reagent compound excluding the output data item in this case as the chemical formula is recorded.
2. The method for constructing a reagent compound predictive model according to claim 1, further comprising: step six, testing the reagent compound prediction model by using the test set; the method comprises the following steps:
S1, deleting SMILES data of chemical reaction formulas in the test set, namely deleting SMILES data of reagent compounds in the reagent compound limit data table to serve as input data items, and taking the deleted SMILES data of the reagent compounds as target data items;
S2, aiming at the corresponding relation between the input data item in the step S1 and the target data item in the step S1, through artificial intelligence deep learning, the AI translation model generates an output data item closest to the target data item in the step S1 according to the input data item in the step S1;
S3, supplementing the output data item obtained in the step S2 into a chemical reaction formula, inputting the chemical reaction formula into the reaction prediction model, and if the predicted product is consistent with the original reaction product, obtaining the output data item obtained in the step S2 as a predicted reagent compound;
S4, comparing the probability P of consistency between the product predicted by the reaction prediction model and the original reaction product under the two conditions that the output data item obtained in the step S2 is supplemented and the output data item obtained in the step S2 is not supplemented; and if the probability P after the output data item obtained in the step S2 is supplemented is obviously higher than the probability P after the output data item obtained in the step S2 is not supplemented, the reagent compound prediction model is successfully constructed.
3. The method for constructing reagent compound predictive model according to claim 1, wherein the n output data items closest to the target data item in the third step are generated through artificial intelligence deep learning in the fourth step by the following method:
after the weight of each layer of the AI translation model is calculated by the input data item in the step three, the output layer obtains the original weights of K different character symbols, and the normalization probability is carried out through a Softmax function; in the training process, an optimization function is adopted, and weights of all layers of the AI translation model are updated through reverse transfer, so that an optimization target for minimizing the difference between an output symbol and a real target symbol is achieved; the optimization function uses cross entropy to calculate the difference between the normalized probability distribution of K different character symbols and the probability distribution of a real target symbol, and minimize the difference; taking the SMILES sequence of the character symbol with the maximum normalized probability of K different character symbols after minimizing the difference between the output symbol and the real target symbol as an output data item in the step four;
Wherein K is an integer greater than or equal to 0; when the character symbol with the highest normalized probability is the "." symbol in the SMILES, the output data item in the fourth step is a blank compound.
4. The method for constructing a reagent compound predictive model according to claim 1, wherein the first step is specifically: standard names of reagent compounds were collected and converted to SMILES using the open chemical noun conversion tool https:// opsin. Ch. Cam. Ac. Uk, combining the accumulated SMILES of reagent compounds to form the reagent compound limit data table for SMILES of >2000 reagent compounds.
5. The method for constructing a reagent compound predictive model according to claim 1, wherein the second step is specifically: the chemical reaction formula is collected through the published chemical patent and processed into SMILES, 150-230 ten thousand reactions are collected in total, wherein 90-97% of the reactions are used as the training set, and the rest reactions are used as the test set.
6. The method of constructing a predictive model of a reagent compound according to claim 1, wherein n.ltoreq.10.
7. A method for automatic predictive complement of a chemical reaction reagent, characterized in that a reagent compound for complement is obtained using the reagent compound predictive model constructed by the method for constructing a reagent compound predictive model according to any one of claims 1 to 6.
8. The method for automated predictive replenishment of a chemical reagent in accordance with claim 7 comprising the steps of:
A1, inputting chemical reaction formula SMILES data to be complemented into the reagent compound prediction model as an input data item, and obtaining an output data item of the reagent compound prediction model;
A2, supplementing the output data item in the step A1 into the chemical reaction formula to be complemented, inputting the reaction prediction model, and if the predicted product is consistent with the original reaction product, obtaining the output data item in the step A2 as the reagent compound for complementing.
9. The automatic chemical reaction reagent predicting and completing device is characterized by comprising an analysis module, a control module and a control module, wherein the analysis module is used for generating a reagent compound for completing according to input chemical reaction formula SMILES data to be completed; the analysis module is loaded with:
the reagent compound predictive model constructed by the method for constructing a reagent compound predictive model according to any one of claims 1 to 6, and the reaction predictive model.
CN202111214079.4A 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent Active CN113990405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111214079.4A CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111214079.4A CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Publications (2)

Publication Number Publication Date
CN113990405A CN113990405A (en) 2022-01-28
CN113990405B true CN113990405B (en) 2024-05-31

Family

ID=79739298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111214079.4A Active CN113990405B (en) 2021-10-19 2021-10-19 Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent

Country Status (1)

Country Link
CN (1) CN113990405B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312135B (en) * 2022-07-21 2023-10-20 苏州沃时数字科技有限公司 Chemical reaction condition prediction method, system, device and storage medium
CN115240786A (en) * 2022-08-09 2022-10-25 腾讯科技(深圳)有限公司 Method for predicting reactant molecules, method for training reactant molecules, device for performing the method, and electronic apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
KR20200072585A (en) * 2018-11-30 2020-06-23 이율희 Method for predicting the HAZARD and RISK of target chemicals BASED ON AI
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
CN113380346A (en) * 2021-06-08 2021-09-10 河南大学 Coupling reaction yield intelligent prediction method based on attention convolution neural network
CN113380345A (en) * 2021-06-08 2021-09-10 河南大学 Organic chemical coupling reaction yield prediction and analysis method based on deep forest

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10622098B2 (en) * 2017-09-12 2020-04-14 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
KR20210050952A (en) * 2019-10-29 2021-05-10 삼성전자주식회사 Apparatus and method for optimizing experimental conditions neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
KR20200072585A (en) * 2018-11-30 2020-06-23 이율희 Method for predicting the HAZARD and RISK of target chemicals BASED ON AI
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
CN113380346A (en) * 2021-06-08 2021-09-10 河南大学 Coupling reaction yield intelligent prediction method based on attention convolution neural network
CN113380345A (en) * 2021-06-08 2021-09-10 河南大学 Organic chemical coupling reaction yield prediction and analysis method based on deep forest

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction;Philippe Schwaller et al.;ACS central science;全文 *
有机合成中化学反应的机器学习;张良顺;功能高分子学报;全文 *

Also Published As

Publication number Publication date
CN113990405A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
Dai et al. Retrosynthesis prediction with conditional graph logic network
US10622098B2 (en) Systems and methods for predicting chemical reactions
Jin et al. Predicting organic reaction outcomes with weisfeiler-lehman network
Knowles ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems
CN113990405B (en) Method for constructing reagent compound prediction model, method and device for automatic prediction and completion of chemical reaction reagent
Ciaburro Regression Analysis with R: Design and develop statistical nodes to identify unique relationships within data at scale
US20220172802A1 (en) Retrosynthesis systems and methods
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
Beykal et al. A data‐driven optimization algorithm for differential algebraic equations with numerical infeasibilities
Du et al. Species tree and reconciliation estimation under a duplication-loss-coalescence model
Lee et al. Survival prediction and variable selection with simultaneous shrinkage and grouping priors
Jeyaraman et al. Practical Machine Learning with R: Define, build, and evaluate machine learning models for real-world applications
Ahmed et al. Context-aware knowledge selection and reliable model recommendation with ACCORDION
CN115410642A (en) Biological relation network information modeling method and system
Sanchez Reconstructing our past˸ deep learning for population genetics
Feinauer et al. Interpretable pairwise distillations for generative protein sequence models
Musinat et al. Genetic algorithm-based multi-objective optimization model for software bugs prediction
Sanchez et al. dnadna: DEEP NEURAL ARCHITECTURES FOR DNA-A DEEP LEARNING FRAMEWORK FOR POPULATION GENETIC INFERENCE
Galagali Bayesian inference of chemical reaction networks
Okada et al. Decomposition of a set of distributions in extended exponential family form for distinguishing multiple oligo-dimensional marker expression profiles of single-cell populations and visualizing their dynamics
Deng Algorithms for reconstruction of gene regulatory networks from high-throughput gene expression data
McKenzie et al. Estimating waiting distances between genealogy changes under a multi-species extension of the sequentially Markov coalescent
WO2023150898A1 (en) Method for identifying chromatin structural characteristic from hi-c matrix, non-transitory computer readable medium storing program for identifying chromatin structural characteristic from hi-c matrix
US20230223100A1 (en) Inter-model prediction score recalibration
CN115512781A (en) Method for improving inverse synthesis credibility through multi-model ensemble learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant