CN113990405A

CN113990405A - Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent

Info

Publication number: CN113990405A
Application number: CN202111214079.4A
Authority: CN
Inventors: 陈德铭; 马汝建; 陈志刚
Original assignee: Wuxi Apptec Co Ltd
Current assignee: Wuxi Apptec Co Ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-01-28
Anticipated expiration: 2041-10-19
Also published as: CN113990405B

Abstract

The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The model construction method comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limitation data table; representing the chemical reaction formula by adopting a character sequence in a SMILES form; deleting SMILES data of a reagent compound existing in the reagent compound restriction data table from the chemical reaction formula SMILES data as an input data item, and using the deleted SMILES data of the reagent compound as a target data item; and generating an output data item closest to the target data item through artificial intelligence deep learning, supplementing the output data item into the chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, outputting the output data item as the predicted reagent compound. And (4) automatically predicting and completing the chemical reaction reagent through the model.

Description

Construction method of reagent compound prediction model, and method and device for automatic prediction and completion of chemical reaction reagent

Technical Field

The invention relates to the field of pharmaceutical chemistry application, in particular to a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of chemical reaction reagents.

Background

In the field of medicinal chemistry application, organic synthesis of new chemical molecules requires relevant prediction and judgment on chemical reactions (assumed by organic chemists or virtualized by computer algorithms), so that loss and waste caused by experiment failure are avoided; in an automated synthesis apparatus, information on reaction conditions (particularly, a compound as a reagent) under which a chemical reaction can proceed is required, and the chemical reaction can proceed. Therefore, automatic prediction of chemical reaction reagents or completion of missing reagents is an important link for realizing automatic organic synthesis design.

An effective modeling approach of forward reaction prediction and reverse reaction generation is provided for organic synthesis automation through Artificial Intelligence (AI) trained by a large amount of organic chemical reaction data, particularly a deep learning model. In modeling retroactive, AI models generally only consider the transformed reactants, not the relevant reagent information (C.W.Coley, L.Rogers, W.H.Green, and K.F.Jensen.computer-Assisted Retrosynthesis Based on Molecular simulation. ACS. Central Science,3,2017.Connor WColey, William H Green, and Klavs F J.S.Rdchiral: An rdkit writer for manipulating the electrochemical chemistry in retrosynthetic template and application. journal of chemical information and modeling,59(6): 2522537, 2019.); few AI models that directly predict reactants and reagents have very low accuracy (top 10 prediction accuracy < 25%) (Alessandra Toniato, Philippe Schwaller, Antonio Cardinal, Joppe Gelumkens & Teodoro Lanino, unknown noise reduction of chemical reaction data, Nature Machine identification volume 3, pages 485-494,2021), and the accuracy of their reagent information also needs to be greatly improved. Therefore, reagent automatic prediction completion is performed on chemical reactions with incomplete information, and improvement of the reaction prediction accuracy of the AI model to the experiment executable level is also a key link.

Even though a chemist artificially designs an organic synthesis route, each synthesis reaction step omits conditions including reagent information due to the fact that reaction conversion is concerned in the supposed process, and the chemist needs to realize related reaction steps in a laboratory, generally searches similar existing experimental reactions to collect similar reaction conditions and adjusts and judges according to experience, so that automatic and accurate reagent automatic prediction completion can be realized, and the efficiency of artificially designing and implementing the organic synthesis route by the chemist can be effectively improved.

The current AI models applied to organic synthesis route design mainly focus on both problems of inverse synthesis and reaction judgment, and have generated numerous model studies and methods. As shown in FIG. 1 (Ryan-Rhys Griffiths, Philippie Schwaller, Alpha A. Lee, Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design, Neurops works hop on Machine Learning for Molecules and Materials,2018), reverse Synthesis or Synthesis Design (Synthesis Design), i.e., given a desired compound to be synthesized, a reactive compound predicted to produce the desired product and related information; reaction prediction (reaction prediction)/determining, i.e., the reactants and associated information (e.g., reaction condition information) for a given chemical reaction, predicting/generating its reaction product, or determining whether a complete reaction comprising a given reaction product is correct. The technical problem concerned by the present invention is reagent prediction (or reaction planning), i.e. the prediction of the reaction conditions (mainly chemical reagents) required for a reaction given the reactants and the desired products of the reaction. On one hand, the reagent prediction is difficult to uniquely predict due to the fuzzy definition information of the compound reagent as the condition in the reaction, and on the other hand, the reagent prediction is very challenging and is rarely researched by the field of AI and deep learning. However, the lack of an effective automatic completion method for chemical reagents is a key gap that leads to the failure of reverse synthesis and reaction prediction AI of organic route design to meet industrial expectations and applications.

In the design of an automatic organic synthesis route, unknown target molecules in each step are virtually decomposed into corresponding candidate reactants, whether the candidate reactants can actually react in an experiment and can generate the target molecules is verified, reasonable reaction conditions need to be considered, and chemical reagents (which widely comprise solvents and catalysts) are important determinants. Compared with other more conventional conditions such as temperature and pressure (usually normal temperature and pressure) which can be adjusted in the same compound combination, the chemical reagent as a compound needs to be prepared (purchased and applied) before a reaction experiment as a reactant, and the accurate judgment and selection of the chemical reagent plays a second key role similar to that of the reactant and can determine whether the reaction is successfully realized.

In reverse synthesis, the AI model typically considers only the transformed reactants, not the relevant reactant information; the few AI models that directly predict reactants and reagents have very low accuracy, and the ratio of the true answer to any of the first ten predictions (Top-10) of the model is < 25% (Alessandra Toniato, Philippe Schwaller, Antonio Cardinal, Joppe Gelumkens & Teodoro Lanino, Unassisted noise reduction of chemical reaction data, Nature Machine understanding volume 3, pages 485-494,2021), so the accuracy of the reagent information needs to be greatly improved as well.

In the current condition prediction research, expert rules are generally input aiming at specific types of reactions, the effectiveness and the scale are constrained by expert knowledge and experience, and the scale cannot be effectively and automatically enlarged and applied to general reactions. Currently, the main work for conditional prediction of unspecific Reactions (Hanyu Gao, Thomas J.Struble, Connor W.Coley, Yuran Wang, William H.Green, and Klavs F.Jensen, Using Machine Learning To preliminary Suitable conditionables for Organic Reactions, ACS Cent.Sci.,4, 1465-. In the real reaction, the role classification is not available before the reaction product is determined, and the number of the reagents is often multiple under the real condition, so that the hard limit of 2 at most cannot be met. The method specifically relies on the unique Identification (ID) of the reaction role division condition class compound in reaction non-public data (Reaxyz https:// www.reaxys.com /) marked by manpower for decades: catalysts, solvents, reaction reagents and the like are classified and modeled, but even though the training model and the tested data of the method are limited to predict only 1 catalyst, 2 solvents and 2 reagents at most, the accuracy of the method is still far away from the accuracy required by real organic synthesis. The accuracy of only 57.3 percent of the first three predictions (Top-3) and 66.0 percent of the first ten predictions (Top-10) of chemical reagents (including catalysts, solvents and reaction reagents) with wide significance is achieved, and the accuracy distance from the actual industrial application is large. The method using the reagent compound ID as the prediction result does not pay attention to the effect of the prediction/completion of the reagent on the reaction judgment, i.e., the prediction ID does not contain chemical information of a relevant prediction chemical reagent, and cannot be effectively combined with the chemical information of a reactant which is supposed to/simulates the reaction to perform the reaction judgment of chemical significance.

Therefore, in order to increase the accuracy of the forward and reverse reaction prediction of the AI model in the automation of organic synthesis to the level of experiment executability required for industrial application, a method and an apparatus for automatically performing the prediction of a reaction reagent or the completion of a missing reagent need to be established, and the method and the apparatus need to have a direct improvement effect on the accuracy of the assumed/virtual reaction judgment.

Disclosure of Invention

According to the industry demand gap of the current technology and organic synthesis design, the invention solves the problem of automatically predicting reaction reagents or completing missing reagents, namely automatically predicting reaction reagents or planning reaction. For this purpose, the present invention provides, in a first aspect, a method for constructing a reagent compound prediction model, comprising the steps of:

step one, expressing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound restriction data table;

step two, expressing the chemical reaction formula by adopting a character sequence in a SMILES form, and dividing the chemical reaction formula into a training set and a testing set; the chemical reaction formulas in the training set and the test set both comprise a full information reactant, a full information reagent compound, and a full information product;

step three, deleting SMILES data of the reagent compounds in the reagent compound limitation data table from the chemical reaction formula SMILES data in the training set to serve as input data items, and using the deleted SMILES data of the reagent compounds as target data items;

fourthly, aiming at the corresponding relation between the input data items in the third step and the target data items in the third step, the AI translation model generates n output data items which are closest to the target data items in the third step through artificial intelligent deep learning according to the input data items in the third step; the closest includes a completely consistent situation; n is an integer not less than 0;

and step five, supplementing the output data item obtained in the step four into a chemical reaction formula, inputting the output data item into a reaction prediction model, and obtaining a reagent compound prediction model according to result feedback optimization: if the predicted product is consistent with the original reaction product, the output data item obtained in the fourth step is the predicted reagent compound; if the predicted product and the original reaction product do not match, the output data item from step four is not taken as the predicted reagent compound, and the output data item excluding that case is recorded as the predicted reagent compound of the chemical equation.

In some embodiments, the AI translation model is a deep learning-based machine translation model, Transformer (translation model), of Ashish Vaswani, Noam Shazer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. The Transformer in Vaswani 2017 is a general Transformer technology, and can be applied to text translation, or chemical reaction products, and reagent prediction of the invention. The AI translation model used in the present invention is not limited to the Transformer in Vaswani 2017, but may be another Transformer.

In some embodiments, further comprising: step six, testing the reagent compound prediction model by using the test set; the method specifically comprises the following steps:

s1, deleting SMILES data of the reagent compounds in the test set as input data items after the SMILES data of the reagent compounds exist in the reagent compound restriction data table, and using the deleted SMILES data of the reagent compounds as target data items;

s2, aiming at the corresponding relation between the input data item in the step S1 and the target data item in the step S1, through artificial intelligence deep learning, an AI translation model generates an output data item which is closest to the target data item in the step S2 according to the input data item in the step S1;

s3, supplementing the output data item obtained in the step S2 into a chemical reaction formula, inputting the output data item into the reaction prediction model, and if the predicted product is consistent with the original reaction product, determining the output data item obtained in the step S2 as a predicted reagent compound;

s4, comparing the consistency probability P of the product predicted by the reaction prediction model and the original reaction product under the two conditions that the output data item obtained in the step S2 is supplemented and the output data item obtained in the step S2 is not supplemented; if the probability P after supplementing the output data item obtained in the step S2 is significantly higher than the probability P without supplementing the output data item obtained in the step S2, it indicates that the reagent compound prediction model is successfully constructed.

In some embodiments, the deep learning with artificial intelligence in step four generates n output data items closest to the target data item in step three by:

after the input data items in the third step are subjected to weight calculation of each layer of the AI translation model, the original weights of K different character symbols are obtained by an output layer, and the probability is normalized through a Softmax function; in the training process, an optimization function is adopted, and the weights of all layers of the AI translation model are updated through reverse transfer so as to achieve the optimization target of minimizing the difference between an output symbol and a real target symbol; the optimization function calculates the difference between the normalized probability distribution of K different character symbols and the probability distribution of a real target symbol by using cross entropy, and minimizes the difference; taking the SMILES sequence of the character symbol with the maximum normalized probability of the K different character symbols after the difference between the minimized output symbol and the real target symbol is taken as the output data item in the fourth step;

wherein K is an integer of more than or equal to 0; when the character symbol with the highest normalized probability is the ". multidot." symbol in SMILES, the output data item in step four is a blank compound.

The value of K is the number of all possible character symbols of the chemical structure. The character symbols of one implementation include a set of elements that occur (such as Cl, Br, I, N, C) and non-element symbols (such as ".) in SMILES.

In some embodiments, the first step is specifically: standard names for reagent compounds were collected, and the accumulated SMILES for reagent compounds were pooled using the open chemical noun conversion tool https:// opsin.ch.cam.ac.uk/conversion to SMILES, forming the reagent compound restriction data table for SMILES for >2000 reagent compounds.

In some embodiments, the second step is specifically: chemical reaction formulas are collected and processed as SMILES through published chemical patents, and a total of 150 to 230 million reactions are collected, of which 90 to 97% are the training set and the remainder (i.e., 10 to 3%) are the test set.

Further, a total of 180 to 210 million responses were collected, with 93 to 96% as the training set and the remaining responses (i.e., 7 to 4%) as the test set. Preferably, a total of 200 ten thousand ± 100 responses are collected, with 95% ± 0.5% as the training set and the remaining responses as the test set.

In some embodiments, n in step four is ≦ 10. The invention does not need to limit the range of n after n is more than or equal to 0, can obtain reasonable reagent number by generating the pretest reagent, and can allow n to be 0 for prediction, namely the reaction does not need to supplement any reagent. However, in generally reasonable reactions, n.ltoreq.10 is typical.

In some embodiments, the reaction prediction model is Philippe Schwaller et al molecular transform A model for uncategorized-calified chemical reaction prediction,2019Sep 25; 1572-1583 (hereinafter referred to as Schwaller 2019) as a Molecular Transducer (MT).

On the other hand, the purpose of the model constructed by the method is to provide a method for automatically predicting and completing a chemical reaction reagent, namely, a reagent compound for completing is obtained by using the reagent compound prediction model constructed by the method for constructing the reagent compound prediction model.

In some embodiments, the method for automated predictive replenishment of a chemical reaction reagent comprises the steps of:

a1, inputting chemical reaction formula SMILES data to be complemented as an input data item into the reagent compound prediction model, and obtaining an output data item of the reagent compound prediction model;

a2, supplementing the output data item in the step A1 into the chemical reaction formula to be completed, inputting the reaction prediction model, and if the predicted product and the original reaction product are consistent, the output data item in the step A2 is the reagent compound for completing.

In a third aspect, the present invention further provides an apparatus for automatic predictive replenishment of a chemical reaction reagent, including an analysis module, configured to generate a reagent compound for replenishment according to input chemical reaction SMILES data to be replenished; the analysis module is loaded with:

the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model described above, and the reaction prediction model.

The device also belongs to the specific application of the model constructed by the construction method.

The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The invention discloses a construction method of a reagent compound prediction model, and a method and a device for automatic prediction and completion of a chemical reaction reagent. The model construction method comprises the steps of representing a reagent compound by adopting a character sequence in a SMILES form to generate a reagent compound limitation data table; representing the chemical reaction formula by adopting a character sequence in a SMILES form; deleting SMILES data of a reagent compound existing in the reagent compound restriction data table from the chemical reaction formula SMILES data as an input data item, and using the deleted SMILES data of the reagent compound as a target data item; and generating an output data item closest to the target data item by artificial intelligence deep learning (such as an optimization function, a reverse transfer method, a Softmax method and the like), supplementing the output data item into the chemical reaction formula, inputting the output data item into a reaction prediction model, and if the predicted product is consistent with the original reaction product, determining the output data item as the predicted reagent compound. And (4) automatically predicting and completing the chemical reaction reagent through the model.

The invention constructs a chemical reaction formula without a reaction reagent (namely, the reagent compound) as an input data item by utilizing reactant, reagent compound and product data in the chemical reaction formula, trains an AI machine learning model to predict the chemical structure of the reaction reagent, and combines the predicted reaction reagent into the chemical reaction to improve the accuracy of chemical reaction prediction/judgment (such as a reaction prediction model), wherein the accuracy can be improved by 33%. Compared with the prior art, the method does not need to perform reaction role division of expert division on the chemical reaction, automatically distinguishes the target to be predicted through reagent data, and is not limited by the class, the number and the like of the reagent.

The construction and evaluation method can utilize the existing chemical reaction data to train the relevant AI reagent compound prediction model (namely the reagent compound prediction model) and automatically evaluate and verify the promotion effect of the AI reagent compound prediction model on the reaction judgment under the condition of no need of manual intervention evaluation.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 shows the relationship among Synthesis (Synthesis design), Reaction prediction/judgment (Reaction prediction), and Reaction reagent prediction (generalized Reaction planning). Where P is the target compound/target product, i.e.the product of the reaction (to the right of the scheme-). Prediction of reaction (broad sense) reagent, knowing the reactant (a) and the target product (P) predicts the compound (reagent) required for the reaction, a technical problem to which the present invention is directed; predicting/judging the reaction, namely, giving a reactant and a reagent (A + B + C; - > left side), predicting a product, or judging whether a certain product P (- > right side) is the product of the reaction, wherein more technical problems concerned by method models exist; the synthesis design, i.e. inverse synthesis, gives a target compound (P) and predicts the reactants producing the target, and has more technical problems focused on method models.

FIG. 2 is a flow chart of reaction reagent prediction data construction, reagent compound prediction model training, and reaction prediction promotion auto-validation.

Figure 3 is a reagent compound predictive model training implementation: the AI translation model was trained to perform reagent compound predictions in chemical reactions (the ".

Detailed Description

In order to make the technical means, the characteristics, the purposes and the functions of the invention easy to understand, the invention is further described with reference to the specific drawings. However, the present invention is not limited to the following embodiments.

It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no technical significance, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.

The technical concept of the invention is as shown in fig. 2, and comprises reaction reagent prediction data construction, reagent compound prediction model training and reaction prediction promotion automatic verification according to an execution sequence. The reaction is carried out by partitioning the left reactant, including the potential reagents, A.B.C, - > on the right is the reaction product (D). Through the matching of the reagent compound restriction data table, it is found that C in the reaction is a reagent compound, and an input data item (or called source language) a.b- > D (missing reagent reaction) and a corresponding output/target data item C (reagent group in case of multiple reagents) for training a machine learning model (such as a machine translation model, i.e. an AI translation model) are constructed by deleting the reagent compound in the chemical reaction. Using a reaction reagent prediction dataset constructed in this manner (i.e., data for a reagent compound prediction model), the trained AI translation model will attempt to predict the reagent compound(s) C' when it encounters a reaction of the form a.b- > D. Although C 'may not be completely consistent with the original answer C, the present invention will still judge that the reagent compound C' is reasonable by using the existing reaction prediction model (such as the molecular converter in Schwaller 2019) if the a.b.c '- > D product is still predicted to be the expected D, and there may be many reasonable reagent compound conditions (C or C' are reasonable) for the same chemical reaction conversion verified by chemists.

Reaction reagent prediction data construction: in the present invention, the reagent compounds to be predicted to be filled are from a "reagent compound limit data table" that limits the range of reagent compounds that the model can predict since not all reactants are reagent compounds. The table can be constructed by obtaining chemical agent names from chemical information websites and chemical literatures, and representing corresponding structures by character string representation of SMILES, or collecting generalized reagent (compounds such as solvents, reagents, catalysts, etc.) compound structures by expert experience. The reagent compound limitation data table is only expressed by using a character string of a chemical reagent SMILES structure, does not need the name of the chemical reagent, and does not need experts to strictly define related roles of a solvent, a reagent and a catalyst. Such as ethanol, of formula C₂H₆O or CH₃CH₂OH, which is represented by the standard SMILES formula as "CCO", and the SMILES formulae of further reagent compounds are briefly illustrated in Table 1, it being understood that the reagent compounds are not limited to those shown in Table 1.

Table 1 reagent compounds examples

Name of reagent Compound	Reagent structure
		Ethanol	CCO
Propylene (PA)	CC＝C
		Acetonitrile	CC#N
Carbonic acid methyl ethyl ester	COC(＝O)CC

The reaction reagent prediction data is constructed by deleting the reagent compounds and their combinations existing in the "reagent compound restriction data table" from the existing chemical reaction data entirely containing the reagent compounds to obtain the chemical reaction lacking the reagent compounds as an input data item and the deleted combination of the reagent compounds as a target data item. The alignment can be by SMILES alignment, since the complete chemical reaction data can also be passed. The method for constructing the missing reaction reagent prediction data set provides the data utilized by the AI machine learning model (or the machine learning model in short) automatically without manual intervention, namely paired input data items (incomplete reaction SMILES) and target data items (the reagent combination SMILES to be predicted), and can be used for training and evaluating the effect of the AI translation model. When the reagent compound chemical structure combination prediction method is applied, the chemical structure combination of the reagent compound to be supplemented can be directly predicted according to the chemical information characteristics extracted by the input reaction, the reagent compound chemical structure combination prediction method is not limited by any reagent category, number and combination, and the method is more suitable for real scenes.

Training of reagent compound prediction model: an AI translation model (e.g., the deep learning-based machine translation model in Vaswani 2017-a Transformer; which can be used for generalized translation tasks) is applied to train on the constructed deletion-response training dataset. The input data is a reaction that simulates the deletion of a reagent compound but requires the presence of reactants and products, and the target output data is a set of reagent compounds from which the reaction has been deleted. The input data-target output data pairs for each reaction, namely the pair of input data items (incomplete reaction SMILES) and target data items (to-be-predicted reagent combination SMILES), are used as a missing reaction training data set, an AI translation model is adopted, and after the model obtains the input data, the model can maximally generate and output the corresponding target data as an optimization target, and finally a reagent compound prediction model for the chemical reaction is obtained.

And (3) reaction prediction and automatic verification promotion, namely using a trained reagent compound prediction model to predict/supplement a reagent for a reaction containing at least one reactant in the constructed verification data test set, and reconstructing the reaction into a reaction for supplementing and predicting reagent information, wherein if the reagent is predicted to be a blank compound, the original reaction does not need to supplement the reagent information and is copied into a reconstruction reaction. For one or more existing trained reaction prediction machine learning models, the models are respectively used for predicting products of a baseline reaction of an unsupplemented reagent compound without a primary product and a reaction of a supplemented pre-predicted reagent compound, and verifying the probability of correct prediction of the products and the effect of correct proportion improvement compared with the baseline reaction. Under the condition of a plurality of existing reaction prediction machine learning models, the model with the largest promotion index is selected as a reaction prediction model which is used together with a reagent compound prediction model subsequently. And in the application stage, the reagent compound prediction model and the selected reaction prediction model are used as a combined component, so that the reaction judgment accuracy in the synthesis design is improved.

Example 1

The machine translation of reagent compound prediction model training is realized by performing modeling training of reagent prediction on a constructed missing reagent reaction training data set based on a deep learning machine translation model.

Step one, collecting related chemical reagent standard names, using open chemical noun conversion tool (https:// opsin.ch.cam.ac.uk/) Converting to SMILES, and combining with SMILES accumulated by other experts to form>"reagent Compound Limit data Table" of 2000 reagents SMILES "

Step two, collecting chemical reaction formulas through the open chemical patent and processing into SMILES, and collecting 200 ten thousand reactions in total, wherein 95% (180 ten thousand) are used as a training set, and the other 5% (95000) are used as a test set.

And step three, deleting the reagents and the combinations thereof in the reagent compound limitation data table on the existing chemical reaction SMILES data completely containing the reagents to obtain the reaction SMILES without the reagents as an input data item, using the deleted reagent combination as a target data item, and using a blank compound (which can be represented by a symbol of the SMILES) as the target data item under the condition that no reagent needs to be supplemented for the reaction.

Step four, the chemical structure of the compound including the reactant and the generalized reagent can be represented by the character sequence converted into the SMILES form, and the character sequence generating the missing reagent combination thereof can be realized as a machine translation problem from the source language to the target language through the character sequence of the reaction. The "source language" data is a set of simulated reactions with reagent compounds deleted, but with reactants and products, and the "target language" data is a set of reagent compounds from which the reactions were deleted. For each "source language" - "target language" pair of reactions, an AI translation model (such as the Transformer in Vaswani 2017) is used to generate the translation closest to the "target language" from the "source language" as the target for optimization, as shown in fig. 3. In fig. 3, the left string is denoted reactant (lacking reagent) by > > and the right string is the reaction product. After passing through the Transformer in Vaswani 2017, character symbols (token) of the predicted reagent combination SMILES are generated one by one. After the input is subjected to weight calculation of each layer of the transform in the Vaswani 2017, in the output layer of the model, original weights zi (>0) of K (in this example, K ═ 5) different tokens are obtained, i ═ 1,2, …, K, and normalized probability output is performed by Softmax, where Br with the highest probability (═ 0.90) is taken as the final output.

The output is a combination of reagents, since one reaction can contain multiple reagents simultaneously. The structures shown in the examples correspond to O ═ C1CCC (═ O) N1Br, the ellipses are omitted to omit other reagents in the model prediction for this combination.

Specifically, the Transformer in Vaswani 2017 includes a reactant, a reactant,>>And generating character symbols (tokens) of the combined SMILES to be pre-predicted one by one. After the input is subjected to transform layer weight calculation, an output layer is an output layer of the model, and original weights z of K (in this example, K is 5) different tokens are obtained_i(>0) And passes Softmax

Normalized probability output is performed (e 2.71828 … is natural logarithm). In the training process, an optimization function is adopted, and weights of all layers of the AI translation model are updated through reverse transfer so as to achieve the optimization target of the minimized output symbol and the real symbol; the optimization function may use cross entropy to calculate the difference between the output probability distribution [0.02,0.90,0.05,0.01,0.02] of model K ═ 5 symbols and the true target symbol probability distribution and minimize the difference; suppose Br is a true symbol in the training data, whose distribution is [0.0,1.0,0.0,0.0,0.0 ]. In the test process of the reagent compound prediction model, the distribution of real symbols is not known, the character symbol sequence with the highest probability is directly output as an output data item, and in the legend, the symbol Br (0.90) with the highest probability is output as a final output.

The reagent compound prediction model outputs the highest probability reagent combination since a reaction may contain multiple reagents simultaneously. The structures shown in the example of fig. 3 correspond to O ═ C1CCC (═ O) N1Br, and the ellipses indicate that the omitted reagent compound prediction model predicts the other reagents in the combination.

And step five, according to a trained Reagent compound prediction model (RT), a reaction prediction model (such as a molecular converter in Schwaller 2019, MT for short) based on a machine translation type can be matched for verification and application. Specifically, for a test reaction, the reagent is first predicted by RT, and if so, no additional reagent is needed. The reaction with the reagent replenished is imported into MT, if the product predicted by Top-1 (first) is identical to the original reaction product, then the reagent predicted by RT is adopted.

Step six, testing the result and the testing accuracy of the set: the results were generated from experiments run on their own using test data and associated models, and were not a direct reference to the data set and result numbers in Schwaller 2019.

Test effect 1: in the database extracted from the published patent reactions, training of reagent compound predictive models (RT) was performed using 180 ten thousand reactions, and the results of reaction prediction were tested on approximately 95000 reactions independent of the training reactions: in comparison to incomplete reactions in the absence of reagents, validation experiments performed reaction predictions on the above reaction data in the absence of reagents using a reaction prediction Model (MT) line trained on the corresponding data. Without the RT of the invention, the MT Top-1 accuracy is 69% (baseline reaction, the result of this test data set test using Schwaller 2019 MT), and after using model RT pretest for the reaction of the missing reagent, the RT + MT Top-1 accuracy is 92%, and the RT can increase the MT reaction prediction accuracy by 23 points (69% -92%, relative increase is 33%).

Test effect 2: in the organic synthesis real scene chemical reaction test set (127 reactions) needing to supplement reaction conditions, the correct proportion of the forward reaction prediction Model (MT) can be effectively improved by 90.7% through the reaction supplemented by the reagent compound prediction model (RT) (the absolute increase of RT + MT is 39 points).

Test effect 3: in the reaction tests (20 reactions) extracted from real patent reactions and checked by chemists as being infeasible for errors, the reagent compound prediction model (RT) can effectively control the error rate (5%; 1 was misjudged by MT) of misjudging the error reaction as the correct reaction.

Test results are exemplified by:

reaction of deletion reagent: SMILES form:

CC(C)(C)CCn1ccc(-c2ccn3nccc3n2)cc1＝O>>CC(C)(C)CCn1ccc(-c2ccn3ncc(Br)c3n2)cc1＝O

the structure form is as follows: the left is > > left reactant, the right is > > right product

Without reagent compound prediction model RT replenishment, the product predicted by reaction prediction Model (MT) was wrong (predicted product would change Br in the right correct product to I, shown omitted).

The reagent set for RT prediction is clccl.o ═ C1CCC (═ O) N1Br, with the structural form:

MT predicted the correct product as described above by the reagent set predicted by RT supplementation, as shown below.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A method for constructing a reagent compound prediction model is characterized by comprising the following steps:

2. The method of constructing a reagent compound predictive model of claim 1, further comprising: step six, testing the reagent compound prediction model by using the test set; the method specifically comprises the following steps:

3. The method for constructing reagent compound prediction model according to claim 1, wherein the deep learning of artificial intelligence in step four generates n output data items closest to the target data item in step three by the following method:

4. The method for constructing a reagent compound predictive model according to claim 1, wherein the first step is specifically: standard names for reagent compounds were collected, and the accumulated SMILES for reagent compounds were pooled using the open chemical noun conversion tool https:// opsin.ch.cam.ac.uk/conversion to SMILES, forming the reagent compound restriction data table for SMILES for >2000 reagent compounds.

5. The method for constructing a reagent compound prediction model according to claim 1, wherein the second step is specifically: chemical reaction formulas are collected and processed into SMILES through published chemical patents, and 150-230 ten thousand reactions are collected in total, wherein 90-97% of the reactions are used as the training set, and the rest of the reactions are used as the test set.

6. The method of claim 1, wherein n is 10 or less.

7. The method for constructing a reagent compound prediction model according to claim 1, wherein the reaction prediction model is Philippie Schwaller et al. molecular transform A model for uncancentaining-calized chemical reaction prediction,2019Sep 25; 1572-1583.

8. A method for automatic predictive completion of a chemical reaction reagent, characterized in that a reagent compound for completion is obtained by using the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model according to any one of claims 1 to 7.

9. The method for automated predictive replenishment of a chemical reaction reagent of claim 8, comprising the steps of:

10. The device for automatically predicting and completing the chemical reaction reagent is characterized by comprising an analysis module, a prediction module and a completion module, wherein the analysis module is used for generating a reagent compound for completion according to input chemical reaction formula SMILES data to be completed; the analysis module is loaded with:

the reagent compound prediction model constructed by the method for constructing a reagent compound prediction model according to any one of claims 1 to 7, and the reaction prediction model.