WO2020023650A1

WO2020023650A1 - Retrosynthesis prediction using deep highway networks and multiscale reaction classification

Info

Publication number: WO2020023650A1
Application number: PCT/US2019/043261
Authority: WO
Inventors: Javier L. BAYLON; Nicholas A. CILFONE; Jeffrey R. Gulcher; Thomas W. Chittenden
Original assignee: Wuxi Nextcode Genomics Usa, Inc.
Priority date: 2018-07-25
Filing date: 2019-07-24
Publication date: 2020-01-30

Abstract

Retrosynthesis prediction using deep highway networks and multiscale reaction classification is provided. In various embodiments, a molecular fingerprint is determined for a chemical product. The molecular fingerprint is provided to a first trained classifier. A candidate reaction group is obtained from the first trained classifier. A second trained classifier is selected from a plurality of trained classifiers, the second trained classifier corresponding to the candidate reaction group. The molecular fingerprint is provided to the second trained classifier. A candidate reaction rule is obtained from the second trained classifier. A plurality of reactants yielding the chemical product is determined from the candidate reaction rule.

Description

RETROSYNTHESIS PREDICTION USING DEEP HIGHWAY NETWORKS AND

MULTISCALE REACTION CLASSIFICATION

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Patent Application No.

62/703,112, filed July 25, 2018, which is hereby incorporated by reference in its entirety.

BACKGROUND

[0002] Embodiments of the present disclosure relate to retrosynthetic analysis, and more specifically, to retrosynthesis prediction using deep highway networks and multiscale reaction classification.

BRIEF SUMMARY

[0003] According to embodiments of the present disclosure, methods of and computer program products for retrosynthetic analysis is are provided. A molecular fingerprint is determined for a chemical product. The molecular fingerprint is provided to a first trained classifier. A candidate reaction group is obtained from the first trained classifier. A second trained classifier is selected from a plurality of trained classifiers, the second trained classifier corresponding to the candidate reaction group. The molecular fingerprint is provided to the second trained classifier. A candidate reaction rule is obtained from the second trained classifier. A plurality of reactants yielding the chemical product is determined from the candidate reaction rule.

[0004] According to embodiments of the present disclosure, methods of and computer program products for retrosynthetic analysis is are provided. A plurality of reaction rules is clustered into a plurality of groups. A first classifier is trained to select one of the plurality of groups based on an input molecular fingerprint of a chemical product. A plurality of additional classifiers are trained, each associated with one of the plurality of groups, to select one of the plurality of reaction rules from the associated group based on an input molecular fingerprint of a chemical product.

[0005] According to embodiments of the present disclosure, systems for retrosynthetic analysis is are provided. The system comprises a first trained classifier adapted to receive a molecular fingerprint for a chemical product, and determine therefrom a candidate reaction group from a plurality of candidate reaction groups. The system comprises a plurality of additional trained classifiers, each associated with one of the plurality of groups, each adapted to select a candidate reaction rule from its associated group based on an input molecular fingerprint of a chemical product.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0006] Fig. 1 is a schematic representation of automatic rule extraction according to

embodiments of the present disclosure.

[0007] Fig. 2 is a schematic view of a multiscale approach for retrosynthetic reaction prediction according to embodiments of the present disclosure.

[0008] Fig. 3 illustrates the chemical diversity associated with reaction rule classes according to embodiments of the present disclosure.

[0009] Fig. 4 is a visualization of identified reaction groups in a multiscale dataset according to embodiments of the present disclosure. [0010] Fig. 5 illustrates classification performance increases in multiscale models according to embodiments of the present disclosure.

[0011] Fig. 6 illustrates overlapping retrosynthetic reaction predictions generated with multiscale and rule-based models according to embodiments of the present disclosure.

[0012] Fig. 7 illustrates correct calls obtained with a rule-only according to embodiments of the present disclosure.

[0013] Fig. 8 illustrates correct predictions obtained with a multiscale model according to embodiments of the present disclosure.

[0014] Fig. 9 illustrates partially correct predictions obtained with the multiscale model according to embodiments of the present disclosure.

[0015] Fig. 10 illustrates the distribution of exemplary reaction groups according to

embodiments of the present disclosure.

[0016] Fig. 11 illustrates the percentage difference in balanced accuracy for reaction

classification in exemplary reaction groups according to embodiments of the present disclosure.

[0017] Fig. 12 illustrates examples of overlapping retrosynthetic reaction predictions generated with multiscale and rule-based models according to embodiments of the present disclosure.

[0018] Fig. 13 illustrates examples of correct calls obtained with a rule-only model, but miscalled with multiscale models according to embodiments of the present disclosure.

[0019] Fig. 14 illustrates examples of correct calls obtained with multiscale model, but miscalled with rule-only models according to embodiments of the present disclosure.

[0020] Fig. 15 is a schematic view of a data-driven, multiscale approach based on deep highway networks (DHNs) and reaction rule classification for retrosynthetic reaction prediction according to the present disclosure. [0021] Fig. 16 illustrates a method of retrosynthetic analysis according to embodiments of the present disclosure.

[0022] Fig. 17 illustrates a method of retrosynthetic analysis according to embodiments of the present disclosure.

[0023] Fig. 18 depicts a computing node according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0024] Planning chemical synthesis of molecules is a crucial aspect of organic chemistry that has applications spanning many different fields, including drug discovery and material science. The goal of chemical synthesis planning is to derive a pathway, often consisting of multiple reaction steps and reactants, by which a target molecule can be produced. Retrosynthetic analysis is a widely employed technique in chemical synthesis planning, where a target molecule is recursively transformed into simpler precursors, following a reversed chemical reaction. This process is carried out until a series of starting precursors ( e.g ., commercially available molecules) are obtained. An overall synthetic route for the target molecule can subsequently be produced by combining all of the derived reactions from the retrosynthetic analysis. Various computational techniques may be used for retrosynthetic analysis.

[0025] For example, computer-aided retrosynthetic analysis may be carried out using reaction rules that consist of a set of minimal transformations (e.g., changes at the reactive center and neighboring bonds and atoms) to characterize a chemical reaction. These reaction rules, which can be encoded by expert chemists or automatically extracted from a given dataset, are used as templates for chemical transformations applied to an input target molecule to derive retrosynthetic precursors. The result of such a template-based approach is a set of reactant molecules that transform into the target product by following the reaction rule.

[0026] However, template-based systems are limited to an initial set of reaction rules, irrespective of whether those rules are hand-coded or extracted from a dataset. Thus, rule-based systems are not well-adapted to predict retrosynthetic reactions for new target products.

[0027] To overcome this limitation, deep learning (DL) may be applied to retrosynthetic analysis. An advantage of such non-linear statistical-learning approaches over template-based retrosynthetic methods is the ability to extract generalizable patterns from large amounts of chemical data. For example, a given model may consider the molecular context in which a reaction occurs at a fraction of the computational cost required for template-based

implementations.

[0028] Molecules may be represented in a variety of ways for use in deep learning based retrosynthetic analysis. For example, retrosynthetic analysis may be formulated as a translation task using a sequences-to-sequence (seq2seq) architecture by representing molecules as simplified molecular-input line-entry system (SMILES) strings. In such approaches, a target product is encoded as a string of characters extracted from its corresponding SMILES string and converted ( e.g ., translated) to another sequence of characters, corresponding to reactant SMILES strings. One advantage of this approach is it naturally incorporates information about the global environment of molecules by considering their entire structure (encoded in its SMILES string), instead of an abstracted version of the reaction (e.g., only the reactive center). However, due to the nature of the translation task, such models are prone to predict chemically unfeasible precursors for a target molecule. [0029] In other approaches, the retrosynthetic prediction task may be formulated as a

classification problem by representing molecules as Morgan Fingerprints. In such approaches, reaction rules (for example, as shown in Fig. 1) are automatically extracted from a large dataset ( e.g ., the Reaxys chemical database consisting of several millions of reactions) and used as labels to train a multilabel classifier, such as one based on deep highway networks. Given an input target molecule encoded as a fingerprint, the model predicts the probability of all the possible reaction rules in the training set. The top predicted rules are then applied to the input target molecule to obtain a set of retrosynthetic precursors. This approach may be combined with Monte Carlo tree search to guide the reaction prediction task and derive robust retrosynthetic pathways.

[0030] A variety of sophisticated molecular representations may be used for machine learning tasks, including latent space representations and molecular graph convolutions. In addition, binary fingerprints may be employed in various cheminformatics tasks, including in drug discovery. Although fingerprints present several issues (including bit collisions and vector sparsity), their flexibility and ease of computation offer a useful featurization scheme for machine learning applications such as reaction prediction.

[0031] A limitation of an entirely data-driven retrosynthesis approach is the finite number of reaction rules extracted from a dataset. These numbers are determined by several factors, including the level of detail of the extracted reaction rule and the diversity of the chemical dataset. This limitation can be aggravated in smaller datasets, for example a private electronic lab notebook (ELN), which would necessarily contain less knowledge than the entirety of known reactions in organic chemistry (which amounts to approximately 12.4 million reactions), or even in the publicly available United States patent (USPTO) dataset. Thus, rule- based models trained on different datasets would learn different amounts of chemistry, determined by the number of extracted reaction rules, and thus would be more limited in their predictive scope. Moreover, because of the chemical diversity expected in a typical ELN, a model can be biased to learn a highly imbalanced type of reaction (one that is highly represented compared to other reactions), if the data is not properly stratified.

[0032] To address these and other shortcomings of alternative approaches, the present disclosure provides a multiscale approach for retrosynthesis analysis using reaction rules extracted from relatively smaller data sets. For example, in some embodiments, a USPTO dataset of reactions is used, which contains patented reactions from the last 50 years. This dataset contains a significantly smaller number of reactions than the set of known organic chemistry reactions (~l. l million vs. 12.4 million). This multiscale approach leads to a significant increase in classification performance (measured by balanced accuracy) compared to alternative reaction rule-based systems. By stratifying the data by reaction groups during the training phase, the model learns patterns over molecules that were obtained by similar chemical reactions, which may be overlooked in an alternative rule-based system. The multiscale approach is validated by predicting the first retrosynthetic step for 40 approved small molecules. For these drugs, the multiscale model correctly predicted more known retrosynthetic steps than the alternative rule- based system.

[0033] In various embodiments, multiscale, data-driven approaches for retrosynthetic analysis with deep highway networks (DHN). In exemplary embodiments, reaction rules are

automatically extracted from datasets consisting of chemical reactions derived from FT.S. patents. In exemplary embodiments, the retrosynthetic reaction prediction task is performed in two steps: first, a DHN model is built to predict which group of reactions (consisting of chemically similar reaction rules) was employed to produce a molecule. Once a reaction group is identified, a DHN trained on the subset of reactions within the identified reaction group is employed to predict the transformation rule used to produce a molecule.

[0034] To validate this approach, the first retrosynthetic step is predicted for 40 approved small- molecule drugs (including anti-cancer and anti-viral drugs) using the multiscale model. Its predictive performance is compared with a rule-based model. Multiscale approaches as set out herein perform better than a purely rule-based model at predicting retrosynthetic reactions and generating valid reactants, achieving > 80% match with known synthetic routes of the tested drugs.

[0035] Accordingly, the multi-scale, data-driven, deep learning based approaches for retrosynthetic analysis provided herein outperform alternative techniques, with over a 10% increase in accuracy of predicted reaction rules. These results demonstrate a significant improvement in retrosynthetic models, enabling synthetic route planning with greater computational efficiency, and thus can achieve results with less computational resources than alternatives.

[0036] Various embodiments described herein apply a classifier to input data to determine an output classification of that input data. In some embodiments, the input comprises a feature vector. It will be appreciated that a variety of classifiers are suitable for use according to the present disclosure. For example, the classifier may be a random decision forest, a linear classifier, a support vector machine (SVM), or a neural network such as a recurrent neural network (R N). [0037] In various embodiments, a given classifier is pre-trained using training data. In some embodiments training data is retrospective data. In some embodiments, the retrospective data is stored in a data store. In some embodiments, the learning system may be additionally trained through manual curation of previously generated outputs.

[0038] Various exemplary embodiments described herein use artificial neural networks (ANNs). ANNs are distributed computing systems, which consist of a number of neurons interconnected through connection points called synapses. Each synapse encodes the strength of the connection between the output of one neuron and the input of another. The output of each neuron is determined by the aggregate input received from other neurons that are connected to it. Thus, the output of a given neuron is based on the outputs of connected neurons from preceding layers and the strength of the connections as determined by the synaptic weights. An ANN is trained to solve a specific problem ( e.g ., pattern recognition) by adjusting the weights of the synapses such that a particular class of inputs produce a desired output.

[0039] Various algorithms may be used for this learning process. Certain algorithms may be suitable for specific tasks such as image recognition, speech recognition, or language processing. Training algorithms lead to a pattern of synaptic weights that, during the learning process, converges toward an optimal solution of the given problem. Backpropagation is one suitable algorithm for supervised learning, in which a known correct output is available during the learning process. The goal of such learning is to obtain a system that generalizes to data that were not available during training.

[0040] In general, during backpropagation, the output of the network is compared to the known correct output. An n error value is calculated for each of the neurons in the output layer. The error values are propagated backwards, starting from the output layer, to determine an error value associated with each neuron. The error values correspond to each neuron’s contribution to the network output. The error values are then used to update the weights. By incremental correction in this way, the network output is adjusted to conform to the training data.

[0041] When applying backpropagation, an ANN rapidly attains a high accuracy on most of the examples in a training-set. The vast majority of training time is spent trying to further increase this test accuracy. During this time, a large number of the training data examples lead to little correction, since the system has already learned to recognize those examples. While in general, ANN performance tends to improve with the size of the data set, this can be explained by the fact that larger data-sets contain more borderline examples between the different classes on which the ANN is being trained.

[0042] Suitable artificial neural networks include but are not limited to a feedforward neural network, a radial basis function network, a self-organizing map, learning vector quantization, a recurrent neural network, a Hopfield network, a Boltzmann machine, an echo state network, long short term memory, a bi-directional recurrent neural network, a hierarchical recurrent neural network, a stochastic neural network, a modular neural network, an associative neural network, a deep neural network, a deep belief network, a convolutional neural networks, a convolutional deep belief network, a large memory storage and retrieval neural network, a deep Boltzmann machine, a deep stacking network, a tensor deep stacking network, a spike and slab restricted Boltzmann machine, a compound hierarchical-deep model, a deep coding network, a multilayer kernel machine, or a deep Q-network.

[0043] In some embodiments, a highway network is used. A highway network is an ANN based on Long Short Term Memory (LSTM) recurrent networks that allows training of deep, efficient networks using gradient-based methods. Highway networks may be efficiently deployed even with large numbers of layers ( e.g ., hundreds of layers). In particular, highway networks use learned gating mechanisms to regulate information flow, for example using Long Short-Term Memory (LSTM) units to build a recurrent neural network. These gating mechanisms allow neural networks to provide paths for information to follow across different layers (which may be referred to as highways).

[0044] In general, LSTM units comprise a memory cell, an input gate, an output gate, and a forget gate. The cell persists a value over time. The input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell, and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. Each gate may be implemented as an artificial neuron, computing an activation of a weighted sum. It will be appreciated that a variety of LSTM unit architectures may be adopted, and moreover that alternative gating mechanisms for highway networks may be adopted in various embodiments.

[0045] In an exemplary embodiment of the present disclosure, a dataset of chemical reactions aggregated from patents granted in the U.S. between 1976 and 2016 is preprocessed. This dataset is referred to herein as the USPTO dataset. Reaction centers for individual reactions are extracted, giving the set of atoms and bonds that changed between reactants and products in a given reaction. In various embodiments, an established protocol is used to define reaction rules for multiclass classification.

[0046] It will be appreciated that a variety of techniques are available for defining reaction rules. For example, such techniques are described in Law, L; Zsoldos, Z.; Simon, A.; Reid, D.; Liu, Y.; Khew, S. Y.; Johnson, A. P.; Major, S.; Wade, R. A.; Ando, H. Y. Route Designer: A Retrosynthetic Analysis Tool Utilizing Automated Retrosynthetic Rule Generation. J. Chem. Inf. Model. 2009, 49, 593-602 and Christ, C. D.; Zentgraf, M.; Kriegl, J. M. Mining Electronic Laboratory Notebooks: Analysis, Retrosynthesis, and Reaction Based Enumeration. J. Chem. Inf. Model. 2012, 52, 1745-1756.

[0047] In the present example, atom properties between mapped atoms in reactants and products are algorithmically compared, and the atoms and bonds that changed in the reaction are identified ( e.g ., the reactive center). Reaction rules (RR) are defined that contain the reactive center and the shell of first-neighboring atoms (e.g., as in Fig. 1). In this example, the initial rule extraction step resulted in a total of 74,482 unique RRs, however a significant portion of the extracted rules were represented only once (e.g., 54,444 RRs). These reaction rules were discarded since they are associated with very specific reactions, which were not generalizable across the dataset.

[0048] Referring to Fig. 1, a schematic representation of automatic rule extraction is provided according to embodiments of the present disclosure. Mapped atoms from the reactant 101 and product 102 side are compared which bonds and atoms changed during the reaction (reactive center). Starting from the reactive center, first and second neighbors to the reactive atoms are identified and are used to assign the reaction rule.

[0049] For the extracted rule set, different cutoffs are defined for the number of times a reaction rule occurs in the dataset (e.g., at least 50, 100, 250, 500 and 1000 times). These numbers are selected to maintain sufficient number of samples per reaction class for statistical power, while keeping a relatively broad set of reaction rules (labels for classification) to encompass diverse chemistries. [0050] Performance of rule-only models is shown in Table 1. Standard deviation is shown in parentheses. A breakdown of per class performance is presented in Appendix 1. The resulting datasets contains 855, 462, 225, 129 and 73 unique reaction rules for classification. The five resulting datasets are stratified by reaction rule, and split into training (80% of the total reactions) and held-out test sets (20% of the total reactions).

Table 1

[0051] As set out below, in various embodiments, reactions are predicted with deep highway networks using different reaction rule set sizes. Referring to Fig. 2, a schematic view of a multiscale approach for retrosynthetic reaction prediction is illustrated. In Fig. 2A, a schematic representation is provided of unsupervised reaction rule grouping according to approaches described herein. Reaction rules ( e.g ., 201) are represented as gray dots, and reaction groups ( e.g ., 202) are represented as dashed lines. As set out below, reaction grouping may be based on chemical similarity of the reaction rules extracted from a dataset, such as the USPTO dataset. Reaction rules and reaction group membership are employed as labels for classification. Fig. 2B is a schematic representation of a deep highway network (DHN) trained to predict on all extracted reaction rules. A molecular fingerprint 203 is provided to DHN 204, which comprises a plurality of layers organized into highways of carrier of transformed bits. The output 205 comprises a classification into one or more of a plurality of reaction candidate rules.

[0052] Fig. 2C is a schematic representation of a multiscale approach according to the present disclosure. A DHN 206 is trained for prediction on all reaction groups, using the same training set data as for the model depicted in Fig. 2B, but stratified by reaction group. Accordingly, a molecular fingerprint is provided to DHN 206, which provides an output classification 207 into one or more classification group. Once a prediction is made at the group scale, another DHN 208, trained to predict only on rules belonging to the predicted group, is employed to make predictions at the reaction scale. Accordingly, group-specific DHN 208 takes molecular fingerprint 203, and provides an output classification 209 into one or more of a plurality of reaction candidate rules.

[0053] In an exemplary embodiment, to build the multiscale approach, an unsupervised method ( e.g ., TaylorButina algorithm) is employed to group the extracted reaction rules by chemical similarity (e.g., as illustrated in Fig. 2A). The model then works via two steps: first a deep neural network (DNN) predicts which group of reactions produces a molecule, and then a smaller, more focused DNN, trained only on that group of reactions, predicts which rule produces the molecule (e.g., as illustrated in Fig. 2C). This is in contrast to the alternative implementation of rule-based retrosynthesis, in which a model is trained to predict a reaction rule from all extracted reaction rules (e.g., as illustrated in Fig. 2B). Product molecules are represented as fingerprints, and each molecule has an associated reaction rule and reaction group, which are used as labels during the training phase.

[0054] The task of retrosynthetic reaction prediction is formulated as a multinomial

classification problem. Given an input product molecule (encoded as a molecular fingerprint, e.g., of 2048 bits), the retrosynthetic DL models predict which reaction rule (class label) was used to produce the molecule (as set out in the discussion of Fig. 2). In some embodiments, deep highway network architectures (DHN) an used, based on a combination of a single hidden layer, five highway layers, and a final softmax layer to output class label probabilities (as illustrated in Fig. 2B). In an exemplary embodiments, five separate DHN are initially trained on each of training datasets (as shown in Table 1).

[0055] The average balanced accuracies on the test set (consisting of 20% of total set) for different reaction rule occurrence cutoffs ranged from 0.774 to 0.8208 (as shown in Table 1). These accuracies are comparable to the performance of alternative retrosynthetic models. For instance, an accuracy of 0.83 has been reported with 137 rules using the Reaxys dataset, compared to an accuracy of 0.81 for the present model trained with 129 rules. However, a smaller dataset is used in this example, thus a direct comparison was difficult. Breaking down the predictive performance of the five models by class label (reaction rule) shows that the models perform significantly better on some reaction rules than others (balanced accuracies of > 0.99 compared to < 0.5, with a mean range = 0.4668 and s.d. = 0.0386 over the five models), despite having a similar number of test samples (as illustrated in Appendix 1). This may be attributed to the chemical diversity of the products within each reaction rule class.

[0056] To quantitatively assess this, product molecules in the five test sets were clustered (using the Taylor-Butina algorithm with a cutoff of 0.8) and the correlation between per class accuracy and the number of product molecule clusters in the respective test set were calculated (as illustrated in Table 2 and Appendix 2). Table 2 shows the correlation between per class accuracy and the number of product molecule clusters in the each of the five test sets. Cutoff for Taylor- Butina clustering was set to 0.8. A complete breakdown of clusters per label is presented in Appendix 2. A statistically significant negative correlation is observed between per class accuracy and the number of product molecule clusters (ranging from -0.4306 to -0.6619

(pO.OOl, by student’s t-test), Table 2). For instance, the best performing label in the set with rule occurrence cutoff of 100 {CC(0)=O.Ncl[nH][n][n][n]l»CC(=0)Ncl[nH][n][n][n]l with an accuracy of 1.0) had only one cluster, while one of the worst performing labels (CN(C)C ()»CNC with an accuracy of 0.5223) had 13 clusters (as illustrated in Fig. 3). This suggests that reduced chemical diversity is associated with increased model performance (generalizability vs. specificity) and that the deep highway networks learned more generalizable patterns for classification on chemical subsets with high similarity (characterized by a small number of clusters) (as illustrated in Fig. 3).

Table 2

[0057] Thus, one strategy to improve model performance would be to stratify /balance datasets based on molecular similarity ( e.g ., use an algorithm such as Taylor-Butina to cluster product molecules in the dataset). However, this would require formulating a priori assumptions about the dataset that could harm the generalizability and applicability of retrosynthetic reaction prediction. In short, by taking this approach, a model could learn to predict reaction rules for a very specific type of product molecule (highly populated cluster), but struggle with others (in less populated clusters) even if they were obtained using the same reaction rule.

[0058] Accordingly, an alternative route to improve model performance is to build smaller and more focused retrosynthetic models on a smaller number of similar reaction rules (e.g., reaction grouping).

[0059] Referring to Fig. 3, the chemical diversity associated with reaction rule classes is illustrated according to embodiments of the present disclosure. Representative examples of clusters (most populated) of product molecules in test set with rule occurrence > 100 are given for the best (in Fig. 3A) and worst (in Fig. 3B) performing classes

(CC(0)=0.Ncl [nH] [n] [n] [n] l»CC(=0)Ncl [nHJ [n] with 1.000 accuracy, and

CN(C)C=0»CNC with 0.5223 accuracy, respectively). Each dividing line represents a subset of members of different clusters. Clusters were obtained using the Taylor-Butina algorithm implemented in RDKit (cutoff = 0.8).

[0060] In various embodiments, classifying on multiscale reaction rules improves deep highway network performance, as set out below. In various embodiments a, strategy is employed that groups similar reaction rules together (termed reaction groups), thus creating a multiscale representation of each individual reaction rule. Each reaction rule has group and rule information, as illustrated in Fig. 4 and Fig. 10. This approach is similar to assigning a reaction type to each reaction in the dataset as a preprocessing step. Reaction type assignment may be performed based on a known, predefined set of reaction types ( e.g ., using NameRXN, https://www.nextmovesoftware.com/namerxn.html). However, the approach described herein provides for reaction similarity search that is entirely data-driven and can easily be extended to any reaction dataset.

[0061] Fig. 4 provides a visualization of identified reaction groups in the multiscale dataset. Distribution of first (in Fig. 4A) and last (in Fig. 4B) 17 reaction groups obtained from reaction rule clustering. Reaction rule clustering was performed with Taylor-Butina (cutoff = 0.7). The distribution of reaction groups was obtained using t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction over 2048 bits (dimensions) of the binary reaction fingerprints. The insets show structure of some representative reaction rules within the most populated groups. The distribution of the remaining reaction groups is presented in Fig. 10. [0062] In an exemplary embodiment, reaction rule grouping is performed by clustering reaction rule fingerprints, built from the reaction rules used in the five derived datasets, with RDKit (using cutoff = 0.7). Reaction rules that were not placed into groups by the clustering algorithm were grouped together into a single group (the last group for each data set, Appendix 5), in order to still consider them in the group model. The same five datasets as in the previous rule-based DHN models described above are employed. However, stratification of training (80% of total data) and test data (20% of total data) was based on balancing reaction group number followed by balancing reaction rules.

[0063] The DHNs were then trained on the multiscale reaction group labels (as described in Fig. 2C). Specifically, a DHN was trained to predict which reaction group (consisting of similar reactions obtained with reaction clustering) was employed to produce a molecule. Once the reaction group was predicted, another DHN (specific to the reaction group) was trained to predict the corresponding reaction rule (using only the reaction rules within the predicted reaction group, as opposed to all the derived rules as in our first two models as shown in Fig. 2B). The first step, corresponding to reaction group prediction, was performed for each of the five datasets to select an appropriate balance between reaction rules (contained within the reaction groups) and model performance (measured by balanced classification accuracy, as illustrated in Table 3 and

Appendix 3). Table 3 illustrates performance of reaction group classification. Standard deviation is shown in parentheses. A breakdown of per class performance is presented in

Appendix 3.

Table 3

[0064] Based on this, training of specific reaction groups models was continued using the dataset with reaction rule occurrence > 100 (hereafter referred to as multiscale dataset), resulting in 68 additional DHN models, one for each specific reaction group (as shown in Table 4). Table 4 provides a summary of per label classification performance for the multiscale dataset (with rule occurrence cutoff > 100). Average was taken over per rule balanced accuracies within each reaction group in the test set. Standard deviations are shown in parentheses. A complete breakdown of each group performance is presented in Appendix 4.

Table 4

[0065] Overall, classifiers trained using multiscale reaction rules (with group and rule information) performed significantly better ( e.g ., mean accuracy increase from 0.7863 to 0.8982) than the rule reaction models (as shown in Fig. 5 and Fig. 11).

[0066] Fig. 5 illustrates classification performance increases in the multiscale models.

Percentage difference in balanced accuracy for the first four (Fig. 5A) and the last eight (Fig.

5B) reaction groups derived from the multiscale set (rule occurrence > 100). The difference was taken between accuracies in the multiscale models and all rules models. The difference for the rest of the reaction groups is presented in Fig. 11.

[0067] The DHN models built using the multiscale dataset at the reaction group scale had a mean balanced accuracy of 0.8516 (as shown in Table 3) and a mean balanced accuracy of 0.8982 at the reaction rule scale (as shown in Table 4). This is an improvement compared to the previous DHN reaction-rule classifier for the same dataset, with average balanced accuracy of 0.7863 (Table 1). With the multiscale approach, 384 out of 462 reaction rules showed an accuracy increase (83%, as shown in Fig. 5 and Fig. 11). The smaller classifiers built on reaction rules from the same reaction group achieved near perfect classification in many examples for the multiscale set (50 rules with a balanced accuracy >0.99, as shown in Appendix

4)·

[0068] This enhancement in classification performance may be attributed to reduced number of classification labels for each model (e.g., from 462 rule labels to 68 group labels for the multiscale dataset). After reaction grouping, the largest number of labels for multinomial classification was 50, and several of the reaction group models were reduced to binomial classifiers (as illustrated in Table 4), which contributed to the observed performance increase in classification.

[0069] In various embodiments, retrosynthetic reactions are predicted with the multiscale reaction rule models provided herein. Retrosynthetic analysis approaches described work at two levels. In the first level, a DHN classifier predicts a (multiscale) reaction rule employed to make a product molecule, then, at the transformation level, the predicted reaction rule is applied to the molecule ( e.g ., using RDKit). If the predicted transformation is valid, a list of precursor molecules (reactants) is generated. This multiscale approach for reaction prediction

outperformed the previous rules-based reaction classification (e.g., 0.8982 vs. 0.7863 accuracy for multiscale and rule-only classification, respectively), which corresponds to the first or upper level of retrosynthetic analysis. To verify the applicability of these multiscale approaches for reactant generation (the second or transformation level of the approach), the first retrosynthetic step of a variety of approved small molecules obtained from DrugBank (Table 5) are predicted.

[0070] As set out below, to compare the performance of the two models at the reactant- generation level, the top prediction of the rule-based model, and the top group and top rule prediction of the multiscale model are considered. The transformation in the top prediction of each model was applied to each molecule in the DrugBank set using RDkit. This resulted in a subset of 40 small molecules with known synthetic routes (obtained from Pharmacodia, http://en.pharmacodia.com) for which the top prediction of either model yielded a retrosynthetic transformation that is consistent with a known precluding synthetic step (as illustrated in Table

5)· [0071] Table 5 provides a summary of small molecules employed for Rule-based and Multiscale model validation. Corresponding literature associated with each predicted reaction is included (patent or journal article). The number of total valid predicted retrosynthetic steps (based on match with known routes) is shown in the last row for each model.

Table 5

[0072] This subset includes a wide variety of small molecules, including antiviral and anticancer drugs ( e.g ., abacavir and dasatinib, respectively). For antiviral drug telaprevir, the multiscale model predicted two different valid retrosynthetic steps using the same multiscale rule. Thus, the total number of predicted retrosynthetic steps used to compare models was 41, including two different valid synthetic steps for telaprevir. Although U.S. patents have been filed associated with the tested small molecules (e.g., US5034394A for abacavir), these reactions were not included in any of the training/test datasets. Therefore, the models had no a priori knowledge of these molecules or their synthetic pathways.

[0073] The multiscale approach produced 34 retrosynthetic predictions out of 41 (82.9% of the total predicted retrosynthetic steps)) that were consistent with known synthetic routes of the tested molecules (as illustrated in Table 5). In contrast, the rule-only model produced 24 predictions out of 41 consistent with known routes (58.5% of the total predicted retrosynthetic steps). This indicated that the multiscale approach not only outperformed the rule-based model at the multinomial classification (upper) level, but also at the reactant-generation

(transformation) level. Of the resulting valid predictions, both models made 17 common calls that were consistent with known synthetic steps (illustrated in Fig. 6 and Fig. 12).

[0074] Fig. 6 illustrates overlapping retrosynthetic reaction predictions generated with multiscale and rule-based models. Fig. 6A shows different valid predictions obtained with each model for two small molecules. Multiscale probabilities (for reaction group and reaction rule) are shown in parentheses in the figures. Fig. 6B shows examples of the same predictions obtained with both models. The remaining examples are presented in Fig. 12.

[0075] For two of these predictions (abacavir and mycophenolate mofetil), the top prediction of each model yielded different, valid retrosynthetic steps (shown in Fig 6A). These

transformations included acetylation (transformation with rule-based prediction for abacavir), and functional group removal (transforming mycphonelate mofetil to mycophenolic acid by removing 4-(2hydroxyethyl) morpholinem). For the remaining 15 common predictions, transformations yielded the same retrosynthetic step (as shown in Fig. 6B and Fig. 12), which in most cases included the main precursor reactant and reported reagent molecule. Reactions predicted in these cases were diverse, and included functional group protection deprotections and interconversions (for cidofovir and selexipag, respectively), and heterocycle formation (for tezacaftor).

[0076] The multiscale model made 13 calls that yielded a known retrosynthetic step for the tested small molecules, which the rule-based model missed (illustrated in Fig. 8 and Fig. 14). Some of the miscalls by the rule-based model were subtle, missing a functional group in an otherwise similar reactant (predicting a carbonyl group instead of Br in darifencin, Fig. 8), which highlighted the advantage of the multiscale approach in recognizing structural patterns within the product molecules over the model trained on the entire dataset. In contrast, the rule-based model made 7 calls matching known precluding steps that the multiscale model missed (illustrated in Fig. 7 and Fig. 13). In most of these cases, the multiscale model predicted the correct region were the retrosynthetic transformation occurs, but with the incorrect functional group ( e.g ., protection of OH group in pralatrexate).

[0077] Fig. 7 illustrates correct calls obtained with a rule-only model. In particular, it provides examples of reactions that were correctly predicted by the rule-based model but miscalled with the multiscale model. The rest of the reactions are shown in Fig. 13.

[0078] Fig. 8 illustrates correct predictions obtained with the multiscale model. In particular, it provides examples of retrosynthetic reactions that were correctly predicted by the multiscale model but miscalled with the rule-only model. The rest of the reactions are shown in Fig. 14.

[0079] As set out here, reaction rule size affects retrosynthetic reaction prediction. In addition to the aforementioned predictions, the multiscale model made 4 partial calls that are consistent with the known precluding step (as illustrated in Fig. 9). In these four cases, the multiscale model correctly predicted the addition of a protecting group to the product molecule as part of the retrosynthetic step; however, the group did not fully match the protecting group in the known reaction, and instead resulted in the addition of a closely related functional group (illustrated in Fig. 9)

[0080] Fig. 9 illustrates partially correct predictions obtained with the multiscale model. In particular, a summary is provided of retrosynthetic reactions that were partially predicted by the multiscale model. In these cases, the functional group addition predicted by the multiscale model is limited by the results of the reaction rule extraction step, which only included reactive center and its first-neighbor atoms. The multiscale model did not know the rule to make the ground truth prediction, which would require more detailed rules from the dataset ( e.g ., first and second atoms).

[0081] Specifically, the multiscale model predicted the addition of maleimide instead of phthalimide (for antidiabetic drugs alogliptin and linagliptin), or TMS (Trimethylsilyl) instead of TBS (tert-Butyldimethylsilyl). This was likely due to an inherent limitation of our rule extraction step, which only considered the reactive center and its first shell of neighboring atoms (as discussed with regard to Fig. 1). Under this scheme, the model was not able to learn on a reaction rule that would yield the addition of a larger functional group (for example, TBS), which would require to extend rule extraction beyond only first neighbors. However, even with this inherent limitation, the multiscale model was able to make a partial prediction that is consistent with the known reaction step. In contrast, the rule-based model was not able to make the correct predictions, even though it was trained with the same set of reaction rules. This observation demonstrates the advantage of the multiscale approach over the rule-only models trained on the same dataset for retrosynthetic reaction prediction.

[0082] In an exemplary embodiment, reaction rules are extracted form a patent dataset. The set of over one million chemical reactions extracted from United States granted patents from the years between 1976 and 2016 was employed. This dataset is freely available online. The reactions found in this dataset were preprocessed with RDKit to eliminate reagents, in order for reaction to only contain reactants and products before the rule extraction step. This step was performed to minimize the possibility of incorrect mapping (by taking into account reagents), leading to incorrect reaction rule assignment. After this preprocessing step, reactions were atom mapped using the Indigo toolkit.

[0083] Rule extraction was implemented as shown in Fig. 1, employing a strategy described in detail above. The rule extraction step was performed with custom scripts using RDKit. Briefly, for each mapped reaction, the reactive core ( e.g ., atoms and bonds that change between reactants and products) was identified by comparing the attributes of corresponding mapped atoms. The considered attributes included charge, bond type, valence and number of neighbors, as previously described. The reactive center was extended to include first neighbors, in order to include more details about the chemical structure of the reactive center. The extracted small and large reaction rules were employed as labels for the reaction classification task. A total of 74,482 unique reaction rules were extracted from the dataset, respectively.

[0084] Training and testing sets were prepared for rule-based and multiscale retrosynthetic reaction prediction. Product molecules were encoded as Morgan fingerprints (FP), a form of extended- connectivity fingerprints (ECFP). Each molecule was converted to a 2048-bits FP (of radius 2) and vectorized using RDKit. The resulting vectors were employed as the input data for the deep learning (DL) models.

[0085] As noted above, various embodiments described herein use Morgan fingerprints having a length of 2048 bits. However, it will be appreciated that the present disclosure is applicable to additional fingerprints. For example, in additional embodiments, a torsion fingerprint may be used. In some embodiments, a three-dimensional fingerprint may be used. One example of a three-dimensional fingerprint, the extended three-dimensional fingerprint (E3FP), is described in Axen, el al. A Simple Representation of Three-Dimensional Molecular Structure. J. Med. Chem. 2017, 60, 7393-7409. Three-dimensional fingerprints explicitly account for 3D structural patterns of molecules. It will also be appreciated that various alternative fingerprint lengths may be adopted according to the present disclosure. In general, a fingerprint length that is sufficiently large to avoid bit collisions within the dataset of interest is desirable. Likewise, a fingerprint length that is short enough to maintain a space-efficient encoding is desirable. In particular, an excessively large fingerprint results in a sparse encoding that is not space-efficient. For the exemplary datasets described herein, a 2048-bit encoding represents a balance between these considerations.

[0086] Five sub-datasets were generated by defining a cutoff on the number of times that each reaction rule occurs (as illustrated in Table 1). The cutoffs employed were 50, 100, 250, 500 and 1000. These numbers were selected to maintain robustness in the dataset while preserving diversity in the number of reaction rules employed.

[0087] To build DL models for classification on a smaller number of similar reactions for the multiscale approach, we performed grouping of the reaction rules, and use reaction group membership as labels for classification to generate additional sub-datasets for modeling. For this step, each reaction rule was encoded as a difference reaction FP, and the pair-wise Tanimoto similarity matrix of these FPs was built. Grouping (clustering) was performed on this distance matrix with the Taylor-Butina method as implemented in RDKit37 (with a cutoff of 0.7, or distance to cluster center of 0.3). Finally, corresponding group labels were assigned to each product in the five datasets (as illustrated in Table 3). To visually inspect the resulting reaction groups, t-SNE was employed for dimensionality reduction.

[0088] Finally, the datasets were split 8:2 for training and testing, respectively. For each model, the same datasets were employed, but the data was cut differently, depending on the labels for classification. Data was stratified using reaction rules or cluster labels, in order to take care of the data imbalance in the dataset. This approach was taken instead of using a balanced dataset (with same number of samples per class) to account for the data imbalance that is expected in a typical chemical dataset, were some reaction would be more represented than others.

[0089] To quantify the degree of chemical diversity within each reaction rule class, Taylor- Butina clustering was performed on the product molecules within each of the five testing sets (one for each rule occurrence cutoff, Appendix 2). Product molecules were encoded as 2048-bits FP (of radius 2) RDKit. The pair-wise Tanimoto similarity matrix was obtained, and Taylor- Butina clustering was performed with RDKit, using a cutoff = 0.8 (distance to cluster center of 0.2). To assess the effect of chemical diversity in classification performance, the correlation coefficient between per class balanced accuracy and cluster number was calculated (as illustrated in Table 2).

[0090] The retrosynthetic reaction prediction task was formulated as a multicass classification problem. A neural network architecture based on a combination of a hidden layer and highway networks was used. Briefly, highway networks differ from typical neural networks in that they employ gating mechanisms to regulate the flow of information (as illustrated in Fig. 2B). This allows for a portion of unmodified input to be passed across layers together with activations.

The same core architecture was employed for the rule-based and multiscale models.

[0091] Models were built using Keras with the TensorFlow back end. The hidden layer included 2048 neurons (one for each bit of the FP we employed), with an exponential linear unit (ELU) (followed by dropout value of 0.2). This was followed by five highway layers with rectified linear units (ReLU), followed with a dropout value of 0.1. The last layer of the network was a softmax to output class label probabilities. All of the layers in the network had normal initialization. The ADAM optimizer (with learning rate of 0.001) was used to minimize the binary cross entropy loss function for classification. Class weights (determined by the number of samples in each class) were implemented to take care of data imbalance in the training set. As mentioned before, this approach was employed to consider the data imbalance expected in a typical chemical dataset. The number of training epochs for the models was determined by early stopping (with patience of 2), implemented by monitoring the loss on a validation dataset (10% of the training set).

[0092] Fig. 10 illustrates the distribution of reaction groups 18 to 51. The distribution of reaction groups was obtained using t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction over 2048 bits (dimensions) of the binary reaction fingerprints. The distribution of the remaining reaction groups is presented in Fig. 4.

[0093] Fig. 11 illustrates the percentage difference in balanced accuracy for reaction

classification in reaction groups 5 to 60. The difference was taken between accuracies in the multiscale models and all rules models. The difference for the rest of the reaction groups is presented in Fig. 5.

[0094] Fig. 12 illustrates examples of overlapping retrosynthetic reaction predictions generated with multiscale and rule-based models. The remaining examples are presented in Fig. 6.

[0095] Fig. 13 illustrates examples of correct calls obtained with rule-only model, but miscalled with the multiscale model. The rest of the reactions are shown in Fig. 7.

[0096] Fig. 14 illustrates examples of correct calls obtained with multiscale model, but miscalled with the rule-only model. The rest of the reactions are shown in Fig. 8.

[0097] Appendix 1 provide a performance summary of models trained for multinomial classification using reaction rules as labels. Results obtained with the test set are presented. [0098] Appendix 2 illustrates chemical diversity within each test dataset. Diversity was quantified as the number of clusters, using the Taylor-Butina algorithm with a cutoff = 0.8.

[0099] Appendix 3 illustrates performance of models trained for multinomial classification using reaction groups as labels. Results obtained with the test set are presented.

[0100] Appendix 4 provides a comparison of test set balanced accuracies for rule-based and multiscale models.

[0101] Referring to Fig. 15, a data-driven, multiscale approach based on deep highway networks (DHNs) and reaction rule classification for retrosynthetic reaction prediction is illustrated according to the present disclosure. In various embodiments, reaction prediction is performed in two steps. First, a DHN 1501 is used to predict which reaction group (consisting of reaction rules grouped by chemical similarity) was used to make a molecule. Once a reaction group prediction is made, a more specific model 1502, trained only on reaction rules within the predicted reaction group, is employed to predict a reaction rule. This results in a larger number of DHN models for the multiscale model (determined by the number of reaction groups extracted from the dataset), as opposed to a single DHN model for a rule-based approach. Once a reaction rule is obtained, the transformation is applied to the input molecule to derive chemically viable reactants.

[0102] As set out above, to compare the performance of the multiscale approach to the rule- based model, a set of approved small molecules was employed and their first precluding synthetic step was predicted. The multiscale model outperforms the rule-based multinomial classification approach, where a model is trained to make predictions over all the reaction rules of the dataset, both at the classification level (the multiscale model has a higher average balanced accuracy), and the reactant-generation level (the multiscale model produces more reactions that match known synthetic routes). This indicates that the multiscale approach enhances the performance of DL for retrosynthetic reaction prediction task relative to alternative models. Moreover, due to the molecular featurization used in various embodiments (fingerprints), this approach can easily be integrated into cheminformatic platforms for synthesis planning.

[0103] In various embodiments, a size restriction in the rule extraction step may have an impact in the chemical structure of predicted reactants. Accordingly, in alternative embodiments, a flexible reaction rule extraction step is performed , in which the shell of neighboring atoms around the reactive center is not fixed in size. Reactive centers and reaction rules may be learned, for example, by crawling along the edges of a molecular graph, parametrized by neural networks. In this flexible reaction rule extraction step, functional groups may be learned by exploring the neighborhood of the reactive center in both sides of a chemical reaction, and comparing reactants and products parametrized as molecular graphs, instead of mapped atoms.

[0104] In various embodiments, the multiscale model described herein may be integrated with other DL models built for optimizing chemical reaction conditions, for example a deep reinforcement learning model, to predict complete retrosynthetic routes. This allows the introduction of information in the model about the conditions in which the reaction occurs. This facilitates complete chemical synthesis planning.

[0105] With reference now to Fig. 16, a method of retrosynthetic analysis is illustrated. At 1601, a molecular fingerprint is determined for a chemical product. At 1602, the molecular fingerprint is provided to a first trained classifier. At 1603, a candidate reaction group is obtained from the first trained classifier. At 1604, a second trained classifier is selected from a plurality of trained classifiers, the second trained classifier corresponding to the candidate reaction group. At 1605, the molecular fingerprint is provided to the second trained classifier.

At 1606, a candidate reaction rule is obtained from the second trained classifier. At 1607, a plurality of reactants yielding the chemical product is determined from the candidate reaction rule.

[0106] With reference now to Fig. 17, a method of retrosynthetic analysis is illustrated. At 1701, a plurality of reaction rules is clustered into a plurality of groups. At 1702, a first classifier is trained to select one of the plurality of groups based on an input molecular fingerprint of a chemical product. At 1703, a plurality of classifiers are trained, each associated with one of the plurality of groups, to select one of the plurality of reaction rules from the associated group based on an input molecular fingerprint of a chemical product.

[0107] Referring now to Fig. 18, a schematic of an example of a computing node is shown. Computing node 10 is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments described herein. Regardless, computing node 10 is capable of being implemented and/or performing any of the functionality set forth hereinabove.

[0108] In computing node 10 there is a computer system/server 12, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or

configurations that may be suitable for use with computer system/server 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.

[0109] Computer system/server 12 may be described in the general context of computer system- executable instructions, such as program modules, being executed by a computer system.

Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system/server 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a

communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

[0110] As shown in Fig. 18, computer system/server 12 in computing node 10 is shown in the form of a general-purpose computing device. The components of computer system/server 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including system memory 28 to processor 16.

[0111] Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus, Peripheral Component Interconnect Express (PCIe), and Advanced Microcontroller Bus Architecture (AMBA).

[0112] Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12, and it includes both volatile and non-volatile media, removable and non-removable media.

[0113] System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a "hard drive"). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk ( e.g ., a "floppy disk"), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.

[0114] Program/utility 40, having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments as described herein.

[0115] Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc.; one or more devices that enable a user to interact with computer system/server 12; and/or any devices ( e.g ., network card, modem, etc.) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet, computer system/server 12 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.

[0116] The present disclosure may be embodied as a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

[0117] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media ( e.g ., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0118] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0119] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more

programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the“C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0120] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

[0121] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0122] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0123] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0124] The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

CLAIMS What is claimed is:

1. A method comprising:

determining a molecular fingerprint for a chemical product;

providing the molecular fingerprint to a first trained classifier;

obtaining from the first trained classifier a candidate reaction group;

selecting a second trained classifier from a plurality of trained classifiers, the second trained classifier corresponding to the candidate reaction group;

providing the molecular fingerprint to the second trained classifier;

obtaining from the second trained classifier a candidate reaction rule.

2. The method of claim 1, further comprising:

determining from the candidate reaction rule a plurality of reactants yielding the chemical product.

3. The method of claim 1, wherein the first trained classifier comprises an artificial neural network.

4. The method of claim 3, wherein the artificial neural network comprises a highway network.

5. The method of claim 1, wherein the second trained classifier comprises an artificial neural network.

6. The method of claim 5, wherein the artificial neural network comprises a highway network.

7. The method of claim 1 , wherein the candidate reaction rule comprises a reaction center.

8. The method of claim 7, wherein the candidate reaction rule further comprises a shell of neighboring atoms.

9. The method of claim 8, wherein the shell consists of first-neighboring atoms.

10. The method of claim 1, wherein the molecular fingerprint has a length of about 2 kbits.

11. The method of claim 1 , wherein the molecular fingerprint comprises a Morgan fingerprint.

12. The method of claim 1, wherein the molecular fingerprint comprises an extended- connectivity fingerprint.

13. The method of claim 1 , wherein the molecular fingerprint comprises a torsion fingerprint.

14. The method of claim 1, wherein the molecular fingerprint comprises a three-dimensional fingerprint.

15. A system comprising:

a computing node comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising:

determining a molecular fingerprint for a chemical product;

providing the molecular fingerprint to a first trained classifier; obtaining from the first trained classifier a candidate reaction group;

providing the molecular fingerprint to the second trained classifier;

obtaining from the second trained classifier a candidate reaction rule.

16. The system of claim 15, the method further comprising: determining from the candidate reaction rule a plurality of reactants yielding the chemical product.

17. A computer program product for retrosynthetic analysis, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:

determining a molecular fingerprint for a chemical product;

providing the molecular fingerprint to the second trained classifier;

obtaining from the second trained classifier a candidate reaction rule.

18. The computer program product of claim 17, the method further comprising:

19. A method comprising:

clustering a plurality of reaction rules into a plurality of groups;

training a first classifier to select one of the plurality of groups based on an input molecular fingerprint of a chemical product;

training a plurality of classifiers, each associated with one of the plurality of groups, to select one of the plurality of reaction rules from the associated group based on an input molecular fingerprint of a chemical product.

20. The method of claim 19, wherein the plurality of reaction rules is clustered according to chemical similarly.

21. The method of claim 19, wherein the plurality of reaction rules is clustered by

unsupervised learning.

22. The method of claim 19, wherein the plurality of reaction rules is clustered by the Taylor- Butina algorithm.

23. The method of claim 19, wherein clustering comprises assigning a reaction type

24. The method of claim 19, wherein the first classifier comprises an artificial neural network.

25. The method of claim 24, wherein the artificial neural network comprises a highway network.

26. The method of claim 19, wherein each of the plurality of classifiers comprises an artificial neural network.

27. The method of claim 26, wherein the artificial neural network comprises a highway network.

28. The method of claim 19, wherein each reaction rule comprises a reaction center

29. The method of claim 28, wherein each reaction rule further comprises a shell of neighboring atoms.

30. The method of claim 29, wherein the shell consists of first-neighboring atoms

31. The method of claim 19, wherein the molecular fingerprint has a length of about 2 kbits

32. The method of claim 19, wherein the molecular fingerprint comprises a Morgan fingerprint.

33. The method of claim 19, wherein the molecular fingerprint comprises an extended- connectivity fingerprint.

34. The method of claim 19, wherein the molecular fingerprint comprises a torsion fingerprint.

35. The method of claim 19, wherein the molecular fingerprint comprises a three- dimensional fingerprint.

36. The method of claim 19, further comprising:

determining a molecular fingerprint for a chemical product;

providing the molecular fingerprint to the first classifier;

obtaining from the first classifier a candidate reaction group;

selecting a second classifier from the plurality of classifiers, the second classifier corresponding to the candidate reaction group;

providing the molecular fingerprint to the second classifier;

obtaining from the second trained classifier a candidate reaction rule.

37. The method of claim 36, further comprising:

38. A system comprising:

clustering a plurality of reaction rules into a plurality of groups; training a first classifier to select one of the plurality of groups based on an input molecular fingerprint of a chemical product;

39. A computer program product for retrosynthetic analysis, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:

clustering a plurality of reaction rules into a plurality of groups;

40. A system comprising:

a first trained classifier, adapted to receive a molecular fingerprint for a chemical product, and determine therefrom a candidate reaction group from a plurality of candidate reaction groups;

a plurality of trained classifiers, each associated with one of the plurality of groups, each adapted to select a candidate reaction rule from its associated group based on an input molecular fingerprint of a chemical product.

41. The system of claim 40, wherein the first trained classifier comprises an artificial neural network.

42. The system of claim 41, wherein the artificial neural network comprises a highway network.

43. The system of claim 40, wherein the plurality of trained classifiers each comprise an artificial neural network.

44. The system of claim 43, wherein the artificial neural network comprises a highway network.

45. The system of claim 40, wherein the candidate reaction rule comprises a reaction center

46. The system of claim 45, wherein the candidate reaction rule further comprises a shell of neighboring atoms.

47. The system of claim 46, wherein the shell consists of first-neighboring atoms

48. The system of claim 40, wherein the molecular fingerprint has a length of about 2 kbits

49. The system of claim 40, wherein the molecular fingerprint comprises a Morgan fingerprint.

50. The system of claim 40, wherein the molecular fingerprint comprises an extended- connectivity fingerprint.

51. The system of claim 40, wherein the molecular fingerprint comprises a torsion fingerprint.

52. The system of claim 40, wherein the molecular fingerprint comprises a three-dimensional fingerprint.