WO2020190887A1

WO2020190887A1 - Methods and systems for de novo molecular configurations

Info

Publication number: WO2020190887A1
Application number: PCT/US2020/023009
Authority: WO
Inventors: Alireza SHABANI; Hanwool YOON; Limeng Pu; Edward Williams; Adam Simon
Original assignee: Qulab Inc.
Priority date: 2019-03-18
Filing date: 2020-03-16
Publication date: 2020-09-24

Abstract

The present disclosure provides methods and systems for identifying a small molecule configured to modulate a macromolecule described herein. The methods and systems generally use generative modeling artificial intelligence (AI) procedures to identify the small molecule based on its energy of interaction with an interaction site of the macromolecule.

Description

METHODS AND SYSTEMS FOR DE NOVO MOLECULAR CONFIGURATIONS

CROSS-REFERENCE

[0001] This application claims the benefit of U.S. Provisional Patent Application No.

62/820,105, filed March 18, 2019, which is entirely incorporated herein by reference for all purposes.

BACKGROUND

[0002] In chemistry and biology, the identification and the prediction of molecules capable of performing a particular function (such as modulating the activity of a target macromolecule) have significant importance as molecular function is inherently embedded in the physical and chemical interactions of the molecule with its target. Experimental approaches to identifying such molecules may require laborious chemical synthesis and may thus be slow or expensive. Existing computational approach to identifying such molecules may rely on searching a very small portion of the available chemical space and may thus fail to consider a large majority of molecules that may better perform the function. Other existing computational approaches may require a random exhaustive search of the available chemical space and may thus be slow or expensive. Thus, there is a need for methods and systems for methods and systems for efficiently identifying and predicting molecules capable of modulating the activity of a target macromolecule.

SUMMARY

[0003] Recognized herein is the need for methods and systems for efficiently identifying and predicting molecules capable of modulating the activity of a target macromolecule.

[0004] The present disclosure provides methods and systems for de novo molecular

configuration or design. Systems and methods provided herein may utilize generative modeling artificial intelligence (AI) procedures to produce a plurality of structures of a plurality of small molecules and to determine an energy of interaction of each of the plurality of small molecules with an interaction site of a macromolecule and to then determine a candidate small molecule based at least in part on the energy of interaction.

[0005] In an aspect, the present disclosure provides a method for identifying a small molecule configured to modulate a macromolecule, comprising: (a) obtaining (i) a representation of the macromolecule, wherein the macromolecule comprises an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule; (b) using at least one computer processor to individually or collectively perform a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises the small molecule having a structure of the plurality of structures; (c) using at least one computer processor to individually or collectively determine an energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule, to identify the small molecule having the structure; and (d) electronically outputting a report corresponding to the structure of the small molecule.

[0006] In some embodiments, the generative modeling AI procedure comprises a machine learning (ML) algorithm. In some embodiments, the ML algorithm comprises at least one ML training algorithm. In some embodiments, the ML algorithm comprises at least one ML inference algorithm. In some embodiments, the generative modeling AI procedure comprises at least one reinforcement learning (RL) procedure. In some embodiments, the generative modeling AI procedure comprises at least one tree search method. In some embodiments, the generative modeling AI procedure comprises at least one evolutionary algorithm. In some embodiments, the generative modeling AI procedure comprises at least one genetic algorithm. In some embodiments, the generative modeling AI procedure comprises at least one simulated annealing algorithm. In some embodiments, the method further comprises, prior to (b), training the generative modeling AI procedure to generate the plurality of structures. In some embodiments, training the generative modeling AI procedure comprises using a training library. In some embodiments, the training library comprises at least one representation selected from the group consisting of a simplified molecular-input line-entry system (SMILES) structure, a Wiswesser line notation, a ROSDAL representation, a SYBYL Line Notation (SLN), a structural drawing, a common name, a trivial name, an International Union of Pure and Applied Chemistry (IUPAC) name, a Chemical Abstracts Service (CAS) number, an International Chemical Identifier (InChl) identifier, a three-dimensional (3D) molecular structure, a molecule graph, and a molecular fingerprint. In some embodiments, the at least one representation is independent of the plurality of structures. In some embodiments, (c) comprises processing the energy of interaction of each of the plurality of small molecules against a threshold value to identify the small molecule. In some embodiments, the method further comprises using at least one computer processor to individually or collective perform an alteration artificial intelligence (AI) procedure to alter at least a subset of the plurality of structures based at least in part on one or more chemical properties of at least a subset of the plurality of small molecules. In some embodiments, the one or more chemical properties comprise one or more members selected from the group consisting of: physiological absorption of the plurality of small molecules, distribution of the plurality of small molecules, metabolism of the plurality of small molecules, excretion of the plurality of small molecules, toxicity of the plurality of small molecules, and ease of synthesis of the plurality of small molecules. In some embodiments, the macromolecule is a protein. In some embodiments, the interaction site is a protein binding site. In some embodiments, the protein binding site comprises a protein binding pocket. In some embodiments, (c) comprises using one or more non-classical computers to individually or collectively determine the energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule, to identify the small molecule having the structure. In some embodiments, the one or more non-classical computers comprise at least one quantum computer. In some embodiments, (c) comprises identifying the structure of the small molecule as having a minimum energy of interaction among structures of other small molecules of the plurality of small molecules. In some embodiments, the report includes the structure. In some embodiments, the report includes a calculated energy of interaction of the small molecule with the interaction site. In some embodiments, the method further comprises, subsequent to (c) and prior to (d), optimizing the structure of the small molecule. In some embodiments, the method further comprises repeating (a)-(d) for each of a plurality of interaction sites of the macromolecule. In some embodiments, the report is generated within at most about 12 hours from a request to generate the report.

[0007] In some embodiments, (b) comprises, for each small molecule of the plurality of small molecules: (1) using at least one computer to individually or collectively perform (i) a first artificial intelligence (AI) procedure to identify one or more seed fragments of the small molecule, and (ii) a second AI procedure to identify one or more substituent fragments of the small molecule; and (2) using the one or more seed fragments and the one or more substituent fragments to generate the structure of the small molecule, which structure comprises at least one of the one or more seed fragments linked to at least one of the one or more substituent fragments.

[0008] In some embodiments, the first AI procedure or the second AI procedure comprises at least one machine learning (ML) algorithm. In some embodiments, the ML algorithm comprises at least one ML training algorithm. In some embodiments, the ML algorithm comprises at least one ML inference algorithm. In some embodiments, the first AI procedure or the second AI procedure comprises at least one reinforcement learning (RL) algorithm. In some embodiments, (1) comprises performing the first AI procedure to select the one or more seed fragments of the small molecule based at least in part on an energy of interaction of the one or more seed fragments with the interaction site. In some embodiments, (2) comprises linking the one or more substituent fragments to the one or more seed fragments to reduce the energy of interaction of the small molecule with the interaction site. In some embodiments, (1) comprises performing the first AI procedure to select the one or more seed fragments for the small molecule based at least in part on one or more chemical properties of the one or more seed fragments. In some embodiments, the one or more chemical properties of the one or more seed fragments comprise one or more members selected from the group consisting of: physiological absorption of the one or more seed fragments, distribution of the one or more seed fragments, metabolism of the one or more seed fragments, excretion of the one or more seed fragments, toxicity of the one or more seed fragments, and ease of synthesis of the one or more seed fragments. In some embodiments, (2) comprises linking the one or more substituent fragments to the one or more seed fragments based at least in part on one or more chemical properties of the one or more substituent fragments. In some embodiments, the one or more chemical properties of the one or more substituent fragments comprise one or more members selected from the group consisting of: physiological absorption of the one or more substituent fragments, distribution of the one or more substituent fragments, metabolism of the one or more substituent fragments, excretion of the one or more substituent fragments, toxicity of the one or more substituent fragments, and ease of synthesis of the one or more substituent fragments. In some embodiments, the one or more seed fragments or the one or more substituent fragments comprise one or more members selected from the group consisting of: atoms, molecules, and molecular fragments. In some embodiments, one or more of the one or more seed fragments are located a distance of at least 1 nanometer (nm) from the interaction site. In some embodiments, one or more of the one or more substituent fragments are located a distance of at most 1 nanometer (nm) from the interaction site.

[0009] In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a small molecule configured to modulate a macromolecule, the method comprising: (a) obtaining (i) a representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule; (b) performing a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises the small molecule having a structure of the plurality of structures; (c) determining an energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule, to identify the small molecule having the structure; and (d) electronically outputting a report corresponding to the structure of the small molecule.

[0010] In another aspect, the present disclosure provides a system for identifying a small molecule configured to modulate a macromolecule, comprising: a database comprising (i) a representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: obtain (i) the representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) the representation of the interaction site of the macromolecule; perform a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises the small molecule having a structure of the plurality of structures; determine an energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule, to identify the small molecule having the structure; and electronically output a report corresponding to the structure of the small molecule.

[0011] Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.

[0012] Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.

[0013] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure.

Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

INCORPORATION BY REFERENCE

[0014] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also“Figure” and“FIG.” herein), of which: [0016] FIG. 1 shows a computer control system that is programmed or otherwise configured to implement methods provided herein.

[0017] FIG. 2 shows a flowchart for an example of a method for identifying a small molecule configured to modulate a macromolecule.

[0018] FIG. 3 shows a flowchart for an example of a method for generating a plurality of small molecules using seed fragments and substituent fragments.

[0019] FIG. 4 shows an example of a scheme for identifying a small molecule configured to modulate a macromolecule using the methods and system described herein.

[0020] FIG. 5 shows an example of a system for identifying a small molecule configured to modulate a macromolecule.

[0021] FIG. 6 shows an example of a scheme for generating simplified molecular-input line- entry system (SMILES) structures using recurrent neural networks and reinforcement learning.

[0022] FIG. 7 shows an example of a scatter plot of docking scores and synthetic accessibility scores (SAscores) of small molecules associated with SMILES structures generated using recurrent neural networks and reinforcement learning.

[0023] FIG. 8 shows an example of histograms showing the distribution of small molecules before and after reinforcement learning optimization has shifted the generation of molecules toward enhanced docking and synthesizability.

[0024] FIG. 9 shows an example of small molecule structures generated without and with enforcement of high SAscores.

[0025] FIG. 10 shows an example of SMILES training data.

[0026] FIG. 11 shows an example of a SMILES structure generation procedure for enforcing chemical features on a small molecule.

[0027] FIG. 12 shows an example of a procedure for generating a three-dimensional (3D) structure of a small molecule by linking seed fragments and substituent fragments.

[0028] FIG. 13 shows an example of a reinforcement learning model for generating three- dimensional molecular structures.

[0029] FIG. 14 shows an example of a reinforcement learning model for generating three- dimensional molecular structures on an atom-by-atom basis.

[0030] FIG. 15 shows an example of optimization of a small molecule structure to enhance one or more chemical properties of the small molecule.

DETAILED DESCRIPTION

[0031] While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.

[0032] Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. As used in this specification and the appended claims, the singular forms“a,”“an,” and“the” include plural references unless the context clearly dictates otherwise. Any reference to“or” herein is intended to encompass“and/or” unless otherwise stated.

[0033] Whenever the term“at least,”“greater than,” or“greater than or equal to” precedes the first numerical value in a series of two or more numerical values, the term“at least,”“greater than” or“greater than or equal to” applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3.

[0034] Whenever the term“no more than,”“less than,” or“less than or equal to” precedes the first numerical value in a series of two or more numerical values, the term“no more than,”“less than,” or“less than or equal to” applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.

[0035] Where values are described as ranges, it will be understood that such disclosure includes the disclosure of all possible sub-ranges within such ranges, as well as specific numerical values that fall within such ranges irrespective of whether a specific numerical value or specific sub range is expressly stated.

[0036] As used herein, like characters refer to like elements.

[0037] As used herein, the terms“artificial intelligence,”“artificial intelligence procedure”, “artificial intelligence operation,” and“artificial intelligence algorithm” generally refer to any system or computational procedure that takes one or more actions to enhance or maximize a chance of successfully achieving a goal. The term“artificial intelligence” may include “generative modeling,”“machine learning” (ML), and/or“reinforcement learning” (RL).

[0038] As used herein, the terms“generative modeling”,“generative modeling procedure”, and “generative modeling artificial intelligence (AI) procedure” generally refer to any system or computational procedure that produces a representation or abstraction of observed phenomena or target variables that can be calculated from observations. Generative modeling may comprise one or more of generative adversarial networks (GANs), variational autoencoders (VAEs), and autoregressive models. [0039] As used herein, the terms“machine learning,”“machine learning procedure,”“machine learning operation,” and“machine learning algorithm” generally refer to any system or analytical and/or statistical procedure that progressively improves computer performance of a task.

Machine learning may include a machine learning algorithm. The machine learning algorithm may be a trained algorithm. Machine learning (ML) may comprise one or more supervised, semi- supervised, or unsupervised machine learning techniques. For example, an ML algorithm may be a trained algorithm that is trained through supervised learning (e.g., various parameters are determined as weights or scaling factors). ML may comprise one or more of regression analysis, regularization, classification, dimensionality reduction, ensemble learning, meta learning, association rule learning, cluster analysis, anomaly detection, deep learning, or ultra-deep learning. ML may comprise, but is not limited to: k-means, k-means clustering, k-nearest neighbors, learning vector quantization, linear regression, non-linear regression, least squares regression, partial least squares regression, logistic regression, stepwise regression, multivariate adaptive regression splines, ridge regression, principle component regression, least absolute shrinkage and selection operation, least angle regression, canonical correlation analysis, factor analysis, independent component analysis, linear discriminant analysis, multidimensional scaling, non-negative matrix factorization, principal components analysis, principal coordinates analysis, projection pursuit, Sammon mapping, t-distributed stochastic neighbor embedding, AdaBoosting, boosting, gradient boosting, bootstrap aggregation, ensemble averaging, decision trees, conditional decision trees, boosted decision trees, gradient boosted decision trees, tree search methods, Monte Carlo tree search, random forests, stacked generalization, Bayesian networks, Bayesian belief networks, naive Bayes, Gaussian naive Bayes, multinomial naive Bayes, hidden Markov models, hierarchical hidden Markov models, support vector machines, encoders, decoders, auto-encoders, stacked auto-encoders, perceptrons, multi-layer perceptrons, artificial neural networks, feedforward neural networks, convolutional neural networks, recurrent neural networks, long short-term memory, deep belief networks, deep Boltzmann machines, deep convolutional neural networks, deep recurrent neural networks, hill-climbing algorithms, grid searches, random searches, multi-objective searches, evolutionary algorithms, genetic algorithms, simulated annealing, or generative adversarial networks.

[0040] As used herein, the terms“reinforcement learning,”“reinforcement learning procedure,” “reinforcement learning operation,” and“reinforcement learning algorithm” generally refer to any system or computational procedure that takes one or more actions to enhance or maximize some notion of a cumulative reward to its interaction with an environment. The agent performing the reinforcement learning (RL) procedure may receive positive or negative reinforcements, called an“instantaneous reward”, from taking one or more actions in the environment and therefore placing itself and the environment in various new states.

[0041] A goal of the agent may be to enhance or maximize some notion of cumulative reward. For instance, the goal of the agent may be to enhance or maximize a“discounted reward function” or an“average reward function”. A“Q-function” may represent the maximum cumulative reward obtainable from a state and an action taken at that state. A“value function” and a“generalized advantage estimator” may represent the maximum cumulative reward obtainable from a state given an optimal or best choice of actions. RL may utilize any one of more of such notions of cumulative reward. As used herein, any such function may be referred to as a“cumulative reward function”. Therefore, computing a best or optimal cumulative reward function may be equivalent to finding a best or optimal policy for the agent.

[0042] The agent and its interaction with the environment may be formulated as one or more Markov Decision Processes (MDPs). The RL procedure may not assume knowledge of an exact mathematical model of the MDPs. The MDPs may be completely unknown, partially known, or completely known to the agent. The RL procedure may sit in a spectrum between the two extents of“model -based” or“model -free” with respect to prior knowledge of the MDPs. As such, the RL procedure may target large MDPs where exact methods may be infeasible or unavailable due to an unknown or stochastic nature of the MDPs.

[0043] The RL procedure may be implemented using one or more computer processors described herein. The digital processing unit may utilize an agent that trains, stores, and later on deploys a “policy” to enhance or maximize the cumulative reward. The policy may be sought (for instance, searched for) for a period of time that is as long as possible or desired. Such an optimization problem may be solved by storing an approximation of an optimal policy, by storing an approximation of the cumulative reward function, or both. In some cases, RL procedures may store one or more tables of approximate values for such functions. In other cases, RL procedure may utilize one or more“function approximators”.

[0044] Examples of function approximators may include neural networks (such as deep neural networks) and probabilistic graphical models (e.g. Boltzmann machines, Helmholtz machines, and Hopfield networks). A function approximator may create a parameterization of an approximation of the cumulative reward function. Optimization of the function approximator with respect to its parameterization may consist of perturbing the parameters in a direction that enhances or maximizes the cumulative rewards and therefore enhances or optimizes the policy (such as in a policy gradient method), or by perturbing the function approximator to get closer to satisfy Bellman’s optimality criteria (such as in a temporal difference method). [0045] During training, the agent may take actions in the environment to obtain more

information about the environment and about good or best choices of policies for survival or better utility. The actions of the agent may be randomly generated (for instance, especially in early stages of training) or may be prescribed by another machine learning paradigm (such as supervised learning, imitation learning, or any other machine learning procedure described herein). The actions of the agent may be refined by selecting actions closer to the agent’s perception of what an enhanced or optimal policy is. Various training strategies may sit in a spectrum between the two extents of off-policy and on-policy methods with respect to choices between exploration and exploitation.

[0046] As used herein, the terms“non-classical computation,”“non-classical procedure,”“non- classical operation,” any“non-classical computer” generally refer to any method or system for performing computational procedures outside of the paradigm of classical computing. A non- classical computation, non-classical procedure, non-classical operation, or non-classical computer may comprise a quantum computation, quantum procedure, quantum operation, or quantum computer.

[0047] As used herein, the terms“quantum computation,”,“quantum procedure,”“quantum operation,” and“quantum computer” generally refer to any method or system for performing computations using quantum mechanical operations (such as unitary transformations or completely positive trace-preserving (CPTP) maps on quantum channels) on a Hilbert space represented by a quantum device. As such, quantum and classical (or digital) computation may be similar in the following aspect: both computations may comprise sequences of instructions performed on input information to then provide an output. Various paradigms of quantum computation may break the quantum operations down into sequences of basic quantum operations that affect a subset of qubits of the quantum device simultaneously. The quantum operations may be selected based on, for instance, their locality or their ease of physical implementation. A quantum procedure or computation may then consist of a sequence of such instructions that in various applications may represent different quantum evolutions on the quantum device. For example, procedures to compute simulate quantum chemistry may represent the quantum states and the annihilation and creation operators of electron spin-orbitals by using qubits (such as two-level quantum systems) and a universal quantum gate set (such as the Hadamard, controlled-not (CNOT), and p/8 rotation) through the so-called Jordan-Wigner transformation or Bravyi-Kitaev transformation.

[0048] Additional examples of quantum procedures or computations may include procedures for optimization such as quantum approximate optimization algorithm (QAOA) or quantum minimum finding. QAOA may comprise performing rotations of single qubits and entangling gates of multiple qubits. In quantum adiabatic computation, the instructions may carry stochastic or non-stochastic paths of evolution of an initial quantum system to a final one.

[0049] Quantum-inspired procedures may include simulated annealing, parallel tempering, master equation solver, Monte Carlo procedures and the like. Quantum -classical or hybrid algorithms or procedures may comprise such procedures as variational quantum eigensolver (VQE) and the variational and adiabatically navigated quantum eigensolver (VanQver). Such hybrid algorithms may be especially suitable for near-term noisy quantum devices where there may be a restriction (and in some cases, a severe restriction) in the available quantum

computational power due to short coherence times and/or a limitation in the number of available qubits.

[0050] A quantum computer may comprise one or more adiabatic quantum computers, quantum gate arrays, one-way quantum computers, topological quantum computers, quantum Turing machines, superconductor-based quantum computers, trapped ion quantum computers, trapped atom quantum computers, optical lattices, quantum dot computers, spin-based quantum computers, spatial-based quantum computers, Loss-DiVincenzo quantum computers, nuclear magnetic resonance (NMR) based quantum computers, solution-state NMR quantum computers, solid-state NMR quantum computers, solid-state NMR Kane quantum computers, electrons-on-helium quantum computers, cavity-quantum- electrodynamics based quantum computers, molecular magnet quantum computers, fullerene-based quantum computers, linear optical quantum computers, diamond-based quantum computers, nitrogen vacancy (NV) diamond-based quantum computers, Bose-Einstein condensate-based quantum computers, transistor-based quantum computers, and rare-earth-metal-ion-doped inorganic crystal based quantum computers. A quantum computer may comprise one or more of: quantum annealers, Ising solvers, optical parametric oscillators (OPO), and gate models of quantum computing.

[0051] As used herein, the terms“alkyl” and“alkyl group” generally refer to substituted or unsubstituted saturated hydrocarbon groups, including straight -chain alkyl and branched-chain alkyl groups. An alkyl group may contain from one to twelve carbon atoms (e.g., Ci-12 alkyl), such as one to eight carbon atoms (Ci-₈ alkyl) or one to six carbon atoms (Ci-₆ alkyl). Exemplary alkyl groups include methyl, ethyl, n-propyl, isopropyl, n-butyl, isobutyl, sec-butyl, /f/V-butyl, pentyl, isopentyl, neopentyl, hexyl, septyl, octyl, nonyl, and decyl. An alkyl group may be attached to the rest of the molecule by a single bond. Unless stated otherwise specifically in the specification, an alkyl group is optionally substituted by one or more substituents such as those substituents described herein.

[0052] As used herein, the terms“haloalkyl” and“haloalkyl group” generally refer to an alkyl group that is substituted by one or more halogens. Exemplary haloalkyl groups include trifluoromethyl, difluoromethyl, trichloromethyl, 2,2,2-trifluoroethyl, 1,2-difluoroethyl,

3-bromo-2-fluoropropyl, and 1,2-dibromoethyl.

[0053] As used herein,“alkenyl” or“alkenyl group” generally refers to substituted or unsubstituted hydrocarbon groups, including straight-chain or branched-chain alkenyl groups containing at least one double bond. An alkenyl group may contain from two to twelve carbon atoms (e.g., C2-12 alkenyl). Exemplary alkenyl groups include ethenyl (i.e., vinyl), prop-l-enyl, but-l-enyl, pent-l-enyl, penta-l,4-dienyl, and the like. Unless stated otherwise specifically in the specification, an alkenyl group is optionally substituted by one or more substituents such as those substituents described herein.

[0054] As used herein, the terms“alkynyl” and“alkynyl group” generally refer to substituted or unsubstituted hydrocarbon groups, including straight-chain or branched-chain alkynyl groups containing at least one triple bond. An alkynyl group may contain from two to twelve carbon atoms (e.g., C2-12 alkynyl). Exemplary alkynyl groups include ethynyl, propynyl, butynyl, pentynyl, hexynyl, and the like. Unless stated otherwise specifically in the specification, an alkynyl group is optionally substituted by one or more substituents such as those substituents described herein.

[0055] As used herein, the terms“alkylene,”“alkylene chain,” and“alkylene group” generally refer to substituted or unsubstituted divalent saturated hydrocarbon groups, including straight- chain alkylene and branched-chain alkylene groups that contain from one to twelve carbon atoms. Exemplary alkylene groups include methylene, ethylene, propylene, and «-butylene. Similarly,“alkenylene” and“alkynylene” refer to alkylene groups, as defined above, which comprise one or more carbon-carbon double or triple bonds, respectively. The points of attachment of the alkylene, alkenylene or alkynylene chain to the rest of the molecule may be through one carbon or any two carbons within the chain. Unless stated otherwise specifically in the specification, an alkylene, alkenylene, or alkynylene group is optionally substituted by one or more substituents such as those substituents described herein.

[0056] As used herein, the terms“heteroalkyl,”,“heteroalkenyl,”“heteroalkynyl,”“heteroalkyl group,”“heteroalkenyl group,” and“heteroalkynyl group” generally refer to substituted or unsubstituted alkyl, alkenyl and alkynyl groups which respectively have one or more skeletal chain atoms selected from an atom other than carbon. Exemplary skeletal chain atoms selected from an atom other than carbon include, e.g., oxygen (O), nitrogen (N), phosphorous (P), silicon (Si), sulfur (S), or combinations thereof, wherein the nitrogen, phosphorus, and sulfur atoms may optionally be oxidized and the nitrogen heteroatom may optionally be quaternized. If given, a numerical range refers to the chain length in total. For example, a 3 - to 8-membered heteroalkyl has a chain length of 3 to 8 atoms. Connection to the rest of the molecule may be through either a heteroatom or a carbon in the heteroalkyl, heteroalkenyl or heteroalkynyl chain. Unless stated otherwise specifically in the specification, a heteroalkyl, heteroalkenyl, or heteroalkynyl group is optionally substituted by one or more substituents such as those substituents described herein.

[0057] As used herein, the terms“carbocycle” and“carbocycle group” generally refer to a saturated, unsaturated or aromatic ring in which each atom of the ring is a carbon atom.

Carbocycles may include 3- to 10-membered monocyclic rings, 6- to 12-membered bicyclic rings, and 6- to 12-membered bridged rings. Each ring of a bicyclic carbocycle may be selected from saturated, unsaturated, and aromatic rings. In some embodiments, the carbocycle is an aryl. In some embodiments, the carbocycle is a cycloalkyl. In some embodiments, the carbocycle is a cycloalkenyl. In an exemplary embodiment, an aromatic ring, e.g., phenyl, may be fused to a saturated or unsaturated ring, e.g., cyclohexane, cyclopentane, or cyclohexene. Any combination of saturated, unsaturated and aromatic bicyclic rings, as valence permits, are included in the definition of carbocyclic. Exemplary carbocycles include cyclopentyl, cyclohexyl, cyclohexenyl, adamantyl, phenyl, indanyl, and naphthyl. Unless stated otherwise specifically in the

specification, a carbocycle is optionally substituted by one or more substituents such as those substituents described herein

[0058] As used herein, the terms“heterocycle” and“heterocycle group” generally refer to a saturated, unsaturated or aromatic ring comprising one or more heteroatoms. Exemplary heteroatoms include N, O, Si, P, B, and S atoms. Heterocycles include 3- to 10-membered monocyclic rings, 6- to 12-membered bicyclic rings, and 6- to 12-membered bridged rings. Each ring of a bicyclic heterocycle may be selected from saturated, unsaturated, and aromatic rings. The heterocycle may be attached to the rest of the molecule through any atom of the heterocycle, valence permitting, such as a carbon or nitrogen atom of the heterocycle. In some embodiments, the heterocycle is a heteroaryl. In some embodiments, the heterocycle is a heterocycloalkyl. In an exemplary embodiment, a heterocycle, e.g., pyridyl, may be fused to a saturated or unsaturated ring, e.g., cyclohexane, cyclopentane, or cyclohexene. Exemplary heterocycles include pyrrolidinyl, pyrrolyl, imidazolyl, pyrazolyl, triazolyl, piperidinyl, pyridinyl, pyrimidinyl, pyridazinyl, pyrazinyl, thiophenyl, oxazolyl, thiazolyl, morpholinyl, indazolyl, indolyl, and quinolinyl. Unless stated otherwise specifically in the specification, a heterocycle is optionally substituted by one or more substituents such as those substituents described herein.

[0059] As used herein,“heteroaryl” and“heteroaryl group” generally refer to a 3- to

12-membered aromatic ring that comprises at least one heteroatom wherein each heteroatom may be independently selected from N, O, and S. As used herein, the heteroaryl ring may be selected from monocyclic or bicyclic and fused or bridged ring systems wherein at least one of the rings in the ring system is aromatic, i.e., it contains a cyclic, delocalized (4n+2) p-electron system in accordance with the Hiickel theory. The heteroatom(s) in the heteroaryl may be optionally oxidized. One or more nitrogen atoms, if present, are optionally quatemized. The heteroaryl may be attached to the rest of the molecule through any atom of the heteroaryl, valence permitting, such as a carbon or nitrogen atom of the heteroaryl. Examples of heteroaryls include, but are not limited to, azepinyl, acridinyl, benzimidazolyl, benzindolyl, 1,3-benzodioxolyl, benzofuranyl, benzooxazolyl, benzo[d]thiazolyl, benzothiadiazolyl, benzo[Z>][l,4]dioxepinyl,

benzo[b][l,4]oxazinyl, 1,4-benzodioxanyl, benzonaphthofuranyl, benzoxazolyl, benzodioxolyl, benzodioxinyl, benzopyranyl, benzopyranonyl, benzofuranyl, benzofuranonyl, benzothienyl (benzothiophenyl), benzothieno[3,2-d]pyrimidinyl, benzotriazolyl,

benzo[4,6]imidazo[l,2-a]pyridinyl, carbazolyl, cinnolinyl, cyclopenta[d]pyrimidinyl, 6,7- dihydro-5H-cyclopenta[4,5]thieno[2,3-d]pyrimidinyl, 5,6-dihydrobenzo[h]quinazolinyl, 5,6- dihydrobenzo[h]cinnolinyl, 6,7-dihydro-5H-benzo[6,7]cyclohepta[l,2-c]pyridazinyl,

dibenzofuranyl, dibenzothiophenyl, furanyl, furanonyl, furo[3,2-c]pyridinyl, 5,6,7,8,9,10- hexahydrocycloocta[d]pyrimidinyl, 5,6,7,8,9,10-hexahydrocycloocta[d]pyridazinyl, 5,6,7,8,9,10- hexahydrocycloocta[d]pyridinyl, isothiazolyl, imidazolyl, indazolyl, indolyl, indazolyl, isoindolyl, indolinyl, isoindolinyl, isoquinolyl, indolizinyl, isoxazolyl, 5,8-methano-5,6,7,8- tetrahydroquinazolinyl, naphthyridinyl, 1,6-naphthyridinonyl, oxadiazolyl, 2-oxoazepinyl, oxazolyl, oxiranyl, 5,6,6a,7,8,9,10,10a-octahydrobenzo[h]quinazolinyl, 1 -phenyl- 1 //-pyrrol yf phenazinyl, phenothiazinyl, phenoxazinyl, phthalazinyl, pteridinyl, purinyl, pyrrolyl, pyrazolyl, pyrazolo[3,4-d]pyrimidinyl, pyridinyl, pyrido[3,2-d]pyrimidinyl, pyrido[3,4-d]pyrimidinyl, pyrazinyl, pyrimidinyl, pyridazinyl, pyrrolyl, quinazolinyl, quinoxalinyl, quinolinyl,

isoquinolinyl, tetrahydroquinolinyl, 5,6,7,8-tetrahydroquinazolinyl, 5, 6,7,8- tetrahydrobenzo[4,5]thieno[2,3-d]pyrimidinyl, 6,7,8,9-tetrahydro-5H-cyclohepta[4,5]thieno[2,3- djpyrimidinyl, 5,6,7,8-tetrahydropyrido[4,5-c]pyridazinyl, thiazolyl, thiadiazolyl, triazolyl, tetrazolyl, triazinyl, thieno[2,3-d]pyrimidinyl, thieno[3,2-d]pyrimidinyl, thieno[2,3-c]pridinyl, and thiophenyl ( i.e . thienyl). Unless stated otherwise specifically in the specification, a heteroaryl is optionally substituted by one or more substituents such as those substituents described herein.

Methods and systems for molecular design or configuration

[0060] In an aspect, the present disclosure provides a method for identifying a small molecule configured to modulate a macromolecule. The method may comprise obtaining (i) a

representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule. Next, a generative modeling artificial intelligence (AI) procedure may be performed to generate a plurality of structures of a plurality of small molecules. The plurality of small molecules may comprise the small molecule having a structure of the plurality of structures. An energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule may then be determined, to identify the small molecule having the structure. A report corresponding to the structure of the small molecule may be electronically outputted, such as on a user interface. The user interface may be a graphical user interface. Such method may be performed using one or more computer processors.

[0061] FIG. 2 shows a flowchart for an example of a method 200 for identifying a small molecule configured to modulate a macromolecule.

[0062] The small molecule may comprise a pharmaceutical or drug molecule. The small molecule may comprise a molecular weight of at least 10 Daltons (Da), 20 Da, 30 Da, 40 Da, 50 Da, 60 Da, 70 Da, 80 Da, 90 Da, 100 Da, 200 Da, 300 Da, 400 Da, 500 Da, 600 Da, 700 Da, 800 Da, 900 Da, 1,000 Da, or more. The small molecule may comprise a molecular weight of at most 1,000 Da, 900 Da, 800 Da, 700 Da, 600 Da, 500 Da, 400 Da, 300 Da, 200 Da, 100 Da, 90 Da, 80 Da, 70 Da, 60 Da, 50 Da, 40 Da, 30 Da, 20 Da, 10 Da, or less. The small molecule may comprise a molecular weight that is within a range defined by any two of the preceding values.

[0063] The macromolecule may comprise a molecular weight of at least 100 Da, 200 Da, 300 Da, 400 Da, 500 Da, 600 Da, 700 Da, 800 Da, 900 Da, 1 kiloDalton (kDa), 2 kDa, 3 kDa, 4 kDa, 5 kDa, 6 kDa, 7 kDa, 8 kDa, 9 kDa, 10 kDa, 20 kDa, 30 kDa, 40 kDa, 50 kDa, 60 kDa, 70 kDa, 80 kDa, 90 kDa, 100 kDa, 200 kDa, 300 kDa, 400 kDa, 500 kDa, 600 kDa, 700 kDa, 800 kDa, 900 kDa, 1,000 kDa, or more. The macromolecule may comprise a molecular weight of at most 1,000 kDa, 900 kDa, 800 kDa, 700 kDa, 600 kDa, 500 kDa, 400 kDa, 300 kDa, 200 kDa, 100 kDa, 90 kDa, 80 kDa, 70 kDa, 60 kDa, 50 kDa, 40 kDa, 30 kDa, 20 kDa, 10 kDa, 9 kDa, 8 kDa,

7 kDa, 6 kDa, 5 kDa, 4 kDa, 3 kDa, 2 kDa, 1 kDa, 900 Da, 800 Da, 700 Da, 600 Da, 500 Da, 400 Da, 300 Da, 200 Da, 100 Da, or less. The macromolecule may comprise a molecular weight that is within a range defined by any two of the previous values.

[0064] The macromolecule may comprise one or more peptides, polypeptides, or proteins. The macromolecule may comprise one or more nucleic acids or nucleic acid complexes. The macromolecule may comprise deoxyribonucleic acid (DNA). The macromolecule may comprise ribonucleic acid (RNA). The macromolecule may comprise one or more interaction sites. For instance, a macromolecule comprising one or more peptides, polypeptides, or proteins may comprise one or more protein binding sites or protein binding pockets.

[0065] The small molecule may be configured to modulate the macromolecule. For instance, the small molecule may be configured to alter the function of a macromolecule comprising one or more peptides, polypeptides, or proteins, such as by inhibiting the macromolecule. The small molecule may be configured to alter the expression of a macromolecule comprising one or more nucleic acids or nucleic acid complexes, such as by upregulating or downregulating expression of the macromolecule.

[0066] With reference to FIG. 2, in an operation 210, the method 200 may comprise obtaining (i) a representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule. The representation of the macromolecule may comprise coordinates of one or more atoms of the macromolecule. For instance, for a macromolecule comprising one or more peptides,

polypeptides, or proteins, the representation of the macromolecule may comprise protein databank (PDB) information comprising coordinates of one or more atoms of the one or more peptides, polypeptides, or proteins. The representation of the interaction site may comprise coordinates of one or more atoms of the interaction site.

[0067] The operation 210 may be performed using one or more computer processors described herein, the one or more computer processors. The one or more computer processors may be individually or collectively configured to perform the operation 210.

[0068] In an operation 220, the method 200 may comprise performing a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules. The plurality of small molecules may comprise the small molecule configured to modulate the macromolecule. The generative modeling AI procedure may comprise one or more ML algorithms, such as any ML algorithm described herein. The generative modeling AI procedure may comprise one or more ML training algorithms. The generative modeling AI procedure may comprise one or more ML inference algorithms. The generative modeling AI procedure may comprise one or more RL procedures, such as any RL procedures described herein.

[0069] The generative modeling AI procedure may generate at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,

20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, 1,000,000, 2,000,000, 3,000,000, 4,000,000, 5,000,000, 6,000,000, 7,000,000,

8,000,000, 9,000,000, 10,000,000, 20,000,000, 30,000,000, 40,000,000, 50,000,000, 60,000,000, 70,000,000, 80,000,000, 90,000,000, 100,000,000, 200,000,000, 300,000,000, 400,000,000, 500,000,000, 600,000,000, 700,000,000, 800,000,000, 900,000,000, 1,000,000,000, or more structures of small molecules. The generative modeling AI procedure may generate at most 1,000,000,000, 900,000,000, 800,000,000, 700,000,000, 600,000,000, 500,000,000, 400,000,000, 300,000,000, 200,000,000, 100,000,000, 90,000,000, 80,000,000, 70,000,000, 60,000,000, 50,000,000, 40,000,000, 30,000,000, 20,000,000, 10,000,000, 9,000,000, 8,000,000, 7,000,000, 6,000,000, 5,000,000, 4,000,000, 3,000,000, 2,000,000, 1,000,000, 900,000, 800,000, 700,000, 600,000, 500,000, 400,000, 300,000, 200,000, 100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 9, 8, 7, 6, 5, 4, 3,

2, or 1 structure(s) of small molecules.

[0070] The method 200 may comprise, prior to operation 220, training the generative modeling AI procedure to generate the plurality of structures. Training the generative modeling AI procedure may comprise using a training library. The training library may comprise a plurality of structures of small molecules. The training library may comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9,

10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, 1,000,000, 2,000,000, 3,000,000, 4,000,000, 5,000,000, 6,000,000, 7,000,000, 8,000,000, 9,000,000, 10,000,000, 20,000,000, 30,000,000, 40,000,000, 50,000,000, 60,000,000, 70,000,000, 80,000,000, 90,000,000, 100,000,000, 200,000,000, 300,000,000, 400,000,000, 500,000,000, 600,000,000, 700,000,000, 800,000,000, 900,000,000, 1,000,000,000, or more structures of small molecules. The training library may comprise at most 1,000,000,000, 900,000,000, 800,000,000, 700,000,000, 600,000,000, 500,000,000, 400,000,000, 300,000,000, 200,000,000, 100,000,000, 90,000,000, 80,000,000, 70,000,000, 60,000,000, 50,000,000, 40,000,000, 30,000,000, 20,000,000, 10,000,000, 9,000,000, 8,000,000, 7,000,000, 6,000,000, 5,000,000, 4,000,000, 3,000,000, 2,000,000, 1,000,000, 900,000, 800,000, 700,000, 600,000, 500,000, 400,000, 300,000, 200,000, 100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 structure(s) of small molecules.

[0071] The training library may comprise a plurality of representations of chemical structures. The representations of chemical structure may be comprise one or more simplified molecular- input line-entry system (SMILES) representations, Wiswesser line notations, ROSDAL representations, SYBYL Line Notations (SLN), structural drawings, common names, trivial names, International Union of Pure and Applied Chemistry (IUPAC) names, three-dimensional (3D) structures, Chemical Abstracts Service (CAS) numbers, International Chemical Identifier (InChl) identifiers, molecule graphs, or molecular fingerprints.. The plurality of representations may be independent of the plurality of structures generated by the generative modeling AI procedure. For instance, the plurality of structures generated by the generative modeling AI procedure may be different from the plurality of representations in the library. -The plurality of representations may be independent of the plurality of representations generated by the generative modeling AI procedure. For instance, the plurality of representations generated by the generative modeling AI procedure may be different from the plurality of representations in the library.

[0072] The operation 220 may comprise one or more operations of method 300 described herein with respect to FIG. 3.

[0073] The operation 220 may be performed using one or more computer processors described herein, the one or more computer processors individually or collectively configured to perform the operation 220.

[0074] In an operation 230, the method 200 may comprise determining an energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule to identify the small molecule configured to modulate the macromolecule. The energy of interaction may comprise a free energy of interaction. The energy of interaction may be determined through a free energy calculation procedure, such as a free energy perturbation (FEP) procedure. The FEP procedure may be performed using a software platform such as FEP+, AMBER, BOSS,

CFLARMM, Desmond, GROMACS, MacroModel, MOLARIS, NAMD, Tinker, or Q. The energy of interaction may be determined through a molecular docking simulation.

[0075] The operation 230 may comprise processing the energy of interaction of each of the plurality of small molecules against a threshold value to identify the small molecule configured to modulate the macromolecule. For instance, the energy of interaction of each of the plurality of small molecules may be compared to the threshold value. Each of the plurality of small molecules may then be identified as being a candidate for modulating the macromolecule if the associated energy of interaction falls below the threshold value or may be identified as not being a candidate for modulating the macromolecule if the associated energy of interaction falls above the threshold value.

[0076] The operation 230 may comprise identifying the structure of the small molecule configured to modulate the macromolecule as having a minimum energy of interaction among structures of other small molecules of the plurality of small molecules. In such a case, the energy of interaction between the small molecule and the macromolecule may have a minimal value.

[0077] The operation 230 may further comprise optimizing the structure of the small molecule configured to modulate the macromolecule.

[0078] The operation 230 may be performed using one or more computer processors described herein, the one or more computer processors individually or collectively configured to perform the operation 230. Alternatively or in combination, the operation 230 may comprise using one or more non-classical computers described herein (such as one or more quantum computers described herein) to individually or collectively determine the energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule.

[0079] In an operation 240, the method 200 may comprise outputting a report corresponding to the structure of the small molecule configured to modulate the macromolecule. The report may include the structure of the small molecule configured to modulate the macromolecule. The report may include an energy of interaction of the small molecule with the interaction site.

[0080] The operation 240 may be performed using one or more computer processors described herein, the one or more computer processors individually or collectively configured to perform the operation 240. For instance, the operation 240 may comprise electronically outputting the report.

[0081] The operation 230 or 240 may further comprise performing an alteration AI procedure to alter at least a subset of the plurality of structures based at least in part on one or more chemical properties of the subset. The chemical properties may comprise one or more members selected from the group consisting of: absorption of the plurality of small molecules, distribution of the plurality of small molecules, metabolism of the plurality of small molecules, excretion of the plurality of small molecules, toxicity of the plurality of small molecules, and ease of synthesis of the plurality of small molecules.

[0082] The method 200 may further comprise repeating any one, two, three, or all of operations 210, 220, 230, and 240 for each of a plurality of interaction sites of the macromolecule.

[0083] FIG. 3 shows a flowchart for an example of a method 300 for generating a plurality of small molecules using seed fragments and substituent fragments.

[0084] In an operation 310, the method 300 may comprise, for each small molecule of the plurality of small molecules: performing (i) a first artificial intelligence (AI) procedure to identify one or more seed fragments of the small molecule, and (ii) a second AI procedure to identify one or more substituent fragments of the small molecule. The first AI procedure or second AI procedure may comprise one or more ML algorithms, such as any ML algorithm described herein. The first AI procedure or second AI procedure may comprise one or more ML training algorithms. The first AI procedure or second AI procedure may comprise one or more ML inference algorithms. The first AI procedure or second AI procedure may comprise one or more RL procedures, such as any RL procedures described herein.

[0085] The operation 310 may comprise performing the first AI procedure to select the one or more seed fragments of the small molecule based at least in part on an energy of interaction of the one or more seed fragments with the interaction site. The energy of interaction may comprise a free energy of interaction. The energy of interaction may be determined through a free energy calculation procedure, such as a free energy perturbation (FEP) procedure. The FEP procedure may be performed using any software platform described herein. The operation 310 may comprise performing the first AI procedure to select the one or more seed fragments for the small molecule based at least in part on one or more chemical properties of the one or more seed fragments. The one or more chemical properties may comprise one or more members selected from the group consisting of: absorption of the one or more seed fragments, distribution of the one or more seed fragments, metabolism of the one or more seed fragments, excretion of the one or more seed fragments, toxicity of the one or more seed fragments, and ease of synthesis of the one or more seed fragments.

[0086] The operation 310 may be performed using one or more computer processors described herein, the one or more computer processors individually or collectively configured to perform the operation 310.

[0087] In an operation 320, the method 300 may comprise, for each small molecule of the plurality of small molecules: using the one or more seed fragments and the one or more substituent fragments to generate the structure of the molecule, by linking the one or more seed fragments to the one or more substituent fragments.

[0088] The operation 320 may comprise linking the one or more seed fragments of the small molecule to the one or more seed fragments to reduce the energy of interaction of the small molecule with the interaction site. The operation 320 may comprise linking the one or more substituent fragments to the one or more seed fragments based at least in part on one or more chemical properties of the one or more substituent fragments. The one or more chemical properties may comprise one or more members selected from the group consisting of: absorption of the one or more substituent fragments, distribution of the one or more substituent fragments, metabolism of the one or more substituent fragments, excretion of the one or more substituent fragments, toxicity of the one or more substituent fragments, and ease of synthesis of the one or more substituent fragments.

[0089] The operation 320 may be performed using one or more computer processors described herein, the one or more computer processors individually or collectively configured to perform the operation 320.

[0090] The one or more seed fragments or one or more substituent fragments may comprise one or more members selected from the group consisting of: atoms, molecules, and molecular fragments. For instance, the one or more seed fragments or one or more substituent fragments may comprise one or more hydrogen (H), carbon (C), nitrogen (N), oxygen (O), fluorine (F), phosphorous (P), silicon (Si), sulfur (S), chlorine (Cl), bromine (Br), or iodine (I) atoms. The one or more seed fragments or one or more substituent fragments may comprise one or more alkyl groups, haloalkyl groups, alkenyl groups, alkynyl groups, alkylene groups, heteroalkyl groups, heteroalkenyl groups, heteroalkynyl groups, carbocycle groups, heterocycle groups, heteroaryl groups, hydroxyl (OH) groups, amine (ME or ME) groups, nitro (NO2) groups, carbonyl (C=0) groups, carboxyl (CC=0) groups, or cyano (CN) groups.

[0091] The one or more seed fragments or one or more substituents fragments may be located a distance of at least 1 nanometer (nm), 2 nm, 3 nm, 4 nm, 5 nm, 6 nm, 7 nm, 8 nm, 9 nm, 10 nm, or more from the interaction site. The one or more seed fragments or one or more substituents fragments may be located a distance of at most 10 nm, 9 nm, 8 nm, 7 nm, 6 nm, 5 nm, 4 nm, 3 nm, 2 nm, 1 nm, or less from the interaction site.

[0092] FIG. 4 shows an example of a scheme for identifying a small molecule configured to modulate a macromolecule using the methods and system described herein. As depicted in FIG. 4, the methods and systems described herein (such as methods 200 or 300 described herein) may generate a small molecule. The methods and systems may then calculate an energy of interaction (such as any energy of interaction described herein) of the small molecule with an interaction site of a macromolecule. Based on the energy of interaction, the methods and systems may generate additional small molecules. When one or more of the small molecules generated meets some criteria (such as following below a threshold energy of interaction with the interaction site of the macromolecule, as described herein), the methods and systems may output candidate small molecules meeting the criteria. The candidate small molecules may then be subjected to further studies, such as synthesis or pharmacological studies. The small molecule may be identified within a time period from a request (such as a user request or a computer -initialized request) to identify the small molecule of at least about 0.001 seconds (s), 0.05 s, 0.01 s, 0.05 s, 0.1 s, 1 s, 5 s, 10 s, 30 s, 1 minute (m), 5 m, 10 m, 15 m, 30 m, 1 hour (h), 2 h, 3 h, 4 h, 5 h, 6 h, 7 h, 8 h, 9 h, 10 h, 12 h, 18 h, 24 h, 48 h, 72 h, 96 h, or more . The small molecule may be identified within a time period from a request of at most about 96 h, 72 h, 48 h, 24 h, 18 h, 12 h, 10 h, 9 h, 8 h, 7 h, 6 h, 5 h, 4 h, 3 h, 2 h, 1 h, 30 m, 15 m, 10 m, 5 m, 1 m, 30 s, 10 s, 5 s, 1 s, 0.1 s, 0.01 s, 0.001 s, or less . The small molecule may be identified within a time period from a request that is within a range defined by any two of the preceding values. The small molecule may be identified within a number of floating point operations from a request of at least about 1 x 10¹ (1E1), 1E2, 1E3, 1E4, 1E5, 1E6, 1E7, 1E8, 1E9, 1E10, 1E11, 1E12, 1E13, 1E14, 1E15, 1E16, 1E17, 1E18, 1E19, 1E20, or more floating point operations of a computer system. The small molecule may be identified within a number of floating point operations from a request of at least about 1 x 10²⁰ (1E20), 1E19, 1E18, 1E17, 1E16, 1E15, 1E14, 1E13, 1E12, 1E11, 1E10, 1E9, 1E8, 1E7, 1E6, 1E5, 1E4, 1E3, 1E2, 1E1, or less floating-point operations of a computer system. The small molecule may be identified within a number of floating point operations from a request that is within a range defined by any two of the preceding values. [0093] In another aspect, the present disclosure provides a system for identifying a small molecule configured to modulate a macromolecule. The system may comprise a database comprising (i) a representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) a representation of the interaction site of the macromolecule. The system may also include one or more computer processors operatively coupled to the database. The one or more computer processors may be individually or collectively programmed to obtain (i) the representation of the macromolecule comprising an interaction site for interacting with the small molecule, and (ii) the representation of the interaction site of the macromolecule. The one or more computer processors may be individually or collectively programmed to perform a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules. The plurality of small molecules may comprise the small molecule having a structure of the plurality of structures. The one or more computer processors may be individually or collectively

programmed to determine an energy of interaction of each of the plurality of small molecules with the interaction site of the macromolecule, to identify the small molecule having the structure, and electronically output a report corresponding to the structure of the small molecule.

[0094] FIG. 5 shows an example of a system for identifying a small molecule configured to modulate a macromolecule. The system may be configured to implement one or more of the methods described herein, such as methods 200 or 300 described herein. As depicted in FIG. 5, the system may take a target macromolecule (such as any macromolecule described herein) as input. The target macromolecule may be directed to a simulator, which may obtain information about the macromolecule, such as a representation of the macromolecule or a representation of an interaction site of the macromolecule. The information about the macromolecule may be directed to a small molecule generator. The target macromolecule may also be directed to the small molecule generator. The small molecule generator may generate one or more small molecules, as described herein. The small molecule generator may calculate one or more energies of interaction (such as any energy of interaction described herein) of the small molecules with an interaction site of the macromolecule. When one or more of the small molecules generated meets some criteria (such as following below a threshold energy of interaction with the interaction site of the macromolecule, as described herein), the small molecule generator may send the smalls molecule that meet the criteria to the molecular synthesizer. The small molecule generator may also send information about the macromolecule to the molecular synthesizer. The molecular synthesizer may comprise an artificial intelligence (AI) unit configured implement an artificial intelligence (AI) procedure to evaluate the synthesizability of the small molecules and to modify the structures of the small molecules to increase synthesizability. The candidate small molecules may then be subjected to further studies, such as synthesis or pharmacological studies.

Computer Systems

[0095] FIG. 1 shows a computer system 101 that is programmed or otherwise configured to operate any method or system described herein (such as any method or system for identifying a small molecule configured to modulate a macromolecule described herein). The computer system 101 can regulate various aspects of the present disclosure. The computer system 101 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.

[0096] The computer system 101 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 101 also includes memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and/or electronic display adapters. The memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard. The storage unit 115 can be a data storage unit (or data repository) for storing data. The computer system 101 can be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120. The network 130 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 130 in some cases is a

telecommunication and/or data network. The network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 130, in some cases with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.

[0097] The CPU 105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 110. The instructions can be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.

[0098] The CPU 105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 101 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC). [0099] The storage unit 115 can store files, such as drivers, libraries and saved programs. The storage unit 115 can store user data, e.g., user preferences and user programs. The computer system 101 in some cases can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.

[0100] The computer system 101 can communicate with one or more remote computer systems through the network 130. For instance, the computer system 101 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 130.

[0101] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 105. In some cases, the code can be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some situations, the electronic storage unit 115 can be precluded, and machine-executable instructions are stored on memory 110.

[0102] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as- compiled fashion.

[0103] Aspects of the systems and methods provided herein, such as the computer system 101, can be embodied in programming. Various aspects of the technology may be thought of as “products” or“articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.

“Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible“storage” media, terms such as computer or machine“readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[0104] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

[0105] The computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (EΊ) 140. Examples of ET’s include, without limitation, a graphical user interface (GET) and web-based user interface.

[0106] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 105. The algorithm can, for example, identify a small molecule configured to modulate a macromolecule using the systems and methods described herein. EXAMPLES

Example 1: Generation of SMILES Structures Using Recurrent Neural Networks and Reinforcement Learning

[0107] FIG. 6 shows an example of a scheme for generating simplified molecular-input line- entry system (SMILES) structures using recurrent neural networks and reinforcement learning. As depicted in FIG. 6, SMILES structures from a database (such as SMILES structures obtained from the ZINC database) may be directed to a prior recurrent neural network (RNN). The prior RNN be trained with a library of SMILES structures using the methods and systems described herein (such as described herein with respect to FIG. 10) to generate novel SMILES structures. For instance, the prior RNN may be trained to generate SMILES structures that are independent of the library of SMILES structures. For a given SMILES structure, the SMILES structure may be used to determine an energy of interaction of a macromolecule with a small molecule associated with the SMILES structure. The energy of interaction may be determined through a free energy calculation procedure, such as a FEP procedure. The energy of interaction may be determined through a molecular docking simulation. A synthetic accessibility score (SAscore) corresponding to the small molecule associated with the SMILES structure may also be determined. Alternatively or in combination, a synthesizability of the small molecule may be determined using an artificial intelligence (AI) procedure to evaluate the synthesizability of the small molecule and to modify the structure of the small molecules to increase synthesizability. The AI procedure may yield an AI-derived synthesizability score. This procedure may be repeated for a plurality of SMILES structures to determine a plurality of energies of interaction, SAscores, or AI-derived synthesizability scores. The energies of interaction, the SAscores, or the AI-derived synthesizability scores may be subjected to a reinforcement learning (RL) procedure, such as an agent RNN, to generate one or more SMILES structures associated with a small molecule.

[0108] The prior RNN may be trained using SMILES sampled from the ZINC dataset in an unsupervised fashion. The prior RNN may comprise an RNN with gated recurrent units (GRUs). The prior RNN may only need to learn the grammar of the SMILES structures in order to generate valid SMILES structures. The generated SMILES may then be evaluated using energy of interaction (such as FEP or docking scores), SAscores, or AI-derived synthesizability scores. The evaluation results may be formulated according to a reward function: Reward =

[0109] The agent RNN may have a same or similar architecture to the prior RNN and may be in charge of generating candidate SMILES structures using the reward function. The agent RNN may leverage the reward function and the prior loss to achieve an optimal performance in terms of generating valid and diverse SMILES structures having favorable energies of interaction with an interaction site of a macromolecule.

[0110] FIG. 7 shows an example of a scatter plot of docking scores and SAscores of small molecules associated with SMILES structures generated using recurrent neural networks and reinforcement learning. As depicted in FIG. 7, there is a clear trend in the plot showing that the docking scores and the SAscores are simultaneously enhanced by the methods and systems described herein. The Pareto Front depicted in FIG. 7 shows the points that are best optimized considering the docking scores and the SAscores simultaneously.

[0111] FIG. 8 shows an example of histograms showing the distribution of small molecules before and after reinforcement learning optimization has shifted the generation of molecules toward enhanced docking and synthesizability.

[0112] FIG. 9 shows an example of small molecule structures generated without and with enforcement of high SAscores. As shown in the top panel of FIG. 9, without the enforcement of high SAscores, reinforcement learning procedures may generate small molecules that are difficult to synthesize. For instance, the small molecules may contain 7- or 8-membered rings. The bottom panel of FIG. 9 shows that enforcement of high SAscores may allow the reinforcement learning procedures to generate small molecules that are easier to synthesize.

Example 2: SMILES Structures Training Data Generation

[0113] FIG. 10 shows an example of SMILES training data. The training data were generated by the following procedure: 1)“Kekulize” the SMILES structures to replace aromatic types

(represented as lower-case letters in a SMILES string) with aliphatic atom types (represented as upper-case letters in a SMILES string) and specified bond types. The“Kekulize” procedure is utilized to avoid limiting he expressiveness that may be associated with aromatic atoms. 2) Find the longest path in the molecular graph to use as a skeleton. Using the longest path may ensure that the entire molecular graph is reachable from the skeleton. Then, reorder the topology to put the rest of the parts in the middle of the SMILES string to express branches and rings. 3)

Randomly choose branches to replace with sequences of blank characters (“_” in the SMILES strings) of the same length as the original branch sequences.

[0114] FIG. 10 shows an example of the procedure described above. The top panel of FIG. 10 depicts a source SMILES structure. The middle panel of FIG. 10 depicts a ground truth SMILES structure. The bottom panel of FIG. 10 depicts a generated SMILES structure. The procedure commenced with a SMILES structure

Cclcc(/C=C2/C(=N)N3N=C(c4ccccc4F)SC3=NC2=0)c(C)nl=clccccclCl taken from the ZINC database. The SMILES structure was Kekulized (step 1 above) as CC 1 =CC(/C=C2/C(=N)N3N=C(C4=CC=CC=C4F) SC3=NC2=0)=C(C)N1 C 1 =CC=CC=C 1 Cl . The longest path was searched and the SMILES structure was reordered (step 2 above) as C(C=C 1 F)=CC=C 1 C(SC 1 =NC2=0)=NN 1 C(=N)C2=CC(C=C 1 C)=C(C)N1 C(C(C1)=C 1 =CC=C 1 . Finally, branches that make fragments or rings were randomly deleted with the associated ring numbers to produce the SMILES structure

C(C=C 1 F)=CC=C 1 C _ =NN_C(=N)C_=CC) _ =CC=C .

Example 3: Generation of SMILES Structures with Enforced Chemical Features

[0115] FIG. 11 shows an example of a SMILES structure generation procedure for enforcing chemical features on a small molecule. In addition to the generative modeling procedure described herein with respect to FIG. 10, the methods and systems described herein may be used to optimize features of small molecules generated by the methods and systems herein. For instance, the methods and systems herein may utilize additional generative modeling to optimize features of small molecules, such as their toxicity, synthesizability, or any other molecular feature described herein. The methods and systems described herein may allow different types of blank SMILES structures to be introduced to fit a local environment. For instance, if a part of an interaction site of a macromolecule is hydrophobic, a hydrophobic blank may be assigned, as depicted in FIG. 11. The SMILES structures were generated using the seq2seq model. The SMILES structure Cl=CC=CC=ClCC0C(CCCC[NH3+])C(=0)C(N)CCC([0-])(=0) shown in the upper left panel of FIG. 11 is converted (using the procedure outlined herein with reference to FIG. 10) to a hydrophobic blank SMILES structure

Cl=CC=CC=ClCCOC(CC _ C _ C[NH3+])C(=0)C(N)CCC([0-])(=0) shown in the bottom panel of FIG. 11. The SMILES structure shown in the bottom panel of FIG. 11 is then converted to a SMILES structure

Cl=CC=CC=ClCC0C(CC(=CC=CCl)C=lC[NH3+])C(=0)C(N)CCC([0-])(=0) shown in the upper right panel of FIG. 11. The generated SMILES structure thus may have a hydrophobic property enforced by choosing the hydrophobic character of the blank SMILES structure.

Example 4: Three-Dimensional Molecular Structure Generation Using Seed Fragments and Substituent Fragments

[0116] FIG. 12 shows an example of a procedure for generating a three-dimensional (3D) structure of a small molecule by linking seed fragments and substituent fragments. As shown in the left panel of FIG. 12, a seed fragment, such as an aromatic fragment, may form a core of a small molecule. As shown in the middle panel of FIG. 12, a model may then choose a substituent fragment (which may be denoted as FragID) and choose a connection point (which may be denoted as Cldx) at which to connect the substituent fragment to the seed fragment by one or more chemical bonds. As shown in the right panel of FIG. 12, the model may choose an angle (which may be denoted as Q) about which to rotate the one or more chemical bonds. This procedure may then be repeated for a plurality of substituent fragments.

[0117] In some cases, angle optimization at each step may change the conformation of a small molecule drastically and may thus confuse a model. Thus, a possible reinforcement learning model may be devised using the three operations identified above as an action space ([FragID, Cldx, Q]). The input to the reinforcement learning model may be a macromolecular interaction site (which may be denoted as M). Such a model is depicted in FIG. 13.

Example 5: Reinforcement Learning Model for Generating Three-Dimensional Molecular Structures

[0118] FIG. 13 shows an example of a reinforcement learning model for generating three- dimensional molecular structures. As shown in FIG. 13, the model may comprise three convolutional neural networks (CNNs): CNN_1, CNN_2, and CNN_3. Each of the three CNNs may produce a separate action. The input to the model maybe the current state (S). S may comprise a voxel representation of a small molecule. The action produced by each of the three CNNs may be concatenated into a single action tuple. Then a small molecule growth procedure may be performed to grow a new small molecule. A new state (S’) may be calculated based on the newly grown small molecule may be calculated. Combined with information from the interaction site (M), a loss function may be calculated and fed back to each of the three CNNs through a backpropagation procedure. The three CNNs may share a same or similar loss function.

Example 6: Reinforcement Learning Model for Generating Three-Dimensional Molecular Structures On an Atom-bv-Atom Basis

[0119] FIG. 14 shows an example of a reinforcement learning model for generating three- dimensional molecular structures on an atom -by-atom basis. In contract to a fragment-based method, small molecules may be grown one atom at a time. A model for generating a small molecule atom-by-atom may choose an atom, place an atom, and add chemical bonds. This procedure may then be repeated for a plurality of atoms. Connections between different types of atoms may have different rules, which may allow engineering of a model. Such a model may be implemented using a model as depicted in FIG. 13.

[0120] Alternatively, the problem of growing a small molecule atom -by-atom may be

reformulated as a segmentation problem. Given a macromolecular interaction site (M), a segmentation network may be employed to decided which atom should be employed at each spatial location within the interaction site. Such a segmentation network is depicted in FIG. 14. The voxel map M may be passed through a series of CNNs followed by a fully connected layer. The information summarized in the fully connected layer may be passed through a series of deconvolution networks. The resulting map M’ may be of a same or similar size to the input map M with each voxel representing a probability that a particular type of atom should be placed at a spatial location corresponding to the voxel.

Example 7: Optimization of Small Molecule Chemical Properties

[0121] FIG. 15 shows an example of optimization of a small molecule structure to enhance one or more chemical properties of the small molecule. As depicted in the left panel of FIG. 15, the methods and systems described herein may generate one or more small molecule structures that provide strong binding to an interaction site of a macromolecule through, for instance, charge, hydrophobic, hydrophilic, or hydrogen-bond (H-bond) interactions. Though such molecules may provide strong binding, the structures of such small molecules may be further optimized to enhance other chemical properties, such as absorption distribution, metabolism, excretion, toxicity, or ease of synthesis. For instance, as shown in the left panel of FIG. 15, a small molecule structure that binds strongly to an interaction site of a macromolecule may comprise a great number of alkyl groups, which may be difficult to synthesize or may have non -optimal pharmacological properties. As shown in FIG. 15, such a small molecule may be optimized using the methods and systems described herein. For instance, certain alkyl groups may be replaced by aromatic or other cyclic groups, which may make the small molecule easier to synthesize or may impart enhanced pharmacological properties to the small molecule.

[0122] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the

aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

CLAIMS WHAT IS CLAIMED IS:

1. A method for identifying a small molecule configured to modulate a macromolecule, comprising:

(a) obtaining (i) a representation of said macromolecule, wherein said macromolecule comprises an interaction site for interacting with said small molecule, and (ii) a representation of said interaction site of said macromolecule;

(b) using at least one computer processor to individually or collectively perform a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises said small molecule having a structure of said plurality of structures;

(c) using at least one computer processor to individually or collectively determine an energy of interaction of each of said plurality of small molecules with said interaction site of said macromolecule, to identify said small molecule having said structure; and

(d) electronically outputting a report corresponding to said structure of said small molecule.

2. The method of claim 1, wherein said generative modeling AI procedure comprises a machine learning (ML) algorithm.

3. The method of claim 2, wherein said ML algorithm comprises at least one ML training algorithm.

4. The method of claim 2, wherein said ML algorithm comprises at least one ML inference algorithm.

5. The method of claim 1, wherein said generative modeling AI procedure comprises at least one reinforcement learning (RL) procedure.

6. The method of claim 1, wherein said generative modeling AI procedure comprises at least one tree search method.

7. The method of claim 1, wherein said generative modeling AI procedure comprises at least one evolutionary algorithm.

8. The method of claim 1, wherein said generative modeling AI procedure comprises at least one genetic algorithm.

9. The method of claim 1, wherein said generative modeling AI procedure comprises at least one simulated annealing algorithm.

10. The method of claim 1, further comprising, prior to (b), training said generative modeling AI procedure to generate said plurality of structures.

11. The method of claim 10, wherein training said generative modeling AI procedure comprises using a training library.

12. The method of claim 11, wherein said training library comprises at least one representation selected from the group consisting of a simplified molecular-input line-entry system (SMILES) structure, a Wiswesser line notation, a ROSDAL representation, a SYBYL Line Notation (SLN), a structural drawing, a common name, a trivial name, an International Union of Pure and Applied Chemistry (IUPAC) name, a Chemical Abstracts Service (CAS) number, an International Chemical Identifier (InChl) identifier, a three-dimensional (3D) molecular structure, a molecule graph, and a molecular fingerprint.

13. The method of claim 12, wherein said at least one representation is independent of said plurality of structures.

14. The method of claim 1, wherein (c) comprises processing said energy of interaction of each of said plurality of small molecules against a threshold value to identify said small molecule.

15. The method of claim 1, further comprising using at least one computer processor to individually or collective perform an alteration artificial intelligence (AI) procedure to alter at least a subset of said plurality of structures based at least in part on one or more chemical properties of at least a subset of said plurality of small molecules.

16. The method of claim 15, wherein said one or more chemical properties comprise one or more members selected from the group consisting of: physiological absorption of said plurality of small molecules, distribution of said plurality of small molecules, metabolism of said plurality of small molecules, excretion of said plurality of small molecules, toxicity of said plurality of small molecules, and ease of synthesis of said plurality of small molecules.

17. The method of claim 1, wherein said macromolecule is a protein.

18. The method of claim 17, wherein said interaction site is a protein binding site.

19. The method of claim 18, wherein said protein binding site comprises a protein binding pocket.

20. The method of claim 1, wherein (c) comprises using one or more non-classical computers to individually or collectively determine said energy of interaction of each of said plurality of small molecules with said interaction site of said macromolecule, to identify said small molecule having said structure.

21. The method of claim 20, wherein said one or more non-classical computers comprise at least one quantum computer.

22. The method of claim 1, wherein (c) comprises identifying said structure of said small molecule as having a minimum energy of interaction among structures of other small molecules of said plurality of small molecules.

23. The method of claim 1, wherein said report includes said structure.

24. The method of claim 1, wherein said report includes a calculated energy of interaction of said small molecule with said interaction site.

25. The method of claim 1, further comprising, subsequent to (c) and prior to (d), optimizing said structure of said small molecule.

26. The method of claim 1, further comprising repeating (a)-(d) for each of a plurality of interaction sites of said macromolecule.

27. The method of claim 1, wherein said report is generated within at most about 12 hours from a request to generate said report.

28. The method of claim 1, wherein (b) comprises, for each small molecule of said plurality of small molecules:

(1) using at least one computer to individually or collectively perform (i) a first artificial intelligence (AI) procedure to identify one or more seed fragments of said small molecule, and (ii) a second AI procedure to identify one or more substituent fragments of said small molecule; and

(2) using said one or more seed fragments and said one or more substituent fragments to generate said structure of said small molecule, which structure comprises at least one of said one or more seed fragments linked to at least one of said one or more substituent fragments.

29. The method of claim 28, wherein said first AI procedure or said second AI procedure comprises at least one machine learning (ML) algorithm.

30. The method of claim 29, wherein said ML algorithm comprises at least one ML training algorithm.

31. The method of claim 29, wherein said ML algorithm comprises at least one ML inference algorithm.

32. The method of claim 28, wherein said first AI procedure or said second AI procedure comprises at least one reinforcement learning (RL) algorithm.

33. The method of claim 28, wherein (1) comprises performing said first AI procedure to select said one or more seed fragments of said small molecule based at least in part on an energy of interaction of said one or more seed fragments with said interaction site.

34. The method of claim 28, wherein (2) comprises linking said one or more substituent fragments to said one or more seed fragments to reduce said energy of interaction of said small molecule with said interaction site.

35. The method of claim 28, wherein (1) comprises performing said first AI procedure to select said one or more seed fragments for said small molecule based at least in part on one or more chemical properties of said one or more seed fragments.

36. The method of claim 35, wherein said one or more chemical properties of said one or more seed fragments comprise one or more members selected from the group consisting of: physiological absorption of said one or more seed fragments, distribution of said one or more seed fragments, metabolism of said one or more seed fragments, excretion of said one or more seed fragments, toxicity of said one or more seed fragments, and ease of synthesis of said one or more seed fragments.

37. The method of claim 28, wherein (2) comprises linking said one or more substituent fragments to said one or more seed fragments based at least in part on one or more chemical properties of said one or more substituent fragments.

38. The method of claim 37, wherein said one or more chemical properties of said one or more substituent fragments comprise one or more members selected from the group consisting of: physiological absorption of said one or more substituent fragments, distribution of said one or more substituent fragments, metabolism of said one or more substituent fragments, excretion of said one or more substituent fragments, toxicity of said one or more substituent fragments, and ease of synthesis of said one or more substituent fragments.

39. The method of claim 28, wherein said one or more seed fragments or said one or more substituent fragments comprise one or more members selected from the group consisting of: atoms, molecules, and molecular fragments.

40. The method of claim 28, wherein one or more of said one or more seed fragments are located a distance of at least 1 nanometer (nm) from said interaction site.

41. The method of claim 28, wherein one or more of said one or more substituent fragments are located a distance of at most 1 nanometer (nm) from said interaction site.

42. A non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a small molecule configured to modulate a macromolecule, said method comprising:

(a) obtaining (i) a representation of said macromolecule comprising an interaction site for interacting with said small molecule, and (ii) a representation of said interaction site of said macromolecule; (b) performing a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises said small molecule having a structure of said plurality of structures;

(c) determining an energy of interaction of each of said plurality of small molecules with said interaction site of said macromolecule, to identify said small molecule having said structure; and

43. A system for identifying a small molecule configured to modulate a macromolecule, comprising:

a database comprising (i) a representation of said macromolecule comprising an interaction site for interacting with said small molecule, and (ii) a representation of said interaction site of said macromolecule; and

one or more computer processors operatively coupled to said database, wherein said one or more computer processors are individually or collectively programmed to:

obtain (i) said representation of said macromolecule comprising an interaction site for interacting with said small molecule, and (ii) said representation of said interaction site of said macromolecule;

perform a generative modeling artificial intelligence (AI) procedure to generate a plurality of structures of a plurality of small molecules, which plurality of small molecules comprises said small molecule having a structure of said plurality of structures; determine an energy of interaction of each of said plurality of small molecules with said interaction site of said macromolecule, to identify said small molecule having said structure; and

electronically output a report corresponding to said structure of said small molecule.