WO2024161359A2 - Compound representation and property analysis at scale - Google Patents

Compound representation and property analysis at scale Download PDF

Info

Publication number
WO2024161359A2
WO2024161359A2 PCT/IB2024/050953 IB2024050953W WO2024161359A2 WO 2024161359 A2 WO2024161359 A2 WO 2024161359A2 IB 2024050953 W IB2024050953 W IB 2024050953W WO 2024161359 A2 WO2024161359 A2 WO 2024161359A2
Authority
WO
WIPO (PCT)
Prior art keywords
compound
compounds
model
property
facility
Prior art date
Application number
PCT/IB2024/050953
Other languages
French (fr)
Other versions
WO2024161359A3 (en
Inventor
Dean PLUMBLEY
Liisi LAANISTE
Louwai MUHAMMED
Original Assignee
Cosyne Therapeutics Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cosyne Therapeutics Limited filed Critical Cosyne Therapeutics Limited
Publication of WO2024161359A2 publication Critical patent/WO2024161359A2/en
Publication of WO2024161359A3 publication Critical patent/WO2024161359A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N10/00Quantum computing, i.e. information processing based on quantum-mechanical phenomena
    • G06N10/60Quantum algorithms, e.g. based on quantum optimisation, quantum Fourier or Hadamard transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N10/00Quantum computing, i.e. information processing based on quantum-mechanical phenomena
    • G06N10/40Physical realisations or architectures of quantum processors or components for manipulating qubits, e.g. qubit coupling or qubit control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • Different chemical compounds may have different properties. Such properties may affect how a compound interacts with its environment, including its interactions with other compounds. The properties may therefore affect whether a compound is able to perform a desired function in interacting with an environment of the compound, the circumstances under which the compound is able to perform the function, or the effectiveness of the compound in performing the function, or otherwise affect the compound’s interactions.
  • a method comprising creating, using a first model, a second model for predicting information regarding a property of compounds input to the second model, wherein the first model was trained using compound information to generate at least one other output different from the information regarding the property.
  • Creating the second model comprises editing the first model to generate the second model and training the second model using training data for the property.
  • the first model comprises a first neural network
  • editing the first model to generate the second model comprises adding at least one layer to, removing at least one layer from, and/or adjusting at least one layer of the first neural network to generate a second neural network.
  • adjusting at least one layer of the first neural network comprises adjusting values of one or more parameters of the at least one layer of the first neural network.
  • the first model comprises a first neural network
  • editing the first model to generate the second model comprises adding a classifier to the first neural network
  • training the second model comprises training the classifier using the training data for the property.
  • the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
  • the method further comprises generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds, generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
  • the first neural network trained using compound information is trained to identify compounds that comply with at least one chemical rule.
  • the method further comprises receiving a request to analyze a library of compounds of interest, the request comprising input characterizing the library of compounds to be analyzed; determining, using the second model, for each compound in the library of compounds, a value for the compound with respect to the property, to generate a set of values of the property for compounds of the library of compounds; and outputting information regarding the set of values of the property for the compounds of the library of compounds.
  • a method comprising creating, using a first model, a second model for predicting information regarding a functional property of compounds input to the second model, wherein the first model was trained using a first amount of compound information to identify compounds that comply with at least one rule of physics and/or chemistry regarding compounds.
  • Creating the second model comprises editing the first model to generate the second model and training the second model using training data for the property, the training data being a second amount of training data that is less than the first amount of compound information.
  • the first model comprises a first neural network
  • editing the first model to generate the second model comprises adding a classifier to the first neural network
  • training the second model comprises training the classifier using the training data for the property.
  • the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
  • a number of compounds in the plurality of compounds of the training data used to train the second model is less than a number of compounds in the compound information used to train the first model.
  • the method further comprises generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds: generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
  • a method comprising generating a digital representation of a compound.
  • the generating comprises receiving an identification of a plurality of atoms and/or molecules of a compound, receiving information regarding interconnections of the plurality of atoms and/or molecules of the compound, receiving information regarding distances between the plurality of atoms and/or molecules of the compound, and generating the digital representation of the compound using the identification of the plurality of atoms and/or molecules, the information regarding the interconnections, and the information regarding the distances.
  • receiving the information regarding the distances comprises receiving information regarding a three-dimensional (3D) structure and/or arrangement of the plurality of atoms and/or molecules of the compound.
  • generating the digital representation of the compound comprises applying at least one transformer to the identification of the plurality of atoms and/or molecules of the compound.
  • the identification of the plurality of atoms and/or molecules of the compound comprises a graph representation of the compound.
  • the method further comprises generating the graph representation of the compound, wherein generating the graph representation of the compound includes: encoding the plurality of atoms and/or molecules of the compound as a plurality of nodes in the graph representation; encoding the interconnections of the plurality of atoms and/or molecules of the compound as a plurality of edges in the graph representation; and iteratively traversing nodes in the plurality of nodes along edges in the plurality of edges to update the graph representation.
  • an apparatus comprising at least one processor and at least one computer-readable storage medium having encoded thereon executable instructions that, when executed by the at least one processor, cause the at least one processor to carry out any one or any combination of the foregoing methods.
  • At least one computer-readable storage medium encoded with computer-executable instructions that, when executed by a computer, cause the computer to carry out any one or any combination of the foregoing methods.
  • FIG. 1A is a schematic diagram of a compound analysis system using quantum computing for property analysis of compounds, in accordance with some embodiments of the technology described herein;
  • FIG. IB and FIG. 1C are diagrams of a superconducting flux qubit with which some embodiments may operate;
  • FIG. 2A is a flowchart of an illustrative method 200 of determining and/or analyzing properties of compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 2B is a diagram for encoding compounds to predict properties of the compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 2C is a diagram of using quantum computing for analysis of the predicted properties of the compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 3A is a flowchart of an illustrative method 300 of encoding properties of the compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 3B is a diagram of encoding compounds from a graph, in accordance with some embodiments of the technology described herein;
  • FIG. 3C is a diagram of using iterative passing to traverse nodes along the edges to update the graph, in accordance with some embodiments of the technology described herein;
  • FIG. 3D illustrates an example of a technique for encoding information regarding a compound, which may be implemented in some embodiments
  • FIG. 4A is a flowchart of an illustrative method 400 of training one or more models to generate compound representations of compounds and identify property values of the compounds from the compound representations of the compounds to establish a trained library of compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 4B is a diagram of a model for encoding and decoding compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 4C is a diagram of training the model to predict properties of compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 5A is a flowchart of an illustrative method 500 of decoding a compound representation of a compound to identify properties of the compound, in accordance with some embodiments of the technology described herein;
  • FIG. 5B is a flowchart of an illustrative method 550 of decoding a compound representation of a compound to synthesize the compound, in accordance with some embodiments of the technology described herein;
  • FIG. 6A is a flowchart of an illustrative method 600 of training a model to predict properties of the compounds from compound representations of the compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 6B is a diagram of adjusting the model for predicting properties of compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 6C is a flowchart of an illustrative method 650 of training the model to use training data to predict new properties of the compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 6D is a diagram of the trained model outputting the predicted properties, in accordance with some embodiments of the technology described herein;
  • FIG. 7A is a flowchart of an illustrative method 700 of identifying compounds to analyze, in accordance with some embodiments of the technology described herein;
  • FIG. 7B is a diagram of obtained chemical structures with which some embodiments may operate;
  • FIG. 7C is a diagram of aligned chemical conformers with which some embodiments may operate.
  • FIG. 7D is a diagram of binding pockets in a protein target with which some embodiments may operate.
  • FIG. 7E and FIG. 7F are diagrams of docked ligands with which some embodiments may operate;
  • FIG. 7G is a diagram of obtaining known inhibitors with which some embodiments may operate.
  • FIG. 7H is a diagram of sanitizing and relaxing conformers with which some embodiments may operate.
  • FIG. 71, FIG. 7J, and FIG. 7K are diagrams of pharmacophores for a ligand with which some embodiments may operate;
  • FIG. 7L and FIG. 7M is a diagram of a protein binding site with which some embodiments may operate;
  • FIG. 8A is a flowchart of an illustrative method 800 of determining and/or analyzing criteria, in accordance with some embodiments of the technology described herein;
  • FIG. 8B is a diagram of a set of compounds with which some embodiments may operate.
  • FIG. 8C is a diagram of one or more criteria with which some embodiments may operate.
  • FIG. 8D is a diagram of one or more criteria with which some embodiments may operate.
  • FIG. 9 is a flowchart of an illustrative method 900 of configuring the quantum annealer, in accordance with some embodiments of the technology described herein;
  • FIG. 10A is a flowchart of an illustrative method 1000 of refining outputted compounds, in accordance with some embodiments of the technology described herein;
  • FIG. 10B is a diagram of criteria with which some embodiments may operate.
  • FIG. 10C is a diagram of the quantum annealer with which some embodiments may operate.
  • FIG. 10D is a diagram of the compound refining with which some embodiments may operate
  • FIG. 10E is a diagram of the identified compound with which some embodiments may operate.
  • FIG. 11 is a schematic diagram of an illustrative computing device with which aspects described herein may be implemented.
  • Described herein are embodiments of techniques for generative and/or predictive artificial intelligence (Al)-driven modeling for chemistry, across different compounds, types of compounds, uses for compounds, and industries.
  • Some techniques described herein may include and enable generating in silico novel compounds and generating structural and/or functional properties of unknown (or known) compounds, which compounds may in some cases be specific conformers of molecules.
  • systems may create, maintain, supplement, analyze, and otherwise interact with compound libraries extending into the billions of compounds, or potentially on the order of IO 60 compounds or beyond in some cases, and including information on an array of properties of such compounds.
  • Some embodiments may enable artificial intelligence (Al)-driven generative processing of information on previously-unknown or understudied properties of known or unknown compounds to build such a library.
  • some embodiments may enable identification of the compound(s) that represent the global maximum for performance of a desired task or function rather than, as is conventional, merely a local performance maximum or other compound that may perform well according to available data for a small set of known compounds.
  • Using some of the techniques described herein may enable a transformation of the drug discovery process (or compound discovery for other fields), including a transformation of analyzing previously-unknown and/or previously-understudied compounds, by bringing the analysis timeline from lifetimes or decades down to hours, days, or weeks and increasing the precision and reliability of in silico analysis.
  • Some embodiments described herein may include Al transformer models or other models for, at high speed and high reliability, generating previously unknown information regarding compounds, including for previously-unknown compounds.
  • Some such techniques may leverage one or more models that, through training, have learned rules of physics and/or chemistry that define possible or functional compounds (in general and/or in specific industries or use cases), and/or that define compounds that are more likely to be functional in a particular context.
  • Such rules may, in some embodiments, include those such as “Lepinski’s Rule of 5,” which defines a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule.
  • Some such trained models may define a continuous representation of chemical space that allows for fast analysis of compounds and fast, reliable generation of information regarding compounds, such as for fast, reliable generative Al for compounds.
  • such models trained with rules for chemistry may be used for reliable transfer learning, such as by editing parts of an existing trained model (e.g., by adding or adjusting output layers, in a case that a model is a neural network). This may include creating a new model for prediction of whether input compounds have a property or predicting a value of such a property for input compounds, by editing an existing model that is trained with information regarding rules of physics and/or chemistry and then training the edited model to operate with that property. This may enable generation of property predictions for previously-unknown or previously-understudied properties. This may also enable generation of property predictions for previously-unknown or previously-understudied compounds.
  • such a model created by editing a previously-existing model may be used to build (e.g., supplement) a library of information regarding compounds, which may include billions of compounds or even on the order of IO 60 compounds or beyond.
  • a library may include information on a large scale of compounds and potentially on a large scale of properties.
  • a model may be created using techniques described herein to make predictions regarding a property not previously defined in the library, after which the model may be used to generate predictions for that property for all compounds of the library (e.g., billions of compounds or up to IO 60 compounds or beyond), on the order of hours, days, or weeks.
  • Such a model may be created and applied even where there is limited available data for the property, which may have prevented training of a reliable model using conventional training and model creation techniques.
  • Some techniques described herein enable generation, training, and use of the model on a practical research timeline, rather than the years or decades that conventional techniques would require for attempting this processing.
  • Such a representation may include structural and/or functional information regarding a compound, such as information on locations of and/or interactions of atoms of a compound.
  • Some techniques described below also enable fast, reliable processing of information regarding a large-scale library of compounds. Such processing of information may enable an identification, from the library, of one or a set of compounds that may meet identified performance criteria for a task or function, such as the global best-performing compound(s) for a desired task or function or meeting other criteria.
  • large scale libraries of information regarding compounds and properties of compounds may be analyzed, including by (in some cases) analyzing information on every compound in the library, on a time scale on the order of hours, days, or weeks.
  • compounds described herein may operate with a variety of compounds and with a variety of properties of interest for those compounds, and are not limited to operating with any particular type of compound or industry.
  • compounds may be any molecule that includes two or more atoms.
  • compounds with which techniques described herein may operate may be or include drugs, which may include pharmaceuticals, biologies, medications, medicines, or other compounds that have a physiological or psychological effect.
  • drugs may include proteins (e.g., antibodies or other proteins) or parts of proteins such as fragments or peptides.
  • proteins e.g., antibodies or other proteins
  • Such drugs may additionally or alternatively include nucleotides or nucleic acids, such as DNAs, RNAs, oligonucleotides, peptides, or others.
  • compounds may be included in antibodies, antisense oligonucleotides, mRNA vaccines, peptide drugs, Proteolysis-targeting chimeric molecules (PROTACs), small interfering RNA (siRNA), or drug delivery molecules.
  • properties that may be analyzed may include whether the compound is blood brain barrier penetrant, whether it is bioavailable, whether it can bind to a specified target or how specifically it binds, how volatile the compound is, how thermostable the compound is, whether it satisfies specified criteria for “ease” of synthesis or manufacturing or distribution at scale, or other properties.
  • compounds may be drugs, embodiments are not so limited. Other molecules performing other functions or serving other functions may be analyzed, including for functions or purposes that are not biological or pharmaceutical.
  • the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, organic light-emitting diodes (OLED), colorants, dyes, paints, agriculture, or pesticides. In such other embodiments and for other uses, any suitable compound property may be analyzed.
  • Drug discovery is a process by which to identify a drug that may perform a particular desired function with desired properties.
  • drug discovery was a manual trial - and-error process, with drugs being synthesized (e.g., manufactured, isolated, or otherwise generated) and then tested to determine their properties and how well they performed the function or whether they had the desired properties.
  • Such synthesizing and testing took a great deal of expense and effort and thus was limited in the drugs (e.g., the number and/or types of drugs) that could be and were analyzed.
  • the techniques were limited to drugs that already could be synthesized with existing equipment, which in many cases may have been previously-synthesized drugs, limiting the ability to discover new drugs.
  • machine learning an engine may be given a structure of a known drug, such as an identification of atoms within the drug and an identification of which atoms are bound to which other atoms, as well as properties that had been determined for that known drug using testing. The machine learning engine may then attempt to identify from the input data a relationship between the atoms of a drug or an arrangement of atoms within a drug, and a performance of the drug with respect to a property. Once it has inferred such relationships, the machine learning engine may be queried to determine properties for other drugs that have not yet been evaluated. Through repeated querying, a user may seek to leverage the trained machine learning engine to identify a drug that may perform better with respect to the property than the known drugs that were used as training.
  • the technique relies on training a model using existing drug data.
  • the performance of the model is linked to that existing data, such that if the existing data includes errors or includes limited data, the wrong relationships may be learned by the engine, which can compromise the outputs from the engine.
  • Such limitations on data may stem from imprecision of existing data, such as imprecision of representations of compounds.
  • existing descriptors for compounds can take the form of identifiers for physiochemical properties of a compound.
  • Example descriptors are a number of atoms and molecular weight, and a chemical fingerprint indicating with bit vectors presence or absence of particular chemical fragments.
  • Such conventional representations do not include two-dimensional (2D) and three-dimensional (3D) structural information. For example, a bit vector identifying that a fragment is present in a molecule does not describe occurrences of that fragment in the molecule, such as information about the amount of that fragment or its spatial positioning within a molecule.
  • SMILES can provide a character-based description of a molecule that is easy to manipulate, but it is known that multiple different SMILES strings represent the same compound. As such, a SMILES string does not and cannot uniquely identify a compound, such as a particular conformer of a molecule, and thus an output of a SMILES string is not useful for identifying a particular compound.
  • Another difficulty with SMILES strings is that the strings correspond to a 2D representation of a molecule, but molecules do not exist in 2D.
  • the 3D structure of a molecule is not represented by and cannot be known from a SMILES string, meaning a conformer of a molecule cannot be identified with a SMILES string.
  • a conventional learning approach can also be limited by insufficient comprehensiveness of the input drug data, because existing chemical property prediction models suffer from extrapolation issues when tested on compounds dissimilar from the compounds found in the training dataset. Due to limitations of the training that was done, a trained machine learning model is only able to produce estimates for new drugs that are similar to the drugs it has already seen. Such limitation on similarity means these conventional techniques are limited to identifying properties for drugs that have only minor structural variances from the input compounds and the model cannot identify (or cannot reliably identify) properties of drugs with significant structural variances from that input.
  • the model may be limited to identifying only the best-performing similar drug, rather than a best-performing drug.
  • this conventional machine learning approach might be considered to be a determination of a local best performing drug candidate in that portion of the landscape, and not a determination of a global best performing drug candidate across the entirety of that landscape or a determination of a best performing drug candidate over more than just the one portion of the drug landscape.
  • Insufficient comprehensiveness of data also creates limitations in conventional model training approaches, limiting the utility of such models and model training and inhibiting use.
  • To train a model to make predictions regarding compounds such as a prediction regarding whether a compound has a property or a numeric value for a property of a compound, there needs to be sufficient information for patterns to be identified and/or relationships to be learned. With insufficient data, the learning is insufficient and low-quality output is generated, which may include (and often does include) incorrect output.
  • Training of models is conventionally limited, then, to compounds and/or properties for which a large amount of data can be obtained and input to the model during a training phase. Often, this means the conventional training is limited to compounds and/or properties included in publicly available data sets.
  • the type of machine learning analysis described above includes two phases, a training phase where the input training data is initially processed and a production phase where the trained model is used to generate or analyze new data. Both take time, particularly when the model is queried repeatedly during the production phase to generate or determine properties of many different drugs or drugs are analyzed in an attempt to find a better- or best-performing candidate among options. If such an approach were to be used with an entire landscape of drug candidates, the analysis across both phases could take many lifetimes. Even if the analysis were distributed across a system of processors sharing resources and results, the analysis could take decades.
  • timelines are governed in part by the manner in which conventional machine learning is done and the manner in which conventional computing hardware operates and the manner in which conventional transistor-based central processing units are provided with data, process the data, and output the data, and affect any large-scale computational processing using this hardware, beyond just drug property generation or drug analysis.
  • These timelines are impractical for any computational processing, and particularly so for research and development of drug candidates that are sought for treating or curing diseases in the near term or for commercialization. Accordingly, even if sufficient data were to be available to train a machine learning model for more types of drugs with different structures and different properties, the processing across a large-scale drug landscape still could not be performed with these conventional machine learning approaches.
  • the inventors have therefore recognized and appreciated the desirability of computational techniques for analysis of compound properties that do not depend on conventional learning techniques, do not depend on conventional molecule representations, and/or do not depend on conventional arrangements or uses of central processing units.
  • Some embodiments described herein include techniques for analyzing properties of compounds, such as to determine a compound or set of compounds that perform a desired function or meet other criteria.
  • techniques for generating compounds or generating property information for compounds including in some embodiments using generative Al techniques. Some such embodiments may include processing a representation of a compound to predict a value for a property of the compound. In some such embodiments, techniques described herein can include training a model for predicting, for a compound, a value for a property of the compound. Some such models may be created by editing an existing model that was previously trained to predict another compound property, such as by adding, removing, or otherwise adjusting one or more layers of an existing model in a case in which a model is a neural network or other model including layers. For example, in some cases a first model may be trained with training data regarding compounds, such as using available property data on compounds.
  • the model may learn information regarding compounds in general and/or regarding one or more properties of compounds.
  • the model is a neural network
  • one or more layers of the neural network may learn general information on structure of compounds or structure of compounds that may be candidate compounds (e.g., pharmaceuticals).
  • Such general information could include a general understanding of possible chemical structures, or effective chemical structures for a function, and/or rules of physics or chemistry that define possible compounds or compounds that may be useful in particular contexts or for particular tasks/fimctions.
  • the model may also learn information regarding one or more properties, such as to determine or predict for input compounds values for the one or more properties, which may not have been previously known for the input compounds.
  • Such a trained model may be used to analyze a library of compounds to determine values for the property.
  • a data store of information regarding compounds may be generated, which may include for each compound of a large set of compounds a value for each property of a set of properties.
  • Such a data store of compound information may, in some cases, include binary, discrete, and/or continuous values for compounds.
  • the data store may be supplemented over time with values for additional properties, for each of the compounds of interest.
  • the data store may include information on a large number of compounds that may be or will be analyzed (including using quantum computing hardware or other computing hardware), such as the IO 60 compounds satisfying the “Lepinski’s Rule of 5,” information may not be available for all properties of all compounds. This may additionally be the case for conformers, where information may be available for one conformer of a molecule but not others. It may be advantageous in some embodiments, therefore, to be able to generate property information for compounds, for addition to a data store of information on compounds as properties and/or compounds are added to a data store.
  • a model for predicting a value of a property for an input compound may be created by editing an existing model that may have been trained to predict values for another property for input compounds. For example, if such an existing model is a neural network, one or more layers of the neural network may be edited through adding, removing, or adjusting layers to create a new neural network. Other layers may be unchanged in some cases, or changed only in how they connect to the edited layer(s). Types of models other than neural networks may be used.
  • An advantage of using such editing of existing models is to take advantage of the training that had already been done on that existing model, for the other property for which the existing model was trained to predict values.
  • Such prior training may have set parameters of the portions (e.g., layers) of the existing model, and the new model that is generated may include in unedited portions of the existing model some of those parameters.
  • a training burden may be reduced for creation of the new model. This may be advantageous in a variety of ways, including to mitigate the burden on conventional model training created by insufficiency of data or limitations on access to data. This is particularly the case for properties or for compounds for which limited data exists.
  • representations of compounds may capture structural and/or functional properties with higher precision than is available with existing representations. Such representations may, in some embodiments, be used as input to property value prediction models, or may in other embodiments be used in other ways. Conventional representations were imprecise or could be ambiguous as to the molecule being represented, as the same molecule may correspond to multiple representations or a representation may correspond to multiple different molecules. Conventional representations may also not include sufficient information to identify a conformer of a molecule with particularity. In some embodiments described herein, a representation of a compound may include sufficient information to identify a particular conformer of a molecule and include other structural and/or functional information regarding a compound.
  • Such a representation may be useful in a variety of contexts, including in some embodiments for processing a representation of a compound to determine one or more values for one or more properties of a compound, where such values may in some cases be subsequently processed in a property analysis or used in other ways.
  • such representations may be useful for analyzing a compound to determine information on properties of the compound, such as by input of the representation of the compound to a model trained to output a value of a property of the compound.
  • a representation that more precisely identifies a compound and more comprehensively includes information regarding a compound may aid in more accurately predicting properties of compounds.
  • some embodiments may include techniques for generating a graph representation of a compound, where the graph includes nodes that each correspond to an atom of a compound and where edges are defined in the graph that represent bonds between the atoms of the compound. Information regarding each atom of the compound and interactions of the atoms of the compound may be added to the representation, including being associated with nodes and/or edges of the graph.
  • nodes and/or edges of the graph may be associated with information regarding the compound, such as structural and/or functional properties of the compound.
  • Such structural and/or functional information may, in some embodiments, be information sufficient to uniquely identify a compound, such as uniquely identifying a conformer of a molecule.
  • the information regarding atoms or interactions of atoms of the compound may be refined through an iterative process by which information regarding an atom is updated based on other atoms of the graph, such as by updating a node based on information regarding other nodes to which the node is connected in the graph by an edge that represents a chemical bond between atoms. Through subsequent iterations, information regarding atoms may be distributed throughout nodes of the graph, even nodes to which a node is not connected by an edge, to reflect potential interactions between atoms that are not directly bonded to one another in a molecule.
  • a representation may also be decoded to enable determination of a structure (e.g., a three-dimensional structure) of a compound corresponding to the representation, or to determine other information regarding the represented compound. This may be advantageous in some cases where a compound to which a representation relates is a compound that has been previously unknown and has not been synthesized before. For example, following analysis of compounds a representation may be output for a compound that has passed some criteria related to the analysis, such as to recommend the compound (or multiple compounds) for a particular function or task. By encoding detailed information regarding a compound, the representation may allow for high-reliability decoding of information regarding a compound, which may aid in subsequent synthesizing or other analysis of the compound.
  • a structure e.g., a three-dimensional structure
  • Some embodiments may include training such an encoder to encode property information for a compound in a representation and/or training a decoder to identify from a compound representation the properties of the compound (or other information regarding the compound) that was encoded into the representation. In some cases, this may include training an encoder and/or decoder to perform a high-precision encoding of information regarding a compound that is a particular conformer and high-precision decoding a representation to yield an identification of that conformer as opposed to identifying a molecule without identifying a particular conformer of that molecule, or identifying a group of molecules. For example, some embodiments may include a representation that indicates a distribution of conformers across a conformer space for a compound. Decoding a compound representation for a compound may in some cases include outputting information regarding structural and/or functional properties of the compound, and/or may include outputting information useful in synthesizing the compound (e.g., synthesizing a particular conformer).
  • information regarding the interactions between atoms may include, for each atom, information regarding the interactions of the atom with the other atoms of the compound.
  • a graph representation may include a value for a node that is updated based on information regarding surrounding nodes, to indicate interactions between atoms.
  • the updates can be performed by an iterative process calculating values of nodes and updating nodes based on surrounding nodes, to traverse the graph and update values across iterations.
  • the graph representation for a compound may be converted to a non-graphical representation of the compound.
  • the non-graphical representation include an array and a vector of values.
  • An encoder and a decoder may be trained in some embodiments to encode a compound and/or a graphical representation of a compound and to decode the non-graphical representation until the decoder is able to with precision and accuracy recreate compound (including, where the compound is a particular conformer of a molecule, that conformer) or otherwise able to decode a non-graphical representation in a manner that satisfies one or more criteria.
  • embodiments are not limited to operating with any particular types of compounds or types of properties. And embodiments that operate with different types of compounds may operate with different properties or types of properties, such that embodiments that operate with drugs may analyze different properties than embodiments that operate with other compounds.
  • the properties may be chemical or biochemical properties of a compound that affect how it interacts with its surroundings, such as how it interacts with other compounds.
  • Some such properties may include structural properties that indicate a content or shape of a compound, including intra-compound dimensions.
  • a structural property may include whether a certain atom or molecule is available for binding or an amount of such atom/molecule that is available for binding, such as through being a donor site or acceptor site.
  • a structural property may also include a distance between parts of a compound, such as between two atoms, two fragments, or two other elements of a compound.
  • Other properties may include functional properties. Functional properties may include those that indicate whether and how a compound performs a function.
  • Such properties may be in connection with a particular other compound or target, such as binding affinity for or binding specificity for a target. Such properties may also be in connection with tissues, such as how effectively a compound crosses or does not cross a tissue, including blood-brain barrier permeability, intestinal permeability, or permeability for other tissues or materials. Such properties may also be in connection with how well a compound survives in its environment or under different environmental conditions, such as solubility, thermostability, or other factors. Accordingly, in some embodiments, properties may relate to physiochemical features of a compound.
  • building a large-scale library of compounds may include generating representations of a large number of compounds, which may include various permutations or combinations of atoms or molecules. Some such compounds may have been previously known or previously existed or synthesized, while other compounds may be previously-unknown. In some such embodiments, compounds of various numbers of atoms may be generated, or various numbers of fragments, such that the library includes compounds of different sizes. Embodiments are not limited to generating representations of any particular compounds or type of compounds in generating a library. In some cases, a library may be defined that includes all compounds meeting some criteria, such as all compounds that satisfy the “Lepinski’s Rule of 5” or other rule.
  • one or more rules may define a set of compounds that may be possible or may be useful for a particular task or function, which may include rules relating to which element(s) may be included in the compounds and/or which molecules (e.g., fragments) may be included in the compounds, how many atoms or molecules may be included, which types of bonds may be included or may be included between particular pairs of atoms and/or molecules, valence rules that may apply, how large the compound may be, how stable the compound must be, how soluble it may be, how it can be synthesized or manufactured, how it can be transported, or other rules that define structural and/or functional properties of a compound that may affect how it performs a particular task.
  • rules relating to which element(s) may be included in the compounds and/or which molecules (e.g., fragments) may be included in the compounds, how many atoms or molecules may be included, which types of bonds may be included or may be included between particular pairs of atoms and/or molecules, valence rules that may apply, how large
  • the rules may be associated with values or ranges of values for the rules that may be acceptable for a given task or function, which may include a given environment in which a task or function is to be performed.
  • a compound generation facility may iterate over these rules and permutations of various values for the rules to identify different compounds that may result from different permutations of the values for the rules.
  • known techniques for enumerating compounds may be used together with techniques described herein for representing and analyzing compounds, such as techniques described herein for determining properties of a compound.
  • the representations may be processed according to techniques described herein. For example, the compounds may be processed to determine representations in accordance with techniques described herein, and the representations may be processed to determine property information and the property information may be analyzed to determine suitability of compounds for a desired task or function.
  • a library of property information for compounds may be formed that includes, for multiple compounds, values for properties of the compounds.
  • the values may be determined from sources of property information and/or may be predicted, such as predicted using one or more models trained as described herein to output predicted values for properties and for compounds.
  • information from the library may be provided to one or more computing systems for use in determining and/or analyzing properties of compounds.
  • Such computing systems may, in some embodiments, be or include quantum computing systems, though other embodiments may not use quantum computing.
  • Quantum computers operate in a wholly different manner from conventional computer hardware and are not natively able to perform the same or even similar processing as conventional computer hardware. More particularly, quantum computers are not able to natively perform a computational analysis of properties of compounds in even the manner in which conventional computing hardware performed that analysis.
  • the determining and/or analyzing may include performing operations with one or more quantum computing systems, such as one or more quantum computers that may have been configured or otherwise arranged to perform computations using quantum annealing with one or more quantum annealers.
  • a collection of one or more compound properties of interest may be identified, such as based on analysis of other compounds and/or based on input from one or more users and/or compound property information provided from a library of property information.
  • the quantum annealer(s) may analyze a set of compounds in connection with properties of interest to identify a subset of the compounds that meet one or more criteria, which may be identified as the best performing compounds for a desired function or meet other criteria. However, in some embodiments, these techniques may be adapted for use by other computing hardware.
  • computing hardware including the quantum annealer(s) may identify the subset by determining the compound(s) from among the set that satisfy one or more criteria regarding statistical values resulting from evaluation of a function, such as identifying compound(s) from among the set that correspond to a maximization, minimization, or other optimization of a function or other statistical operation with respect to a function with which the quantum annealer(s) is/are configured.
  • the quantum annealer(s) may be configured with one or more weights or other values that affect operations of the annealer(s) and thereby affect evaluation of the function.
  • the weights may, in some cases, indicate relationships between variables to be analyzed by the annealer(s), such as relationships between variables relating to one or more of the properties of interest.
  • the quantum annealer(s) may receive as input values corresponding to each compound to be analyzed with respect to the function and for each property of interest to be analyzed for the compound.
  • the values may be a set of discrete values.
  • Such values may, in some embodiments, be binary values, such that the quantum annealer(s) may receive the values as a matrix of binary values where each value in the matrix indicates whether a compound has a particular property or whether that property for the compound satisfies one or more criteria for the properties or for that property (e.g., how a value for a property compares to a threshold).
  • the values can be a set of a multiple continuous values.
  • Such values may, in some cases, be retrieved from a data store or library of property information for compounds, such as in some embodiments one including property values generated according to models generated and/or trained using techniques described herein.
  • the quantum annealer(s) may analyze binary values corresponding to compounds and properties of interest to identify, from among the compounds, one or more compounds that satisfy one or more criteria and so may have a desirable combination of properties of interest. Such compounds may then be synthesized and tested, or tested in silico using other techniques, to further identify a smaller set of compounds that may be candidates for use in a particular context, for further experimentation, or other purposes. Using such a process, one or more compounds may be identified that may advantageously perform a function.
  • a quantum computing based analysis serve as an initial filter on a large set of compounds to identify a candidate set of compounds that may perform a desired function well, after which a second lab based or in silico analysis can further filter and refine the candidate set.
  • a quantum annealer may receive as input a matrix of values, where each row in the matrix corresponds to a drug candidate and the value for that row indicates a value for a property of that drug candidate.
  • the values can be discrete or continuous values.
  • the discrete values can be binary values (e.g., 0 or 1).
  • the continuous values can be any number from 0 to 1, such as 0.8.
  • the quantum annealer may analyze the matrix to identify a ranking of predicted performance of the drug candidates with respect to the properties and identify a number N of the drug candidates that have an overall best performance with respect to the properties, based on evaluation of a function with which the quantum annealer is configured for determining an optimal or otherwise desirable combination of properties. With such a process, a best, top five, top ten, top one hundred, or other top N drug candidates may be identified by the quantum annealer. The identified drug candidates may then be analyzed using other techniques to determine or confirm properties in the list or determine or confirm the performance of drug candidates in the subset. Such other techniques may include other computational techniques, such as using machine learning or other artificial intelligence techniques, or laboratory work that involves synthesizing and testing the drug candidates.
  • a library of compounds may be identified and analyzed to determine a subset of the compounds that meet one or more criteria, such as that they are predicted to have a desirable combination of properties.
  • the library of compounds that are analyzed may be all compounds that satisfy one or more criteria.
  • the criterion may be all molecules that satisfy “Lepinski’s Rule of 5” that define a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule.
  • There are IO 60 such compounds in that library a number of molecules that cannot be practically evaluated using conventional techniques.
  • all molecules up to 30 atoms in size is 10 24 molecules.
  • these or other libraries of molecules may be evaluated in a practical timeline, such as within a matter of days or less than two weeks. Analyzing such a large library of compounds in a timeline that is a matter of days or otherwise practical for research and development may allow for a more reliable and comprehensive identification of well-performing compounds, such as a determination of a global “best” performing compound with respect to properties of interest or otherwise a determination of a compound that performs well or best across a large library of compounds.
  • techniques for analyzing compound properties may start with a user specifying properties of interest.
  • properties of interest may additionally or alternatively be determined by a system through analysis of input compounds.
  • the input compounds may be ones that are identified by a user as performing a function or performing a function in a manner that satisfies one or more criteria, such as performing the function with a desired effectiveness.
  • a process may include determining pharmacophores for the input compounds. These pharmacophores may be used to determine properties that are present in the compounds and may be related to performing the function or performing the function in the manner that satisfies the criteria.
  • Information from the pharmacophores, or other input from a user or information regarding compounds may be used to determine rules regarding properties for compounds that, when present in a compound, may lead to the compound performing a function or performing a function in a manner that satisfies one or more criteria.
  • rules related to the properties of interest and/or to values for those properties may be determined.
  • the rules enable a description of those properties with respect to a binary value, such as whether the property is present or not in a compound or whether a criterion with respect to the property (e.g., a value above or below a threshold) is satisfied for the compound.
  • a value may be determined with respect to each property and for each compound. These values may be arranged in a matrix of values, where each row represents a compound, and each column represents a property.
  • the rules can enable a description of those properties with respect to a set of discrete values or continuous values.
  • the matrix can include binary values (e.g., 0 or 1).
  • the matrix can include continuous values (e.g., 0 to 1).
  • the quantum computing hardware may be configured with a function that identifies relationships between variables, where the variables relate to compound properties and relationships between them, such as relative priorities of different properties in a desirable or well-performing compound.
  • the variables may be set such that when the quantum computing hardware identifies, using input binary values relating to properties for compounds, a compound that relates to a maximum, minimum, optimum, or other statistical value for the function with which the quantum computing hardware is configured, that compound may be the best compound with respect to the properties of interest or otherwise satisfy one or more criteria with respect to those properties of interest.
  • the quantum computing hardware may be configured to perform quantum annealing.
  • the quantum annealing may be in the form of a QUBO in some such embodiments.
  • quantum computers While techniques leveraging quantum computers are described herein in connection with some embodiments, it should be appreciated that other embodiments may not include quantum computers. Such other embodiments that do not include quantum computers include embodiments that do not analyze properties of compounds or identify candidate sets of compounds, and embodiments that analyze properties of compounds and identify candidate sets but do not use quantum computing systems.
  • FIG. 1A is a block diagram of an example system 100 for determining and/or analyzing properties of compounds, in accordance with some embodiments of the technology described herein.
  • system 100 includes a network 105, a computing device 110 including a client interface 112 for interfacing with a client 115, a computing device 120 including a compound analysis facility 126, and a quantum annealer 130.
  • system 100 is illustrative and that a system may have one or more other components of any suitable type in addition to or instead of the components illustrated in FIG. 1A. For example, there may be additional remote systems (e.g., two or more) present within a system.
  • the system 100 may include classical computing hardware that is configured in any architecture.
  • the classical computing hardware may be in addition to or instead of the quantum annealer 120.
  • Such other hardware may include one or more central processing units (CPUs), graphics processing units (GPUs), and/or other hardware accelerators, such as a distributed array of CPUs, GPUs, and/or other hardware accelerators that are configured to interoperate and execute portions of a task in parallel.
  • CPUs central processing units
  • GPUs graphics processing units
  • other hardware accelerators such as a distributed array of CPUs, GPUs, and/or other hardware accelerators that are configured to interoperate and execute portions of a task in parallel.
  • the network 105 may be or include one or more local and/or wide-area, wired and/or wireless networks, including a local -area or wide-area enterprise network and/or the Internet. Accordingly, the network 105 may be, for example, a hard-wired network (e.g., a local area network within a biopharma research office), a wireless network (e.g., connected over Wi-Fi and/or cellular networks), a cloud-based computing network, or any combination thereof.
  • a hard-wired network e.g., a local area network within a biopharma research office
  • a wireless network e.g., connected over Wi-Fi and/or cellular networks
  • cloud-based computing network e.g., a cloud-based computing network, or any combination thereof.
  • the computing device 110 and the computing device 120 may be located within the same building or building complex and connected directly to each other or connected to each other via the network 105, while the quantum annealer 130 may be located in a remote building and connected to the computing device 110 and the computing device 120 through the network 105.
  • the computing device 110 and the computing device 120 are integrated as one device.
  • the computing device 110 may be any suitable one or more electronic devices configured to send instructions and/or information to the computing device 120, to receive information from the computing device 120, and/or to process obtained data.
  • computing device 110 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device.
  • the computing device 110 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to send instructions and/or information to the computing device 120, to receive information from the computing device 120, and/or to process obtained data.
  • the computing device 110 can include the client interface 112 for interfacing with a client 115.
  • the client interface 112 includes graphical user interfaces.
  • the client interface 112 includes executable instructions.
  • the client 115 can interact with the client interface 112 to control or configure the computing device 110, the computing device 120, the quantum annealer 130, and/or classical computing hardware.
  • the client 115 can use the client interface 112 to view data generated by the computing device 120 or the quantum annealer 130.
  • the computing device 120 may be any suitable one or more electronic devices configured to send instructions and/or information to the computing device 110 and/or the quantum annealer 130, to receive information from the computing device 110 and/or the quantum annealer 130, and/or to process obtained data.
  • computing device 120 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device.
  • the computing device 120 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to send instructions and/or information to the computing device 110 and/or the quantum annealer 130, to receive information from the computing device 110 and/or the quantum annealer 130, and/or to process obtained data.
  • the computing device 120 may communicate with classical computing hardware that is configured in any architecture.
  • the computing device 120 can include a compound representation facility 122 for creating and/or managing representations of compounds.
  • the compound representation facility 122 can encode compounds into compound representations to create a library of compound representations.
  • the compound representation facility 122 may also use representations to determine property information encoded in the representation, and may in some cases be trained to identify the properties of a compound in the library of compounds by decoding the compound’s representation.
  • the compound representation facility 122 can decode the compound representation of a compound for output of its properties.
  • the compound representation facility 122 may also in some embodiments query the library of compounds (e.g., in response to a request from a user or other source) to identify a compound representation of a compound having certain properties of interest and may output the properties of that compound, which may in some cases include information sufficient or helpful to synthesize the compound.
  • the compound representation facility 122 may maintain the library of compounds for analysis and querying by the compound model facility 124.
  • the computing device 120 can include a compound model facility 124 for predicting the properties of the compounds in the library of compounds maintained by the compound representation facility 122.
  • the compound model facility 124 may include one or more models configured to predict values of one or more properties of an input compound. Such a model of the compound model facility 124 may in some cases be trained to predict values for properties of compounds.
  • the compound model facility 124 may provide compounds and predicted property values to the compound analysis facility 126.
  • a model of the compound model facility 124 may be trained to predict values for new properties of compounds based on an identification of a new property of interest and training data for the new properties of interest, and in some cases an existing model may be edited to generate a new model to predict values for compounds for a new property.
  • the computing device 120 can include a compound analysis facility 126 for managing analysis of compounds in accordance with techniques described herein.
  • the compound analysis facility 126 includes executable instructions that can be executed by the computing device 120.
  • the compound analysis facility 126 may receive input information from the interface 112 or the compound model facility 124, which may include data identifying one or more properties of interest, one or more known compounds that perform a function, and/or one or more criteria identifying a set of compounds to be analyzed.
  • the compounds are included in antibodies, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules.
  • the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
  • the facility 126 may identify criteria for compound properties of interest, which may be implemented as a set of rules to be used in analysis by the facility 126 and/or by the quantum annealer 130.
  • the facility 126 may determine the rules in part by determine pharmacophores for input known compounds and use the pharmacophores in determining the rules.
  • the facility 126 may also identify compounds to be analyzed by the quantum annealer 130, such as based on user input defining a landscape of compounds to be analyzed through identifying properties of the compounds or a definition of the compounds of interest (e.g., all compounds having up to 30 atoms).
  • the facility 126 may also receive user input that specifies a desired resolution for compounds to be analyzed. The resolution may relate to a number of compounds that are analyzed by the annealer from among an entirety of the compounds that may satisfy a definition or characterization of compounds of interest, such as all of the compounds, half of the compounds, one quarter of the compounds, or other suitable portion of the compounds.
  • the compound analysis facility 126 may also identify values for the rules for the properties of interest for the compounds, to be analyzed by the quantum annealer 130, including by retrieving information from one or more data stores of compound property values.
  • the compound analysis facility 126 may, in some embodiments, send instructions and/or configuration information to the quantum annealer 130 (e.g., via network 105) to control or configure the quantum annealer 130.
  • Such instructions and/or configuration information may include specifying a function to be used by the annealer 130 in analysis, such as by setting values for one or more variables of the analysis in accordance with some techniques described herein.
  • the compound analysis facility 126 can also transmit data to the quantum annealer 130 (e.g., via network 105) and trigger the quantum annealer 130 to analyze the data.
  • data may, in some cases, be a matrix of values, such as values indicating values with respect to rules for properties of compounds, in accordance with techniques described herein.
  • the values can be discrete or continuous values.
  • the discrete values can be binary values (e.g., 0 or 1).
  • the continuous values can be any number from 0 to 1, such as 0.8.
  • the compound analysis facility 126 may also receive data analyzed by the quantum annealer 130.
  • the compound analysis facility 126 may be adapted to cause classical computing hardware to analyze the data.
  • the quantum annealer 130 can be a quantum computer (or more than one quantum computer) configured to perform quantum annealing. While the embodiment of FIG. 1A implements a quantum computer as a quantum annealer, it should be appreciated that some embodiments may operate with one or more quantum computers configured to perform a different analysis. In some embodiments, quantum annealer 130 may include additional computer hardware to interact with other computing devices (e.g., devices 110, 120) and to execute operations to configure the quantum computing hardware of the annealer 130.
  • the quantum annealer 130 may be configured in some embodiments to identify a solution to an objective function with which it has been configured by the facility 126, based on input provided to it (e.g., a set of candidate solutions) by facility 126.
  • the solution to the objective function may be a minimum or maximum value for the function from among the input data, or other statistical value.
  • the quantum annealer 130 may be configured to perform a QUBO analysis, may receive a binary table of values, and may identify from among the values of the binary table a row that provides a “global” (with respect to the input candidate solutions) minimum solution to the QUBO function.
  • the quantum annealer 130 can be implemented as a D-Wave quantum computer that uses superconducting flux qubits.
  • the superconducting flux qubit may perform analysis using quantum mechanical spin.
  • a qubit loop may have current applied to it, and the circulating current in the qubit loop can give rise to a flux inside the loop.
  • that flux can encode two distinct quantum spin states that can exist in a superposition.
  • the quantum annealer 130 can include two superconducting loops for each qubit of the annealer 130, and the annealer 130 may have multiple qubits. In such a case, each loop can be subject to an external flux bias ⁇ bix or ⁇ I>2x. When cooled to a near absolute zero kelvin, the two superconducting loops can behave as a super positioned state.
  • the ioi can be the Pauli spin matrix with eigenvectors ⁇
  • the DWave implementation can allow for both and h t to be set independently when defining a QUBO, in some embodiments as described herein.
  • operation of the quantum computer can be influenced. This enables customizing of the computation to be done by the quantum computer, including per input from a user via interface 112 and/or processing by the compound analysis facility 126, which can be provided by the compound analysis facility 126 to the annealer 130 as configuration input.
  • the annealer 130 may have multiple qubits, each of which may include the loop shown in FIG. IB.
  • the accumulation of many qubits may enable the quantum annealer 130 to perform computations (e.g., identify a minimum value of a function) for a large expanse of variable space.
  • the computing device 120 can provide a quantum mechanical super-positioned state of all possible solutions with equal weighting.
  • the quantum annealer 130 can receive the objective function from the computing device 110 and/or the computing device 120, which can define the objective function as a QUBO or Ising model.
  • the quantum annealer 130 can minimize or maximize an objective function, or otherwise calculate a statistical value as a solution to the objective value that meets one or more criteria.
  • a quantum waveform representing the super-positioned state of the Qubits can collapse per the influence of the programmed weighting (bias) applied to the magnetic fields associated with the super-cooled and super-positioned currents in the chip.
  • the annealing process can therefore produce a sampled list of energetic states associated with each possible solution.
  • the minimal energy states can represent optimal solutions or solutions otherwise meeting one or more criteria.
  • a single result may be output from the annealer 130 and provided to the facility 126 in response to the configuring and the triggering of the computation by the annealer 130.
  • multiple results may be provided, such as a top five, top ten, top one hundred, or otherwise top N results that are the compounds that are predicted to perform best.
  • the annealer 130 may output all results that meet a criterion, such as by outputting all compounds that appear to have a combination of properties that satisfies one or more criterion, including being associated with a result of the objective function above a threshold.
  • FIG. 2A illustrates a method 200 for determining and/or analyzing properties of compounds.
  • FIG. 2B-10 illustrate processes that may be used in some embodiments to carry out some of the acts described in connection with FIG. 2A.
  • the method 200 may be performed by the compound representation facility 122, the compound model facility 124, and the compound analysis facility 126 executed by the computing device 120.
  • the computing device 120 can receive data for processing by the quantum annealer 130.
  • the quantum annealer 130 might not be able to execute standard executable instructions (e.g., standard computer code), so the computing device 120 can translate the input into a format that is compatible with the quantum annealer 130, configure the quantum annealer 130 with the manner in which the data is to be processed, and provide the data to the quantum annealer 130 for processing.
  • the computing device 120 may cause classical computing hardware to process the data.
  • the compound representation facility 122 can generate compound representations of compounds (sometimes referred to herein interchangeably as “compound representations” or “digital representations”). As shown in FIG. 2B, for a compound, the compound representation facility 122 may convert information regarding the compound into a graph. In some embodiments, the facility 122 may convert such a graph representation into a vector. In some embodiments, the compound representation facility 122 can generate representations of compounds using information regarding properties of the compounds. For example, the compound representation facility 122 can generate representations of 3D protein target structures, existing ligands, or property datasets.
  • the compound representation facility 122 can generate, for each respective compound of multiple compounds, a respective compound representation that represents multiple atoms of the respective compound and at least one property of the respective compound.
  • the compound representation facility 122 can receive the properties from the client interface 112. For example, the compound representation facility 122 can receive properties of new compounds to represent the new compounds and predict their properties.
  • the compound model facility 124 can train one or more property determination models to identify one or more properties of compounds.
  • the compound representation facility 122 may train a neural network 220 for predicting properties (though it should be appreciated that models may be implemented in ways other than neural networks).
  • the compound model facility 124 can train a model to predict chemical features that are related to activity and performance of the compounds.
  • the compound model facility 124 can train a property determination model based on the respective compound representation of each respective compound of the compounds.
  • the compound model facility 124 can generate predicted values for properties for a set of compounds. As shown in FIG.
  • the compound representation facility 122 can maintain graph embeddings (e.g., compound representations) 222 in an embedding space (e.g., solution library) 224.
  • the set of compounds can be compounds that have the properties of interest.
  • the compound model facility 124 can generate a library of compounds accessible to the property determination model, the library of compounds comprising the respective compound representation of each respective compound of the compounds.
  • the compound model facility 124 may generate a solution library of compounds (e.g., RNA, RNA, or peptides) and their properties for the quantum annealer 130 or other computing hardware to analyze.
  • the compound model facility 124 can generate using models, or retrieve from a data store, properties for compounds. As shown in FIG. 2B, the compound model facility 124 can generate property predictions 226 for compounds. The compound model facility 124 can identify a numeric value for every compound and its properties. In some embodiments, the compound model facility 124 can determine, for one or more (or each) compounds in the library of compounds, a value for the compound with respect to each property of a set of properties, to generate a set of property values for each compound of the library of compounds. For example, for a property that is blood-brain barrier penetrance, a numeric value can indicate how blood brain barrier penetrant a compound is. In some embodiments, the properties can be calculable and can be performed at scale using techniques described herein.
  • Some embodiments that create a library of information regarding compounds may include techniques for analysis of compounds, such as to identify a compound that may perform a desired function or task or meet some performance criterion for the same. Other embodiments, however, may not include such an analysis and may end once the library is created or supplemented with property information.
  • the process includes analysis functionality.
  • the compound model facility 124 can configure criteria.
  • the criteria can determine the likelihood of success for a compound to perform well in a given application.
  • the criteria can be based on the compounds and properties of interest of those compounds.
  • the compound model facility 124 can receive criteria defining a set of compounds that have properties of interest.
  • the compound model facility 124 can extract criteria that defines blood barrier penetrant drugs for the quantum annealer 130 to optimize.
  • the compound model facility 124 can extract criteria for other computing hardware to optimize.
  • the criteria can be inputted into an objective function for comparing against a dataset of candidate solutions by the quantum annealer 130, or in some embodiments, by other computing hardware.
  • the compound model facility 124 can provide the predicted properties 230 from a large chemical property database 232 as candidate solutions 234 to a quantum computer 236 (e.g., quantum annealer 130).
  • a quantum computer 236 e.g., quantum annealer 130
  • the space of candidate solutions the quantum annealer 130 can check can be in the magnitude of IO 20 and beyond.
  • the compound model facility 124 can receive criteria from the computing device 110, which can receive the criteria via the client interface 112 from the client 115.
  • a drug discovery scientist or medicinal chemist can define criteria for an optimal compound in research study.
  • the compound model facility 124 can be provided with multi-objective optimization criteria 238.
  • the scientist or chemist can define a multi-objective optimization criterion for what the compound should do.
  • the medicinal chemist can derive criteria that define how existing or theoretical molecules interact with the protein.
  • the compound analysis facility 126 can operate the quantum computer 236 (e.g., quantum annealer 130) to trigger execution of criteria against the set of compounds.
  • the compound analysis facility 126 can trigger at least one quantum computer to analyze the set of property values for each compound of the library in connection with an objective function with which the at least one quantum computer is configured, to determine a compound for which corresponding property values generate a minimum value for the objective function.
  • the compound analysis facility 126 can execute the quantum annealer 130 with properties of interest of the compounds to analyze any property of any compound.
  • the compound analysis facility 126 can cause classical computing hardware to analyze the property.
  • the quantum annealer 130 can use the criteria defining the property information generated from the fast property determination of the compound model facility 124.
  • the pre-calculated properties output by the compound model facility 124 can be the candidate solutions that the quantum annealer 130 can cause the quantum computer to analyze.
  • the compound analysis facility 126 can use the quantum annealer 130 to search the data store of all drug -like compounds maintained by the compound representation facility 122 for compounds satisfying the criteria. For example, the search can result in the compiling of properties for all molecules that meet a criterion (e.g., all drug-like).
  • the compound analysis facility 126 can cause classical computing hardware to search for compounds.
  • the compound analysis facility 126 can convert the criteria to be processed by the quantum annealer 130. In some embodiments, the compound analysis facility 126 can use the pre-calculated properties as input to create the QUBO when searching chemical space. As shown in FIG. 2C, the compound analysis facility 126 can convert the multi -objective optimization criteria into a QUBO quantum formulation 240. For example, the compound analysis facility 126 can convert or reformulate the criteria in the values of a QUBO for the quantum annealer 130 to use the criteria. The compound analysis facility 126 can create the QUBO to include the predicted properties of the compounds predicted by the compound model facility 124. However, it should be appreciated that in some embodiments, the compound analysis facility 126 can generate an input for classical computing hardware to analyze for compounds.
  • the reformulated criteria as QUBO can be used by the quantum annealer 130 for optimization (e.g., minimization) on the quantum computer.
  • the compound analysis facility 126 can cause the quantum annealer 130 to run optimization against the set of compounds.
  • the quantum annealer 130 can identify optimal compounds from the property predictions pre -calculated by the large-scale property models of the compound model facility 124.
  • the classical computing hardware can identify compounds.
  • the compound analysis facility 126 can receive a subset of the compounds.
  • the subset of the compounds can be molecules or compounds that best optimizes and satisfies the defined criteria.
  • the quantum computer e.g., quantum annealer 130
  • the quantum computer can provide ranked optimal results 242.
  • the subset of the compounds can be ranked to identify the most optimal compound.
  • the quantum computer e.g., quantum annealer 130
  • the compound analysis facility 126 can receive a set of the top N most optimal compounds.
  • the subset of the compounds can be ranked based on how blood brain barrier penetrant they are or how they bind to a target.
  • the compound analysis facility 126 can receive from the at least one quantum computer, an identification of the compound.
  • the compound analysis facility 126 can use the quantum annealer 130 to use the quantum computing to identify an optimized compound having the properties of interest. It should also be appreciated that in some embodiments, the compound analysis facility 126 may be configured to receive the compounds from the classical computing hardware.
  • the compound analysis facility 126 can further analyze the subset of the compounds, and potentially refine the subset.
  • the compound analysis facility 126 can output the identification of the compound.
  • the client 115 via the client interface 112, can cause the compound analysis facility 126 to fine tune the top compounds using slower computation or experimental methods.
  • FIG. 3 A is a flowchart of an illustrative method 300 of encoding properties of compounds.
  • the method 300 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments.
  • the method 300 can include converting a compound into a graph representation, and then encoding the graph representation into a compound representation (e.g., digital representation).
  • performing step 202 of the method 200 includes performing the method 300.
  • the process 300 of FIG. 3 A may operate on suitable input regarding compounds.
  • representations may be generated based on numerical vectors (physicochemical descriptors), fingerprints of binary or integer vectors containing a hashed or numerical count representation of the constituents of a compound, SMILES strings, graph representation of the molecule's 2D structure, 3D representation of the molecule's structure and conformer, and multiple 3D representations of compound conformers.
  • the compound representation facility 122 can generate multiple nodes to represent multiple atoms of a compound for which a representation is to be generated.
  • the compound representation facility 122 can represent chemicals, pharmaceutical compounds, drugs, or biologies using a graph in which nodes of the graph are atoms.
  • Each respective node of the plurality of nodes can represent each respective atom of the plurality of atoms of the compound in some such embodiments.
  • data regarding a compound (e.g., molecule) 310 can be converted into a molecular graph (312) (e.g., with data regarding atoms 314a, 314b of the compound 310 converted to nodes 316a, 316b of the molecular graph 312).
  • the compound representation facility 122 can generate edges of the graph to represent bonds between the atoms of the compound (e.g., interatomic bonds). Each respective edge of the edges can represent a respective association between a respective pair of atoms of the compound.
  • the molecular graph can represent the structure of the molecule with atoms as nodes and bonds 318 between atoms as edges 320.
  • the compound representation facility 122 can generate the graph for representing compounds and their protein binding sites.
  • the graph may include nodes representing atoms and edges representing bonds (e.g., interatomic bonds) as well as weak bonds between ligands and proteins.
  • the facility 122 may process attention inputs, process an adjacency matrix, and/or process an atomic distance matrix, examples of which are described below.
  • the compound representation facility 122 can iteratively traverse the nodes along the edges to update the graph representation of the compound. For example, as shown in FIG. 3C, iterative passing (e.g., iterative message passing) can be used traverse nodes along the edges to update the graph representations.
  • compound representation facility 122 can utilize an iterative message passing process to traverse the edges to identify which nodes are near other nodes.
  • the compound representation facility 122 can identify information about each node (e.g., atom) based on information regarding surrounding nodes (e.g., to indicate interactions between atoms). In some embodiments, based on the surrounding atoms, the compound representation facility 122 can update the graph representation of the compound for analysis.
  • the compound representation facility 122 can update the graph representation in accordance with a configuration to encapsulate (e.g., summarize) the information of all of the atoms in the neighborhood and the atoms in those atoms’ neighborhood. In some embodiments, the compound representation facility 122 can use machine learning to analyze and update the graph representation.
  • the “message passing” technique that may be used across iterations may in some embodiments include training multiple aspects of a model, such as multiple neural network layers, using adjacency and/or distance matrices (examples of which are described below), which define the bonds and 3D structure of a compound.
  • FIG. 3D illustrates an example of such a process.
  • the distance matrix 330 is formed as a “3D conformer encoding” matrix that indicates values representing or derived for distances between pairs of atoms (and/or molecules) of a compound, across different conformers of the compound.
  • the values for each distance may be an average distance, median distance, standard deviation of distance, and/or other calculated distance.
  • the adjacency matrix 332 may identify pairs of atoms (and/or molecules) between which a bond exists in the compound. In some cases, as in the example of FIG. 3D, as mentioned below, one or more parts of the compound may be masked during creation of these matrices. Attention inputs (in the form of query vectors Q 334, key vectors K 336, and/or value vectors V 338) may include information on atoms and bonds, but not on 3D structure, and may include information on nodes of a graph of the compound, such as information regarding atoms of the compound. Such atom information may identify an element for the atom, valence information, or other information defining that atom of the compound.
  • the matrices and attention inputs may be input to the training process 340 and neural networks may be trained with the values.
  • This process may yield a representation 342 of the information regarding the compound (e.g., matrix representation of information regarding the compound, compound representation, etc.).
  • the compound representation facility 122 can utilize a weighted algorithm that iteratively updates features of nodes (e.g., atoms) with those of the surrounding atoms. For example, compound representation facility 122 can utilize the weighted algorithm to update the graph representation from neighboring nodes to get a weighted sum status of the neighbors. In some embodiments, the compound representation facility 122 can update the graph representation by using nonlinear weighting. For example, the compound representation facility 122 can assign a larger influence or weight on the representation of the atom based on the atoms that are local, close, or immediate neighbors. Meanwhile, updates from more distant atoms, such as atoms farther away from the atom that is being analyzed, can have less influence on the representation of that atom.
  • nodes e.g., atoms
  • the compound representation facility 122 can utilize the weighted algorithm to update the graph representation from neighboring nodes to get a weighted sum status of the neighbors.
  • the compound representation facility 122 can update the graph representation by using nonlinear
  • Atoms that are distant, such as not immediate neighbors, can have some influence on the atom based on weighted information acquired during subsequent rounds of message parsing.
  • the compound representation facility 122 can update the graph representation based on the interatomic bond distance between the atoms.
  • the compound representation facility 122 can use the updated graph representation to generate (e.g., learn) a compound representation of the atoms in the network based on the local atom and bond structure.
  • the compound representation facility 122 can generate a compound representation of the compound.
  • the compound representation facility 122 can reduce or convert the graph representation of a compound into a compound representation of the compound.
  • the compound representation can comprise the nodes, the edges, and one or more properties of the compound.
  • the property can be any desired property for any compound.
  • Properties can include information associated with the structure describing properties of the structure.
  • the compound representation can include structural features as well as functional features. Structural features might be indicative of functional features if structure informs function, and the structural information is captured. This may include information associated with an atom describing interactions with other atoms.
  • the properties can identify the atoms (e.g., carbon) and atomic weights of the compound.
  • Examples of functional properties include whether the compound is blood brain barrier penetrant, bioavailable, and can bind to a particular target.
  • property information stored in a compound representation may be limited to structural property information.
  • the compound representation facility 122 can store the compound representation of the compound in an array.
  • the compound representation can be in an array of information describing values for properties of interactions.
  • the array can be a vector.
  • the compound representation facility 122 can store values (e.g., numbers) in the array or vector for that compound.
  • the compound representation facility 122 can identify an atom (e.g., alpha carbon within the molecule) of the compound and assign the value of the atom in the compound representation.
  • the values represent the compound and its atoms, bond, and properties.
  • values can indicate whether a certain compound is blood brain barrier penetrant, how well it binds a target, or how volatile it is.
  • the values can be default values for each type of atom, bond, and its properties.
  • the compound representation facility 122 can store random values in the vectors. For example, the compound representation facility 122 can generate an unlearned compound representation for each compound.
  • FIG. 4A is a flowchart of an illustrative method 400 of training a model to generate compound representations (e.g., digital representations ) of compounds and identify property values of the compounds from the compound representations of the compounds to establish a library (e.g., trained library) of compounds.
  • the method 400 may be performed by the compound representation facility 122 and/or the compound model facility 124 executed by the computing device 120, in some embodiments.
  • the compound representation facility 122 can train an encoder network to encode properties of compounds and a decoder network to decode those properties of the compounds. If a library of compounds can include encoded compounds and their property values that can be accurately decoded, then the library of compounds can be searchable and useful for compound related research.
  • performing step 202 of the method 200 includes performing at least some steps (e.g., steps 402-410) of the method 400.
  • performing step 206 of the method 200 includes performing at least some steps (e.g., step 412) of the method 400.
  • the compound representation facility 122 can identify known property values of compounds in a library of compounds.
  • the known property values can include structural properties of the compounds.
  • the known property values can indicate whether a particular compound is actually blood brain penetrant.
  • the compound representation facility 122 can receive the known property values from a scientist or data store.
  • the training data for the training the compound representation facility 122 can include a large collection of chemical molecules.
  • databases of drug like molecules such as Chambly (3.7 million molecules), ZINC (970 million molecules) or GDP-17 (166 billion molecules).
  • the compound representation facility 122 can generate a drug like subjection of the training data by enumerating theoretical chemical graphs up to a certain number of atoms.
  • the compound representation facility 122 can fdter out the molecules that are not drug like (“un-drug”).
  • the compound representation facility 122 can identify un-drug molecules by defining a set of rules that determine the similarity of a compound to those which are known to be successful. Examples of rules include atom valency, solubility, molecular weight, number of hydrogen bond donors.
  • the compound representation facility 122 can train an encoder network to encode the compounds into compound representations from graph representations of the compound.
  • the compound representation facility 122 can train the graph neural network layers (e.g., encoder network) 246 to convert the graph representations 248 into a vector representation (e.g., compound representation) 250 that can be converted to graph embeddings 222 to be maintained in a library of compounds (e.g., searchable embedding space 224).
  • the graph representations are generated by the compound representation facility 122 as discussed in method 300.
  • the compound representation facility 122 can receive graph representations 420 of the compounds.
  • the compound representation facility 122 can include the encoder network 422 and train the encoder network to encode the graph representations 420 into respective compound representations 424.
  • the encoder network of the compound representation facility 122 can include block transformers to process the input regarding compounds.
  • the training of the encoder in step 404 may be done in some embodiments using attention information 426, an adjacency matrix 428, and/or an atomic distance matrix 430, examples of which are discussed below.
  • the intermediate graph representation 420 could be omitted, and the encoder network can be trained to create a compound representation 424 from an input chemical structure of the compound.
  • the compound representation facility 122 may receive, generate, and/or calculate an adjacency matrix 428 from input information regarding an input compound.
  • the adjacency matrix may indicate interconnections of atoms in a compound, or interconnections between molecules (e.g., fragments) in a compound.
  • the facility 122 may generate the adjacency matrix with atoms (or molecules) in respective rows and columns and assign values to indicate whether atoms in a cell (the atom for the row and the atom for a column) share a bond in the compound.
  • the compound representation facility 122 can assign a value of 1 to indicate a bond between two pairs of atoms and 0 to indicate no connection between two pairs of atoms.
  • the compound representation facility 122 can assign values to indicate information regarding the bond between atoms. For example, the compound representation facility 122 can assign 0 to indicate no bond, 1 to indicate a single bond, or 2 to indicate a double bond. As another example, values may indicate whether a bond is ionic, covalent, hydrogen, metallic, or van der Waals, or other information regarding a bond.
  • the compound representation facility 122 may also receive, generate, and/or calculate an atomic distance matrix 430 for a compound.
  • the distance matrix may encode information about a 3D structure of the compound by indicating distances between atoms (or molecules) in a compound. The distances may be in angstroms or other suitable unit of measure.
  • the compound representation facility 122 may use in the matrix a measured distance between atoms for a particular conformer of a compound, while in other embodiments the facility 122 may calculate an average/mean, median, mode, standard deviation, or other calculation of atomic distance between atoms (or molecules), which may be calculated based on distances across different conformers of a molecule.
  • atom positions that are used in calculating distance may be determined using a position determination process such as the “ETKDG” method.
  • the compound representation facility 122 may also receive, generate, and/or calculate attention information 426 for a compound. This may include contextualized query vectors, key vectors, and value vectors for use in attention analysis.
  • the compound representation facility 122 can be trained (e.g., learn) to generate the compound representations while retaining their structural information and information about the atoms and bonds of the compounds.
  • Each compound representation can be unique for each compound conformer or indicate distance distribution of conformers of a molecule.
  • the compound representation facility 122 can be trained for representation of pharmaceutical chemical space but can also be trained on representation of any chemical space.
  • the compound representation facility 122 can be trained on a dataset of proteins and ligands.
  • the compound representation facility 122 can train a decoder 432 to generate decoded property values of the compounds from the compound representation of the compounds, such as structural properties. For example, as shown in FIG. 4B, the compound representation facility 122 can generate graph un-embeddings 434 of the encoded graph embeddings that were input into the compound representation facility 122.
  • the decoder can be a decoder network or a representation layer for predicting or identifying properties of the encoded compounds. By training the decoder, the compound representation facility 122 can establish a searchable chemical space to optimize and identify molecules based on their properties. The compound representation facility 122 can train the decoder to extract the property values of the compounds from the compound representations.
  • the compound representation facility 122 can train a generative neural network to reconstruct (e.g., decode) the property values of the property representations of the compounds.
  • the compound representation facility 122 can train the decoder to reconstruct chemical structures from vector representations of the compounds.
  • the compound representation facility 122 can identify the decoded property values of the compounds.
  • the compound representation facility 122 can identify the decoded property values from the compound representations.
  • the compound representation facility 122 can identify the decoded property values in the vector representation of the compounds.
  • the decoded property values can indicate structural and/or functional properties of the compound.
  • the compound representation facility 122 can determine whether the decoded property values of the compound correspond to the known property values of the compound.
  • the compound representation facility 122 can compare whether both the known property values and the decoded property values indicate that the compound is blood brain penetrant, or indicate a same degree or amount of penetrance.
  • the compound representation facility 122 can train the decoder to reconstruct chemical structures based on a loss function that can be conditioned by the decoder network or the properties (e.g., property values).
  • the loss can be the difference between the known property values (e.g., input) and the decoded property values (e.g., output).
  • the method 400 can proceed to step 404 for the compound analysis facility 122 to continue training the encoder and decoder to encode in a different manner to store information in the representation in a manner that may enable more accurate decoding and/or decode in a different manner so as to more precisely generate the decoded property values from the compound representation of the compound.
  • the compound analysis facility 122 can continue retraining the encoder and/or decoder until the decoder is able to accurately identify the property values of any particular conformation.
  • the compound analysis facility 122 can train the encoder and the decoder until they can accurately generate property representations of the compounds from the graph representations of the compounds, and then recreate the chemical structure of the compounds from the graph representations. For example, the training can be based on the fact that similar compounds might have similar structures and properties.
  • the method 400 can proceed to step 412 for the compound analysis facility 122 to output a trained library of compounds (e.g., trained chemical representation space or trained embedding space).
  • a trained library of compounds e.g., trained chemical representation space or trained embedding space.
  • the library of compounds can include encoded compounds and their property values that can be accurately decoded, which allows the library of compounds to be searchable and usable for compound related search.
  • the facility 122 may conduct partial masking 436 of compounds.
  • one or more parts of the compound e.g., one or more atoms or molecules, such molecules being portions of the compound such as a fragment, and bonds of those atoms/molecules
  • the compound may be removed from the input such that information regarding the atoms and bonds of the masked part(s) are not input to the encoding process.
  • the input representation may include a placeholder “dummy” token in place of the masked atom/molecule
  • the model may be evaluated by how well the model may decode an encoded representation of the masked compound to recreate the original unmasked compound.
  • an input compound may have a portion between 5% and 25% of the overall compound (e.g., between 10% and 20%, such as 15%) masked.
  • the information may not be used to determine the adjacency matrix, such that the model does not receive information about the interconnections of the masked atoms/molecules, including interconnections to the unmasked parts of the compound.
  • the information on the masked portion may also not be used in determining atomic distance.
  • the information may be used in determining the attention vectors (query, key, and value).
  • the model is thus trained with information indicating that there are atoms/molecules included in the compound, but not indicating the arrangement (e.g., position and/or interconnection) of those atoms/molecules.
  • the facility 122 determines whether the model accurately determined the arrangement of the atoms/molecules. The model is trained until a performance criterion is met for recreation of the original, unmasked compound. In doing so, the model is trained to reliably determine such information for additional molecules that have not yet been synthesized or for which arrangement information is unknown.
  • FIG. 5A is a flowchart of an illustrative method 500 of decoding a compound representation (e.g., digital representation) of a compound to recreate a chemical structure of the compound.
  • the method 500 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments.
  • the compound representation facility 122 can identify a compound representation of a compound. For example, the compound representation facility 122 can retrieve a vector that represents the compound, other representation described herein, or other representation. In some embodiments, the compound representation facility 122 can receive a compound (e.g., input molecule) for which to identify properties.
  • a compound e.g., input molecule
  • the compound representation facility 122 can extract property values of the compound from the compound representation of the compound. For example, the compound representation facility 122 can query or identify the property values stored in the vector. In some embodiments, the compound representation facility 122 can reconstruct (e.g., decode) the property values of the property representations of the compounds.
  • the compound representation facility 122 can identify, from the property values, properties of the compound. For example, the compound representation facility 122 can identify that decoded property values indicate structural properties for a compound, and/or may indicate functional properties for a compound such as whether a particular compound is blood brain penetrant.
  • the compound representation facility 122 can output the properties of the compound. For example, compound representation facility 122 can reconstruct the compound from its compound representation. The reconstructed compound can be output to a user, such as a scientist researching the compound.
  • FIG. 5B is a flowchart of an illustrative method 550 of decoding a compound representation (e.g., digital representation) of a compound to synthesize the compound.
  • the method 550 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments.
  • the compound representation facility 122 can query a library of compounds for a compound having properties of interest.
  • the compound representation facility 122 can receive a desirable set of properties, and then query all the compounds in the library of compounds for compounds having the desired set of properties.
  • the library of compounds can be a searchable data store of compounds.
  • the library of compounds can be a searchable embedding space of molecules and/or an information-rich embedding space capturing information relevant to chemistry (e.g., rules of chemistry, relationships between structural and functional properties of compounds, etc.).
  • the compound representation facility 122 can use algorithms to find compounds in the library of compounds.
  • the compound representation facility 122 can use Bayesian optimization to find molecules in the chemical space maintained by the library of compounds.
  • the library of compounds can be stored on the computing device 120 and searchable via the computing device 110.
  • the library of compounds can be queried by the client 115 via the computing device 110 to look up the properties of any compound.
  • the compound representation facility 122 can identify, in the library of compounds, a compound representation for the compound having the properties of interest.
  • the library of compounds can be generated and maintained as discussed in reference to FIG.
  • the library of compounds can include a respective compound representation for each compound (or one or more (e.g., all) compound conformers of each compound) in the library of compounds.
  • the respective compound representation enables the compound representation facility 122 to query the library of compounds for any compound and reconstruct any compound molecule from its compound representation.
  • the library of compounds can maintain a one-to-one mapping between each respective compound representation and each compound conformer.
  • the compound representation facility 122 can reconstruct the compound from the one-to-one mapping to the compound representation.
  • the Bayesian optimization algorithm can search the embedding space if it maps one- to-one to chemistry space.
  • the compound representation facility 122 can extract, from the compound representation of the compound, properties of the compound, wherein the properties include the properties of interest. For example, the compound representation facility 122 can query or identify the property values stored in the vector. In some embodiments, the compound representation facility 122 can reconstruct (e.g., decode) the property values of the property representations of the compounds.
  • the compound representation facility 122 can output the properties of the compound for synthesis.
  • compound representation facility 122 can reconstruct the compound from its compound representation.
  • the reconstructed compound can include the chemical structure of the compound and the properties of the compound.
  • the reconstructed compound can be output to a user, such as a scientist that wants to synthesize the compound.
  • the compound representation facility 122 can synthesize the compound to generate at least one synthesized compound fortesting.
  • a synthesized compound may be used in a variety of ways, including in research to confirm properties of a compound or confirm whether a compound is suitable for a function (e.g., binding to a target, addressing a medical condition, or other function). Any suitable synthesis techniques can be used to synthesize the compound.
  • FIG. 6A is a flowchart of an illustrative method 600 of training a model to predict a value for a property of one or more compounds, which may be done using compound representations (e.g., digital representations) of the compounds such as representations discussed above.
  • the method 600 may be performed by the compound model facility 124 executed by the computing device 120, in some embodiments.
  • method 600 can include the compound model facility 124 predicting values for a property for an entire library of compounds.
  • performing step 204 of the method 200 includes performing the method 600.
  • FIG. 6B Some aspects of the method 600 of training a model are illustrated in FIG. 6B.
  • the compound model facility 124 can identify a plurality of compound representations. Each of the plurality of compound representations can represent a respective compound in the library of compounds. For example, the compound model facility 124 can access the library of compounds generated and maintained by the compound representation facility 122 as discussed in reference to FIGS. 3A and 4A. In some embodiments, the compound model facility 124 can be trained on a graph (e.g., instead of a vector generated from the graph) or with some other data structures that represent compounds. For example, the compound model facility 124 can train on any representation of compounds 620, such as molecule patterns of a protein and its bound ligand form or whether the molecule is blood brain barrier penetrable.
  • the compound model facility 124 can train 622 a machine learning model on the plurality of compound representations for predicting values for properties of each respective compound in the library of compounds. For example, the compound model facility 124 can predict properties of compounds from the compound representation of the compounds. In some embodiments, the compound model facility 124 can leverage the graph representation to train the model to predict properties of each respective compound. For example, the compound model facility 124 can use the graph representation to pre-calculate property predictions of the compounds in the library of compounds (e.g., a large-scale chemical library). In some embodiments, the compound model facility 124 can train the model (e.g., graph transformer models) such that the model learns a general understanding of the chemical properties (e.g., structure and function) of the compounds.
  • the model e.g., graph transformer models
  • the compound model facility 124 can utilize the machine learning model to generate predicted property values indicative of the properties of the respective compound.
  • the generated predicted property values can be numeric values in a vector or an array.
  • the compound model facility 124 can determine 624 whether the predicted property values of the respective compound correspond to known property values of the respective compound.
  • the predictions can be output to assess the accuracy of the model. By assessing the accuracy of the model, its parameters can be improved and updated to improve the predictive performance of the model.
  • the compound model facility 124 can compare the predicted property values (e.g., numerical vector) to the known property values (e.g., input label). For example, the compound model facility 124 can condition the model based on a joint loss function to compare the predicted property values and known property values.
  • the compound mode facility 124 can use a comparison between the predicted property values and the known property values to evaluate the performance of the model. For example, a well performing model would not have a significant difference between the predicted property values and the known property values, while a model that does not perform well would generate predicted property values that are significantly different from the known property values.
  • the method 600 can proceed to step 604 for the compound model facility 124 to re-train the machine learning model to generate the predicted property values of the respective compound. As the compound model facility 124 is trained on more data, the predictions may become more accurate.
  • the method 600 can proceed to step 610.
  • the compound model facility 124 can output the trained model, which has been trained to predict one or more known properties of compounds.
  • the compound analysis facility can use the machine learning model to predict the properties of the library of compounds.
  • the compound model facility 124 can include fine-tuned property models (e.g., trained models) that the compound analysis facility 126 can use to output the predicted properties for compounds in a large chemical database (e.g., library of compounds) to establish a large chemical property database.
  • the compound analysis facility 126 can provide the predicted properties to the quantum annealer 130 as input for the optimization criteria to identify and rank compounds having the predicted properties.
  • the compound analysis facility 126 may be adapted to provide the predicted properties to conventional computing hardware to analyze the compounds.
  • FIG. 6C is a flowchart of an illustrative method 650 of training a model to predict new (e.g., previously unknown) properties of the compounds.
  • performing step 204 of the method 200 includes performing the method 650.
  • Some aspects of the method 650 of training a model to predict new properties of compounds are illustrated in FIG. 6D.
  • the method 650 may be performed by the compound model facility 124 executed by the computing device 120, in some embodiments.
  • the model may be trained to predict new properties of compounds using training data.
  • the training data can include either no or few labeled training examples of compounds having the new properties, but the compound model facility 124 can leverage an existing model to train a new model to predict values for a new property, such as for compounds in a library of compounds.
  • the compound model facility 124 can fine-tune an existing model (e.g., with minor adjustments) to predict new properties of the compounds.
  • the compound model facility 124 can receive identification of a new property of interest and training data for the new property of interest.
  • the compound model facility 124 can identify property values for the new property of interest in the training data.
  • the property values can be numeric values in a vector or an array, which may include data that has been retrieved from a source of data regarding compounds (e.g., a public source of data), obtained through testing of compounds, or otherwise obtained.
  • the compound model facility 124 can modify, based on the new property of interest and the training data, a machine learning model trained to predict one or more properties of compounds.
  • the training data can be used to train the compound model facility 124 to predict a value for the new property of interest for each compound of the library.
  • to train a new, wholly untrained model to predict a value for a property and with acceptable accuracy or reliability may require a large amount of training data for compounds that have or not have the property, or have a range of values of the property. For some properties, such as previously unstudied or understudied properties, such an amount of data may not be available. Or there may be other hurdles to obtaining sufficient data. In some embodiments, a smaller amount of training data may be used to reliably train a model, by leveraging an existing model that had been trained to predict a value for a different property of compounds.
  • an existing model may be a neural network and may include layers that were previously trained to predict a value for another property.
  • the facility 124 may use these layers as a backbone for a new model that is created by editing the existing model.
  • the compound analysis facility 124 can learn or generate a non-linear representation of the latent layer(s) of the existing model to produce learned representations of the data that become complex with addition of a new layer to the neural network.
  • the existing layers of the model may be layers that have learned information on structure of compounds, or other functional properties of compounds, or general information on classes of compounds.
  • those models may feed one or more classifier layers or other layers that output information on a value of a property for an input compound.
  • Those later layers of the model may be specific to the one property for which that model is designed, but those earlier layers and the parameters with which they are configured as a result of earlier training may be reusable in models that predict a value for a different property for an input compound. Such information may be useful in a network that is to predict a value for another property.
  • the model may be edited in a way that allows for retaining some parts of the model (e.g., one or more layers of a neural network) while editing the model.
  • Such editing may include adding, removing, or adjusting a part of the model, such as adding, removing, or adjusting layers of a neural network.
  • a new output layer can be added to a model following removal of a prior output layer, where the output layer may be a layer that predicts a value for a property of input compounds.
  • the added layer of the model may predict whether an input compound is blood brain barrier penetrant or not, or an amount of degree of penetrance.
  • the compound analysis facility 124 can add a new layer for the new property to an existing model to retrain the model for predicting the new property.
  • the compound model facility 124 can add one or more new layers for predicting the new property.
  • the compound analysis facility 124 can add or append a new layer for the new property to the network.
  • the compound analysis facility 124 can train the model with the new layer to train the model (including the layer) on the new property.
  • the compound model facility 124 can copy the existing layers for predicting the existing properties and combine the copied layers with the one or more new layers for predicting the new property.
  • the compound model facility 124 can remove one or more layers related to the existing properties and add one or more new layers for predicting the new properties.
  • the layer being removed can be the layer responsible for decoding the chemical representation.
  • the prediction layer can be removed, and the remaining representation network is copied over and added to a new untrained layer.
  • the initial network can be removed while copying all the weights in the model and fine tune on data set (e.g., blood brain barrier dataset). The fine tuning can be for the new property.
  • Embodiments are not limited to operating with any particular property or type of property. Examples of properties include solubility, blood brain barrier, toxicity, synthesizability, protein ligand binding.
  • the compound model facility 124 can then pre-calculate these categories on a large dataset such as GDP- 17.
  • the compound model facility 124 can adjust an existing layer to incorporate the training data.
  • the compound model facility 124 can train the machine learning model to identify the new property of interest in a set of compounds of the library of compounds. For example, the compound model facility 124 can train the machine learning model by running the training data of step 652 through the model. The compound model facility 124 can be trained to predict the new property with one or more layers at the output side of the network. For example, only the new layers need to be trained but the machine learning model as a whole is trained to predict whether any of the compounds include the new property of interest.
  • the compound model facility 124 can use the machine learning model to identify or generate property values for the new property of interest in the compounds.
  • the property values can be numeric values in a vector or an array.
  • the compound model facility 124 can be trained without any labeled training data. By not needing labelled training data, the compound model facility 124 can enumerate chemical structures to a desired size. If the compound model facility 124 is trained without labeled training data or validation, the method 650 can proceed to step 660. If the compound model facility 124 is trained with a validation step, the method 650 can proceed to step 658.
  • the compound model facility 124 can determine whether the compounds predicted to have the new property of interest correspond to the compounds expected to have the new properties of the set of compounds.
  • the predictions can be the output to assess the accuracy of the model. By assessing the accuracy of the model, its parameters can be improved and updated to improve the predictive performance of the model. For example, a scientist or subject matter expert can provide information about which compounds are expected to have the new properties.
  • the compound model facility 124 can compare the predicted new property values (e.g., numerical vector) to the expected property values (e.g., input label) forthose compounds. For example, the compound model facility 124 can condition the model based on a joint loss function to compare the predicted property values and expected property values.
  • the compound mode facility 124 can use a comparison between the predicted new property values and the expected property values to evaluate the performance of the model. For example, a well performing model would not have a significant difference between the predicted property values and the expected property values, while a model that does not perform well would generate predicted property values that are significantly different from the expected property values.
  • the method 650 can proceed to step 656 for the compound model facility 124 to re-train the machine learning model to identify the new property of interest in the library of compounds.
  • the method 650 can proceed to step 660 for the compound analysis facility 126 to use the trained machine learning model to predict the new property of interest in the library of compounds.
  • the compound analysis facility 126 can use the machine learning model to provide the predicted properties to the quantum annealer 130 as input for the optimization criteria to identify and rank compounds having the new property.
  • the compound analysis facility 126 may be adapted to provide the input to classical computing hardware.
  • FIG. 7A illustrates a method 700 for identifying compounds to analyze.
  • the method 700 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments.
  • the compound analysis facility 126 can identify compounds having properties of interest.
  • the compound analysis facility 126 can receive the properties of interest from the client 115 via the client interface 112 executing on the computing device 110.
  • properties of interest may additionally or alternatively be determined by the compound analysis facility 126 through analysis of input compounds.
  • the input compounds may be ones that are identified by the client 115 as performing a function or performing a function in a manner that satisfies one or more criteria, such as performing the function with a desired effectiveness.
  • the compound analysis facility 126 can obtain or compute data of existing molecules that have known favorable properties of interest.
  • the compound analysis facility 126 can extract properties of interest forthose compounds.
  • the compound analysis facility 126 can identify properties of interest that define or increase the likelihood of interaction with a protein target.
  • the compound analysis facility 126 can identify features that contribute specific desirable properties.
  • the compound analysis facility 126 can obtain 3D structures of compounds (e.g., proteins, ligands, etc.) with known favorable properties for a particular function of interest, from which to identify properties of interest. If the 3D protein structure is unavailable or does not exist, the compound analysis facility 126 can predict the structure of the known compounds. In some embodiments, the compound analysis facility 126 can identify properties of interest for other therapy modalities. For example, DNA, RNA, or peptide-based therapies. The compound analysis facility 126 can identify features and properties specific to the performance in those therapy modalities.
  • compounds e.g., proteins, ligands, etc.
  • the compound analysis facility 126 can generate compound conformers.
  • the compound analysis facility 126 can identify compounds having low energy states.
  • the compound analysis facility 126 can generate a plurality of 3D shapes of a compound, which may be 3D shapes of the compound that would be present in low energy states.
  • the compound analysis facility 126 can obtain known inhibitors and, as shown in FIG. 7H, sanitize and relax the conformers.
  • the compound analysis facility 126 can analyze the target(s) for compounds of interest.
  • the target(s) may be analyzed in connection with a 3D structure of the target(s).
  • the target may be a protein or other molecule that a compound of interest is to interact with, such as binding with.
  • the compound analysis facility 126 can detect one or more binding sites of the protein target.
  • the binding site may be a binding pocket.
  • a composition and/or structure of the binding site may be detected.
  • a molecular composition and structure may be determined.
  • the compound analysis facility 126 can analyze the known compounds in connection with how the known compounds dock with the protein target, for example, how they dock with the identified binding site(s).
  • the compound analysis facility 126 can in silico dock compounds into a protein target at the binding site(s) and analyze how the binding is executed.
  • the analysis may be of properties of the known compounds that are related to the docking with the binding site(s), such as a composition and/or structure of the compounds that relate to the docking, may be identified.
  • the compound analysis facility 126 can align the known compounds. For example, the compound analysis facility 126 can align the generated 3D conformers together in 3D space.
  • the compound analysis facility 126 can identify, using the aligned conformers and the analysis of the docking, one or more compound properties of interest in the aligned compounds.
  • the compound properties of interest can be pharmacophore features in aligned compounds.
  • the compound analysis facility 126 can determine pharmacophores for the input compounds.
  • the pharmacophore properties can take the form of abstract molecular features related to a ligand’s interaction with a biological macromolecule (e.g., protein).
  • the compound analysis facility 126 can use the pharmacophores to determine properties that are present in the compounds and may be related to performing the function or performing the function in the manner that satisfies the criteria.
  • the compound analysis facility 126 can abstract the aligned functional groups and features into numerical features (e.g., number of hydrogen bond donors/acceptors). As shown in FIG. 71, FIG. 7J, and FIG. 7K, the compound analysis facility 126 can generate a visual representation of relevant pharmacophores for a given ligand. As shown in FIG. 71, the hydrogen bond donors are highlighted. As shown in FIG. 71, the hydrogen bond donors are visualized. As shown in FIG. 7J, the hydrogen bond acceptors are visualized. As shown in FIG. 7K, the hydrophobics are visualized. As shown in FIG. 7L and FIG.
  • the compound analysis facility 126 can generate a visual representation of a ligand’s interaction with amino acid resides in a given protein binding site.
  • the compound analysis facility 126 can transmit the visual presentations (e.g., as shown in FIG. 7I-7M) to the computing device 110 for display in the client interface 112 to the client 115.
  • FIG. 8 A illustrates a method 800 for determining and/or analyzing criteria.
  • the method 800 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments.
  • the criterion may be all molecules that satisfy “Lepinski’s Rule of 5” that define a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule.
  • the property of interest can be that the compound has one hydrogen bond donor.
  • the quantum annealer 130 might not be able to execute standard executable instructions (e.g., standard computer code), so the computing device 120 can translate the inputted properties of interest into a format that is compatible with the quantum annealer 130 by adapting a compound property analysis to a value (e.g., QUBO) that the computation may be performed by quantum computing hardware and configure the quantum annealer 130 with the manner in which the data is to be processed and provide the data to the quantum annealer 130 for processing.
  • standard executable instructions e.g., standard computer code
  • the compound analysis facility 126 can identify a set of compounds to be analyzed.
  • the compound analysis facility 126 can identify a set of compounds to analyze based on the identified properties of interest.
  • Examples of the size and scope of the set of compounds include a small library of molecules in similar chemical space or a larger space defined by molecular criteria set by a user.
  • molecular criteria may be broad, such as all molecules with a number of atoms less than or equal to a number, or all molecules that satisfy the Rule of 5 mentioned above, or similar criteria.
  • a user may specify other criteria, such as to focus the analysis on chemical space of particular interest to a user, such as all compounds that include a certain atom or molecule.
  • the compound analysis facility 126 can identify the core substructure of active compounds from compounds used to generate the compound properties of interest (e.g., pharmacophores).
  • FIG. 8B shows examples of pharmacophores
  • FIG. 8C shows examples of properties of interest that may have been derived from pharmacophores.
  • the compound analysis facility 126 can append this substructure by using a library of chemical fragments to produce a set of compounds that include chemicals enumerated around the core structure. While this approach may be advantageous in some cases, it can also risk biasing analysis to a similar chemical space to existing compounds and so may not be an appropriate technique in all embodiments.
  • the compound analysis facility 126 can generate the set of compounds to search for optimal or desirable compounds, such as those predicted to have a desirable combination of properties for drugs.
  • the set of compounds can be a library of compounds to be analyzed by the quantum annealer 130 to determine a subset of the compounds that meet one or more criteria, such as those predicted to have a desirable combination of properties for drugs.
  • the compound analysis facility 126 can generate the set of compounds by enumerating all possible theoretical molecules up to a certain number of atoms.
  • the compound analysis facility 126 when the compound analysis facility 126 analyzes other therapy modalities, the compound analysis facility 126 can generate a set of compounds by enumerating over peptide, RNA, or DNA sequences up to a certain number of residues.
  • the compound analysis facility 126 can enumerate by generating an algorithm that generates possible combinations of compounds up to a certain number of atoms.
  • the compound analysis facility 126 can use the algorithm to start from a single carbon atom and add possible atoms in a bond with it up to and including the number of desired compounds.
  • the compound analysis facility 126 can generate one or more criteria.
  • the compound analysis facility 126 can use information from the pharmacophores, or other input from a user or information regarding compounds, to determine the criteria regarding properties for compounds that, when present in a compound, may lead to the compound performing a function or performing a function in a manner that satisfies one or more criteria.
  • the generation of properties of interest e.g., pharmacophores
  • the compound analysis facility 126 can derive the features that are relevant to a set of physiochemical properties (e.g., solubility, Blood Brain Barrier (BBB) penetration).
  • the physiochemical properties can be in the form of machine learning model predictions, presence/absence of chemical fragments, or other calculable features which relate to activity.
  • the compound analysis facility 126 can generate the Quantitative Structure Activity Relationships (QSAR).
  • QSAR Quantitative Structure Activity Relationships
  • the compound analysis facility 126 can generate the one or more criteria to search the set of compounds against the one or more criteria.
  • the compound analysis facility 126 can determine the one or more criteria related to the properties of interest and/or to values for those properties.
  • the one or more criteria enable a description of those properties with respect to a binary value, such as whether the property is present or not in a compound or whether a criterion with respect to the property (e.g., a value above or below a threshold) is satisfied for the compound.
  • the compound analysis facility 126 can generate the one or more criteria in a format that is processable by the quantum annealer 130. For example, for a list of compounds, the compound analysis facility 126 can generate a binary value with respect to each property and for each compound. These values may be arranged in a matrix of values, where each row represents a compound, and each column represents a property. In some embodiments, the values can be discrete or continuous values. For example, the discrete values can be binary values (e.g., 0 or 1). In another example, the continuous values can be any number from 0 to 1, such as 0.8.
  • FIG. 9 illustrates a method 900 for configuring the quantum annealer 130.
  • the method 900 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments.
  • the compound analysis facility 126 can generate a function that identifies relationships between variables, where the variables relate to compound properties and relationships between them, such as relative priorities of different properties in a desirable or well-performing compound.
  • the compound analysis facility 126 can select weighting values for the magnetic field and provide them to the quantum annealer 130 to configure the quantum annealer 130 to perform the analysis.
  • the compound analysis facility 126 may additionally receive input values regarding how each compound relates to a property of interest, such as whether the compound has the property or the compound’s status with respect to a rule for the property.
  • the values may be binary values, and the input values may be received as an array or matrix of values where each row corresponds to a compound and each column relates to a property.
  • the facility 126 may also provide the values to the quantum annealer 130 for analysis, as part of configuring the annealer 130.
  • the facility 126 may in some embodiments also trigger the analysis by the annealer 130, following configuration.
  • the values can be discrete or continuous values.
  • the values can be binary values or a set of a plurality of continuous values.
  • the weights selected in step 902 may, in some cases, indicate relationships between variables to be analyzed by the annealer(s), such as relationships between variables relating to one or more of the properties of interest.
  • the compound analysis facility 126 can configure the quantum annealer 130 with one or more weights or other values that affect operations of the quantum annealer 130 and thereby affect evaluation of the function.
  • the compound analysis facility 126 can map the function to the bias strengths of the values by setting the variables and the strengths of the couplers in the quantum annealer 130.
  • the quantum annealer 130 can expect to process a minimization of an objective function, and the quantum annealer 130 can process values formatted based on QUBO.
  • the quantum annealer 130 can execute a search modeled after the minimum energy of the Ising Hamiltonian energy function: where s ( G — 1, 1 are the spin values that are subject to local fields hi and to the nearest neighbor interactions with coupling strength Jij.
  • the compound analysis facility 126 can form the QUBO expression:
  • the compound analysis facility 126 can set the linear bias a and the quadratic bias b between variables. In some embodiments, the compound analysis facility 126 can set the linear bias “a” and the quadratic bias “6” between variables to convert a scientific question into values based on a QUBO.
  • the variables may be set such that when the quantum annealer 130 can identify, using input binary values relating to properties for compounds, a compound that relates to a maximum, minimum, optimum, or other statistical value for the function with which the quantum computing hardware is configured, that compound may be the best compound with respect to the properties of interest or otherwise satisfy one or more criteria with respect to those properties of interest.
  • the values can be discrete or continuous values.
  • the values can be binary values or a set of a plurality of continuous values. Such values may, in some embodiments, in a matrix of values where each value in the matrix indicates whether a compound has a particular property or whether that property for the compound satisfies one or more criteria (e.g., how a value for a property compares to a threshold).
  • the values can be discrete or continuous values.
  • the discrete values can be binary values (e.g., 0 or 1).
  • the continuous values can be any number from 0 to 1, such as 0.8.
  • each row in the matrix corresponds to a drug candidate and the value for that row indicates a value for a property of that drug candidate.
  • the solution landscape can be a list of all possible compounds and representation of how many hydrogen bond donors such compounds have:
  • Table 2 Possible solutions to the one hydrogen bond donor optimization problem. Each compound can have three fragments with the number of hydrogen bond donors labelled. F can be a fragment in the compound.
  • the compound analysis facility 126 can search for compounds that have a certain number of bonds.
  • optimization objective can be compounds where only one fragment has a hydrogen bond donor. This can be expressed algebraically as:
  • the minimization objective function can be:
  • the compound analysis facility 126 can factor out the above minimization function:
  • the compound analysis facility 126 can simplify the above function:
  • the compound analysis facility 126 can simplify 2F A , 2F B , 2F C to F A , F B , F c .
  • the compound analysis facility 126 can include these variables into the QUBO:
  • the quantum annealer 130 may receive as input values corresponding to each compound to be analyzed with respect to the function and for each property of interest to be analyzed for the compound. Accordingly, in some embodiments, the quantum annealer 130 may analyze binary values corresponding to compounds and properties of interest to identify, from among the compounds, one or more compounds that satisfy one or more criteria and so may have a desirable combination of properties of interest. The quantum annealer 130 may analyze the set of compounds in connection with the properties of interest to identify a subset of the set of compounds that meet one or more criteria.
  • the quantum annealer 130 may identify the subset by determining the compound from among the set that satisfy one or more criteria regarding statistical values resulting from evaluation of a function, such as identifying compound from among the set that correspond to a maximization, minimization, or other optimization of a function or other statistical operation with respect to a function with which the quantum annealer 130 is configured.
  • the compound analysis facility 126 can transmit the function and the set of compounds to the quantum annealer 130 for analysis.
  • the compound analysis facility 126 can transmit the function and the set of compounds to the quantum annealer 130 via the network 105.
  • the compound analysis facility 126 can receive a ranked set of compounds of interest.
  • the compound analysis facility 126 can receive the ranked set of compounds from the quantum annealer 130.
  • the ranked list can be based on weighting values at which the quantum annealer 130 was configured.
  • the quantum annealer 130 can analyze the input (e.g., matrix values) received from the compound analysis facility 126 to identify a ranking of predicted performance of the compounds.
  • the compound analysis facility 126 can receive the ranked set of compounds as a series of ‘energy states’ that are a proxy for how well each candidate solution to the function performs, where each candidate solution corresponds to a compound. For example, the compounds that satisfy the one or more criteria can have the lowest energy state.
  • the ranked set of compounds can include compounds that satisfy the one or more criteria.
  • a best, top five, top ten, top one hundred, or other top N drug candidates may be received by the compound analysis facility 126 from the quantum annealer 130.
  • the quantum annealer 130 can identify drug candidates with respect to the properties and identify a number N of the drug candidates that have an overall best performance with respect to the properties, based on evaluation of a function with which the quantum annealer 130 is configured for determining an optimal or otherwise desirable combination of properties.
  • the compounds that satisfy the one or more criteria for the presence of only 1 hydrogen bond acceptor can have the lowest energy state, while the compounds that violate the criteria can have higher energy states:
  • the compound analysis facility 126 can select compounds from the ranked list.
  • compound analysis facility 126 can select compounds that satisfy the one or more criteria. For example, the compound analysis facility 126 can select compound B, compound C, and compound E because they have an ‘energy state’ of 0 and thus satisfy the criteria of having one hydrogen bond donor.
  • the compound analysis facility 126 can generate an output of selected compounds.
  • the compound analysis facility 126 can transmit the output of selected compounds to the computing device 110 for display in the client interface 112 to the client 115.
  • FIG. 10A illustrates a method 1000 for refining the outputted compounds.
  • the method 1000 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments.
  • the compound analysis facility 126 can refine the compounds received after the analysis by the quantum annealer 130 as discussed in reference to FIG. 5A and 5B. For example, as shown in FIG. 10B, one or more criteria can be identified for the compounds to be analyzed against.
  • the quantum annealer 130 can analyze the compounds against the one or more criteria and provide a list of compounds to the compound analysis facility 126.
  • the compound analysis facility 126 can receive selections of compounds to refine from the list of compounds received from the quantum annealer 130.
  • one or more of the compounds can be selected for additional testing (e.g., fine tuning). For example, the number of selected compounds can be based on the scale or cost of the testing to be performed.
  • the selections can be received from the client 115 via the client interface 112 executing on the computing device 110.
  • the compounds may then be synthesized and tested, or tested in silico using other techniques, to further identify a smaller set of compounds that may be candidates for use in a particular context, for further experimentation, or other purposes.
  • the compounds can be identified drug candidates that can then be analyzed using other techniques to determine or confirm properties in the list or determine or confirm the performance of drug candidates in the subset.
  • the compounds can be used for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules.
  • the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
  • the compound analysis facility 126 can generate a refined set of compounds.
  • the compound analysis facility 126 can apply computational techniques, such as using machine learning or other artificial intelligence techniques, or laboratory work that involves synthesizing and testing the drug candidates. In some cases, such techniques may be assisted with rule-based algorithms, randomized algorithms, brute force algorithms, or any other computerized process.
  • the compound analysis facility 126 can receive test results, simulations, or measurements, or any other information about the selected compounds.
  • the compound analysis facility 126 can modify the selected list of compounds based on the test results, simulations, or measurements, or any other information.
  • one or more compounds may be identified that may advantageously perform a function.
  • the compounds can be advantageous or optimal for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules.
  • the compounds can be advantageous or optimal as chemical molecules.
  • the compounds can be advantageous or optimal for battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
  • the compounds can be optimal for therapies.
  • features that describe the likelihood of success of those biological molecules can be extracted and optimized against.
  • the features can map the relationship between the sequence/tertiary structure of DNA/RNA and peptide-based molecules to their performance in the clinic. The performance can be based on both the ability of the molecule to undertake its manipulation of a biological network through its mechanism of action and also its ability to perform well when taken by patients (e.g., non-toxic, orally bio-available, optimal clearance).
  • the compound analysis facility 126 can generate an output of the refined set of compounds. Based on the additional experimentation, the compound analysis facility 126 can identify the refined compounds from the set of compounds. In some embodiments, the compound analysis facility 126 can transmit the refined compounds to the computing device 110 for display in the client interface 112 to the client 115. For example, as shown in FIG. 10E, the compound analysis facility 126 can identify and output a lead compound.
  • the lead compound can be a compound for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, drug delivery molecules, battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
  • a method comprising analyzing, using at least one quantum computer, information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, synthesizing at least a portion of the one or more compounds of the subset to generate at least one synthesized compound, and testing the at least one synthesized compound.
  • a method comprising analyzing, using at least one quantum computer, information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, analyzing the subset of the one or more compounds using at least one trained machine learning engine to determine one or more properties of the one or more compounds and/or analyze predicted performance of each of the one or more compounds with respect to a function, and outputting from the analyzing using the at least one trained machine learning engine a ranked list of one or more candidate compounds for performing the function.
  • a method comprising triggering analysis by at least one quantum computer of information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, receiving, as a result of the analysis, an identification of the one or more compounds of the subset, and outputting the identification of the one or more compounds as a result of the analysis.
  • a method comprising receiving a request for at least one quantum computer to analyze a library of compounds in connection with one or more criteria, the request comprising input characterizing the library of compounds to be analyzed by the at least one quantum computer, triggering analysis by at least one quantum computer of information regarding properties of the library of compounds characterized by the input to identify, from among the compounds of the library, a subset of one or more compounds that analysis indicates satisfy the one or more criteria, receiving, as a result of the analysis, an identification of the one or more compounds of the subset, and outputting the identification of the one or more compounds as a result of the analysis.
  • a method comprising receiving a request for at least one quantum computer to analyze a library of compounds of interest, the request comprising input characterizing the library of compounds to be analyzed by the at least one quantum computer, determining, for each compound in the library of compounds, a value for the compound with respect to each property of at least one property of interest, to generate a set of property values for compounds of the library of compounds of interest, triggering the at least one quantum computer to analyze the set of property values for the compounds of the library, receiving from the at least one quantum computer an identification of one or more compounds of a subset of the library that analysis by the at least one quantum computer indicates satisfy one or more criteria, and outputting information regarding the one or more compounds of the subset as a result of the analysis requested in the request.
  • a method comprising receiving a request for at least one quantum computer to analyze a library of compounds of interest, the request comprising first input characterizing the library of compounds to be analyzed by the at least one quantum computer and second input identifying a set of properties of interest, identifying, for each property of the set of properties of interest, a rule reflecting a binary status of a compound with respect to the property, and determining, for each compound in the library of compounds, a binary value for the compound with respect to each property of the set of properties of interest, to generate a set of binary property values for compounds of the library of compounds of interest.
  • the method further comprises triggering the at least one quantum computer to analyze the set of binary property values for the compounds of the library in connection with an objective function with which the at least one quantum computer is configured, to determine a compound for which corresponding binary property values generate a minimum value for the objective function, receiving from the at least one quantum computer an identification of one or more compounds of an identification of the compound, and outputting information regarding the compound as a result of the analysis requested in the request.
  • a method comprising receiving a request for at least one computer to analyze a library of compounds of interest, the request comprising first input characterizing the library of compounds to be analyzed by the at least one computer and second input identifying a set of properties of interest, identifying, for each property of the set of properties of interest, a rule reflecting a binary status of a compound with respect to the property, and determining, for each compound in the library of compounds, a binary value for the compound with respect to each property of the set of properties of interest, to generate a set of binary property values for compounds of the library of compounds of interest.
  • the method further comprises triggering the at least one computer to analyze the set of binary property values for the compounds of the library in connection with an objective function with which the at least one computer is configured, to determine a compound for which corresponding binary property values generate a minimum value for the objective function, receiving from the at least one computer an identification of one or more compounds of an identification of the compound, and outputting information regarding the compound as a result of the analysis requested in the request.
  • the techniques described herein may be embodied in computer-executable instructions implemented as software, including as application software, system software, firmware, middleware, embedded code, or any other suitable type of computer code.
  • Such computer-executable instructions may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
  • these computer-executable instructions may be implemented in any suitable manner, including as a number of functional facilities, each providing one or more operations to complete execution of algorithms operating according to these techniques.
  • a “functional facility,” however instantiated, is a structural component of a computer system that, when integrated with and executed by one or more computers, causes the one or more computers to perform a specific operational role.
  • a functional facility may be a portion of or an entire software element.
  • a functional facility may be implemented as a function of a process, or as a discrete process, or as any other suitable unit of processing.
  • each functional facility may be implemented in its own way; all need not be implemented the same way.
  • these functional facilities may be executed in parallel and/or serially, as appropriate, and may pass information between one another using a shared memory on the computer(s) on which they are executing, using a message passing protocol, or in any other suitable way.
  • functional facilities include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • functionality of the functional facilities may be combined or distributed as desired in the systems in which they operate.
  • one or more functional facilities carrying out techniques herein may together form a complete software package.
  • These functional facilities may, in alternative embodiments, be adapted to interact with other, unrelated functional facilities and/or processes, to implement a software program application.
  • Some exemplary functional facilities have been described herein for carrying out one or more tasks. It should be appreciated, though, that the functional facilities and division of tasks described is merely illustrative of the type of functional facilities that may implement the exemplary techniques described herein, and that embodiments are not limited to being implemented in any specific number, division, or type of functional facilities. In some implementations, all functionalities may be implemented in a single functional facility. It should also be appreciated that, in some implementations, some of the functional facilities described herein may be implemented together with or separately from others (i.e., as a single unit or separate units), or some of these functional facilities may not be implemented.
  • Computer-executable instructions implementing the techniques described herein may, in some embodiments, be encoded on one or more computer-readable media to provide functionality to the media.
  • Computer-readable media include magnetic media such as a hard disk drive, optical media such as a Compact Disk (CD) or a Digital Versatile Disk (DVD), a persistent or non-persistent solid-state memory (e.g., Flash memory, Magnetic RAM, etc.), or any other suitable storage media.
  • Such a computer-readable medium may be implemented in any suitable manner, including as computer-readable storage media 1106 of FIG. 11 described below (i.e., as a portion of a computing device 1100) or as a stand-alone, separate storage medium.
  • “computer-readable media” refers to tangible storage media. Tangible storage media are non-transitory and have at least one physical, structural component.
  • at least one physical, structural component has at least one physical property that may be altered in some way during a process of creating the medium with embedded information, a process of recording information thereon, or any other process of encoding the medium with information. For example, a magnetization state of a portion of a physical structure of a computer-readable medium may be altered during a recording process.
  • these instructions may be executed on one or more suitable computing device(s) operating in any suitable computer system, including the exemplary computer system of FIG. 1 A, or one or more computing devices (or one or more processors of one or more computing devices) may be programmed to execute the computer-executable instructions.
  • a computing device or processor may be programmed to execute instructions when the instructions are stored in a manner accessible to the computing device or processor, such as in a data store (e.g., an on-chip cache or instruction register, a computer-readable storage medium accessible via a bus, a computer-readable storage medium accessible via one or more networks and accessible by the device/processor, etc.).
  • a data store e.g., an on-chip cache or instruction register, a computer-readable storage medium accessible via a bus, a computer-readable storage medium accessible via one or more networks and accessible by the device/processor, etc.
  • Functional facilities comprising these computer-executable instructions may be integrated with and direct the operation of a single multi-purpose programmable digital computing device, a coordinated system of two or more multi-purpose computing device sharing processing power and jointly carrying out the techniques described herein, a single computing device or coordinated system of computing devices (co-located or geographically distributed) dedicated to executing the techniques described herein, one or more Field-Programmable Gate Arrays (FPGAs) for carrying out the techniques described herein, or any other suitable system.
  • FPGAs Field-Programmable Gate Arrays
  • FIG. 11 illustrates one exemplary implementation of a computing device in the form of a computing device 1100 that may be used in a system implementing techniques described herein, although others are possible. It should be appreciated that FIG. 11 is intended neither to be a depiction of necessary components for a computing device to execute a compound representation facility 122, a compound model facility 124, and/or a compound analysis facility 126 in accordance with the principles described herein, nor a comprehensive depiction.
  • Computing device 1100 may comprise at least one processor 1102, a network adapter 1104, and computer-readable storage media 1106.
  • Computing device 1100 may be, for example, a desktop or laptop personal computer, a personal digital assistant (PDA), a smart mobile phone, a server, a wireless access point or other networking element, or any other suitable computing device.
  • Network adapter 1104 may be any suitable hardware and/or software to enable the computing device 1100 to communicate wired and/or wirelessly with any other suitable computing device over any suitable computing network.
  • the computing network may include wireless access points, switches, routers, gateways, and/or other networking equipment as well as any suitable wired and/or wireless communication medium or media for exchanging data between two or more computers, including the Internet.
  • Computer-readable media 1106 may be adapted to store data to be processed and/or instructions to be executed by processor 1102.
  • Processor 1102 enables processing of data and execution of instructions.
  • the data and instructions may be stored on the computer-readable storage media 1106.
  • the data and instructions stored on computer-readable storage media 1106 may comprise computer-executable instructions implementing techniques which operate according to the principles described herein.
  • computer-readable storage media 1106 stores computer-executable instructions implementing various facilities and storing various information as described above.
  • Computer-readable storage media 1106 may store a compound representation facility 122, a compound model facility 124, and/or a compound analysis facility 126.
  • a computing device may additionally have one or more components and peripherals, including input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computing device may receive input information through speech recognition or in other audible format.
  • Embodiments have been described where the techniques are implemented in circuitry and/or computer-executable instructions. It should be appreciated that some embodiments may be in the form of a method, of which at least one example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
  • exemplary is used herein to mean serving as an example, instance, or illustration. Any embodiment, implementation, process, feature, etc. described herein as exemplary should therefore be understood to be an illustrative example and should not be understood to be a preferred or advantageous example unless otherwise indicated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Condensed Matter Physics & Semiconductors (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)
  • Error Detection And Correction (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)

Abstract

Described herein are embodiments of systems and methods for determining and/or analyzing properties of compounds In some embodiments described herein, the systems and methods can include encoding properties of the compounds. In some embodiments, the systems and methods can include training a decoder to identify properties from the compound representations of the compounds. In some embodiments, the systems and methods can include decoding a compound representation of a compound to output properties of the compound. In some embodiments, the system and method can include decoding a compound representation of a compound to synthesize the compound. In some embodiments, the systems and methods can include training a model to predict properties of the compounds from compound representations of the compounds. In some embodiments, the systems and methods can include training the model to use training data to predict new properties of the compounds.

Description

COMPOUND REPRESENTATION AND PROPERTY ANALYSIS AT SCALE
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 63/442,618, filed February 1, 2023, and titled “Compound representation and property analysis at scale,” the contents of which are incorporated by reference herein in their entirety.
BACKGROUND
Different chemical compounds may have different properties. Such properties may affect how a compound interacts with its environment, including its interactions with other compounds. The properties may therefore affect whether a compound is able to perform a desired function in interacting with an environment of the compound, the circumstances under which the compound is able to perform the function, or the effectiveness of the compound in performing the function, or otherwise affect the compound’s interactions.
SUMMARY
In one embodiment, there is provided a method comprising creating, using a first model, a second model for predicting information regarding a property of compounds input to the second model, wherein the first model was trained using compound information to generate at least one other output different from the information regarding the property. Creating the second model comprises editing the first model to generate the second model and training the second model using training data for the property.
In some embodiments, the first model comprises a first neural network, and editing the first model to generate the second model comprises adding at least one layer to, removing at least one layer from, and/or adjusting at least one layer of the first neural network to generate a second neural network.
In some embodiments, adjusting at least one layer of the first neural network comprises adjusting values of one or more parameters of the at least one layer of the first neural network.
In some embodiments, the first model comprises a first neural network, editing the first model to generate the second model comprises adding a classifier to the first neural network, and training the second model comprises training the classifier using the training data for the property. In some embodiments, the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
In some embodiments, the method further comprises generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds, generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
In some embodiments, the first neural network trained using compound information is trained to identify compounds that comply with at least one chemical rule.
In some embodiments, the method further comprises receiving a request to analyze a library of compounds of interest, the request comprising input characterizing the library of compounds to be analyzed; determining, using the second model, for each compound in the library of compounds, a value for the compound with respect to the property, to generate a set of values of the property for compounds of the library of compounds; and outputting information regarding the set of values of the property for the compounds of the library of compounds.
In another embodiment, there is provided a method comprising creating, using a first model, a second model for predicting information regarding a functional property of compounds input to the second model, wherein the first model was trained using a first amount of compound information to identify compounds that comply with at least one rule of physics and/or chemistry regarding compounds. Creating the second model comprises editing the first model to generate the second model and training the second model using training data for the property, the training data being a second amount of training data that is less than the first amount of compound information.
In some embodiments, the first model comprises a first neural network; editing the first model to generate the second model comprises adding a classifier to the first neural network; and training the second model comprises training the classifier using the training data for the property. In some embodiments, the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
In some embodiments, a number of compounds in the plurality of compounds of the training data used to train the second model is less than a number of compounds in the compound information used to train the first model.
In some embodiments, the method further comprises generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds: generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
In a further embodiment, there is provided a method comprising generating a digital representation of a compound. The generating comprises receiving an identification of a plurality of atoms and/or molecules of a compound, receiving information regarding interconnections of the plurality of atoms and/or molecules of the compound, receiving information regarding distances between the plurality of atoms and/or molecules of the compound, and generating the digital representation of the compound using the identification of the plurality of atoms and/or molecules, the information regarding the interconnections, and the information regarding the distances.
In some embodiments, receiving the information regarding the distances comprises receiving information regarding a three-dimensional (3D) structure and/or arrangement of the plurality of atoms and/or molecules of the compound.
In some embodiments, generating the digital representation of the compound comprises applying at least one transformer to the identification of the plurality of atoms and/or molecules of the compound.
In some embodiments, the identification of the plurality of atoms and/or molecules of the compound comprises a graph representation of the compound.
In some embodiments, the method further comprises generating the graph representation of the compound, wherein generating the graph representation of the compound includes: encoding the plurality of atoms and/or molecules of the compound as a plurality of nodes in the graph representation; encoding the interconnections of the plurality of atoms and/or molecules of the compound as a plurality of edges in the graph representation; and iteratively traversing nodes in the plurality of nodes along edges in the plurality of edges to update the graph representation.
In a further embodiment, there is provided a method comprising, a combination of any two or more of the foregoing methods.
In another embodiment, there is provided an apparatus comprising at least one processor and at least one computer-readable storage medium having encoded thereon executable instructions that, when executed by the at least one processor, cause the at least one processor to carry out any one or any combination of the foregoing methods.
In a further embodiment, there is provided at least one computer-readable storage medium encoded with computer-executable instructions that, when executed by a computer, cause the computer to carry out any one or any combination of the foregoing methods.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. In the drawings:
FIG. 1A is a schematic diagram of a compound analysis system using quantum computing for property analysis of compounds, in accordance with some embodiments of the technology described herein;
FIG. IB and FIG. 1C are diagrams of a superconducting flux qubit with which some embodiments may operate;
FIG. 2A is a flowchart of an illustrative method 200 of determining and/or analyzing properties of compounds, in accordance with some embodiments of the technology described herein;
FIG. 2B is a diagram for encoding compounds to predict properties of the compounds, in accordance with some embodiments of the technology described herein;
FIG. 2C is a diagram of using quantum computing for analysis of the predicted properties of the compounds, in accordance with some embodiments of the technology described herein;
FIG. 3A is a flowchart of an illustrative method 300 of encoding properties of the compounds, in accordance with some embodiments of the technology described herein; FIG. 3B is a diagram of encoding compounds from a graph, in accordance with some embodiments of the technology described herein;
FIG. 3C is a diagram of using iterative passing to traverse nodes along the edges to update the graph, in accordance with some embodiments of the technology described herein;
FIG. 3D illustrates an example of a technique for encoding information regarding a compound, which may be implemented in some embodiments;
FIG. 4A is a flowchart of an illustrative method 400 of training one or more models to generate compound representations of compounds and identify property values of the compounds from the compound representations of the compounds to establish a trained library of compounds, in accordance with some embodiments of the technology described herein;
FIG. 4B is a diagram of a model for encoding and decoding compounds, in accordance with some embodiments of the technology described herein;
FIG. 4C is a diagram of training the model to predict properties of compounds, in accordance with some embodiments of the technology described herein;
FIG. 5A is a flowchart of an illustrative method 500 of decoding a compound representation of a compound to identify properties of the compound, in accordance with some embodiments of the technology described herein;
FIG. 5B is a flowchart of an illustrative method 550 of decoding a compound representation of a compound to synthesize the compound, in accordance with some embodiments of the technology described herein;
FIG. 6A is a flowchart of an illustrative method 600 of training a model to predict properties of the compounds from compound representations of the compounds, in accordance with some embodiments of the technology described herein;
FIG. 6B is a diagram of adjusting the model for predicting properties of compounds, in accordance with some embodiments of the technology described herein;
FIG. 6C is a flowchart of an illustrative method 650 of training the model to use training data to predict new properties of the compounds, in accordance with some embodiments of the technology described herein;
FIG. 6D is a diagram of the trained model outputting the predicted properties, in accordance with some embodiments of the technology described herein;
FIG. 7A is a flowchart of an illustrative method 700 of identifying compounds to analyze, in accordance with some embodiments of the technology described herein; FIG. 7B is a diagram of obtained chemical structures with which some embodiments may operate;
FIG. 7C is a diagram of aligned chemical conformers with which some embodiments may operate;
FIG. 7D is a diagram of binding pockets in a protein target with which some embodiments may operate;
FIG. 7E and FIG. 7F are diagrams of docked ligands with which some embodiments may operate;
FIG. 7G is a diagram of obtaining known inhibitors with which some embodiments may operate;
FIG. 7H is a diagram of sanitizing and relaxing conformers with which some embodiments may operate;
FIG. 71, FIG. 7J, and FIG. 7K are diagrams of pharmacophores for a ligand with which some embodiments may operate;
FIG. 7L and FIG. 7M is a diagram of a protein binding site with which some embodiments may operate;
FIG. 8A is a flowchart of an illustrative method 800 of determining and/or analyzing criteria, in accordance with some embodiments of the technology described herein;
FIG. 8B is a diagram of a set of compounds with which some embodiments may operate;
FIG. 8C is a diagram of one or more criteria with which some embodiments may operate;
FIG. 8D is a diagram of one or more criteria with which some embodiments may operate;
FIG. 9 is a flowchart of an illustrative method 900 of configuring the quantum annealer, in accordance with some embodiments of the technology described herein;
FIG. 10A is a flowchart of an illustrative method 1000 of refining outputted compounds, in accordance with some embodiments of the technology described herein;
FIG. 10B is a diagram of criteria with which some embodiments may operate;
FIG. 10C is a diagram of the quantum annealer with which some embodiments may operate;
FIG. 10D is a diagram of the compound refining with which some embodiments may operate; FIG. 10E is a diagram of the identified compound with which some embodiments may operate; and
FIG. 11 is a schematic diagram of an illustrative computing device with which aspects described herein may be implemented.
DETAILED DESCRIPTION
Described herein are embodiments of techniques for generative and/or predictive artificial intelligence (Al)-driven modeling for chemistry, across different compounds, types of compounds, uses for compounds, and industries. Some techniques described herein may include and enable generating in silico novel compounds and generating structural and/or functional properties of unknown (or known) compounds, which compounds may in some cases be specific conformers of molecules. Through some techniques described herein, systems may create, maintain, supplement, analyze, and otherwise interact with compound libraries extending into the billions of compounds, or potentially on the order of IO60 compounds or beyond in some cases, and including information on an array of properties of such compounds. Some embodiments may enable artificial intelligence (Al)-driven generative processing of information on previously-unknown or understudied properties of known or unknown compounds to build such a library. And some embodiments may enable identification of the compound(s) that represent the global maximum for performance of a desired task or function rather than, as is conventional, merely a local performance maximum or other compound that may perform well according to available data for a small set of known compounds. Using some of the techniques described herein may enable a transformation of the drug discovery process (or compound discovery for other fields), including a transformation of analyzing previously-unknown and/or previously-understudied compounds, by bringing the analysis timeline from lifetimes or decades down to hours, days, or weeks and increasing the precision and reliability of in silico analysis.
Some embodiments described herein may include Al transformer models or other models for, at high speed and high reliability, generating previously unknown information regarding compounds, including for previously-unknown compounds. Some such techniques may leverage one or more models that, through training, have learned rules of physics and/or chemistry that define possible or functional compounds (in general and/or in specific industries or use cases), and/or that define compounds that are more likely to be functional in a particular context. Such rules may, in some embodiments, include those such as “Lepinski’s Rule of 5,” which defines a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule. Some such trained models may define a continuous representation of chemical space that allows for fast analysis of compounds and fast, reliable generation of information regarding compounds, such as for fast, reliable generative Al for compounds.
In some embodiments, such models trained with rules for chemistry may be used for reliable transfer learning, such as by editing parts of an existing trained model (e.g., by adding or adjusting output layers, in a case that a model is a neural network). This may include creating a new model for prediction of whether input compounds have a property or predicting a value of such a property for input compounds, by editing an existing model that is trained with information regarding rules of physics and/or chemistry and then training the edited model to operate with that property. This may enable generation of property predictions for previously-unknown or previously-understudied properties. This may also enable generation of property predictions for previously-unknown or previously-understudied compounds.
In some embodiments, such a model created by editing a previously-existing model may be used to build (e.g., supplement) a library of information regarding compounds, which may include billions of compounds or even on the order of IO60 compounds or beyond. For example, such a library may include information on a large scale of compounds and potentially on a large scale of properties. A model may be created using techniques described herein to make predictions regarding a property not previously defined in the library, after which the model may be used to generate predictions for that property for all compounds of the library (e.g., billions of compounds or up to IO60 compounds or beyond), on the order of hours, days, or weeks. Such a model may be created and applied even where there is limited available data for the property, which may have prevented training of a reliable model using conventional training and model creation techniques. Some techniques described herein enable generation, training, and use of the model on a practical research timeline, rather than the years or decades that conventional techniques would require for attempting this processing.
Also described herein are techniques for creating reliable representations of compounds, which may be processed using techniques herein. Such a representation may include structural and/or functional information regarding a compound, such as information on locations of and/or interactions of atoms of a compound. Some techniques described below also enable fast, reliable processing of information regarding a large-scale library of compounds. Such processing of information may enable an identification, from the library, of one or a set of compounds that may meet identified performance criteria for a task or function, such as the global best-performing compound(s) for a desired task or function or meeting other criteria. In this way, large scale libraries of information regarding compounds and properties of compounds may be analyzed, including by (in some cases) analyzing information on every compound in the library, on a time scale on the order of hours, days, or weeks.
Compounds
It should be appreciated that embodiments described herein may operate with a variety of compounds and with a variety of properties of interest for those compounds, and are not limited to operating with any particular type of compound or industry. In some embodiments, compounds may be any molecule that includes two or more atoms.
In some embodiments, compounds with which techniques described herein may operate may be or include drugs, which may include pharmaceuticals, biologies, medications, medicines, or other compounds that have a physiological or psychological effect. In some cases, such drugs may include proteins (e.g., antibodies or other proteins) or parts of proteins such as fragments or peptides. Such drugs may additionally or alternatively include nucleotides or nucleic acids, such as DNAs, RNAs, oligonucleotides, peptides, or others. In some embodiments, compounds may be included in antibodies, antisense oligonucleotides, mRNA vaccines, peptide drugs, Proteolysis-targeting chimeric molecules (PROTACs), small interfering RNA (siRNA), or drug delivery molecules. In cases in which compounds are drugs, properties that may be analyzed may include whether the compound is blood brain barrier penetrant, whether it is bioavailable, whether it can bind to a specified target or how specifically it binds, how volatile the compound is, how thermostable the compound is, whether it satisfies specified criteria for “ease” of synthesis or manufacturing or distribution at scale, or other properties.
While in some embodiments, compounds may be drugs, embodiments are not so limited. Other molecules performing other functions or serving other functions may be analyzed, including for functions or purposes that are not biological or pharmaceutical. In some embodiments, the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, organic light-emitting diodes (OLED), colorants, dyes, paints, agriculture, or pesticides. In such other embodiments and for other uses, any suitable compound property may be analyzed. Conventional Drug Discovery
Drug discovery is a process by which to identify a drug that may perform a particular desired function with desired properties. Traditionally, drug discovery was a manual trial - and-error process, with drugs being synthesized (e.g., manufactured, isolated, or otherwise generated) and then tested to determine their properties and how well they performed the function or whether they had the desired properties. Such synthesizing and testing took a great deal of expense and effort and thus was limited in the drugs (e.g., the number and/or types of drugs) that could be and were analyzed. Moreover, the techniques were limited to drugs that already could be synthesized with existing equipment, which in many cases may have been previously-synthesized drugs, limiting the ability to discover new drugs.
Attempts have been made at performing in silico drug discovery using computerized processes. In some cases, such processes may have been assisted with machine learning engines. With machine learning, an engine may be given a structure of a known drug, such as an identification of atoms within the drug and an identification of which atoms are bound to which other atoms, as well as properties that had been determined for that known drug using testing. The machine learning engine may then attempt to identify from the input data a relationship between the atoms of a drug or an arrangement of atoms within a drug, and a performance of the drug with respect to a property. Once it has inferred such relationships, the machine learning engine may be queried to determine properties for other drugs that have not yet been evaluated. Through repeated querying, a user may seek to leverage the trained machine learning engine to identify a drug that may perform better with respect to the property than the known drugs that were used as training.
While this computerized machine learning approach may be faster than the manual trial-and-error approach, it still has limitations. The technique relies on training a model using existing drug data. The performance of the model is linked to that existing data, such that if the existing data includes errors or includes limited data, the wrong relationships may be learned by the engine, which can compromise the outputs from the engine.
Such limitations on data may stem from imprecision of existing data, such as imprecision of representations of compounds. For example, existing descriptors for compounds can take the form of identifiers for physiochemical properties of a compound. Example descriptors are a number of atoms and molecular weight, and a chemical fingerprint indicating with bit vectors presence or absence of particular chemical fragments. Such conventional representations do not include two-dimensional (2D) and three-dimensional (3D) structural information. For example, a bit vector identifying that a fragment is present in a molecule does not describe occurrences of that fragment in the molecule, such as information about the amount of that fragment or its spatial positioning within a molecule. Existing representation by simplified molecular-input line-entry system (SMILES) also has drawbacks. SMILES can provide a character-based description of a molecule that is easy to manipulate, but it is known that multiple different SMILES strings represent the same compound. As such, a SMILES string does not and cannot uniquely identify a compound, such as a particular conformer of a molecule, and thus an output of a SMILES string is not useful for identifying a particular compound. Another difficulty with SMILES strings is that the strings correspond to a 2D representation of a molecule, but molecules do not exist in 2D. The 3D structure of a molecule is not represented by and cannot be known from a SMILES string, meaning a conformer of a molecule cannot be identified with a SMILES string.
A conventional learning approach can also be limited by insufficient comprehensiveness of the input drug data, because existing chemical property prediction models suffer from extrapolation issues when tested on compounds dissimilar from the compounds found in the training dataset. Due to limitations of the training that was done, a trained machine learning model is only able to produce estimates for new drugs that are similar to the drugs it has already seen. Such limitation on similarity means these conventional techniques are limited to identifying properties for drugs that have only minor structural variances from the input compounds and the model cannot identify (or cannot reliably identify) properties of drugs with significant structural variances from that input. Since input data may not be available for a large number of drugs across a large diversity of structures, even in cases where a model is repeatedly queried in an attempt to determine a best-performing drug, the model may be limited to identifying only the best-performing similar drug, rather than a best-performing drug. When considering a landscape of all drug candidates, where the input drugs represent only one portion of that drug landscape, this conventional machine learning approach might be considered to be a determination of a local best performing drug candidate in that portion of the landscape, and not a determination of a global best performing drug candidate across the entirety of that landscape or a determination of a best performing drug candidate over more than just the one portion of the drug landscape.
Insufficient comprehensiveness of data also creates limitations in conventional model training approaches, limiting the utility of such models and model training and inhibiting use. To train a model to make predictions regarding compounds, such as a prediction regarding whether a compound has a property or a numeric value for a property of a compound, there needs to be sufficient information for patterns to be identified and/or relationships to be learned. With insufficient data, the learning is insufficient and low-quality output is generated, which may include (and often does include) incorrect output. Training of models is conventionally limited, then, to compounds and/or properties for which a large amount of data can be obtained and input to the model during a training phase. Often, this means the conventional training is limited to compounds and/or properties included in publicly available data sets. When research calls for analysis of a compound or a property not previously researched and thus not included in publicly available data sets, the lack of available volumes of data means that models cannot be trained for such compounds/properties. This limits the availability of model training for research and, by limiting the available models, limits the ability to predict property values for compounds using models.
Conventional techniques for computational processing are limited by data availability, as discussed above, but are also limited by practical considerations on the time of their processing. The type of machine learning analysis described above includes two phases, a training phase where the input training data is initially processed and a production phase where the trained model is used to generate or analyze new data. Both take time, particularly when the model is queried repeatedly during the production phase to generate or determine properties of many different drugs or drugs are analyzed in an attempt to find a better- or best-performing candidate among options. If such an approach were to be used with an entire landscape of drug candidates, the analysis across both phases could take many lifetimes. Even if the analysis were distributed across a system of processors sharing resources and results, the analysis could take decades. These timelines are governed in part by the manner in which conventional machine learning is done and the manner in which conventional computing hardware operates and the manner in which conventional transistor-based central processing units are provided with data, process the data, and output the data, and affect any large-scale computational processing using this hardware, beyond just drug property generation or drug analysis. These timelines are impractical for any computational processing, and particularly so for research and development of drug candidates that are sought for treating or curing diseases in the near term or for commercialization. Accordingly, even if sufficient data were to be available to train a machine learning model for more types of drugs with different structures and different properties, the processing across a large-scale drug landscape still could not be performed with these conventional machine learning approaches. The inventors have also recognized and appreciated that conventional computational processes, including conventional machine learning processes, for analyzing drug candidates are limited in their effectiveness and accuracy or comprehensiveness of their output. Moreover, the inventors have recognized and appreciated that these limitations are inherent in the architecture and design of these conventional computational techniques and cannot be mitigated even with more data or more computational resources.
The inventors have therefore recognized and appreciated the desirability of computational techniques for analysis of compound properties that do not depend on conventional learning techniques, do not depend on conventional molecule representations, and/or do not depend on conventional arrangements or uses of central processing units.
Compound Representation and Property Analysis at Scale
As discussed above, conventional approaches to drug discovery or other computational analysis of compounds has been limited.
Some embodiments described herein include techniques for analyzing properties of compounds, such as to determine a compound or set of compounds that perform a desired function or meet other criteria. In embodiments that analyze compound property values, it may be advantageous to generate a library of values for properties of compounds, including for known or unknown compounds and for known properties or previously-unknown or previously-understudied properties as well as for known or well-studied properties. It may be advantageous in some embodiments to create such a library of billions of compounds or numbers of compounds up to IO60 or beyond, with any number of properties, potentially including dozens of properties.
Accordingly, described herein are some embodiments of techniques for generating compounds or generating property information for compounds, including in some embodiments using generative Al techniques. Some such embodiments may include processing a representation of a compound to predict a value for a property of the compound. In some such embodiments, techniques described herein can include training a model for predicting, for a compound, a value for a property of the compound. Some such models may be created by editing an existing model that was previously trained to predict another compound property, such as by adding, removing, or otherwise adjusting one or more layers of an existing model in a case in which a model is a neural network or other model including layers. For example, in some cases a first model may be trained with training data regarding compounds, such as using available property data on compounds. Through training, the model may learn information regarding compounds in general and/or regarding one or more properties of compounds. For example, if the model is a neural network, one or more layers of the neural network may learn general information on structure of compounds or structure of compounds that may be candidate compounds (e.g., pharmaceuticals). Such general information could include a general understanding of possible chemical structures, or effective chemical structures for a function, and/or rules of physics or chemistry that define possible compounds or compounds that may be useful in particular contexts or for particular tasks/fimctions. Through training, the model may also learn information regarding one or more properties, such as to determine or predict for input compounds values for the one or more properties, which may not have been previously known for the input compounds. Such a trained model may be used to analyze a library of compounds to determine values for the property.
Accordingly, in some embodiments, a data store of information regarding compounds may be generated, which may include for each compound of a large set of compounds a value for each property of a set of properties. Such a data store of compound information may, in some cases, include binary, discrete, and/or continuous values for compounds. In some such embodiments that include such a data store of compound information, the data store may be supplemented over time with values for additional properties, for each of the compounds of interest.
In cases in which the data store may include information on a large number of compounds that may be or will be analyzed (including using quantum computing hardware or other computing hardware), such as the IO60 compounds satisfying the “Lepinski’s Rule of 5,” information may not be available for all properties of all compounds. This may additionally be the case for conformers, where information may be available for one conformer of a molecule but not others. It may be advantageous in some embodiments, therefore, to be able to generate property information for compounds, for addition to a data store of information on compounds as properties and/or compounds are added to a data store.
Accordingly, described herein are techniques for generating values for properties for compounds using models. In some such embodiments, property values are generated using models that have been trained to predict values for a property. In some such embodiments, a model for predicting a value of a property for an input compound may be created by editing an existing model that may have been trained to predict values for another property for input compounds. For example, if such an existing model is a neural network, one or more layers of the neural network may be edited through adding, removing, or adjusting layers to create a new neural network. Other layers may be unchanged in some cases, or changed only in how they connect to the edited layer(s). Types of models other than neural networks may be used. An advantage of using such editing of existing models is to take advantage of the training that had already been done on that existing model, for the other property for which the existing model was trained to predict values. Such prior training may have set parameters of the portions (e.g., layers) of the existing model, and the new model that is generated may include in unedited portions of the existing model some of those parameters. By leveraging these parameters gained from the previously performed training, a training burden may be reduced for creation of the new model. This may be advantageous in a variety of ways, including to mitigate the burden on conventional model training created by insufficiency of data or limitations on access to data. This is particularly the case for properties or for compounds for which limited data exists. As discussed above, when there is limited data available for a property, it can be difficult to train a model to reliably predict a value for a property. In practice, this has meant that values cannot be generated for properties with little available data, curbing the ability to research new properties or use new properties in computational drug discovery. In some embodiments described herein, by editing existing models and thereby reusing some of the training previously done for those models, the amount of data needed for training of a model to predict values for a new property (rather than the property/properties for which the existing model(s) were trained) can be reduced, meaning that reliable value prediction can be achieved in some cases for new properties with little available data.
Also described herein are techniques for creating representations of compounds that may capture structural and/or functional properties with higher precision than is available with existing representations. Such representations may, in some embodiments, be used as input to property value prediction models, or may in other embodiments be used in other ways. Conventional representations were imprecise or could be ambiguous as to the molecule being represented, as the same molecule may correspond to multiple representations or a representation may correspond to multiple different molecules. Conventional representations may also not include sufficient information to identify a conformer of a molecule with particularity. In some embodiments described herein, a representation of a compound may include sufficient information to identify a particular conformer of a molecule and include other structural and/or functional information regarding a compound. Such a representation may be useful in a variety of contexts, including in some embodiments for processing a representation of a compound to determine one or more values for one or more properties of a compound, where such values may in some cases be subsequently processed in a property analysis or used in other ways.
In some embodiments, such representations may be useful for analyzing a compound to determine information on properties of the compound, such as by input of the representation of the compound to a model trained to output a value of a property of the compound. A representation that more precisely identifies a compound and more comprehensively includes information regarding a compound, may aid in more accurately predicting properties of compounds. Accordingly, some embodiments may include techniques for generating a graph representation of a compound, where the graph includes nodes that each correspond to an atom of a compound and where edges are defined in the graph that represent bonds between the atoms of the compound. Information regarding each atom of the compound and interactions of the atoms of the compound may be added to the representation, including being associated with nodes and/or edges of the graph. In some embodiments described herein, nodes and/or edges of the graph may be associated with information regarding the compound, such as structural and/or functional properties of the compound. Such structural and/or functional information may, in some embodiments, be information sufficient to uniquely identify a compound, such as uniquely identifying a conformer of a molecule. In some embodiments, the information regarding atoms or interactions of atoms of the compound may be refined through an iterative process by which information regarding an atom is updated based on other atoms of the graph, such as by updating a node based on information regarding other nodes to which the node is connected in the graph by an edge that represents a chemical bond between atoms. Through subsequent iterations, information regarding atoms may be distributed throughout nodes of the graph, even nodes to which a node is not connected by an edge, to reflect potential interactions between atoms that are not directly bonded to one another in a molecule.
In some embodiments, a representation may also be decoded to enable determination of a structure (e.g., a three-dimensional structure) of a compound corresponding to the representation, or to determine other information regarding the represented compound. This may be advantageous in some cases where a compound to which a representation relates is a compound that has been previously unknown and has not been synthesized before. For example, following analysis of compounds a representation may be output for a compound that has passed some criteria related to the analysis, such as to recommend the compound (or multiple compounds) for a particular function or task. By encoding detailed information regarding a compound, the representation may allow for high-reliability decoding of information regarding a compound, which may aid in subsequent synthesizing or other analysis of the compound.
Some embodiments may include training such an encoder to encode property information for a compound in a representation and/or training a decoder to identify from a compound representation the properties of the compound (or other information regarding the compound) that was encoded into the representation. In some cases, this may include training an encoder and/or decoder to perform a high-precision encoding of information regarding a compound that is a particular conformer and high-precision decoding a representation to yield an identification of that conformer as opposed to identifying a molecule without identifying a particular conformer of that molecule, or identifying a group of molecules. For example, some embodiments may include a representation that indicates a distribution of conformers across a conformer space for a compound. Decoding a compound representation for a compound may in some cases include outputting information regarding structural and/or functional properties of the compound, and/or may include outputting information useful in synthesizing the compound (e.g., synthesizing a particular conformer).
In some embodiments, as part of determining a graph representation for a compound, information regarding the interactions between atoms may include, for each atom, information regarding the interactions of the atom with the other atoms of the compound. For example, a graph representation may include a value for a node that is updated based on information regarding surrounding nodes, to indicate interactions between atoms. In some such embodiments, the updates can be performed by an iterative process calculating values of nodes and updating nodes based on surrounding nodes, to traverse the graph and update values across iterations.
In some embodiments that include such a graph representation, the graph representation for a compound may be converted to a non-graphical representation of the compound. Examples of the non-graphical representation include an array and a vector of values. An encoder and a decoder may be trained in some embodiments to encode a compound and/or a graphical representation of a compound and to decode the non-graphical representation until the decoder is able to with precision and accuracy recreate compound (including, where the compound is a particular conformer of a molecule, that conformer) or otherwise able to decode a non-graphical representation in a manner that satisfies one or more criteria. Various examples of ways in which these techniques and systems can be implemented are described below. It should be appreciated, however, that embodiments are not limited to operating in accordance with these examples. Other embodiments are possible.
It should be appreciated that while examples of compounds and properties are described herein, embodiments are not limited to operating with any particular types of compounds or types of properties. And embodiments that operate with different types of compounds may operate with different properties or types of properties, such that embodiments that operate with drugs may analyze different properties than embodiments that operate with other compounds.
Embodiments are not limited to operating with any particular properties or types of properties. In some embodiments, the properties may be chemical or biochemical properties of a compound that affect how it interacts with its surroundings, such as how it interacts with other compounds. Some such properties may include structural properties that indicate a content or shape of a compound, including intra-compound dimensions. A structural property may include whether a certain atom or molecule is available for binding or an amount of such atom/molecule that is available for binding, such as through being a donor site or acceptor site. A structural property may also include a distance between parts of a compound, such as between two atoms, two fragments, or two other elements of a compound. Other properties may include functional properties. Functional properties may include those that indicate whether and how a compound performs a function. Such properties may be in connection with a particular other compound or target, such as binding affinity for or binding specificity for a target. Such properties may also be in connection with tissues, such as how effectively a compound crosses or does not cross a tissue, including blood-brain barrier permeability, intestinal permeability, or permeability for other tissues or materials. Such properties may also be in connection with how well a compound survives in its environment or under different environmental conditions, such as solubility, thermostability, or other factors. Accordingly, in some embodiments, properties may relate to physiochemical features of a compound.
In some embodiments, building a large-scale library of compounds may include generating representations of a large number of compounds, which may include various permutations or combinations of atoms or molecules. Some such compounds may have been previously known or previously existed or synthesized, while other compounds may be previously-unknown. In some such embodiments, compounds of various numbers of atoms may be generated, or various numbers of fragments, such that the library includes compounds of different sizes. Embodiments are not limited to generating representations of any particular compounds or type of compounds in generating a library. In some cases, a library may be defined that includes all compounds meeting some criteria, such as all compounds that satisfy the “Lepinski’s Rule of 5” or other rule. For example, one or more rules may define a set of compounds that may be possible or may be useful for a particular task or function, which may include rules relating to which element(s) may be included in the compounds and/or which molecules (e.g., fragments) may be included in the compounds, how many atoms or molecules may be included, which types of bonds may be included or may be included between particular pairs of atoms and/or molecules, valence rules that may apply, how large the compound may be, how stable the compound must be, how soluble it may be, how it can be synthesized or manufactured, how it can be transported, or other rules that define structural and/or functional properties of a compound that may affect how it performs a particular task. The rules may be associated with values or ranges of values for the rules that may be acceptable for a given task or function, which may include a given environment in which a task or function is to be performed. A compound generation facility may iterate over these rules and permutations of various values for the rules to identify different compounds that may result from different permutations of the values for the rules. In some embodiments, known techniques for enumerating compounds may be used together with techniques described herein for representing and analyzing compounds, such as techniques described herein for determining properties of a compound. Following generation of such compounds in one or more representations, in some embodiments the representations may be processed according to techniques described herein. For example, the compounds may be processed to determine representations in accordance with techniques described herein, and the representations may be processed to determine property information and the property information may be analyzed to determine suitability of compounds for a desired task or function.
Compound and Property Analysis, Including Using Quantum Computing
As discussed above, in some embodiments, a library of property information for compounds may be formed that includes, for multiple compounds, values for properties of the compounds. In some such embodiments, the values may be determined from sources of property information and/or may be predicted, such as predicted using one or more models trained as described herein to output predicted values for properties and for compounds. In some embodiments, information from the library may be provided to one or more computing systems for use in determining and/or analyzing properties of compounds. Such computing systems may, in some embodiments, be or include quantum computing systems, though other embodiments may not use quantum computing.
Accordingly, described herein in connection with some embodiments are techniques that leverage quantum computing hardware for computational analysis of properties of compounds. Quantum computers operate in a wholly different manner from conventional computer hardware and are not natively able to perform the same or even similar processing as conventional computer hardware. More particularly, quantum computers are not able to natively perform a computational analysis of properties of compounds in even the manner in which conventional computing hardware performed that analysis.
Also described herein are techniques for adapting a compound property analysis to a form that the computation may be performed quickly and reliably by computing hardware, including quantum computing hardware. For example, in some embodiments described herein, the determining and/or analyzing may include performing operations with one or more quantum computing systems, such as one or more quantum computers that may have been configured or otherwise arranged to perform computations using quantum annealing with one or more quantum annealers. In some such embodiments that operate using such a quantum annealer (or one or more quantum annealers), a collection of one or more compound properties of interest may be identified, such as based on analysis of other compounds and/or based on input from one or more users and/or compound property information provided from a library of property information. The quantum annealer(s) may analyze a set of compounds in connection with properties of interest to identify a subset of the compounds that meet one or more criteria, which may be identified as the best performing compounds for a desired function or meet other criteria. However, in some embodiments, these techniques may be adapted for use by other computing hardware.
In some such embodiments, computing hardware including the quantum annealer(s) may identify the subset by determining the compound(s) from among the set that satisfy one or more criteria regarding statistical values resulting from evaluation of a function, such as identifying compound(s) from among the set that correspond to a maximization, minimization, or other optimization of a function or other statistical operation with respect to a function with which the quantum annealer(s) is/are configured. The quantum annealer(s) may be configured with one or more weights or other values that affect operations of the annealer(s) and thereby affect evaluation of the function. The weights may, in some cases, indicate relationships between variables to be analyzed by the annealer(s), such as relationships between variables relating to one or more of the properties of interest. In some cases, the quantum annealer(s) may receive as input values corresponding to each compound to be analyzed with respect to the function and for each property of interest to be analyzed for the compound. For example, the values may be a set of discrete values. Such values may, in some embodiments, be binary values, such that the quantum annealer(s) may receive the values as a matrix of binary values where each value in the matrix indicates whether a compound has a particular property or whether that property for the compound satisfies one or more criteria for the properties or for that property (e.g., how a value for a property compares to a threshold). In yet another example, the values can be a set of a multiple continuous values. Such values may, in some cases, be retrieved from a data store or library of property information for compounds, such as in some embodiments one including property values generated according to models generated and/or trained using techniques described herein.
Accordingly, in some embodiments, the quantum annealer(s) may analyze binary values corresponding to compounds and properties of interest to identify, from among the compounds, one or more compounds that satisfy one or more criteria and so may have a desirable combination of properties of interest. Such compounds may then be synthesized and tested, or tested in silico using other techniques, to further identify a smaller set of compounds that may be candidates for use in a particular context, for further experimentation, or other purposes. Using such a process, one or more compounds may be identified that may advantageously perform a function. As such, in some embodiments, a quantum computing based analysis serve as an initial filter on a large set of compounds to identify a candidate set of compounds that may perform a desired function well, after which a second lab based or in silico analysis can further filter and refine the candidate set.
Accordingly, in some embodiments, techniques described herein may be used to identify one or more drugs that may have a desirable combination of properties, such as one or more drugs that may bind to a target as well as have a desirable solubility, thermostability, blood-brain barrier permeability, and/or other properties. In some such embodiments, a quantum annealer may receive as input a matrix of values, where each row in the matrix corresponds to a drug candidate and the value for that row indicates a value for a property of that drug candidate. In some embodiments, the values can be discrete or continuous values. For example, the discrete values can be binary values (e.g., 0 or 1). In another example, the continuous values can be any number from 0 to 1, such as 0.8. The quantum annealer may analyze the matrix to identify a ranking of predicted performance of the drug candidates with respect to the properties and identify a number N of the drug candidates that have an overall best performance with respect to the properties, based on evaluation of a function with which the quantum annealer is configured for determining an optimal or otherwise desirable combination of properties. With such a process, a best, top five, top ten, top one hundred, or other top N drug candidates may be identified by the quantum annealer. The identified drug candidates may then be analyzed using other techniques to determine or confirm properties in the list or determine or confirm the performance of drug candidates in the subset. Such other techniques may include other computational techniques, such as using machine learning or other artificial intelligence techniques, or laboratory work that involves synthesizing and testing the drug candidates.
In some embodiments, a library of compounds may be identified and analyzed to determine a subset of the compounds that meet one or more criteria, such as that they are predicted to have a desirable combination of properties. In some embodiments, the library of compounds that are analyzed may be all compounds that satisfy one or more criteria. For example, in some embodiments in which the compounds are drugs, the criterion may be all molecules that satisfy “Lepinski’s Rule of 5” that define a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule. There are IO60 such compounds in that library, a number of molecules that cannot be practically evaluated using conventional techniques. As another example, all molecules up to 30 atoms in size is 1024 molecules. In some embodiments using techniques described herein, these or other libraries of molecules may be evaluated in a practical timeline, such as within a matter of days or less than two weeks. Analyzing such a large library of compounds in a timeline that is a matter of days or otherwise practical for research and development may allow for a more reliable and comprehensive identification of well-performing compounds, such as a determination of a global “best” performing compound with respect to properties of interest or otherwise a determination of a compound that performs well or best across a large library of compounds.
In some embodiments (including some that use quantum computing or others that use other computing hardware), techniques for analyzing compound properties may start with a user specifying properties of interest. In other embodiments, properties of interest may additionally or alternatively be determined by a system through analysis of input compounds. The input compounds may be ones that are identified by a user as performing a function or performing a function in a manner that satisfies one or more criteria, such as performing the function with a desired effectiveness. In some such embodiments that receive input compounds, a process may include determining pharmacophores for the input compounds. These pharmacophores may be used to determine properties that are present in the compounds and may be related to performing the function or performing the function in the manner that satisfies the criteria. Information from the pharmacophores, or other input from a user or information regarding compounds, may be used to determine rules regarding properties for compounds that, when present in a compound, may lead to the compound performing a function or performing a function in a manner that satisfies one or more criteria.
In some embodiments described herein (including some that use quantum computing or others that use other computing hardware), once properties of interest are determined, rules related to the properties of interest and/or to values for those properties may be determined. In some such embodiments, the rules enable a description of those properties with respect to a binary value, such as whether the property is present or not in a compound or whether a criterion with respect to the property (e.g., a value above or below a threshold) is satisfied for the compound. For a list of compounds, a value may be determined with respect to each property and for each compound. These values may be arranged in a matrix of values, where each row represents a compound, and each column represents a property. The rules can enable a description of those properties with respect to a set of discrete values or continuous values. In some embodiments, the matrix can include binary values (e.g., 0 or 1). In some embodiments, the matrix can include continuous values (e.g., 0 to 1).
In some embodiments that use quantum computing hardware, the quantum computing hardware may be configured with a function that identifies relationships between variables, where the variables relate to compound properties and relationships between them, such as relative priorities of different properties in a desirable or well-performing compound. The variables may be set such that when the quantum computing hardware identifies, using input binary values relating to properties for compounds, a compound that relates to a maximum, minimum, optimum, or other statistical value for the function with which the quantum computing hardware is configured, that compound may be the best compound with respect to the properties of interest or otherwise satisfy one or more criteria with respect to those properties of interest. In some embodiments described herein, the quantum computing hardware may be configured to perform quantum annealing. The quantum annealing may be in the form of a QUBO in some such embodiments. It should be appreciated, however, that other forms of quantum annealing, or other forms of quantum analysis or other ways of configuring a quantum computer, may be used in other embodiments. It should also be appreciated that in some embodiments, these techniques may be adapted for use by classical computing hardware that is configured in any architecture.
While techniques leveraging quantum computers are described herein in connection with some embodiments, it should be appreciated that other embodiments may not include quantum computers. Such other embodiments that do not include quantum computers include embodiments that do not analyze properties of compounds or identify candidate sets of compounds, and embodiments that analyze properties of compounds and identify candidate sets but do not use quantum computing systems.
Hardware Components of Some Embodiments
FIG. 1A is a block diagram of an example system 100 for determining and/or analyzing properties of compounds, in accordance with some embodiments of the technology described herein. In the illustrative example of FIG. 1A, system 100 includes a network 105, a computing device 110 including a client interface 112 for interfacing with a client 115, a computing device 120 including a compound analysis facility 126, and a quantum annealer 130. It should be appreciated that system 100 is illustrative and that a system may have one or more other components of any suitable type in addition to or instead of the components illustrated in FIG. 1A. For example, there may be additional remote systems (e.g., two or more) present within a system. It should also be appreciated that in some embodiments, the system 100 may include classical computing hardware that is configured in any architecture. For example, the classical computing hardware may be in addition to or instead of the quantum annealer 120. Such other hardware may include one or more central processing units (CPUs), graphics processing units (GPUs), and/or other hardware accelerators, such as a distributed array of CPUs, GPUs, and/or other hardware accelerators that are configured to interoperate and execute portions of a task in parallel.
The network 105 may be or include one or more local and/or wide-area, wired and/or wireless networks, including a local -area or wide-area enterprise network and/or the Internet. Accordingly, the network 105 may be, for example, a hard-wired network (e.g., a local area network within a biopharma research office), a wireless network (e.g., connected over Wi-Fi and/or cellular networks), a cloud-based computing network, or any combination thereof. For example, in some embodiments, the computing device 110 and the computing device 120 may be located within the same building or building complex and connected directly to each other or connected to each other via the network 105, while the quantum annealer 130 may be located in a remote building and connected to the computing device 110 and the computing device 120 through the network 105. In some embodiments, the computing device 110 and the computing device 120 are integrated as one device.
The computing device 110 may be any suitable one or more electronic devices configured to send instructions and/or information to the computing device 120, to receive information from the computing device 120, and/or to process obtained data. In some embodiments, computing device 110 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device. Alternatively, the computing device 110 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to send instructions and/or information to the computing device 120, to receive information from the computing device 120, and/or to process obtained data.
The computing device 110 can include the client interface 112 for interfacing with a client 115. In some embodiments, the client interface 112 includes graphical user interfaces. In some embodiments, the client interface 112 includes executable instructions. The client 115 can interact with the client interface 112 to control or configure the computing device 110, the computing device 120, the quantum annealer 130, and/or classical computing hardware. The client 115 can use the client interface 112 to view data generated by the computing device 120 or the quantum annealer 130.
The computing device 120 may be any suitable one or more electronic devices configured to send instructions and/or information to the computing device 110 and/or the quantum annealer 130, to receive information from the computing device 110 and/or the quantum annealer 130, and/or to process obtained data. In some embodiments, computing device 120 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device. Alternatively, the computing device 120 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to send instructions and/or information to the computing device 110 and/or the quantum annealer 130, to receive information from the computing device 110 and/or the quantum annealer 130, and/or to process obtained data. It should also be appreciated that in some embodiments, the computing device 120 may communicate with classical computing hardware that is configured in any architecture.
The computing device 120 can include a compound representation facility 122 for creating and/or managing representations of compounds. The compound representation facility 122 can encode compounds into compound representations to create a library of compound representations. The compound representation facility 122 may also use representations to determine property information encoded in the representation, and may in some cases be trained to identify the properties of a compound in the library of compounds by decoding the compound’s representation. The compound representation facility 122 can decode the compound representation of a compound for output of its properties. The compound representation facility 122 may also in some embodiments query the library of compounds (e.g., in response to a request from a user or other source) to identify a compound representation of a compound having certain properties of interest and may output the properties of that compound, which may in some cases include information sufficient or helpful to synthesize the compound. The compound representation facility 122 may maintain the library of compounds for analysis and querying by the compound model facility 124.
The computing device 120 can include a compound model facility 124 for predicting the properties of the compounds in the library of compounds maintained by the compound representation facility 122. The compound model facility 124 may include one or more models configured to predict values of one or more properties of an input compound. Such a model of the compound model facility 124 may in some cases be trained to predict values for properties of compounds. The compound model facility 124 may provide compounds and predicted property values to the compound analysis facility 126. A model of the compound model facility 124 may be trained to predict values for new properties of compounds based on an identification of a new property of interest and training data for the new properties of interest, and in some cases an existing model may be edited to generate a new model to predict values for compounds for a new property.
The computing device 120 can include a compound analysis facility 126 for managing analysis of compounds in accordance with techniques described herein. In some embodiments, the compound analysis facility 126 includes executable instructions that can be executed by the computing device 120. The compound analysis facility 126 may receive input information from the interface 112 or the compound model facility 124, which may include data identifying one or more properties of interest, one or more known compounds that perform a function, and/or one or more criteria identifying a set of compounds to be analyzed. In some embodiments, the compounds are included in antibodies, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules. In some embodiments, the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides. In some embodiments, the facility 126 may identify criteria for compound properties of interest, which may be implemented as a set of rules to be used in analysis by the facility 126 and/or by the quantum annealer 130. In some embodiments, the facility 126 may determine the rules in part by determine pharmacophores for input known compounds and use the pharmacophores in determining the rules. In some embodiments, in addition to determining properties and rules, the facility 126 may also identify compounds to be analyzed by the quantum annealer 130, such as based on user input defining a landscape of compounds to be analyzed through identifying properties of the compounds or a definition of the compounds of interest (e.g., all compounds having up to 30 atoms). In some embodiments, the facility 126 may also receive user input that specifies a desired resolution for compounds to be analyzed. The resolution may relate to a number of compounds that are analyzed by the annealer from among an entirety of the compounds that may satisfy a definition or characterization of compounds of interest, such as all of the compounds, half of the compounds, one quarter of the compounds, or other suitable portion of the compounds. The compound analysis facility 126 may also identify values for the rules for the properties of interest for the compounds, to be analyzed by the quantum annealer 130, including by retrieving information from one or more data stores of compound property values. The compound analysis facility 126 may, in some embodiments, send instructions and/or configuration information to the quantum annealer 130 (e.g., via network 105) to control or configure the quantum annealer 130. Such instructions and/or configuration information may include specifying a function to be used by the annealer 130 in analysis, such as by setting values for one or more variables of the analysis in accordance with some techniques described herein. The compound analysis facility 126 can also transmit data to the quantum annealer 130 (e.g., via network 105) and trigger the quantum annealer 130 to analyze the data. Such data may, in some cases, be a matrix of values, such as values indicating values with respect to rules for properties of compounds, in accordance with techniques described herein. In some embodiments, the values can be discrete or continuous values. For example, the discrete values can be binary values (e.g., 0 or 1). In another example, the continuous values can be any number from 0 to 1, such as 0.8. Following the analysis by the annealer 130, the compound analysis facility 126 may also receive data analyzed by the quantum annealer 130. However, it should be appreciated that in some embodiments, the compound analysis facility 126 may be adapted to cause classical computing hardware to analyze the data. The quantum annealer 130 can be a quantum computer (or more than one quantum computer) configured to perform quantum annealing. While the embodiment of FIG. 1A implements a quantum computer as a quantum annealer, it should be appreciated that some embodiments may operate with one or more quantum computers configured to perform a different analysis. In some embodiments, quantum annealer 130 may include additional computer hardware to interact with other computing devices (e.g., devices 110, 120) and to execute operations to configure the quantum computing hardware of the annealer 130. The quantum annealer 130 may be configured in some embodiments to identify a solution to an objective function with which it has been configured by the facility 126, based on input provided to it (e.g., a set of candidate solutions) by facility 126. In some embodiments, the solution to the objective function may be a minimum or maximum value for the function from among the input data, or other statistical value. As one example, the quantum annealer 130 may be configured to perform a QUBO analysis, may receive a binary table of values, and may identify from among the values of the binary table a row that provides a “global” (with respect to the input candidate solutions) minimum solution to the QUBO function.
In some embodiments, the quantum annealer 130 can be implemented as a D-Wave quantum computer that uses superconducting flux qubits. The superconducting flux qubit may perform analysis using quantum mechanical spin. As illustrated in FIG. IB, a qubit loop may have current applied to it, and the circulating current in the qubit loop can give rise to a flux inside the loop. As shown in FIG. 1C, that flux can encode two distinct quantum spin states that can exist in a superposition. In some embodiments, the quantum annealer 130 can include two superconducting loops for each qubit of the annealer 130, and the annealer 130 may have multiple qubits. In such a case, each loop can be subject to an external flux bias <bix or <I>2x. When cooled to a near absolute zero kelvin, the two superconducting loops can behave as a super positioned state.
Setting of the bias in each loop and setting of programmable coupling elements provide a spin-spin coupling energy that can be tunable, as described using the known Ising quantum model:
Figure imgf000030_0001
The ioi can be the Pauli spin matrix with eigenvectors {| T>, | l>}. The DWave implementation can allow for both and ht to be set independently when defining a QUBO, in some embodiments as described herein. Using the bias and the coupling, then, operation of the quantum computer can be influenced. This enables customizing of the computation to be done by the quantum computer, including per input from a user via interface 112 and/or processing by the compound analysis facility 126, which can be provided by the compound analysis facility 126 to the annealer 130 as configuration input.
As mentioned above, the annealer 130 may have multiple qubits, each of which may include the loop shown in FIG. IB. The accumulation of many qubits may enable the quantum annealer 130 to perform computations (e.g., identify a minimum value of a function) for a large expanse of variable space. For example, the computing device 120 can provide a quantum mechanical super-positioned state of all possible solutions with equal weighting. The quantum annealer 130 can receive the objective function from the computing device 110 and/or the computing device 120, which can define the objective function as a QUBO or Ising model. For example, the quantum annealer 130 can minimize or maximize an objective function, or otherwise calculate a statistical value as a solution to the objective value that meets one or more criteria. During the computation (e.g., the minimization computation), a quantum waveform representing the super-positioned state of the Qubits can collapse per the influence of the programmed weighting (bias) applied to the magnetic fields associated with the super-cooled and super-positioned currents in the chip. The annealing process can therefore produce a sampled list of energetic states associated with each possible solution. The minimal energy states can represent optimal solutions or solutions otherwise meeting one or more criteria. In some embodiments, a single result may be output from the annealer 130 and provided to the facility 126 in response to the configuring and the triggering of the computation by the annealer 130. In other embodiments, multiple results may be provided, such as a top five, top ten, top one hundred, or otherwise top N results that are the compounds that are predicted to perform best. In some embodiments, rather than a fixed number of outputs, the annealer 130 may output all results that meet a criterion, such as by outputting all compounds that appear to have a combination of properties that satisfies one or more criterion, including being associated with a result of the objective function above a threshold.
Representing Compounds and Representation Analysis
FIG. 2A illustrates a method 200 for determining and/or analyzing properties of compounds. FIG. 2B-10 illustrate processes that may be used in some embodiments to carry out some of the acts described in connection with FIG. 2A. The method 200 may be performed by the compound representation facility 122, the compound model facility 124, and the compound analysis facility 126 executed by the computing device 120. The computing device 120 can receive data for processing by the quantum annealer 130. However, the quantum annealer 130 might not be able to execute standard executable instructions (e.g., standard computer code), so the computing device 120 can translate the input into a format that is compatible with the quantum annealer 130, configure the quantum annealer 130 with the manner in which the data is to be processed, and provide the data to the quantum annealer 130 for processing. However, it should be appreciated that in some embodiments, the computing device 120 may cause classical computing hardware to process the data.
At step 202, the compound representation facility 122 can generate compound representations of compounds (sometimes referred to herein interchangeably as “compound representations” or “digital representations”). As shown in FIG. 2B, for a compound, the compound representation facility 122 may convert information regarding the compound into a graph. In some embodiments, the facility 122 may convert such a graph representation into a vector. In some embodiments, the compound representation facility 122 can generate representations of compounds using information regarding properties of the compounds. For example, the compound representation facility 122 can generate representations of 3D protein target structures, existing ligands, or property datasets. In some embodiments, the compound representation facility 122 can generate, for each respective compound of multiple compounds, a respective compound representation that represents multiple atoms of the respective compound and at least one property of the respective compound. In some embodiments, the compound representation facility 122 can receive the properties from the client interface 112. For example, the compound representation facility 122 can receive properties of new compounds to represent the new compounds and predict their properties.
At step 204, the compound model facility 124 can train one or more property determination models to identify one or more properties of compounds. As shown in FIG. 2B, in some embodiments, the compound representation facility 122 may train a neural network 220 for predicting properties (though it should be appreciated that models may be implemented in ways other than neural networks). For example, the compound model facility 124 can train a model to predict chemical features that are related to activity and performance of the compounds. In some embodiments, the compound model facility 124 can train a property determination model based on the respective compound representation of each respective compound of the compounds. At step 206, the compound model facility 124 can generate predicted values for properties for a set of compounds. As shown in FIG. 2B, the compound representation facility 122 can maintain graph embeddings (e.g., compound representations) 222 in an embedding space (e.g., solution library) 224. The set of compounds can be compounds that have the properties of interest. In some embodiments, the compound model facility 124 can generate a library of compounds accessible to the property determination model, the library of compounds comprising the respective compound representation of each respective compound of the compounds. For example, the compound model facility 124 may generate a solution library of compounds (e.g., RNA, RNA, or peptides) and their properties for the quantum annealer 130 or other computing hardware to analyze.
In some embodiments, the compound model facility 124 can generate using models, or retrieve from a data store, properties for compounds. As shown in FIG. 2B, the compound model facility 124 can generate property predictions 226 for compounds. The compound model facility 124 can identify a numeric value for every compound and its properties. In some embodiments, the compound model facility 124 can determine, for one or more (or each) compounds in the library of compounds, a value for the compound with respect to each property of a set of properties, to generate a set of property values for each compound of the library of compounds. For example, for a property that is blood-brain barrier penetrance, a numeric value can indicate how blood brain barrier penetrant a compound is. In some embodiments, the properties can be calculable and can be performed at scale using techniques described herein.
Some embodiments that create a library of information regarding compounds may include techniques for analysis of compounds, such as to identify a compound that may perform a desired function or task or meet some performance criterion for the same. Other embodiments, however, may not include such an analysis and may end once the library is created or supplemented with property information. In the example of FIG. 2, the process includes analysis functionality.
Accordingly, at step 208, the compound model facility 124 can configure criteria. The criteria can determine the likelihood of success for a compound to perform well in a given application. The criteria can be based on the compounds and properties of interest of those compounds. For example, the compound model facility 124 can receive criteria defining a set of compounds that have properties of interest. For example, the compound model facility 124 can extract criteria that defines blood barrier penetrant drugs for the quantum annealer 130 to optimize. However, in some embodiments, the compound model facility 124 can extract criteria for other computing hardware to optimize.
The criteria can be inputted into an objective function for comparing against a dataset of candidate solutions by the quantum annealer 130, or in some embodiments, by other computing hardware. As shown in FIG. 2C, the compound model facility 124 can provide the predicted properties 230 from a large chemical property database 232 as candidate solutions 234 to a quantum computer 236 (e.g., quantum annealer 130). For example, the space of candidate solutions the quantum annealer 130 can check can be in the magnitude of IO20 and beyond.
In some embodiments, the compound model facility 124 can receive criteria from the computing device 110, which can receive the criteria via the client interface 112 from the client 115. For example, a drug discovery scientist or medicinal chemist can define criteria for an optimal compound in research study. As shown in FIG. 2C, the compound model facility 124 can be provided with multi-objective optimization criteria 238. For example, the scientist or chemist can define a multi-objective optimization criterion for what the compound should do. In particular, if a medicinal chemist attempts to design a compound to interact with a particular protein target of interest, the medicinal chemist can derive criteria that define how existing or theoretical molecules interact with the protein.
At step 210, the compound analysis facility 126 can operate the quantum computer 236 (e.g., quantum annealer 130) to trigger execution of criteria against the set of compounds. In some embodiments, the compound analysis facility 126 can trigger at least one quantum computer to analyze the set of property values for each compound of the library in connection with an objective function with which the at least one quantum computer is configured, to determine a compound for which corresponding property values generate a minimum value for the objective function. For example, the compound analysis facility 126 can execute the quantum annealer 130 with properties of interest of the compounds to analyze any property of any compound. In another example, the compound analysis facility 126 can cause classical computing hardware to analyze the property.
In some embodiments, the quantum annealer 130 can use the criteria defining the property information generated from the fast property determination of the compound model facility 124. For example, the pre-calculated properties output by the compound model facility 124 can be the candidate solutions that the quantum annealer 130 can cause the quantum computer to analyze. In some embodiments, the compound analysis facility 126 can use the quantum annealer 130 to search the data store of all drug -like compounds maintained by the compound representation facility 122 for compounds satisfying the criteria. For example, the search can result in the compiling of properties for all molecules that meet a criterion (e.g., all drug-like). However, it should be appreciated that in some embodiments, the compound analysis facility 126 can cause classical computing hardware to search for compounds.
The compound analysis facility 126 can convert the criteria to be processed by the quantum annealer 130. In some embodiments, the compound analysis facility 126 can use the pre-calculated properties as input to create the QUBO when searching chemical space. As shown in FIG. 2C, the compound analysis facility 126 can convert the multi -objective optimization criteria into a QUBO quantum formulation 240. For example, the compound analysis facility 126 can convert or reformulate the criteria in the values of a QUBO for the quantum annealer 130 to use the criteria. The compound analysis facility 126 can create the QUBO to include the predicted properties of the compounds predicted by the compound model facility 124. However, it should be appreciated that in some embodiments, the compound analysis facility 126 can generate an input for classical computing hardware to analyze for compounds.
The reformulated criteria as QUBO can be used by the quantum annealer 130 for optimization (e.g., minimization) on the quantum computer. For example, the compound analysis facility 126 can cause the quantum annealer 130 to run optimization against the set of compounds. For example, the quantum annealer 130 can identify optimal compounds from the property predictions pre -calculated by the large-scale property models of the compound model facility 124. However, it should be appreciated that in some embodiments, the classical computing hardware can identify compounds.
At step 212, the compound analysis facility 126 can receive a subset of the compounds. The subset of the compounds can be molecules or compounds that best optimizes and satisfies the defined criteria. As shown in FIG. 2C, the quantum computer (e.g., quantum annealer 130) can provide ranked optimal results 242. In some embodiments, the subset of the compounds can be ranked to identify the most optimal compound. As shown in FIG. 2C, the quantum computer (e.g., quantum annealer 130) can provide ranked solutions 244. For example, the compound analysis facility 126 can receive a set of the top N most optimal compounds. In another example, the subset of the compounds can be ranked based on how blood brain barrier penetrant they are or how they bind to a target. In some embodiments, the compound analysis facility 126 can receive from the at least one quantum computer, an identification of the compound. For example, the compound analysis facility 126 can use the quantum annealer 130 to use the quantum computing to identify an optimized compound having the properties of interest. It should also be appreciated that in some embodiments, the compound analysis facility 126 may be configured to receive the compounds from the classical computing hardware.
At step 214, the compound analysis facility 126 can further analyze the subset of the compounds, and potentially refine the subset. In some embodiments, the compound analysis facility 126 can output the identification of the compound. For example, the client 115, via the client interface 112, can cause the compound analysis facility 126 to fine tune the top compounds using slower computation or experimental methods.
It should be appreciated that, while examples of quantum computing techniques have been described, embodiments that include analysis of compounds (e.g., to determine a compound that performs a function or task) are not limited to use of quantum computing technology.
Encoding Compounds
FIG. 3 A is a flowchart of an illustrative method 300 of encoding properties of compounds. The method 300 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments. For example, the method 300 can include converting a compound into a graph representation, and then encoding the graph representation into a compound representation (e.g., digital representation). In some embodiments, performing step 202 of the method 200 includes performing the method 300.
The process 300 of FIG. 3 A may operate on suitable input regarding compounds. Embodiments are not limited in this respect. For example, representations may be generated based on numerical vectors (physicochemical descriptors), fingerprints of binary or integer vectors containing a hashed or numerical count representation of the constituents of a compound, SMILES strings, graph representation of the molecule's 2D structure, 3D representation of the molecule's structure and conformer, and multiple 3D representations of compound conformers.
At step 302, the compound representation facility 122 can generate multiple nodes to represent multiple atoms of a compound for which a representation is to be generated. For example, the compound representation facility 122 can represent chemicals, pharmaceutical compounds, drugs, or biologies using a graph in which nodes of the graph are atoms. Each respective node of the plurality of nodes can represent each respective atom of the plurality of atoms of the compound in some such embodiments. As shown in FIG. 3B, data regarding a compound (e.g., molecule) 310 can be converted into a molecular graph (312) (e.g., with data regarding atoms 314a, 314b of the compound 310 converted to nodes 316a, 316b of the molecular graph 312).
At step 304, the compound representation facility 122 can generate edges of the graph to represent bonds between the atoms of the compound (e.g., interatomic bonds). Each respective edge of the edges can represent a respective association between a respective pair of atoms of the compound. For example, the molecular graph can represent the structure of the molecule with atoms as nodes and bonds 318 between atoms as edges 320. In some embodiments, the compound representation facility 122 can generate the graph for representing compounds and their protein binding sites. The graph may include nodes representing atoms and edges representing bonds (e.g., interatomic bonds) as well as weak bonds between ligands and proteins.
In some embodiments, in connection with the generation of the nodes in step 302 and/or the edges of step 304, the facility 122 may process attention inputs, process an adjacency matrix, and/or process an atomic distance matrix, examples of which are described below.
At step 306, the compound representation facility 122 can iteratively traverse the nodes along the edges to update the graph representation of the compound. For example, as shown in FIG. 3C, iterative passing (e.g., iterative message passing) can be used traverse nodes along the edges to update the graph representations. In some embodiments, compound representation facility 122 can utilize an iterative message passing process to traverse the edges to identify which nodes are near other nodes. In some embodiments, the compound representation facility 122 can identify information about each node (e.g., atom) based on information regarding surrounding nodes (e.g., to indicate interactions between atoms). In some embodiments, based on the surrounding atoms, the compound representation facility 122 can update the graph representation of the compound for analysis. In some embodiments, the compound representation facility 122 can update the graph representation in accordance with a configuration to encapsulate (e.g., summarize) the information of all of the atoms in the neighborhood and the atoms in those atoms’ neighborhood. In some embodiments, the compound representation facility 122 can use machine learning to analyze and update the graph representation.
The “message passing” technique that may be used across iterations may in some embodiments include training multiple aspects of a model, such as multiple neural network layers, using adjacency and/or distance matrices (examples of which are described below), which define the bonds and 3D structure of a compound. FIG. 3D illustrates an example of such a process. In the example of FIG. 3D, the distance matrix 330 is formed as a “3D conformer encoding” matrix that indicates values representing or derived for distances between pairs of atoms (and/or molecules) of a compound, across different conformers of the compound. The values for each distance may be an average distance, median distance, standard deviation of distance, and/or other calculated distance. The adjacency matrix 332 (e.g., “edge/bond encoding”) may identify pairs of atoms (and/or molecules) between which a bond exists in the compound. In some cases, as in the example of FIG. 3D, as mentioned below, one or more parts of the compound may be masked during creation of these matrices. Attention inputs (in the form of query vectors Q 334, key vectors K 336, and/or value vectors V 338) may include information on atoms and bonds, but not on 3D structure, and may include information on nodes of a graph of the compound, such as information regarding atoms of the compound. Such atom information may identify an element for the atom, valence information, or other information defining that atom of the compound. During training, the matrices and attention inputs may be input to the training process 340 and neural networks may be trained with the values. This process may yield a representation 342 of the information regarding the compound (e.g., matrix representation of information regarding the compound, compound representation, etc.).
The compound representation facility 122 can utilize a weighted algorithm that iteratively updates features of nodes (e.g., atoms) with those of the surrounding atoms. For example, compound representation facility 122 can utilize the weighted algorithm to update the graph representation from neighboring nodes to get a weighted sum status of the neighbors. In some embodiments, the compound representation facility 122 can update the graph representation by using nonlinear weighting. For example, the compound representation facility 122 can assign a larger influence or weight on the representation of the atom based on the atoms that are local, close, or immediate neighbors. Meanwhile, updates from more distant atoms, such as atoms farther away from the atom that is being analyzed, can have less influence on the representation of that atom. Atoms that are distant, such as not immediate neighbors, can have some influence on the atom based on weighted information acquired during subsequent rounds of message parsing. In another example, the compound representation facility 122 can update the graph representation based on the interatomic bond distance between the atoms. The compound representation facility 122 can use the updated graph representation to generate (e.g., learn) a compound representation of the atoms in the network based on the local atom and bond structure. At step 308, the compound representation facility 122 can generate a compound representation of the compound. For example, the compound representation facility 122 can reduce or convert the graph representation of a compound into a compound representation of the compound. In some embodiments, the compound representation can comprise the nodes, the edges, and one or more properties of the compound. The property can be any desired property for any compound. Properties can include information associated with the structure describing properties of the structure. In some embodiments, the compound representation can include structural features as well as functional features. Structural features might be indicative of functional features if structure informs function, and the structural information is captured. This may include information associated with an atom describing interactions with other atoms. For example, the properties can identify the atoms (e.g., carbon) and atomic weights of the compound. Examples of functional properties include whether the compound is blood brain barrier penetrant, bioavailable, and can bind to a particular target. In some embodiments, property information stored in a compound representation may be limited to structural property information.
In some embodiments, the compound representation facility 122 can store the compound representation of the compound in an array. For example, the compound representation can be in an array of information describing values for properties of interactions. In another example, the array can be a vector. The compound representation facility 122 can store values (e.g., numbers) in the array or vector for that compound. In some embodiments, to generate a compound representation of the compound, the compound representation facility 122 can identify an atom (e.g., alpha carbon within the molecule) of the compound and assign the value of the atom in the compound representation. In some embodiments, the values represent the compound and its atoms, bond, and properties. For example, values can indicate whether a certain compound is blood brain barrier penetrant, how well it binds a target, or how volatile it is. The values can be default values for each type of atom, bond, and its properties. In some embodiments, the compound representation facility 122 can store random values in the vectors. For example, the compound representation facility 122 can generate an unlearned compound representation for each compound.
Training Models For Property Determination, Including Using Representations
FIG. 4A is a flowchart of an illustrative method 400 of training a model to generate compound representations (e.g., digital representations ) of compounds and identify property values of the compounds from the compound representations of the compounds to establish a library (e.g., trained library) of compounds. The method 400 may be performed by the compound representation facility 122 and/or the compound model facility 124 executed by the computing device 120, in some embodiments. For example, the compound representation facility 122 can train an encoder network to encode properties of compounds and a decoder network to decode those properties of the compounds. If a library of compounds can include encoded compounds and their property values that can be accurately decoded, then the library of compounds can be searchable and useful for compound related research. In some embodiments, performing step 202 of the method 200 includes performing at least some steps (e.g., steps 402-410) of the method 400. In some embodiments, performing step 206 of the method 200 includes performing at least some steps (e.g., step 412) of the method 400.
At step 402, the compound representation facility 122 can identify known property values of compounds in a library of compounds. The known property values can include structural properties of the compounds. For example, the known property values can indicate whether a particular compound is actually blood brain penetrant. In another example, the compound representation facility 122 can receive the known property values from a scientist or data store. For example, the training data for the training the compound representation facility 122 can include a large collection of chemical molecules. For example, databases of drug like molecules, such as Chambly (3.7 million molecules), ZINC (970 million molecules) or GDP-17 (166 billion molecules). The compound representation facility 122 can generate a drug like subjection of the training data by enumerating theoretical chemical graphs up to a certain number of atoms. For example, the compound representation facility 122 can fdter out the molecules that are not drug like (“un-drug”). The compound representation facility 122 can identify un-drug molecules by defining a set of rules that determine the similarity of a compound to those which are known to be successful. Examples of rules include atom valency, solubility, molecular weight, number of hydrogen bond donors.
At step 404, the compound representation facility 122 can train an encoder network to encode the compounds into compound representations from graph representations of the compound. In some embodiments, and as shown in FIG. 2B, the compound representation facility 122 can train the graph neural network layers (e.g., encoder network) 246 to convert the graph representations 248 into a vector representation (e.g., compound representation) 250 that can be converted to graph embeddings 222 to be maintained in a library of compounds (e.g., searchable embedding space 224). In some embodiments, the graph representations are generated by the compound representation facility 122 as discussed in method 300. For example, as shown in FIG. 4B, the compound representation facility 122 can receive graph representations 420 of the compounds.
In some embodiments, the compound representation facility 122 can include the encoder network 422 and train the encoder network to encode the graph representations 420 into respective compound representations 424. For example, as shown in FIG. 4B, the encoder network of the compound representation facility 122 can include block transformers to process the input regarding compounds. The training of the encoder in step 404 may be done in some embodiments using attention information 426, an adjacency matrix 428, and/or an atomic distance matrix 430, examples of which are discussed below.
In some embodiments, the intermediate graph representation 420 could be omitted, and the encoder network can be trained to create a compound representation 424 from an input chemical structure of the compound.
The compound representation facility 122 may receive, generate, and/or calculate an adjacency matrix 428 from input information regarding an input compound. The adjacency matrix may indicate interconnections of atoms in a compound, or interconnections between molecules (e.g., fragments) in a compound. For example, the facility 122 may generate the adjacency matrix with atoms (or molecules) in respective rows and columns and assign values to indicate whether atoms in a cell (the atom for the row and the atom for a column) share a bond in the compound. For example, the compound representation facility 122 can assign a value of 1 to indicate a bond between two pairs of atoms and 0 to indicate no connection between two pairs of atoms. In some embodiments, the compound representation facility 122 can assign values to indicate information regarding the bond between atoms. For example, the compound representation facility 122 can assign 0 to indicate no bond, 1 to indicate a single bond, or 2 to indicate a double bond. As another example, values may indicate whether a bond is ionic, covalent, hydrogen, metallic, or van der Waals, or other information regarding a bond.
The compound representation facility 122 may also receive, generate, and/or calculate an atomic distance matrix 430 for a compound. The distance matrix may encode information about a 3D structure of the compound by indicating distances between atoms (or molecules) in a compound. The distances may be in angstroms or other suitable unit of measure. In some embodiments, the compound representation facility 122 may use in the matrix a measured distance between atoms for a particular conformer of a compound, while in other embodiments the facility 122 may calculate an average/mean, median, mode, standard deviation, or other calculation of atomic distance between atoms (or molecules), which may be calculated based on distances across different conformers of a molecule. In some embodiments, atom positions that are used in calculating distance may be determined using a position determination process such as the “ETKDG” method.
The compound representation facility 122 may also receive, generate, and/or calculate attention information 426 for a compound. This may include contextualized query vectors, key vectors, and value vectors for use in attention analysis.
Accordingly, in some embodiments, the compound representation facility 122 can be trained (e.g., learn) to generate the compound representations while retaining their structural information and information about the atoms and bonds of the compounds. Each compound representation can be unique for each compound conformer or indicate distance distribution of conformers of a molecule. For example, the compound representation facility 122 can be trained for representation of pharmaceutical chemical space but can also be trained on representation of any chemical space. In another example, the compound representation facility 122 can be trained on a dataset of proteins and ligands.
At step 406, the compound representation facility 122 can train a decoder 432 to generate decoded property values of the compounds from the compound representation of the compounds, such as structural properties. For example, as shown in FIG. 4B, the compound representation facility 122 can generate graph un-embeddings 434 of the encoded graph embeddings that were input into the compound representation facility 122. In some embodiments, the decoder can be a decoder network or a representation layer for predicting or identifying properties of the encoded compounds. By training the decoder, the compound representation facility 122 can establish a searchable chemical space to optimize and identify molecules based on their properties. The compound representation facility 122 can train the decoder to extract the property values of the compounds from the compound representations. For example, the compound representation facility 122 can train a generative neural network to reconstruct (e.g., decode) the property values of the property representations of the compounds. In some embodiments, the compound representation facility 122 can train the decoder to reconstruct chemical structures from vector representations of the compounds. At step 408, the compound representation facility 122 can identify the decoded property values of the compounds. The compound representation facility 122 can identify the decoded property values from the compound representations. For example, the compound representation facility 122 can identify the decoded property values in the vector representation of the compounds. The decoded property values can indicate structural and/or functional properties of the compound. At step 410, the compound representation facility 122 can determine whether the decoded property values of the compound correspond to the known property values of the compound. For example, the compound representation facility 122 can compare whether both the known property values and the decoded property values indicate that the compound is blood brain penetrant, or indicate a same degree or amount of penetrance. In some embodiments, the compound representation facility 122 can train the decoder to reconstruct chemical structures based on a loss function that can be conditioned by the decoder network or the properties (e.g., property values). In some embodiments, the loss can be the difference between the known property values (e.g., input) and the decoded property values (e.g., output).
If the decoded property values of the compound do not correspond to the known property values of the compound, the method 400 can proceed to step 404 for the compound analysis facility 122 to continue training the encoder and decoder to encode in a different manner to store information in the representation in a manner that may enable more accurate decoding and/or decode in a different manner so as to more precisely generate the decoded property values from the compound representation of the compound. For example, the compound analysis facility 122 can continue retraining the encoder and/or decoder until the decoder is able to accurately identify the property values of any particular conformation. In some embodiments, the compound analysis facility 122 can train the encoder and the decoder until they can accurately generate property representations of the compounds from the graph representations of the compounds, and then recreate the chemical structure of the compounds from the graph representations. For example, the training can be based on the fact that similar compounds might have similar structures and properties.
If the decoded property values of the compound correspond to the known property values of the compound, the method 400 can proceed to step 412 for the compound analysis facility 122 to output a trained library of compounds (e.g., trained chemical representation space or trained embedding space). For example, if the decoded and known values match, then the library of compounds can include encoded compounds and their property values that can be accurately decoded, which allows the library of compounds to be searchable and usable for compound related search.
In some embodiments, as mentioned in FIG. 4B and as shown in FIG. 4C, as part of the training of method 400 of FIG. 4A, the facility 122 may conduct partial masking 436 of compounds. In the masking, one or more parts of the compound (e.g., one or more atoms or molecules, such molecules being portions of the compound such as a fragment, and bonds of those atoms/molecules) may be removed from the input such that information regarding the atoms and bonds of the masked part(s) are not input to the encoding process. The input representation may include a placeholder “dummy” token in place of the masked atom/molecule During training, the model may be evaluated by how well the model may decode an encoded representation of the masked compound to recreate the original unmasked compound. For example, an input compound may have a portion between 5% and 25% of the overall compound (e.g., between 10% and 20%, such as 15%) masked. In embodiments that use such masking, the information may not be used to determine the adjacency matrix, such that the model does not receive information about the interconnections of the masked atoms/molecules, including interconnections to the unmasked parts of the compound. In some embodiments, the information on the masked portion may also not be used in determining atomic distance. However, the information may be used in determining the attention vectors (query, key, and value). In these embodiments, the model is thus trained with information indicating that there are atoms/molecules included in the compound, but not indicating the arrangement (e.g., position and/or interconnection) of those atoms/molecules. During decoding, the facility 122 determines whether the model accurately determined the arrangement of the atoms/molecules. The model is trained until a performance criterion is met for recreation of the original, unmasked compound. In doing so, the model is trained to reliably determine such information for additional molecules that have not yet been synthesized or for which arrangement information is unknown.
Property Determination for Compounds
FIG. 5A is a flowchart of an illustrative method 500 of decoding a compound representation (e.g., digital representation) of a compound to recreate a chemical structure of the compound. The method 500 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments.
At step 502, the compound representation facility 122 can identify a compound representation of a compound. For example, the compound representation facility 122 can retrieve a vector that represents the compound, other representation described herein, or other representation. In some embodiments, the compound representation facility 122 can receive a compound (e.g., input molecule) for which to identify properties.
At step 504, the compound representation facility 122 can extract property values of the compound from the compound representation of the compound. For example, the compound representation facility 122 can query or identify the property values stored in the vector. In some embodiments, the compound representation facility 122 can reconstruct (e.g., decode) the property values of the property representations of the compounds.
At step 506, the compound representation facility 122 can identify, from the property values, properties of the compound. For example, the compound representation facility 122 can identify that decoded property values indicate structural properties for a compound, and/or may indicate functional properties for a compound such as whether a particular compound is blood brain penetrant.
At step 508, the compound representation facility 122 can output the properties of the compound. For example, compound representation facility 122 can reconstruct the compound from its compound representation. The reconstructed compound can be output to a user, such as a scientist researching the compound.
Identifying Compound Having Desirable Properties
FIG. 5B is a flowchart of an illustrative method 550 of decoding a compound representation (e.g., digital representation) of a compound to synthesize the compound. The method 550 may be performed by the compound representation facility 122 executed by the computing device 120, in some embodiments.
At step 552, the compound representation facility 122 can query a library of compounds for a compound having properties of interest. For example, the compound representation facility 122 can receive a desirable set of properties, and then query all the compounds in the library of compounds for compounds having the desired set of properties. The library of compounds can be a searchable data store of compounds. For example, the library of compounds can be a searchable embedding space of molecules and/or an information-rich embedding space capturing information relevant to chemistry (e.g., rules of chemistry, relationships between structural and functional properties of compounds, etc.). In some embodiments, the compound representation facility 122 can use algorithms to find compounds in the library of compounds. For example, the compound representation facility 122 can use Bayesian optimization to find molecules in the chemical space maintained by the library of compounds. In some embodiments, the library of compounds can be stored on the computing device 120 and searchable via the computing device 110. In some embodiments, the library of compounds can be queried by the client 115 via the computing device 110 to look up the properties of any compound.
At step 554, the compound representation facility 122 can identify, in the library of compounds, a compound representation for the compound having the properties of interest. The library of compounds can be generated and maintained as discussed in reference to FIG.
3 A, 3B, 3C and 4A. The library of compounds can include a respective compound representation for each compound (or one or more (e.g., all) compound conformers of each compound) in the library of compounds. The respective compound representation enables the compound representation facility 122 to query the library of compounds for any compound and reconstruct any compound molecule from its compound representation. For example, the library of compounds can maintain a one-to-one mapping between each respective compound representation and each compound conformer. The compound representation facility 122 can reconstruct the compound from the one-to-one mapping to the compound representation. For example, the Bayesian optimization algorithm can search the embedding space if it maps one- to-one to chemistry space.
At step 556, the compound representation facility 122 can extract, from the compound representation of the compound, properties of the compound, wherein the properties include the properties of interest. For example, the compound representation facility 122 can query or identify the property values stored in the vector. In some embodiments, the compound representation facility 122 can reconstruct (e.g., decode) the property values of the property representations of the compounds.
At step 558, the compound representation facility 122 can output the properties of the compound for synthesis. For example, compound representation facility 122 can reconstruct the compound from its compound representation. The reconstructed compound can include the chemical structure of the compound and the properties of the compound. The reconstructed compound can be output to a user, such as a scientist that wants to synthesize the compound.
At step 560, the compound representation facility 122 can synthesize the compound to generate at least one synthesized compound fortesting. A synthesized compound may be used in a variety of ways, including in research to confirm properties of a compound or confirm whether a compound is suitable for a function (e.g., binding to a target, addressing a medical condition, or other function). Any suitable synthesis techniques can be used to synthesize the compound.
Property Prediction for Compounds
FIG. 6A is a flowchart of an illustrative method 600 of training a model to predict a value for a property of one or more compounds, which may be done using compound representations (e.g., digital representations) of the compounds such as representations discussed above. The method 600 may be performed by the compound model facility 124 executed by the computing device 120, in some embodiments. For example, method 600 can include the compound model facility 124 predicting values for a property for an entire library of compounds. In some embodiments, performing step 204 of the method 200 includes performing the method 600. Some aspects of the method 600 of training a model are illustrated in FIG. 6B.
At step 602, the compound model facility 124 can identify a plurality of compound representations. Each of the plurality of compound representations can represent a respective compound in the library of compounds. For example, the compound model facility 124 can access the library of compounds generated and maintained by the compound representation facility 122 as discussed in reference to FIGS. 3A and 4A. In some embodiments, the compound model facility 124 can be trained on a graph (e.g., instead of a vector generated from the graph) or with some other data structures that represent compounds. For example, the compound model facility 124 can train on any representation of compounds 620, such as molecule patterns of a protein and its bound ligand form or whether the molecule is blood brain barrier penetrable.
At step 604, the compound model facility 124 can train 622 a machine learning model on the plurality of compound representations for predicting values for properties of each respective compound in the library of compounds. For example, the compound model facility 124 can predict properties of compounds from the compound representation of the compounds. In some embodiments, the compound model facility 124 can leverage the graph representation to train the model to predict properties of each respective compound. For example, the compound model facility 124 can use the graph representation to pre-calculate property predictions of the compounds in the library of compounds (e.g., a large-scale chemical library). In some embodiments, the compound model facility 124 can train the model (e.g., graph transformer models) such that the model learns a general understanding of the chemical properties (e.g., structure and function) of the compounds.
At step 606, the compound model facility 124 can utilize the machine learning model to generate predicted property values indicative of the properties of the respective compound. For example, the generated predicted property values can be numeric values in a vector or an array.
At step 608, the compound model facility 124 can determine 624 whether the predicted property values of the respective compound correspond to known property values of the respective compound. The predictions can be output to assess the accuracy of the model. By assessing the accuracy of the model, its parameters can be improved and updated to improve the predictive performance of the model. In some embodiments, the compound model facility 124 can compare the predicted property values (e.g., numerical vector) to the known property values (e.g., input label). For example, the compound model facility 124 can condition the model based on a joint loss function to compare the predicted property values and known property values. In some embodiments, the compound mode facility 124 can use a comparison between the predicted property values and the known property values to evaluate the performance of the model. For example, a well performing model would not have a significant difference between the predicted property values and the known property values, while a model that does not perform well would generate predicted property values that are significantly different from the known property values.
If the predicted property values do not correspond to the known property values of the respective compound, then the method 600 can proceed to step 604 for the compound model facility 124 to re-train the machine learning model to generate the predicted property values of the respective compound. As the compound model facility 124 is trained on more data, the predictions may become more accurate.
If the predicted property values correspond to the known property values of the respective compound, then the method 600 can proceed to step 610. At step 610, the compound model facility 124 can output the trained model, which has been trained to predict one or more known properties of compounds. In come embodiments, the compound analysis facility can use the machine learning model to predict the properties of the library of compounds. The compound model facility 124 can include fine-tuned property models (e.g., trained models) that the compound analysis facility 126 can use to output the predicted properties for compounds in a large chemical database (e.g., library of compounds) to establish a large chemical property database. For example, the compound analysis facility 126 can provide the predicted properties to the quantum annealer 130 as input for the optimization criteria to identify and rank compounds having the predicted properties. In another example, the compound analysis facility 126 may be adapted to provide the predicted properties to conventional computing hardware to analyze the compounds.
Model Training
FIG. 6C is a flowchart of an illustrative method 650 of training a model to predict new (e.g., previously unknown) properties of the compounds. In some embodiments, performing step 204 of the method 200 includes performing the method 650. Some aspects of the method 650 of training a model to predict new properties of compounds are illustrated in FIG. 6D.
The method 650 may be performed by the compound model facility 124 executed by the computing device 120, in some embodiments. The model may be trained to predict new properties of compounds using training data. The training data can include either no or few labeled training examples of compounds having the new properties, but the compound model facility 124 can leverage an existing model to train a new model to predict values for a new property, such as for compounds in a library of compounds. In some embodiments, as shown in FIG. 6D, the compound model facility 124 can fine-tune an existing model (e.g., with minor adjustments) to predict new properties of the compounds.
At step 652, the compound model facility 124 can receive identification of a new property of interest and training data for the new property of interest. In some embodiments, the compound model facility 124 can identify property values for the new property of interest in the training data. For example, the property values can be numeric values in a vector or an array, which may include data that has been retrieved from a source of data regarding compounds (e.g., a public source of data), obtained through testing of compounds, or otherwise obtained.
At step 654, the compound model facility 124 can modify, based on the new property of interest and the training data, a machine learning model trained to predict one or more properties of compounds. The training data can be used to train the compound model facility 124 to predict a value for the new property of interest for each compound of the library.
As discussed above, to train a new, wholly untrained model to predict a value for a property and with acceptable accuracy or reliability, may require a large amount of training data for compounds that have or not have the property, or have a range of values of the property. For some properties, such as previously unstudied or understudied properties, such an amount of data may not be available. Or there may be other hurdles to obtaining sufficient data. In some embodiments, a smaller amount of training data may be used to reliably train a model, by leveraging an existing model that had been trained to predict a value for a different property of compounds.
In some embodiments, an existing model may be a neural network and may include layers that were previously trained to predict a value for another property. In some embodiments, the facility 124 may use these layers as a backbone for a new model that is created by editing the existing model. The compound analysis facility 124 can learn or generate a non-linear representation of the latent layer(s) of the existing model to produce learned representations of the data that become complex with addition of a new layer to the neural network.
In some cases, the existing layers of the model may be layers that have learned information on structure of compounds, or other functional properties of compounds, or general information on classes of compounds. In the existing model, those models may feed one or more classifier layers or other layers that output information on a value of a property for an input compound. Those later layers of the model may be specific to the one property for which that model is designed, but those earlier layers and the parameters with which they are configured as a result of earlier training may be reusable in models that predict a value for a different property for an input compound. Such information may be useful in a network that is to predict a value for another property. Accordingly, through techniques described herein, in some embodiments the model may be edited in a way that allows for retaining some parts of the model (e.g., one or more layers of a neural network) while editing the model. Such editing may include adding, removing, or adjusting a part of the model, such as adding, removing, or adjusting layers of a neural network. In some such embodiments, a new output layer can be added to a model following removal of a prior output layer, where the output layer may be a layer that predicts a value for a property of input compounds. For example, the added layer of the model may predict whether an input compound is blood brain barrier penetrant or not, or an amount of degree of penetrance.
In some embodiments, the compound analysis facility 124 can add a new layer for the new property to an existing model to retrain the model for predicting the new property. In some embodiments, the compound model facility 124 can add one or more new layers for predicting the new property. In some embodiments, the compound analysis facility 124 can add or append a new layer for the new property to the network. The compound analysis facility 124 can train the model with the new layer to train the model (including the layer) on the new property. In some embodiments, the compound model facility 124 can copy the existing layers for predicting the existing properties and combine the copied layers with the one or more new layers for predicting the new property.
In some embodiments, the compound model facility 124 can remove one or more layers related to the existing properties and add one or more new layers for predicting the new properties. In some embodiments, the layer being removed can be the layer responsible for decoding the chemical representation. In some embodiments, the prediction layer can be removed, and the remaining representation network is copied over and added to a new untrained layer. For example, the initial network can be removed while copying all the weights in the model and fine tune on data set (e.g., blood brain barrier dataset). The fine tuning can be for the new property. Embodiments are not limited to operating with any particular property or type of property. Examples of properties include solubility, blood brain barrier, toxicity, synthesizability, protein ligand binding. The compound model facility 124 can then pre-calculate these categories on a large dataset such as GDP- 17. In some embodiments, the compound model facility 124 can adjust an existing layer to incorporate the training data.
At step 656, the compound model facility 124 can train the machine learning model to identify the new property of interest in a set of compounds of the library of compounds. For example, the compound model facility 124 can train the machine learning model by running the training data of step 652 through the model. The compound model facility 124 can be trained to predict the new property with one or more layers at the output side of the network. For example, only the new layers need to be trained but the machine learning model as a whole is trained to predict whether any of the compounds include the new property of interest.
In some embodiments, the compound model facility 124 can use the machine learning model to identify or generate property values for the new property of interest in the compounds. For example, the property values can be numeric values in a vector or an array. By training the compound model facility 124 with existing layers and then adding the new layers, the compound model facility 124 can be trained without any labeled training data. By not needing labelled training data, the compound model facility 124 can enumerate chemical structures to a desired size. If the compound model facility 124 is trained without labeled training data or validation, the method 650 can proceed to step 660. If the compound model facility 124 is trained with a validation step, the method 650 can proceed to step 658.
At step 658, the compound model facility 124 can determine whether the compounds predicted to have the new property of interest correspond to the compounds expected to have the new properties of the set of compounds. The predictions can be the output to assess the accuracy of the model. By assessing the accuracy of the model, its parameters can be improved and updated to improve the predictive performance of the model. For example, a scientist or subject matter expert can provide information about which compounds are expected to have the new properties. In some embodiments, the compound model facility 124 can compare the predicted new property values (e.g., numerical vector) to the expected property values (e.g., input label) forthose compounds. For example, the compound model facility 124 can condition the model based on a joint loss function to compare the predicted property values and expected property values. In some embodiments, the compound mode facility 124 can use a comparison between the predicted new property values and the expected property values to evaluate the performance of the model. For example, a well performing model would not have a significant difference between the predicted property values and the expected property values, while a model that does not perform well would generate predicted property values that are significantly different from the expected property values.
If the predictions for which compounds have the new property do not correspond to the compounds expected to have the new property, the method 650 can proceed to step 656 for the compound model facility 124 to re-train the machine learning model to identify the new property of interest in the library of compounds.
If the predictions for which compounds have the new property correspond to the compounds expected to have the new property, the method 650 can proceed to step 660 for the compound analysis facility 126 to use the trained machine learning model to predict the new property of interest in the library of compounds. For example, the compound analysis facility 126 can use the machine learning model to provide the predicted properties to the quantum annealer 130 as input for the optimization criteria to identify and rank compounds having the new property. In another example, the compound analysis facility 126 may be adapted to provide the input to classical computing hardware.
Quantum Computing Techniques
FIG. 7A illustrates a method 700 for identifying compounds to analyze. The method 700 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments. At step 702, the compound analysis facility 126 can identify compounds having properties of interest. In some embodiments, the compound analysis facility 126 can receive the properties of interest from the client 115 via the client interface 112 executing on the computing device 110. In other embodiments, properties of interest may additionally or alternatively be determined by the compound analysis facility 126 through analysis of input compounds. The input compounds may be ones that are identified by the client 115 as performing a function or performing a function in a manner that satisfies one or more criteria, such as performing the function with a desired effectiveness.
For example, to identify properties of interest that define how well a drug molecule, DNA, RNA, or peptide would perform, the compound analysis facility 126 can obtain or compute data of existing molecules that have known favorable properties of interest. In some embodiments, to optimize compounds (e.g., DNA, RNA, or peptides), the compound analysis facility 126 can extract properties of interest forthose compounds. In some embodiments, to optimize drug compounds, the compound analysis facility 126 can identify properties of interest that define or increase the likelihood of interaction with a protein target. In some embodiments, the compound analysis facility 126 can identify features that contribute specific desirable properties.
In some embodiments, as shown in FIG. 7B, the compound analysis facility 126 can obtain 3D structures of compounds (e.g., proteins, ligands, etc.) with known favorable properties for a particular function of interest, from which to identify properties of interest. If the 3D protein structure is unavailable or does not exist, the compound analysis facility 126 can predict the structure of the known compounds. In some embodiments, the compound analysis facility 126 can identify properties of interest for other therapy modalities. For example, DNA, RNA, or peptide-based therapies. The compound analysis facility 126 can identify features and properties specific to the performance in those therapy modalities.
At step 704, the compound analysis facility 126 can generate compound conformers. In some embodiments, the compound analysis facility 126 can identify compounds having low energy states. In some embodiments, as shown in FIG. 7C, the compound analysis facility 126 can generate a plurality of 3D shapes of a compound, which may be 3D shapes of the compound that would be present in low energy states. In some embodiments, as shown in FIG. 7G, the compound analysis facility 126 can obtain known inhibitors and, as shown in FIG. 7H, sanitize and relax the conformers.
Referring back to FIG. 7A, at step 706, the compound analysis facility 126 can analyze the target(s) for compounds of interest. The target(s) may be analyzed in connection with a 3D structure of the target(s). The target may be a protein or other molecule that a compound of interest is to interact with, such as binding with. In some embodiments, as shown in FIG. 7D, the compound analysis facility 126 can detect one or more binding sites of the protein target. The binding site may be a binding pocket. For a binding site, a composition and/or structure of the binding site may be detected. For example, a molecular composition and structure may be determined.
Referring back to FIG. 7A, at step 708, the compound analysis facility 126 can analyze the known compounds in connection with how the known compounds dock with the protein target, for example, how they dock with the identified binding site(s). In some embodiments, as shown in FIG. 7E and FIG. 7F, the compound analysis facility 126 can in silico dock compounds into a protein target at the binding site(s) and analyze how the binding is executed. The analysis may be of properties of the known compounds that are related to the docking with the binding site(s), such as a composition and/or structure of the compounds that relate to the docking, may be identified.
Referring back to FIG. 7A, at step 710, the compound analysis facility 126 can align the known compounds. For example, the compound analysis facility 126 can align the generated 3D conformers together in 3D space.
At step 712, the compound analysis facility 126 can identify, using the aligned conformers and the analysis of the docking, one or more compound properties of interest in the aligned compounds. The compound properties of interest can be pharmacophore features in aligned compounds. For example, the compound analysis facility 126 can determine pharmacophores for the input compounds. The pharmacophore properties can take the form of abstract molecular features related to a ligand’s interaction with a biological macromolecule (e.g., protein). The compound analysis facility 126 can use the pharmacophores to determine properties that are present in the compounds and may be related to performing the function or performing the function in the manner that satisfies the criteria.
In some embodiments, as shown in FIG. 7I-7M, the compound analysis facility 126 can abstract the aligned functional groups and features into numerical features (e.g., number of hydrogen bond donors/acceptors). As shown in FIG. 71, FIG. 7J, and FIG. 7K, the compound analysis facility 126 can generate a visual representation of relevant pharmacophores for a given ligand. As shown in FIG. 71, the hydrogen bond donors are highlighted. As shown in FIG. 71, the hydrogen bond donors are visualized. As shown in FIG. 7J, the hydrogen bond acceptors are visualized. As shown in FIG. 7K, the hydrophobics are visualized. As shown in FIG. 7L and FIG. 7M the compound analysis facility 126 can generate a visual representation of a ligand’s interaction with amino acid resides in a given protein binding site. The compound analysis facility 126 can transmit the visual presentations (e.g., as shown in FIG. 7I-7M) to the computing device 110 for display in the client interface 112 to the client 115.
FIG. 8 A illustrates a method 800 for determining and/or analyzing criteria. The method 800 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments. In an example in which the compounds are drugs, the criterion may be all molecules that satisfy “Lepinski’s Rule of 5” that define a space of druglike molecules that have pharmacokinetic properties within the human body that make them more likely candidates for drugs than other molecules that do not meet the rule. In this example, when designing a drug, it might be important to identify molecules that have one hydrogen bond donor. In this example, the property of interest can be that the compound has one hydrogen bond donor. However, the quantum annealer 130 might not be able to execute standard executable instructions (e.g., standard computer code), so the computing device 120 can translate the inputted properties of interest into a format that is compatible with the quantum annealer 130 by adapting a compound property analysis to a value (e.g., QUBO) that the computation may be performed by quantum computing hardware and configure the quantum annealer 130 with the manner in which the data is to be processed and provide the data to the quantum annealer 130 for processing.
At step 802, in which the compound analysis facility 126 can identify a set of compounds to be analyzed. In some embodiments, the compound analysis facility 126 can identify a set of compounds to analyze based on the identified properties of interest. Examples of the size and scope of the set of compounds include a small library of molecules in similar chemical space or a larger space defined by molecular criteria set by a user. Such molecular criteria may be broad, such as all molecules with a number of atoms less than or equal to a number, or all molecules that satisfy the Rule of 5 mentioned above, or similar criteria. A user may specify other criteria, such as to focus the analysis on chemical space of particular interest to a user, such as all compounds that include a certain atom or molecule.
When generating libraries of similar molecules, the compound analysis facility 126 can identify the core substructure of active compounds from compounds used to generate the compound properties of interest (e.g., pharmacophores). FIG. 8B shows examples of pharmacophores, and FIG. 8C shows examples of properties of interest that may have been derived from pharmacophores. The compound analysis facility 126 can append this substructure by using a library of chemical fragments to produce a set of compounds that include chemicals enumerated around the core structure. While this approach may be advantageous in some cases, it can also risk biasing analysis to a similar chemical space to existing compounds and so may not be an appropriate technique in all embodiments.
As shown in FIG. 8D, shown is an example set of compounds. The compound analysis facility 126 can generate the set of compounds to search for optimal or desirable compounds, such as those predicted to have a desirable combination of properties for drugs. For example, the set of compounds can be a library of compounds to be analyzed by the quantum annealer 130 to determine a subset of the compounds that meet one or more criteria, such as those predicted to have a desirable combination of properties for drugs. The compound analysis facility 126 can generate the set of compounds by enumerating all possible theoretical molecules up to a certain number of atoms.
Referring back to FIG. 8A, in some embodiments, when the compound analysis facility 126 analyzes other therapy modalities, the compound analysis facility 126 can generate a set of compounds by enumerating over peptide, RNA, or DNA sequences up to a certain number of residues. The compound analysis facility 126 can enumerate by generating an algorithm that generates possible combinations of compounds up to a certain number of atoms. In some embodiments, as shown in Table 1, to generate the set of compounds by enumerating, the compound analysis facility 126 can use the algorithm to start from a single carbon atom and add possible atoms in a bond with it up to and including the number of desired compounds.
Figure imgf000056_0001
Table 1 : Example enumeration of compounds up to a number of atoms At step 804, the compound analysis facility 126 can generate one or more criteria. In some embodiments, the compound analysis facility 126 can use information from the pharmacophores, or other input from a user or information regarding compounds, to determine the criteria regarding properties for compounds that, when present in a compound, may lead to the compound performing a function or performing a function in a manner that satisfies one or more criteria. For example, the generation of properties of interest (e.g., pharmacophores) can enable the compound analysis facility 126 to identify a list of features that a compound can have to increase the likelihood of activity. The compound analysis facility 126 can derive the features that are relevant to a set of physiochemical properties (e.g., solubility, Blood Brain Barrier (BBB) penetration). The physiochemical properties can be in the form of machine learning model predictions, presence/absence of chemical fragments, or other calculable features which relate to activity. For example, the compound analysis facility 126 can generate the Quantitative Structure Activity Relationships (QSAR).
The compound analysis facility 126 can generate the one or more criteria to search the set of compounds against the one or more criteria. In some embodiments, once properties of interest are determined, the compound analysis facility 126 can determine the one or more criteria related to the properties of interest and/or to values for those properties. In some such embodiments, the one or more criteria enable a description of those properties with respect to a binary value, such as whether the property is present or not in a compound or whether a criterion with respect to the property (e.g., a value above or below a threshold) is satisfied for the compound.
The compound analysis facility 126 can generate the one or more criteria in a format that is processable by the quantum annealer 130. For example, for a list of compounds, the compound analysis facility 126 can generate a binary value with respect to each property and for each compound. These values may be arranged in a matrix of values, where each row represents a compound, and each column represents a property. In some embodiments, the values can be discrete or continuous values. For example, the discrete values can be binary values (e.g., 0 or 1). In another example, the continuous values can be any number from 0 to 1, such as 0.8.
FIG. 9 illustrates a method 900 for configuring the quantum annealer 130. The method 900 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments. In some embodiments, the compound analysis facility 126 can generate a function that identifies relationships between variables, where the variables relate to compound properties and relationships between them, such as relative priorities of different properties in a desirable or well-performing compound.
At step 902, in which the compound analysis facility 126 can select weighting values for the magnetic field and provide them to the quantum annealer 130 to configure the quantum annealer 130 to perform the analysis. At step 904, the compound analysis facility 126 may additionally receive input values regarding how each compound relates to a property of interest, such as whether the compound has the property or the compound’s status with respect to a rule for the property. In some embodiments, the values may be binary values, and the input values may be received as an array or matrix of values where each row corresponds to a compound and each column relates to a property. The facility 126 may also provide the values to the quantum annealer 130 for analysis, as part of configuring the annealer 130. The facility 126 may in some embodiments also trigger the analysis by the annealer 130, following configuration. In some embodiments, the values can be discrete or continuous values. For example, the values can be binary values or a set of a plurality of continuous values.
The weights selected in step 902 may, in some cases, indicate relationships between variables to be analyzed by the annealer(s), such as relationships between variables relating to one or more of the properties of interest. In some embodiments, the compound analysis facility 126 can configure the quantum annealer 130 with one or more weights or other values that affect operations of the quantum annealer 130 and thereby affect evaluation of the function.
In some embodiments, the compound analysis facility 126 can map the function to the bias strengths of the values by setting the variables and the strengths of the couplers in the quantum annealer 130. For example, the quantum annealer 130 can expect to process a minimization of an objective function, and the quantum annealer 130 can process values formatted based on QUBO. The quantum annealer 130 can execute a search modeled after the minimum energy of the Ising Hamiltonian energy function:
Figure imgf000058_0001
where s( G — 1, 1 are the spin values that are subject to local fields hi and to the nearest neighbor interactions with coupling strength Jij. In some embodiments, the compound analysis facility 126 can generate values as a Boolean QUBO equivalent using the transform s = 2x — 1 where x G 0, 1, and 1 can be a vector of ones. The compound analysis facility 126 can form the QUBO expression:
Figure imgf000058_0002
In some embodiments, the compound analysis facility 126 can set the linear bias a and the quadratic bias b between variables. In some embodiments, the compound analysis facility 126 can set the linear bias “a” and the quadratic bias “6” between variables to convert a scientific question into values based on a QUBO. The variables may be set such that when the quantum annealer 130 can identify, using input binary values relating to properties for compounds, a compound that relates to a maximum, minimum, optimum, or other statistical value for the function with which the quantum computing hardware is configured, that compound may be the best compound with respect to the properties of interest or otherwise satisfy one or more criteria with respect to those properties of interest.
In some embodiments, the values can be discrete or continuous values. For example, the values can be binary values or a set of a plurality of continuous values. Such values may, in some embodiments, in a matrix of values where each value in the matrix indicates whether a compound has a particular property or whether that property for the compound satisfies one or more criteria (e.g., how a value for a property compares to a threshold). In some embodiments, the values can be discrete or continuous values. For example, the discrete values can be binary values (e.g., 0 or 1). In another example, the continuous values can be any number from 0 to 1, such as 0.8.
In some embodiments, each row in the matrix corresponds to a drug candidate and the value for that row indicates a value for a property of that drug candidate. For example, when designing a drug, it might be important to identify molecules that have one hydrogen bond donor. The solution landscape can be a list of all possible compounds and representation of how many hydrogen bond donors such compounds have:
Figure imgf000059_0001
Table 2: Possible solutions to the one hydrogen bond donor optimization problem. Each compound can have three fragments with the number of hydrogen bond donors labelled. F can be a fragment in the compound.
The compound analysis facility 126 can search for compounds that have a certain number of bonds. For example, optimization objective can be compounds where only one fragment has a hydrogen bond donor. This can be expressed algebraically as:
FA + B + c = 1
The minimization objective function can be:
E(FA, FB, FC) = (FA + FB + FC - 1)2
When generating the values based on the QUBO, the compound analysis facility 126 can factor out the above minimization function:
E(FA, FB, FC) = F2 + FAFB
= FA + FC - FA + FBFA + FB + FBFC - FB + FCFA + FCFB + F2 - FC - FA
- FB - Fc + 1
The compound analysis facility 126 can simplify the above function:
E(FA, FB, FC) = F2 + F2 + F2 + 2FAFB + 2FAFC + FBFC + 2FBFC - 2FA - 2FB - Fc + 1
Given that these variables FA, FB, Fc are binary variables, the compound analysis facility 126 can simplify 2FA, 2FB, 2FC to FA, FB, Fc.
The compound analysis facility 126 can include these variables into the QUBO:
Figure imgf000060_0001
Referring again to FIG. 9, it is these values that may be received by the compound analysis facility 126 as input in step 904, so as to provide them to the quantum annealer 130 for analysis. The quantum annealer 130 may receive as input values corresponding to each compound to be analyzed with respect to the function and for each property of interest to be analyzed for the compound. Accordingly, in some embodiments, the quantum annealer 130 may analyze binary values corresponding to compounds and properties of interest to identify, from among the compounds, one or more compounds that satisfy one or more criteria and so may have a desirable combination of properties of interest. The quantum annealer 130 may analyze the set of compounds in connection with the properties of interest to identify a subset of the set of compounds that meet one or more criteria. In some embodiments, the quantum annealer 130 may identify the subset by determining the compound from among the set that satisfy one or more criteria regarding statistical values resulting from evaluation of a function, such as identifying compound from among the set that correspond to a maximization, minimization, or other optimization of a function or other statistical operation with respect to a function with which the quantum annealer 130 is configured.
Accordingly, the compound analysis facility 126 can transmit the function and the set of compounds to the quantum annealer 130 for analysis. In some embodiments, the compound analysis facility 126 can transmit the function and the set of compounds to the quantum annealer 130 via the network 105.
At step 906, the compound analysis facility 126 can receive a ranked set of compounds of interest. In some embodiments, after the compound analysis facility 126 provides the function to the quantum annealer 130, the compound analysis facility 126 can receive the ranked set of compounds from the quantum annealer 130. In some embodiments, the ranked list can be based on weighting values at which the quantum annealer 130 was configured. To generate the ranked set of compounds, the quantum annealer 130 can analyze the input (e.g., matrix values) received from the compound analysis facility 126 to identify a ranking of predicted performance of the compounds. The compound analysis facility 126 can receive the ranked set of compounds as a series of ‘energy states’ that are a proxy for how well each candidate solution to the function performs, where each candidate solution corresponds to a compound. For example, the compounds that satisfy the one or more criteria can have the lowest energy state.
The ranked set of compounds can include compounds that satisfy the one or more criteria. With such a process, a best, top five, top ten, top one hundred, or other top N drug candidates may be received by the compound analysis facility 126 from the quantum annealer 130. For example, the quantum annealer 130 can identify drug candidates with respect to the properties and identify a number N of the drug candidates that have an overall best performance with respect to the properties, based on evaluation of a function with which the quantum annealer 130 is configured for determining an optimal or otherwise desirable combination of properties. In the drug design example of seeking to identify compounds with one hydrogen bond donor, the compounds that satisfy the one or more criteria for the presence of only 1 hydrogen bond acceptor can have the lowest energy state, while the compounds that violate the criteria can have higher energy states:
Figure imgf000061_0001
Table 3 : Candidate solutions to the hydrogen bond donor optimization problem with their respective ‘energy’ states. F is a fragment in the compound.
At step 908, the compound analysis facility 126 can select compounds from the ranked list. In some embodiments, compound analysis facility 126 can select compounds that satisfy the one or more criteria. For example, the compound analysis facility 126 can select compound B, compound C, and compound E because they have an ‘energy state’ of 0 and thus satisfy the criteria of having one hydrogen bond donor.
At step 910, the compound analysis facility 126 can generate an output of selected compounds. In some embodiments, the compound analysis facility 126 can transmit the output of selected compounds to the computing device 110 for display in the client interface 112 to the client 115.
FIG. 10A illustrates a method 1000 for refining the outputted compounds. The method 1000 may be performed by the compound analysis facility 126 executed by the computing device 120, in some embodiments. The compound analysis facility 126 can refine the compounds received after the analysis by the quantum annealer 130 as discussed in reference to FIG. 5A and 5B. For example, as shown in FIG. 10B, one or more criteria can be identified for the compounds to be analyzed against. As shown in FIG. 10C, the quantum annealer 130 can analyze the compounds against the one or more criteria and provide a list of compounds to the compound analysis facility 126.
Referring back to FIG. 10A, at step 1002, in which the compound analysis facility 126 can receive selections of compounds to refine from the list of compounds received from the quantum annealer 130. After the compounds have been ranked by the quantum annealer 130, one or more of the compounds can be selected for additional testing (e.g., fine tuning). For example, the number of selected compounds can be based on the scale or cost of the testing to be performed. In some embodiments, the selections can be received from the client 115 via the client interface 112 executing on the computing device 110.
The compounds may then be synthesized and tested, or tested in silico using other techniques, to further identify a smaller set of compounds that may be candidates for use in a particular context, for further experimentation, or other purposes. For example, the compounds can be identified drug candidates that can then be analyzed using other techniques to determine or confirm properties in the list or determine or confirm the performance of drug candidates in the subset. In some embodiments, the compounds can be used for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules. In some embodiments, the compounds can be used in battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
At step 1004, the compound analysis facility 126 can generate a refined set of compounds. In some embodiments, as shown in FIG. 10D, the compound analysis facility 126 can apply computational techniques, such as using machine learning or other artificial intelligence techniques, or laboratory work that involves synthesizing and testing the drug candidates. In some cases, such techniques may be assisted with rule-based algorithms, randomized algorithms, brute force algorithms, or any other computerized process.
Referring back to FIG. 10A, in some embodiments, the compound analysis facility 126 can receive test results, simulations, or measurements, or any other information about the selected compounds. The compound analysis facility 126 can modify the selected list of compounds based on the test results, simulations, or measurements, or any other information. Using such a process, one or more compounds may be identified that may advantageously perform a function. For example, the compounds can be advantageous or optimal for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, or drug delivery molecules. In some embodiments, the compounds can be advantageous or optimal as chemical molecules. For example, the compounds can be advantageous or optimal for battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides. In some embodiments, the compounds can be optimal for therapies. For example, when optimizing biotherapeutics, features that describe the likelihood of success of those biological molecules can be extracted and optimized against. The features can map the relationship between the sequence/tertiary structure of DNA/RNA and peptide-based molecules to their performance in the clinic. The performance can be based on both the ability of the molecule to undertake its manipulation of a biological network through its mechanism of action and also its ability to perform well when taken by patients (e.g., non-toxic, orally bio-available, optimal clearance).
At step 1006, the compound analysis facility 126 can generate an output of the refined set of compounds. Based on the additional experimentation, the compound analysis facility 126 can identify the refined compounds from the set of compounds. In some embodiments, the compound analysis facility 126 can transmit the refined compounds to the computing device 110 for display in the client interface 112 to the client 115. For example, as shown in FIG. 10E, the compound analysis facility 126 can identify and output a lead compound. For example, the lead compound can be a compound for antibody development, antisense oligonucleotides, mRNA vaccines, peptide drugs, PROTACs, siRNA, drug delivery molecules, battery development, petrochemical industry, biodegradable plastics, veterinary medicine, OLED, colorants, dyes, paints, agriculture, or pesticides.
In another embodiment, there is provided a method comprising analyzing, using at least one quantum computer, information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, synthesizing at least a portion of the one or more compounds of the subset to generate at least one synthesized compound, and testing the at least one synthesized compound.
In a further embodiment, there is provided a method comprising analyzing, using at least one quantum computer, information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, analyzing the subset of the one or more compounds using at least one trained machine learning engine to determine one or more properties of the one or more compounds and/or analyze predicted performance of each of the one or more compounds with respect to a function, and outputting from the analyzing using the at least one trained machine learning engine a ranked list of one or more candidate compounds for performing the function.
In another embodiment, there is provided a method comprising triggering analysis by at least one quantum computer of information regarding properties of compounds of interest to identify, from among the compounds of interest, a subset of one or more compounds that analysis indicates satisfy one or more criteria, receiving, as a result of the analysis, an identification of the one or more compounds of the subset, and outputting the identification of the one or more compounds as a result of the analysis.
In a further embodiment, there is provided a method comprising receiving a request for at least one quantum computer to analyze a library of compounds in connection with one or more criteria, the request comprising input characterizing the library of compounds to be analyzed by the at least one quantum computer, triggering analysis by at least one quantum computer of information regarding properties of the library of compounds characterized by the input to identify, from among the compounds of the library, a subset of one or more compounds that analysis indicates satisfy the one or more criteria, receiving, as a result of the analysis, an identification of the one or more compounds of the subset, and outputting the identification of the one or more compounds as a result of the analysis.
In another embodiment, there is provided a method comprising receiving a request for at least one quantum computer to analyze a library of compounds of interest, the request comprising input characterizing the library of compounds to be analyzed by the at least one quantum computer, determining, for each compound in the library of compounds, a value for the compound with respect to each property of at least one property of interest, to generate a set of property values for compounds of the library of compounds of interest, triggering the at least one quantum computer to analyze the set of property values for the compounds of the library, receiving from the at least one quantum computer an identification of one or more compounds of a subset of the library that analysis by the at least one quantum computer indicates satisfy one or more criteria, and outputting information regarding the one or more compounds of the subset as a result of the analysis requested in the request.
In a further embodiment, there is provided a method comprising receiving a request for at least one quantum computer to analyze a library of compounds of interest, the request comprising first input characterizing the library of compounds to be analyzed by the at least one quantum computer and second input identifying a set of properties of interest, identifying, for each property of the set of properties of interest, a rule reflecting a binary status of a compound with respect to the property, and determining, for each compound in the library of compounds, a binary value for the compound with respect to each property of the set of properties of interest, to generate a set of binary property values for compounds of the library of compounds of interest. The method further comprises triggering the at least one quantum computer to analyze the set of binary property values for the compounds of the library in connection with an objective function with which the at least one quantum computer is configured, to determine a compound for which corresponding binary property values generate a minimum value for the objective function, receiving from the at least one quantum computer an identification of one or more compounds of an identification of the compound, and outputting information regarding the compound as a result of the analysis requested in the request.
In another embodiment, there is provided a method comprising receiving a request for at least one computer to analyze a library of compounds of interest, the request comprising first input characterizing the library of compounds to be analyzed by the at least one computer and second input identifying a set of properties of interest, identifying, for each property of the set of properties of interest, a rule reflecting a binary status of a compound with respect to the property, and determining, for each compound in the library of compounds, a binary value for the compound with respect to each property of the set of properties of interest, to generate a set of binary property values for compounds of the library of compounds of interest. The method further comprises triggering the at least one computer to analyze the set of binary property values for the compounds of the library in connection with an objective function with which the at least one computer is configured, to determine a compound for which corresponding binary property values generate a minimum value for the objective function, receiving from the at least one computer an identification of one or more compounds of an identification of the compound, and outputting information regarding the compound as a result of the analysis requested in the request.
Example Computer Implementations
Techniques operating according to the principles described herein may be implemented in any suitable manner. Included in the discussion above are a series of flow charts showing the steps and acts of various processes for property analysis for compounds using quantum computing. The processing and decision blocks of the flow charts above represent steps and acts that may be included in algorithms that carry out these various processes. Algorithms derived from these processes may be implemented as software integrated with and directing the operation of one or more single- or multi-purpose processors, may be implemented as functionally equivalent circuits such as a Digital Signal Processing (DSP) circuit or an Application- Specific Integrated Circuit (ASIC), or may be implemented in any other suitable manner. It should be appreciated that the flow charts included herein do not depict the syntax or operation of any particular circuit or of any particular programming language or type of programming language. Rather, the flow charts illustrate the functional information one skilled in the art may use to fabricate circuits or to implement computer software algorithms to perform the processing of a particular apparatus carrying out the types of techniques described herein. It should also be appreciated that, unless otherwise indicated herein, the particular sequence of steps and/or acts described in each flow chart is merely illustrative of the algorithms that may be implemented and can be varied in implementations and embodiments of the principles described herein.
Accordingly, in some embodiments, the techniques described herein may be embodied in computer-executable instructions implemented as software, including as application software, system software, firmware, middleware, embedded code, or any other suitable type of computer code. Such computer-executable instructions may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
When techniques described herein are embodied as computer-executable instructions, these computer-executable instructions may be implemented in any suitable manner, including as a number of functional facilities, each providing one or more operations to complete execution of algorithms operating according to these techniques. A “functional facility,” however instantiated, is a structural component of a computer system that, when integrated with and executed by one or more computers, causes the one or more computers to perform a specific operational role. A functional facility may be a portion of or an entire software element. For example, a functional facility may be implemented as a function of a process, or as a discrete process, or as any other suitable unit of processing. If techniques described herein are implemented as multiple functional facilities, each functional facility may be implemented in its own way; all need not be implemented the same way. Additionally, these functional facilities may be executed in parallel and/or serially, as appropriate, and may pass information between one another using a shared memory on the computer(s) on which they are executing, using a message passing protocol, or in any other suitable way.
Generally, functional facilities include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the functional facilities may be combined or distributed as desired in the systems in which they operate. In some implementations, one or more functional facilities carrying out techniques herein may together form a complete software package. These functional facilities may, in alternative embodiments, be adapted to interact with other, unrelated functional facilities and/or processes, to implement a software program application.
Some exemplary functional facilities have been described herein for carrying out one or more tasks. It should be appreciated, though, that the functional facilities and division of tasks described is merely illustrative of the type of functional facilities that may implement the exemplary techniques described herein, and that embodiments are not limited to being implemented in any specific number, division, or type of functional facilities. In some implementations, all functionalities may be implemented in a single functional facility. It should also be appreciated that, in some implementations, some of the functional facilities described herein may be implemented together with or separately from others (i.e., as a single unit or separate units), or some of these functional facilities may not be implemented.
Computer-executable instructions implementing the techniques described herein (when implemented as one or more functional facilities or in any other manner) may, in some embodiments, be encoded on one or more computer-readable media to provide functionality to the media. Computer-readable media include magnetic media such as a hard disk drive, optical media such as a Compact Disk (CD) or a Digital Versatile Disk (DVD), a persistent or non-persistent solid-state memory (e.g., Flash memory, Magnetic RAM, etc.), or any other suitable storage media. Such a computer-readable medium may be implemented in any suitable manner, including as computer-readable storage media 1106 of FIG. 11 described below (i.e., as a portion of a computing device 1100) or as a stand-alone, separate storage medium. As used herein, “computer-readable media” (also called “computer-readable storage media”) refers to tangible storage media. Tangible storage media are non-transitory and have at least one physical, structural component. In a “computer-readable medium,” as used herein, at least one physical, structural component has at least one physical property that may be altered in some way during a process of creating the medium with embedded information, a process of recording information thereon, or any other process of encoding the medium with information. For example, a magnetization state of a portion of a physical structure of a computer-readable medium may be altered during a recording process.
In some, but not all, implementations in which the techniques may be embodied as computer-executable instructions, these instructions may be executed on one or more suitable computing device(s) operating in any suitable computer system, including the exemplary computer system of FIG. 1 A, or one or more computing devices (or one or more processors of one or more computing devices) may be programmed to execute the computer-executable instructions. A computing device or processor may be programmed to execute instructions when the instructions are stored in a manner accessible to the computing device or processor, such as in a data store (e.g., an on-chip cache or instruction register, a computer-readable storage medium accessible via a bus, a computer-readable storage medium accessible via one or more networks and accessible by the device/processor, etc.). Functional facilities comprising these computer-executable instructions may be integrated with and direct the operation of a single multi-purpose programmable digital computing device, a coordinated system of two or more multi-purpose computing device sharing processing power and jointly carrying out the techniques described herein, a single computing device or coordinated system of computing devices (co-located or geographically distributed) dedicated to executing the techniques described herein, one or more Field-Programmable Gate Arrays (FPGAs) for carrying out the techniques described herein, or any other suitable system.
FIG. 11 illustrates one exemplary implementation of a computing device in the form of a computing device 1100 that may be used in a system implementing techniques described herein, although others are possible. It should be appreciated that FIG. 11 is intended neither to be a depiction of necessary components for a computing device to execute a compound representation facility 122, a compound model facility 124, and/or a compound analysis facility 126 in accordance with the principles described herein, nor a comprehensive depiction.
Computing device 1100 may comprise at least one processor 1102, a network adapter 1104, and computer-readable storage media 1106. Computing device 1100 may be, for example, a desktop or laptop personal computer, a personal digital assistant (PDA), a smart mobile phone, a server, a wireless access point or other networking element, or any other suitable computing device. Network adapter 1104 may be any suitable hardware and/or software to enable the computing device 1100 to communicate wired and/or wirelessly with any other suitable computing device over any suitable computing network. The computing network may include wireless access points, switches, routers, gateways, and/or other networking equipment as well as any suitable wired and/or wireless communication medium or media for exchanging data between two or more computers, including the Internet. Computer-readable media 1106 may be adapted to store data to be processed and/or instructions to be executed by processor 1102. Processor 1102 enables processing of data and execution of instructions. The data and instructions may be stored on the computer-readable storage media 1106.
The data and instructions stored on computer-readable storage media 1106 may comprise computer-executable instructions implementing techniques which operate according to the principles described herein. In the example of FIG. 11, computer-readable storage media 1106 stores computer-executable instructions implementing various facilities and storing various information as described above. Computer-readable storage media 1106 may store a compound representation facility 122, a compound model facility 124, and/or a compound analysis facility 126.
While not illustrated in FIG. 11, a computing device may additionally have one or more components and peripherals, including input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computing device may receive input information through speech recognition or in other audible format.
Embodiments have been described where the techniques are implemented in circuitry and/or computer-executable instructions. It should be appreciated that some embodiments may be in the form of a method, of which at least one example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Various aspects of the embodiments described above may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
The word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any embodiment, implementation, process, feature, etc. described herein as exemplary should therefore be understood to be an illustrative example and should not be understood to be a preferred or advantageous example unless otherwise indicated.
Having thus described several aspects of at least one embodiment, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure and are intended to be within the spirit and scope of the principles described herein. Accordingly, the foregoing description and drawings are by way of example only.

Claims

CLAIMS What is claimed is:
1. A method comprising : creating, using a first model, a second model for predicting information regarding a property of compounds input to the second model, wherein the first model was trained using compound information to generate at least one other output different from the information regarding the property, wherein creating the second model comprises: editing the first model to generate the second model; and training the second model using training data for the property.
2. The method of claim 1, wherein: the first model comprises a first neural network; and editing the first model to generate the second model comprises adding at least one layer to, removing at least one layer from, and/or adjusting at least one layer of the first neural network to generate a second neural network.
3. The method of claim 2, wherein adjusting at least one layer of the first neural network comprises adjusting values of one or more parameters of the at least one layer of the first neural network.
4. The method of claim 1, wherein: the first model comprises a first neural network; editing the first model to generate the second model comprises adding a classifier to the first neural network; and training the second model comprises training the classifier using the training data for the property.
5. The method of claim 4, wherein the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
6. The method of claim 5, further comprising generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds: generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
7. The method of claim 4, wherein the first neural network trained using compound information is trained to identify compounds that comply with at least one chemical rule.
8. The method of claim 1, further comprising: receiving a request to analyze a library of compounds of interest, the request comprising input characterizing the library of compounds to be analyzed; determining, using the second model, for each compound in the library of compounds, a value for the compound with respect to the property, to generate a set of values of the property for compounds of the library of compounds; and outputting information regarding the set of values of the property for the compounds of the library of compounds.
9. A method comprising: creating, using a first model, a second model for predicting information regarding a functional property of compounds input to the second model, wherein the first model was trained using a first amount of compound information to identify compounds that comply with at least one rule of physics and/or chemistry regarding compounds, wherein creating the second model comprises: editing the first model to generate the second model; and training the second model using training data for the property, the training data being a second amount of training data that is less than the first amount of compound information.
10. The method of claim 9, wherein: the first model comprises a first neural network; editing the first model to generate the second model comprises adding a classifier to the first neural network; and training the second model comprises training the classifier using the training data for the property.
11. The method of claim 10, wherein the training data for the property include digital representations of a plurality of compounds and property data indicating whether each compound in the plurality of compounds has the property.
12. The method of claim 11, wherein a number of compounds in the plurality of compounds of the training data used to train the second model is less than a number of compounds in the compound information used to train the first model.
13. The method of claim 10, further comprising generating the digital representations of the plurality of compounds, wherein generating the digital representations of the plurality of compounds includes, for each respective compound in the plurality of compounds: generating the digital representation of the respective compound using an identification of a plurality of atoms and/or molecules of the respective compound, information regarding interconnections of the plurality of atoms and/or molecules of the respective compound, and information regarding distances between the plurality of atoms and/or molecules of the compound.
14. A method comprising: generating a digital representation of a compound, the generating comprising: receiving an identification of a plurality of atoms and/or molecules of a compound; receiving information regarding interconnections of the plurality of atoms and/or molecules of the compound; receiving information regarding distances between the plurality of atoms and/or molecules of the compound; and generating the digital representation of the compound using the identification of the plurality of atoms and/or molecules, the information regarding the interconnections, and the information regarding the distances.
15. The method of claim 14, wherein receiving the information regarding the distances comprises receiving information regarding a three-dimensional (3D) structure and/or arrangement of the plurality of atoms and/or molecules of the compound.
16. The method of claim 14, wherein generating the digital representation of the compound comprises applying at least one transformer to the identification of the plurality of atoms and/or molecules of the compound.
17. The method of claim 16, wherein the identification of the plurality of atoms and/or molecules of the compound comprises a graph representation of the compound.
18. The method of claim 17, further comprising generating the graph representation of the compound, wherein generating the graph representation of the compound includes: encoding the plurality of atoms and/or molecules of the compound as a plurality of nodes in the graph representation; encoding the interconnections of the plurality of atoms and/or molecules of the compound as a plurality of edges in the graph representation; and iteratively traversing nodes in the plurality of nodes along edges in the plurality of edges to update the graph representation.
19. An apparatus comprising: at least one processor; and at least one computer-readable storage medium having encoded thereon executable instructions that, when executed by the at least one processor, cause the at least one processor to carry out the method of any one or more of claims 1-18.
20. At least one computer-readable storage medium encoded with computer-executable instructions that, when executed by a computer, cause the computer to carry out the method of any one or more of claims 1-18.
PCT/IB2024/050953 2023-02-01 2024-02-01 Compound representation and property analysis at scale WO2024161359A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363442618P 2023-02-01 2023-02-01
US63/442,618 2023-02-01

Publications (2)

Publication Number Publication Date
WO2024161359A2 true WO2024161359A2 (en) 2024-08-08
WO2024161359A3 WO2024161359A3 (en) 2024-10-03

Family

ID=90368037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2024/050953 WO2024161359A2 (en) 2023-02-01 2024-02-01 Compound representation and property analysis at scale

Country Status (1)

Country Link
WO (1) WO2024161359A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118824355A (en) * 2024-09-20 2024-10-22 北京望石智慧科技有限公司 Training methods for molecular prediction models and processing methods for protein pockets

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118824355A (en) * 2024-09-20 2024-10-22 北京望石智慧科技有限公司 Training methods for molecular prediction models and processing methods for protein pockets

Also Published As

Publication number Publication date
WO2024161359A3 (en) 2024-10-03

Similar Documents

Publication Publication Date Title
Jisna et al. Protein structure prediction: conventional and deep learning perspectives
Bepler et al. Learning the protein language: Evolution, structure, and function
Karthikeyan et al. Artificial intelligence: machine learning for chemical sciences
Baldi Deep learning in biomedical data science
Wang et al. Structure-aware multimodal deep learning for drug–protein interaction prediction
Hu et al. Protein language models and structure prediction: Connection and progression
McGibbon et al. From intuition to AI: evolution of small molecule representations in drug discovery
Shilpa et al. Recent applications of machine learning in molecular property and chemical reaction outcome predictions
US20230154561A1 (en) Deep learning systems and methods for predicting structural aspects of protein-related complexes
Zankov et al. Chemical complexity challenge: Is multi‐instance machine learning a solution?
Bahi et al. Convolutional neural network with stacked autoencoders for predicting drug-target interaction and binding affinity
WO2024161359A2 (en) Compound representation and property analysis at scale
Reker et al. Selection of informative examples in chemogenomic datasets
Mulligan Current directions in combining simulation-based macromolecular modeling approaches with deep learning
Aburidi et al. Wasserstein distance-based graph kernel for enhancing drug safety and efficacy prediction
Wang et al. LDS-CNN: A deep learning framework for drug-target interactions prediction based on large-scale drug screening
Kyro et al. T-ALPHA: A Hierarchical Transformer-Based Deep Neural Network for Protein–Ligand Binding Affinity Prediction with Uncertainty-Aware Self-Learning for Protein-Specific Alignment
Navidi et al. Morphodiff: Cellular morphology painting with diffusion models
Hu et al. Advances of deep learning in protein science: a comprehensive survey
WO2024187031A2 (en) Systems and methods for dynamic-backbone protein-ligand structure prediction with multiscale generative diffusion models
Goel et al. AI-assisted methods for protein structure prediction and analysis
Du et al. FusionESP: Improved enzyme-substrate pair prediction by fusing protein and chemical knowledge
Yuan et al. Sequence-based predictions of residues that bind proteins and peptides
Lu et al. DTIAM: A unified framework for predicting drug-target interactions, binding affinities and activation/inhibition mechanisms
Ma et al. NesT-NABind: a Nested Transformer for Nucleic Acid-Binding Site Prediction on Protein Surface

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24712940

Country of ref document: EP

Kind code of ref document: A2