CN110730822A

CN110730822A - Method for identifying compounds

Info

Publication number: CN110730822A
Application number: CN201880040438.9A
Authority: CN
Inventors: E.A.西格尔; L.薛; C.J.马尔赫恩; D.J.莫西亚
Original assignee: X-Chemical Co Ltd
Current assignee: X-Chemical Co Ltd
Priority date: 2017-04-18
Filing date: 2018-04-18
Publication date: 2020-01-24
Anticipated expiration: 2038-04-18
Also published as: WO2018195134A1; BR112019021786A2; CN110730822B; JP2020518898A; JP2023113620A; JP7277378B2; EP3612545A1; AU2023206117A1; AU2018256367A1; EP3612545A4; US20200143903A1; MA51864A; EA201992476A1

Abstract

The present disclosure provides virtual screening methods that utilize data sets from nucleotide-encoding libraries (e.g., DNA-encoding libraries). These methods allow for high confidence prediction of binding interactions between candidate compounds and proteins of interest for the development of therapeutic agents.

Description

Method for identifying compounds

Background

The virtual screening method can extend the available screening options for a given objective and can increase the likelihood of successful optimization. Virtual screening can be a fast and inexpensive method of identifying multiple scaffolds to be used as starting points for optimization. The ability to virtually screen is generally limited by the size of the experimentally determined data set used, as it relies on comparison with known experimental data to generate virtual data. Therefore, there is a need for a method that combines robust computational methods with extremely large data sets to produce sufficient confidence in the computational predictions to replace traditional high-throughput screening methods.

Summary of The Invention

The present disclosure provides methods for identifying compounds that are useful as therapeutic agents and/or that can be used as starting points for optimization in the development of therapeutic agents. These methods combine computational methods for predicting binding between a compound and a protein with large data sets of experimental data obtained using nucleotide-encoding libraries (e.g., DNA-encoding libraries). The combination of data generated with nucleotide coding libraries and computational methods allows for high confidence prediction of binding interactions between candidate compounds and proteins of interest.

Accordingly, in one aspect, the present disclosure provides a method comprising the steps of: (a) providing a plurality of binding interaction findings (e.g., at least 250,000 findings) for a target protein in a physical computing device having a representation of a set of candidate compounds (e.g., small molecule compounds), wherein at least 50% (e.g., at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%) of the plurality of binding interaction findings represent a binding interaction between the target protein and a compound (e.g., a member of a DNA-encoding library) comprising a nucleotide tag encoding the identity of the compound; (b) using the plurality of binding interactions to find an estimated binding interaction for the candidate compound using the computing device; and (c) outputting a list of candidate compounds that can be displayed and ranked by the highest estimated binding interaction.

In some embodiments, the plurality of binding interaction findings comprises at least 250,000 (e.g., at least 500,000, at least one million, at least two million, at least five million, at least ten million, at least twenty-five million) binding interaction findings.

In some embodiments, at least 50% of the plurality of binding interaction findings are determined by contacting a plurality (e.g., at least 250,000, at least 500,000, at least one million, at least two million, at least five million, at least ten million) of compounds comprising a nucleotide tag encoding the identity of the compound with the target protein simultaneously (e.g., simultaneously in the same reaction vessel). For example, in some embodiments, at least 50% of the binding interaction findings for DNA encoding library members used to generate the estimated binding interaction are determined in a single experiment in a single reaction vessel.

In some embodiments, the method further comprises providing one or more additional plurality of binding interaction findings for one or more additional target proteins, wherein at least 50% of the one or more additional plurality of binding interaction findings represent binding interactions between the additional target protein and compounds from the plurality of binding interaction findings with the target protein of step (a). In some embodiments, the method further comprises providing one or more additional plurality of binding interaction findings of one or more negative control experiments, wherein at least 50% of the plurality of binding interaction findings represent a negative control for a compound from the plurality of binding interaction findings with the target protein of step (a). In some embodiments, the method further comprises providing one or more additional plurality of binding interaction findings of one or more control experiments, wherein the plurality of binding interaction findings comprises binding interaction findings of a compound (e.g., a known inhibitor or natural ligand) having a known binding interaction with the target protein of step (a). In some embodiments, the method comprises generating a selectivity score by comparing the binding or estimated binding of the compound or candidate compound to the target protein to the binding or estimated binding of the compound or candidate compound to the one or more additional target proteins and/or a negative control. In some embodiments, the list of candidate compounds can be displayed and ranked by selective scoring. In some embodiments, the one or more additional target proteins comprise a mutant of the target protein.

In some embodiments, the estimated binding interaction is generated using chemical structure comparison, e.g., using molecular performance. Molecular representations include, but are not limited to, topological representations (e.g., fingerprints, linkage table, molecular connectivity, and/or molecular graphical representations), electrostatic representations (e.g., surface electrons), geometric representations (e.g., pharmacophores, pharmacophore fingerprints, shape-based fingerprints, and/or 3D molecular coordinates using atoms, features, or functional groups) or quantum chemical representations based on atoms, features, or functional groups and their connectivity. In some embodiments, topological representations (e.g., fingerprints, linkage tables, molecular connectivity, and/or molecular graphical representations) based on atoms, features, or functional groups and their connectivity are used to generate estimated binding interactions. In some embodiments, the estimated binding interaction is generated using electrostatic representation (e.g., surface electrons). In some embodiments, the estimated binding interaction is generated using geometric representation (e.g., pharmacophore fingerprint, shape-based fingerprint, and/or using 3D molecular coordinates of atoms, features, or functional groups). In some embodiments, the estimated binding interaction is generated using quantum chemical representation. In some embodiments, the estimated binding interaction is generated using a chemical fingerprint.

Chemical fingerprinting may be used to aggregate structural information and binding interaction data of compounds to identify structural patterns indicative of binding to a target protein. Thus, in some embodiments, the method further comprises (i) providing a plurality of chemical fingerprints (e.g., chemical fingerprints such as ECFP6, FCFP6, ECFP4, MACCS, or morgan/ring fingerprints having different numbers of bits (e.g., 166, 512, 1024)) for a plurality of compounds; and (ii) utilizing the plurality of chemical fingerprints in the generation of the estimated binding interactions. In some embodiments, such as in a training set, the plurality of chemical fingerprints includes chemical fingerprints of one or more of the compounds comprising nucleotide tags encoding the identity of the compounds, e.g., a chemical fingerprint is a representation of the structure of a compound without nucleotide tags. In some embodiments, for example in a predictive set, the plurality of chemical fingerprints includes chemical fingerprints of one or more candidate compounds. In some embodiments, the chemical fingerprint is an ECFP6 fingerprint.

In some embodiments, the method further comprises providing one or more property findings (e.g., molecular weight and/or clogP) for the set of candidate compounds. In some embodiments, the one or more property findings are used to generate an estimated binding interaction. In some embodiments, the list of candidate compounds is capable of being displayed and ranked by the one or more property findings

In some embodiments, the method further comprises sending the list of candidate compounds over the internet or to a display device. In some embodiments, the physical computing devices are accessed and operated over the internet.

In some embodiments, the method further comprises generating a confidence score for each estimated binding interaction of a candidate compound, wherein the confidence score is generated using a chemical structure comparison (e.g., a principal component analysis) between the candidate compound and one or more compounds from the plurality of binding interactions of the target protein of step (a). For example, in some embodiments, the confidence score is generated by comparing the candidate compound to a chemical space defined by the plurality of binding-interacting compounds from step (a), by determining the distance of the chemical space of the candidate compound as the euclidean distance on the dimension defined by the principal component analysis. In some embodiments, the list of candidate compounds can be displayed and ranked by the confidence score of the estimated binding interaction of the candidate compound.

In some embodiments, the method further comprises (d) synthesizing one or more of the candidate compounds from a list of candidate compounds.

In some embodiments, the method further comprises (e) contacting one or more synthetic candidate compounds with the target protein to determine one or more experimental binding interactions.

In one aspect, the present disclosure provides a computer-readable medium having stored thereon executable instructions for directing a physical computing device to implement a method comprising:

(a) providing a plurality of binding interaction findings for a target protein in a physical computing device, the physical computing device having representations of a set of candidate compounds,

wherein at least 90% of the plurality of binding interaction findings represent binding interactions between the target protein and a compound comprising a nucleotide tag encoding the identity of the compound;

(b) using the plurality of binding interactions to find an estimated binding interaction using the computing device to generate the candidate compound; and

(c) a list of candidate compounds that can be displayed and ranked by the highest estimated binding interaction is output.

In one aspect, the present disclosure provides a physical computing device having a representation of a set of candidate compounds and programmed with executable instructions to direct the device to perform a method comprising:

Definition of

As used herein, a "confidence score" refers to a calculation that indicates a confidence in an estimated binding interaction of a candidate compound based on the structural similarity between the candidate compound and one or more compounds in the dataset used to prepare the estimate.

The term "binding interaction" as used herein refers to an association (e.g., non-covalent or covalent) between two or more entities. "direct" binding refers to physical contact between entities or moieties; indirect binding involves physical interaction by way of physical contact with one or more intermediate entities. Binding between two or more entities can generally be assessed in any of a variety of contexts-including where interacting entities or moieties are studied separately or in the context of more complex systems (e.g., when covalently or otherwise associated with a carrier entity and/or in a biological system or cell).

The affinity of a molecule X for its partner Y can generally be determined by the dissociation constant (K)_D) And (4) showing. Affinity can be measured by conventional methods known in the art, including those described herein. The term "K" as used herein_D"means the dissociation equilibrium constant for a particular compound-protein or complex-protein interaction. Generally, the compounds of the present invention are present in amounts less than about 10^-6M, e.g. less than about 10^-7M、10^-8M、10^-9M or 10^-10M or even lower dissociation equilibrium constant (K)_D) Binding to the presentation protein, for example when the presentation protein is used as an analyte by Surface Plasmon Resonance (SPR) techniques and the compound is assayed as a ligand. In some embodiments, the compounds of the present invention are present in an amount less than about 10^-6M, e.g. less than about 10^-7M、10^-8M、10^-9M or 10^-10M or even lower dissociation equilibrium constant (K)_D) Binding to a target protein (e.g. a eukaryotic target protein such as a mammalian target protein or a fungal target protein or a prokaryotic target protein such as a bacterial target protein), for example when the target protein is used as an analyte and the compound is assayed as a ligand by Surface Plasmon Resonance (SPR) techniques.

As used herein, "binding interaction discovery" refers to the binding interaction between a compound and a protein (e.g., a target protein) or the lack thereof, as has been experimentally determined by, for example, SPR. For example, in some embodiments, a binding interaction discovery refers to a determination that a compound does not interact with a protein (e.g., a target protein).

The term "molecular manifestation" refers, for example, to a topological, electrostatic, geometric, or quantum chemical manifestation of a compound. Molecular manifestations include, for example, chemical fingerprints.

The term "electrostatic representation" refers to a type of molecular representation that includes information such as surface electrons.

As used herein, "estimated binding interaction" refers to a binding interaction that has been predicted using computational analysis. In some embodiments, the estimated binding interaction of the candidate compound with the target protein is generated by comparing the chemical structure of the candidate compound to the chemical structure of one or more compounds for which binding interaction with the target protein has been experimentally determined.

The term "chemical fingerprint" as used herein refers to a machine-readable representation of a molecule of a compound, such as a bit string, i.e., a list of binary values (0 or 1), which characterizes the two-and/or three-dimensional structure of the molecule. Exemplary methods of generating chemical fingerprints are known in the art, including, but not limited to, MACCS, Extended Connectivity Fingerprints (ECFP), Functional Class Fingerprints (FCFP), morgan/cyclic fingerprints, and chemical hash fingerprints.

The term "clogP" as used herein refers to the calculated partition coefficient of a molecule or portion of a molecule. Partition coefficient is the ratio of the concentration of a compound in a mixture of two immiscible phases (e.g., octanol and water) at equilibrium and measures the hydrophobicity or hydrophilicity of a compound. There are a variety of methods available in the art for determining clogP, for example, in some embodiments clogP can be determined using quantitative structure-property relationship algorithms known in the art (e.g., using fragment-based prediction methods that predict logP of a compound by determining the sum of non-overlapping molecular fragments of the compound). Several algorithms for calculating clogP are known in the art, including those used by molecular editing software such as CHEMDAW Pro, 12.0.2.1092 version (Cambridge Soft, Cambridge, MA) and MARVINSKETCH (Chemaxon, Budapest, Hungary).

The term "equivalent" as used herein refers to two or more compounds, entities, situations, sets of conditions, etc., which may not be identical to each other, but are sufficiently similar to allow comparisons to be made between them so that conclusions can be reasonably drawn based on the differences or similarities observed. In some embodiments, an equivalent set of conditions, environment, individual, or population is characterized by a plurality of substantially identical features and one or a small number of varying features. One of ordinary skill in the art will understand in the background what degree of identity is required in any given situation for two or more such compounds, entities, situations, sets of conditions, etc., to be considered equivalent. For example, one of ordinary skill in the art will appreciate that a collection of environments, individuals, or populations are equivalent to one another when they are characterized by a sufficient number and type of substantially identical features to warrant a reasonable conclusion (i.e., that the results or observed phenomenological differences obtained or observed under or with different collections of environments, individuals, or populations are changes in those features that are caused or indicative of changes in those features that are changed).

Many of the methods described herein include a "determining" step. One of ordinary skill in the art will understand upon reading this specification that such a "determination" can be accomplished using any of a variety of techniques available to those of skill in the art or by using any of a variety of techniques available to those of skill in the art, including, for example, the specific techniques explicitly mentioned herein. In some embodiments, the determination relates to manipulation of the physical sample. In some embodiments, considerations and/or processing relating to the data or information are determined, for example, using a computer or other processing unit adapted to perform the correlation analysis. In some embodiments, determining comprises receiving the relevant information and/or material from the source. In some embodiments, determining comprises comparing one or more characteristics of the sample or entity to a comparable reference.

The term "geometric representation" refers to one type of molecular representation. The geometric representation may include information about, for example, pharmacophores, pharmacophore fingerprints, shape-based fingerprints, and/or 3D molecular coordinates using atoms, features, or functional groups.

The term "library" as used herein refers to 2, 5, 10²、10³、10⁴、10⁵、10⁶、10⁷、10⁸、10⁹A collection of one or more different molecules. In some embodiments, at least 10% (e.g., at least 20%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100%) of the compounds in the library are compounds that include a nucleotide tag that encodes their identity, such as DNA-encoding compounds.

The term "negative control" as used herein refers to an experiment in which no defined binding interaction of the target protein is present.

The term "polar surface area" refers to the sum of the surfaces of all polar atoms of a molecule or portion of a molecule, including the hydrogen to which they are attached. The polar surface areas were determined by computer using a program such as CHEMDAW Pro, Version 12.0.2.1092(Cambridge Soft, Cambridge, MA).

The term "positive control" as used herein refers to an experiment in which the binding interaction is determined, wherein the binding affinity of a compound in contact with a target protein is known.

As used herein, "property discovery" refers to a calculated or experimentally determined property (e.g., clogP, polar surface area, molecular weight) of a particular compound.

The term "selective", when used in reference to a compound having activity, is understood by those skilled in the art to mean that the compound distinguishes between potential target entities or states. For example, in some embodiments, a compound is said to "selectively" bind to a target if it preferentially binds to the target in the presence of one or more competing candidate targets. In many embodiments, selective interactions depend on the presence of specific structural features (e.g., epitopes, clefts, binding sites) of the target entity. It should be understood that selectivity need not be absolute. In some embodiments, selectivity can be assessed relative to the selectivity of a binding agent for one or more other potential target entities (e.g., competitors). In some embodiments, selectivity is assessed relative to selectivity for a reference selective binding agent. In some embodiments, selectivity is assessed relative to selectivity for a reference non-selective binding agent. In some embodiments, the agent or entity detectably does not bind to a competing candidate target under conditions for binding to its target entity. In some embodiments, a binding agent binds its target entity with a higher on-rate, a lower off-rate, increased affinity, decreased dissociation, and/or increased stability compared to a competing candidate target.

As used herein, "selectivity score" refers to the calculation of the specificity of a compound for a target protein. In some embodiments, the selectivity score can be calculated by comparing the binding of a compound to a target protein to the binding of a compound to another protein (e.g., a mutant or unrelated protein to the target protein). In other embodiments, the selectivity score can be calculated by comparing the binding of the compound to the target protein and a negative control.

The term "small molecule" refers to a low molecular weight organic and/or inorganic compound. Typically, a "small molecule" is a molecule that is less than about 5 kilodaltons (kD) in size. In some embodiments, the small molecule is less than about 4 kD, 3 kD, about 2 kD, or about 1 kD. In some embodiments, the small molecule is less than about 800 daltons (D), about 600D, about 500D, about 400D, about 300D, about 200D, or about 100D. In some embodiments, the small molecule is less than about 2000g/mol, less than about 1500g/mol, less than about 1000g/mol, less than about 800g/mol, or less than about 500 g/mol. In some embodiments, the small molecule is not a polymer. In some embodiments, the small molecule does not include a polymeric moiety. In some embodiments, the small molecule is not a protein or polypeptide (e.g., is not an oligopeptide or peptide). In some embodiments, the small molecule is not a polynucleotide (e.g., is not an oligonucleotide). In some embodiments, the small molecule is not a polysaccharide. In some embodiments, the small molecule does not include a polysaccharide (e.g., is not a glycoprotein, proteoglycan, glycolipid, etc.). In some embodiments, the small molecule is not a lipid. In some embodiments, the small molecule is a modulatory compound. In some embodiments, the small molecule is biologically active. In some embodiments, the small molecule is detectable (e.g., comprises at least one detectable moiety). In some embodiments, the small molecule is a therapeutic agent.

One of ordinary skill in the art, upon reading this disclosure, will appreciate that certain small molecule compounds described herein can be provided and/or utilized in any of a variety of forms, such as salt forms, protected forms, prodrug forms, ester forms, isomeric forms (e.g., optical and/or structural isomers), isotopic forms, and the like. In some embodiments, reference to a particular compound may relate to a particular form of the compound. In some embodiments, reference to a particular compound may relate to any form of that compound. In some embodiments, when a compound is one that is present or found in nature, the compound may be provided and/or utilized in accordance with the present invention in a form that is different from the form in which it is present or found in nature. One of ordinary skill in the art will appreciate that a preparation of a compound that includes one or more individual forms at different levels, amounts, or ratios from a reference preparation or source (e.g., a natural source) of the compound can be considered to be different forms of the compound described herein. Thus, in some embodiments, for example, a preparation of a single stereoisomer of a compound can be considered to be a different form of the compound than the racemic mixture of the compound; a particular salt of a compound may be considered to be a different form from another salt form of the compound; a formulation comprising one conformer of a double bond ((Z) or (E)) may be considered to be in a different form to a formulation comprising the other conformer of a double bond ((E) or (Z)); preparations in which one or more atoms is an isotope other than that present in the reference preparation can be considered to be in a different form; and so on.

The term "specific binding" or "specific for … …" or "specific for … …" as used herein refers to the interaction between a binding agent and a target entity. As one of ordinary skill will appreciate, an interaction is considered "specific" if it is favorable in the presence of an alternative interactionE.g. K_DLess than 10 μ M binding (e.g., less than 5 μ M, less than 1 μ M, less than 500 nM, less than 200 nM, less than 100 nM, less than 75 nM, less than 50 nM, less than 25 nM, less than 10 nM, or 10 nM to 100 nM, 50 nM to 250 nM, 100 nM to 500 nM, 250 nM to 1 μ M, 500 nM to 2 μ M, 1 μ M to 5 μ M). In many embodiments, the specific interaction depends on the presence of a particular structural feature (e.g., epitope, cleft, binding site) of the target entity. It is to be understood that specificity need not be absolute. In some embodiments, specificity can be assessed relative to the specificity of a binding agent for one or more other potential target entities (e.g., competitors). In some embodiments, specificity is assessed relative to the specificity of a reference specific binding agent. In some embodiments, the specificity is assessed relative to the specificity of a reference non-specific binding agent.

The term "structural similarity" refers to the similarity in the two-or three-dimensional arrangement and/or orientation of atoms or moieties relative to each other (e.g., the distance and/or angle between an agent of interest and a reference agent) in one or more different compounds.

The term "substantially" refers to a qualitative condition that exhibits a complete or near complete degree or degree of a characteristic or attribute of interest. One of ordinary skill in the biological arts will appreciate that few, if any, biological and chemical phenomena have progressed to completion and/or proceed to completion or achieve or avoid some absolute result. The term "substantially" is therefore used herein to cover the complete potential absence inherent in many biological and chemical phenomena.

The term "does not substantially bind" a particular protein as used herein may, for example, be defined by having 10 to the target^-4M is greater than or equal to, or 10^-5M is greater than or equal to, or 10^-6M is greater than or equal to, or 10^-7M is greater than or equal to, or 10^-8M is greater than or equal to, or 10^-9M is greater than or equal to, or 10^-10M is greater than or equal to, or 10^-11M is greater than or equal to, or 10^-12K of M or greater_DOr 10 is^-4M to 10^-12M, or 10^-6M to 10^-10M, or 10^-7M to 10^-9K in the range of M_DOr a portion of a molecule.

The term "target protein" refers to a protein that binds to a small molecule. In some embodiments, the target protein is involved in a biological pathway associated with a disease, disorder, or condition. In some embodiments, the target protein is a naturally occurring protein; in some such embodiments, the target protein is naturally present in certain mammalian cells (e.g., mammalian target protein), fungal cells (e.g., fungal target protein), bacterial cells (e.g., bacterial target protein), or plant cells (e.g., plant target protein). In some embodiments, the target protein is characterized by a natural interaction with one or more naturally occurring presented protein/naturally occurring small molecule complexes. In some embodiments, the target protein is characterized by natural interactions with a plurality of different naturally presented protein/natural small molecule complexes; in some such embodiments, some or all of the complexes utilize the same presentation protein (and different small molecules). The target protein may be naturally occurring, e.g., wild-type. Alternatively, the target protein may be different from the wild-type protein, but still retain biological function, e.g., as an allelic variant, splice mutant, or biologically active fragment. Exemplary mammalian target proteins are gtpases, GTPase activating proteins, ornithine nucleotide exchange factors, heat shock proteins, ion channels, coiled coil proteins, kinases, phosphatases, ubiquitin ligases, transcription factors, chromatin modifying/remodeling factors, proteins with classical protein-protein interaction domains and motifs, or any other protein involved in a biological pathway associated with a disease, disorder or condition.

The term "topological representation" refers to a type of molecular representation that depends on the topology of the molecule and that indicates the position of individual atoms and the bonding connections between them. The topological representation can be based on atoms, features, or functional groups and their connectivity (e.g., fingerprints, connection tables, molecular connectivity, and/or molecular graphical representations). The topological representation may be computed based on the molecular graphical representation.

The term "quantum chemical manifestation" refers to a type of molecular manifestation. Quantum chemical manifestation may include information about, for example, the energy or electronic properties of a compound.

Brief Description of Drawings

FIG. 1 is a graph illustrating the prediction of binding interactions as the number of libraries increases.

FIG. 2 is a graph illustrating multiple prediction trials over time due to improvements in the prediction model.

Detailed Description

The present disclosure provides virtual screening methods for identifying compounds that are useful as therapeutic agents and/or that can be used as starting points for optimization in the development of therapeutic agents. These methods utilize large data sets of experimental data obtained using DNA-encoding libraries to generate high-confidence predictions of binding interactions between candidate compounds and proteins of interest.

Coding compound

The invention features methods of using coded chemical entities, including a chemical entity, one or more tags, and a headpiece operably associating a first chemical entity and one or more tags. Chemical entities, headpieces, labels, bonds, and bifunctional spacers are further described below.

Chemical entities

The coding compounds (e.g., small molecules) utilized in the methods of the invention can include one or more building blocks and optionally one or more scaffolds.

The scaffold S may be a monoatomic or molecular scaffold. Exemplary monoatomic scaffolds include carbon, boron, nitrogen, or phosphorus atoms, among others. Exemplary polyatomic scaffolds include cycloalkyls, cycloalkenyls, heterocycloalkyls, heterocycloalkenyls, aryls, or heteroaryls. Specific embodiments of heteroaryl scaffolds include triazines, such as1, 3, 5-triazine, 1,2, 3-triazine, or1, 2, 4-triazine; a pyrimidine; pyrazine; pyridazine; furan; pyrrole; pyrroline; a pyrrolidine; oxazole; pyrazole; isoxazole; a pyran; pyridine; indole; indazoles; or a purine.

The scaffold S can be operably linked to the label by any available method. In one example, S is a triazine directly attached to the headpiece. To getTo this exemplary scaffold, trichlorotriazine (i.e., a chlorinated triazine precursor having three chlorines) is reacted with a nucleophilic group of a headpiece. Using this approach, S has three sites available for substitution with a chloride, two of which are available diversity nodes and one linked to the headpiece. Next, the component A is put_nDiverse nodes added to the scaffold and will be member A_nCoded mark A_n("Mark A)_n") to the header fragment, wherein the two steps can be performed in any order. Then, the member B may be put_nAdded to the remaining diversity nodes and will be member B_nCoded mark B_nAttached to tag A_nOf the end portion of (a). In another example, S is a nucleophilic group (e.g., amino group) operably linked to a labeled triazine, wherein the trichlorotriazine is reacted with PEG, a labeled aliphatic or aromatic linker. As described above, building blocks and associated tags may be added.

In another example, S is operatively connected to member A_nThe triazine of (1). To obtain such scaffolds, a building block A having two diversity nodes (e.g., electrophilic and nucleophilic groups, such as Fmoc-amino acids) is used_nWith a nucleophilic group of a linker (e.g., a terminal group of a PEG, aliphatic or aromatic linker attached to the headpiece). Then, trichlorotriazine is reacted with component A_nIs reacted with a nucleophilic group. Using this approach, all three chlorine sites of S are used as diversity nodes for the building block. Additional members and markers may be added, and additional stents S may be added, as described herein_n。

Exemplary Member A_n' includes, for example, amino acids (e.g., alpha-, beta-, Y-, delta-, and epsilon-amino acids, as well as derivatives of natural and unnatural amino acids), chemically reactive reactants with amines (e.g., azide or alkyne chains), or thiol reactants, or combinations thereof. Component A_nThe choice of (a) depends on, for example, the nature of the reactive group used in the linker, the nature of the scaffold moiety, and the solvent used for the chemical synthesis.

Exemplary Member B_n' and C_n' includes any useful structural unit of a chemical entity, such as an optionally substituted aromatic group (e.g., optionally substituted phenyl or benzyl), an optionally substituted heterocyclic group (e.g., optionally substituted quinolinyl, isoquinolinyl, indolyl, isoindolyl, azaindolyl, benzimidazolyl, azabenzimidazolyl, benzisoxazole, pyridyl, piperidyl, or pyrrolidinyl), an optionally substituted alkyl group (e.g., optionally substituted straight or branched C_1-6Alkyl or optionally substituted C_1-6Aminoalkyl), or an optionally substituted carbocyclic group (e.g., optionally substituted cyclopropyl, cyclohexyl, or cyclohexenyl). Particularly useful component B_n' and C_n' includes those having one or more reactive groups, such as optionally substituted groups (e.g., any described herein) having one or more substituents that are optionally reactive groups or that can be chemically modified to form reactive groups. Exemplary reactive groups include amines (-NR)₂Wherein each R is independently H or optionally substituted C_1-6Alkyl), hydroxy, alkoxy (-OR, wherein R is optionally substituted C_1-6Alkyl, such as methoxy), carboxyl (-COOH), amide, or chemically reactive substituents. For example, it can be at the mark B_nOr C_nInto which restriction sites can be introduced, wherein the complex can be recognized by performing PCR and restriction digestion with one of the corresponding restriction enzymes.

Head segment

In one coding chemical entity, the headpiece operably links each chemical entity to its coding oligonucleotide tag. Generally, the headpiece is an initial oligonucleotide having at least two functional groups that can be further derivatized, wherein a first functional group operably links the first chemical entity (or component thereof) to the headpiece and a second functional group operably links one or more labels of the headpiece to the headpiece. A bifunctional spacer may optionally be used as the spacer moiety between the headpiece and the chemical entity.

The functional group of the headpiece can be used to form a covalent bond with a chemical entity component and another covalent bond with a label. The component may be any part of a small molecule, such as a scaffold with a multiplicity of nodes or building blocks. Alternatively, the headpiece can be derivatized to provide a spacer (e.g., a spacer moiety that separates the headpiece from the small molecule to be formed in the library) that terminates in a functional group (e.g., a hydroxyl, amine, carboxyl, thiol, alkynyl, azido, or phosphate group) that is used to form a covalent bond with a chemical entity component. The spacer may be attached to the 5 '-end, or the 3' -end of the headpiece at one of the internal sites. When a spacer is attached to one of the internal sites, the spacer can be operably linked to a derivatized base (e.g., the C5 site of uridine) or placed internally within the oligonucleotide using standard techniques known in the art. Exemplary spacers are described herein.

The headpiece can have any useful configuration. The headpiece may be, for example, 1 to 100 nucleotides in length, preferably 5 to 20 nucleotides in length, and most preferably 5 to 15 nucleotides in length. As described herein, the headpiece can be single-stranded or double-stranded, and can be composed of natural or modified nucleotides. For example, a chemical moiety is operably linked to the 3 '-terminus or the 5' -terminus of the headpiece. In particular embodiments, the headpiece includes a hairpin structure formed by complementary bases within the sequence. For example, a chemical moiety may be operably linked to an internal site, 3 '-terminus, or 5' -terminus of the headpiece.

Generally, the headpiece includes a non-self-complementary sequence on the 5 '-or 3' -end that allows for binding of the oligonucleotide tag by polymerization, enzymatic ligation, or chemical reaction. The headpiece may allow for ligation of oligonucleotide tags and optional purification and phosphorylation steps. After the addition of the last tag, additional adaptor sequences may be added to the 5' -end of the last tag. Exemplary adaptor sequences include primer binding sequences or sequences with a label (e.g., biotin). In cases where a number of building blocks and corresponding labels are used (e.g., 100), a mix-split strategy can be employed during the oligonucleotide synthesis step to form the desired number of labels. Such mix-resolution strategies for DNA synthesis are known in the art. The resulting library members may be amplified by PCR and subsequently selected for binding entities to the target of interest.

The headpiece or complex may optionally include one or more primer binding sequences. For example, the headpiece has a sequence in a hairpin loop region that serves as a primer binding region for amplification, where the primer binding region has a higher melting temperature for its complementary primer (e.g., which may include a flanking identifier region) than the sequence in the headpiece. In other embodiments, the complex comprises two primer binding sequences on both sides of one or more labels (which encode one or more building blocks) (e.g., such that a PCR reaction can occur). Alternatively, the headpiece may contain a primer binding sequence at the 5 '-or 3' -end. In other embodiments, the headpiece is a hairpin and the loop region forms a primer binding site or the primer binding site is introduced on the 3' side of the loop of the headpiece by hybridization of an oligonucleotide. A primer oligonucleotide comprising a region homologous to the 3 '-end of the headpiece and carrying a primer binding region on its 5' -end (e.g. to make a PCR reaction feasible) may be hybridised to the headpiece and may comprise a label encoding a building block or adding a building block. The primer oligonucleotide may comprise additional information, such as a random nucleotide region, e.g., 2 to 16 nucleotides in length, which is included for bioinformatic analysis.

The headpiece may optionally include a hairpin structure, where such a structure can be achieved by any useful method. For example, the headpiece can include complementary bases that form an intermolecular base-pairing partner, e.g., by Watson-Crick (Watson-Crick) base-pairing (e.g., adenine-thymine and guanine-cytosine) and/or by wobble base-pairing (e.g., guanine-uracil, inosine-adenine and inosine-cytosine). In another example, the headpiece may include modified or substituted nucleotides that can form higher affinity duplex formations than unmodified nucleotides, such modified or substituted nucleotides being known in the art. In another example, the headpiece includes one or more bases that are cross-linked to form a hairpin structure. For example, bases within a single strand or bases in different duplexes may be cross-linked, e.g., by using psoralen.

The headpiece or complex may optionally include one or more labels for detection. For example, the headpiece, one or more oligonucleotide tags, and/or one or more primer sequences can include an isotope, a radioimaging agent, a marker, a tracer, a fluorescent tag (e.g., rhodamine or fluorescein), a chemiluminescent tag, a quantum dot, or a reporter molecule (e.g., biotin or histidine tag).

In other embodiments, the head fragments or tags may be modified to support solubility under semi-aqueous, reduced aqueous or non-aqueous (e.g., organic) conditions. The C5 position of, for example, T or C bases can be modified by using an aliphatic chain to make the headpiece or labeled nucleotide bases more hydrophobic and not significantly disrupt their ability to form hydrogen bonds with their complementary bases. Exemplary modified or substituted nucleotides are 5' -dimethoxytrityl-N4-diisobutylaminomethylidene-5- (1-propynyl) -2' -deoxycytidine, 3' - [ (2-cyanoethyl) - (N, N-diisopropyl) ] -phosphoramidite; 5' -dimethoxytrityl-5- (1-propynyl) -2' -deoxyuridine, 3' - [ (2-cyanoethyl) - (N, N-diisopropyl) ] -phosphoramidite; 5' -dimethoxytrityl-5-fluoro-2 ' -deoxyuridine, 3' - [ (2-cyanoethyl) - (N, N-diisopropyl) ] -phosphoramidite; and 5' -dimethoxytrityl-5- (pyrene-1-yl-ethynyl) -2' -deoxyuridine, or 3' - [ (2-cyanoethyl) - (N, N-diisopropyl) ] -phosphoramidite.

In addition, the headpiece oligonucleotide may be interspersed with modifications that increase solubility in organic solvents. For example, azobenzene phosphoramidites can introduce hydrophobic moieties into the design of the headpiece. Such insertion of the hydrophobic amidate into the headpiece may occur anywhere in the molecule. However, if used for tag deconvolution, the insertion cannot interfere with subsequent labeling using additional DNA tags during library synthesis or subsequent PCR or microarray analysis once selection is complete. Such additions to the headpiece design described herein may render the headpiece soluble in, for example, 15%, 25%, 30%, 50%, 75%, 90%, 95%, 98%, 99%, or 100% organic solvent. Thus, the addition of hydrophobic residues to the design of the headpiece results in improved solubility under semi-aqueous or non-aqueous (e.g., organic) conditions while enabling the headpiece to be used for oligonucleotide labeling. In addition, DNA markers subsequently introduced into the library may also be modified at the C5 site of the T or C base, making them also rendering the library more hydrophobic and soluble in organic solvents for subsequent steps of library synthesis.

In particular embodiments, the headpiece and the first tag may be the same entity, i.e., multiple headpiece-tag entities may be constructed, all sharing a common portion (e.g., a primer binding region) and all differing on another portion (e.g., a coding region). They can be used in the "split" step and assembled after the events they encode have occurred.

In particular embodiments, the headpiece may encode information, for example by including a sequence encoding the first resolution step or a sequence encoding the identity of the library, such as by using a particular sequence associated with a particular library.

Oligonucleotide labeling

The oligonucleotide tags described herein (e.g., tags or partial headpieces or partial tailpieces) can be used to encode any useful information, such as a molecule, a portion of a chemical entity, addition of a component (e.g., scaffold or building block), headpieces in a library, identity of a library, use of one or more library members (e.g., use of members of an aliquot of a library), and/or source of a library member (e.g., by using a sequence of origin).

Any sequence in the oligonucleotide may be used to encode any information. Thus, one oligonucleotide sequence may be used for multiple purposes, for example to encode two or more types of information or to provide a starting oligonucleotide that also encodes one or more types of information. For example, the first marker may be the addition of a first building block and an identification code for the library. In another example, a headpiece can be used to provide an initial oligonucleotide that operably links a chemical entity to a label, wherein the headpiece additionally includes a sequence encoding an identity of the library (e.g., a library recognition sequence). Thus, any of the information described herein can be encoded in a separate oligonucleotide tag or can be combined and encoded in the same oligonucleotide sequence (e.g., an oligonucleotide tag such as a tag or headpiece).

The building block sequence encodes the identity of the building block and/or the type of binding reaction to be performed using the building block. Such building block sequences are included in a tag, wherein the tag may optionally include one or more types of sequences (e.g., library-identifying sequences, use sequences, and/or source sequences) as described below.

The library recognition sequence encodes the identity of a particular library. To allow for the mixing of two or more libraries, the library members may contain one or more library recognition sequences, such as in a library recognition tag (i.e., an oligonucleotide comprising a library recognition sequence), in a ligated tag, in a portion of the head fragment sequence, or in the tail fragment sequence. These library recognition sequences can be used to derive coding relationships in which tagged sequences are translated and correlated with chemical (synthetic) history information. Thus, these library recognition sequences allow two or more libraries to be mixed together for selection, amplification, purification, sequencing, and the like.

The sequence of use encodes the history (i.e., use) of one or more library members in an individual aliquot of the library. For example, separate aliquots can be treated with different reaction conditions, components, and/or selection steps. In particular, such sequences can be used to identify such aliquots and infer their history (use), and thus allow aliquots of the same library having different histories (uses) (e.g., different selection experiments) to be mixed together for the purpose of mixing samples together for selection, amplification, purification, sequencing, and the like. These use sequences can be included in the head fragment, tail fragment, tag, use tag (i.e., an oligonucleotide that includes the use sequence), or any other tag described herein (e.g., a library-identifying tag or source tag).

The source sequence is a degenerate (randomly generated) oligonucleotide sequence of any useful length (e.g., about six oligonucleotides) that encodes a source of the library member. Such sequences are used to randomly subdivide library members that are otherwise identical in all respects into entities that are distinguishable by sequence information, such that the observation of amplification products derived from a unique progenitor template (e.g., a selected library member) can be distinguished from the observation of multiple amplification products derived from the same progenitor template (e.g., a selected library member). For example, after library formation and prior to the selection step, each library member may include a different source sequence, for example in a source tag. After selection, selected library members can be amplified to produce amplification products, and a portion of the library members expected to include the source sequence (e.g., in the source signature) can be observed and compared to the source sequence in each of the other library members. Since the source sequence is degenerate, each amplification product of each library member should have a different source sequence. However, observation of the same source sequence in the amplification product may indicate multiple amplicons derived from the same template molecule. The source marker may be used when it is desired to determine statistics and statistics of the population encoding the marker prior to amplification rather than after amplification. These source sequences can be included in the head fragment, tail fragment, tag, source tag (i.e., an oligonucleotide that includes the source sequence), or any tag described herein (e.g., a library-identifying tag or a use tag).

Any type of sequence described herein may be included in the header fragment. For example, the headpiece can include one or more of a building block sequence, a library recognition sequence, a use sequence, or a source sequence.

Any of these sequences described herein may be included in the tail segment. For example, the tail segment can include one or more of a library recognition sequence, a use sequence, or a source sequence.

Any of the labels described herein may include a linker at or near the 5 '-or 3' -end with the fixed sequence. The linker facilitates the formation of a bond (e.g., a chemical bond) by providing a reactive group (e.g., a chemically reactive group or a photoreactive group) or by providing a site for a reagent that allows formation of a bond (e.g., a reagent that intercalates a moiety or a reversibly reactive group in the linker or cross-linking oligonucleotide). Each 5 '-linker may be the same or different, and each 3' -linker may be the same or different. In an exemplary non-limiting complex with more than one tag, each tag can include a5 '-linker and a 3' -linker, where each 5 '-linker has the same sequence and each 3' -linker has the same sequence (e.g., where the sequence of the 5 '-linker can be the same or different from the sequence of the 3' -linker). The linker provides a sequence that can be used for one or more keys. To allow binding of the transfer primer or hybridization of the cross-linking oligonucleotide, the linker may include one or more functional groups that allow bond formation (e.g., a bond, such as a chemical bond, for which the polymerase has reduced read-through or translocation capability).

These sequences may include any modification described herein for an oligonucleotide, such as one or more modifications that promote solubility in organic solvents (e.g., any described herein, such as for a headpiece), that provide a native phosphodiester bond (e.g., a phosphorothioate analog), or that provide one or more non-natural oligonucleotides (e.g., 2' -substituted nucleotides, such as 2' -O-methylated nucleotides and 2' -fluoro nucleotides, or any of the nucleotides described herein).

These sequences may include any of the features described herein for the oligonucleotides. For example, these sequences may be included in a tag of less than 20 nucleotides (e.g., a tag as described herein). In other examples, markers comprising one or more of these sequences have about the same mass (e.g., each marker has a mass that is about +/-10% different from the average mass within a particular marker set that encodes a particular variable); lack of a primer binding (e.g., constant) region; lack of a constant region; or a constant region of reduced length (e.g., less than 30 nucleotides, less than 25 nucleotides, less than 20 nucleotides, less than 19 nucleotides, less than 18 nucleotides, less than 17 nucleotides, less than 16 nucleotides, less than 15 nucleotides, less than 14 nucleotides, less than 13 nucleotides, less than 12 nucleotides, less than 11 nucleotides, less than 10 nucleotides, less than 9 nucleotides, less than 8 nucleotides, or less than 7 nucleotides in length).

Sequencing strategies for libraries and oligonucleotides of this length may optionally include concatenation or linkage strategies to increase read fidelity or sequencing depth, respectively. In particular, the selection of coding libraries lacking primer binding regions has been described in the literature for SELEX, such as Jarosch et al,Nucleic Acids Res.34 e86 (2006), which is incorporated herein by reference. For example, library members can be modified (e.g., after the selection step) to include a first adaptor sequence on the 5 '-end of the complex and a second adaptor sequence on the 3' -end of the complex, wherein the first sequence is substantially complementary to the second sequence and causes duplex formation. To further improve yield, two immobilized dangling nucleotides (e.g., CC) are added to the 5' -end.

Key with a key body

The bond of the invention is present between the information-encoding oligonucleotides (e.g., between the headpiece and the tag, between two tags, or between a tag and a tailpiece). Exemplary linkages include phosphodiester linkages, phosphonate linkages, and phosphorothioate linkages. In some embodiments, the polymerase has reduced ability to read or translocate through one or more bonds. In certain embodiments, the chemical bond includes one or more chemically reactive groups, such as a monophosphate and/or hydroxyl group, a photoreactive group, an intercalating moiety, a cross-linking oligonucleotide, or a reversible co-reactive group.

A bond can be tested to determine if the polymerase has reduced ability to read through or translocate through the bond. This ability can be tested by any useful method, such as liquid chromatography-mass spectrometry, RT-PCR analysis, sequence population statistics, and/or PCR analysis. In some embodiments, chemical linking comprises the use of one or more chemical reaction pairs to provide bonds, such as monophosphates and hydroxyls. As described herein, the readable bond can be synthesized by chemical ligation, for exampleE.g. by the presence of a cyanoimidazole and a divalent metal source (e.g. ZnCl)₂) In the case of (3), the reaction of a monophosphate, monothiophosphate, or monophosphonic acid at the 5 '-or 3' -terminus with a hydroxyl group at the 5 '-or 3' -terminus.

Other exemplary chemical reaction pairs are such pairs: including an optionally substituted alkynyl group and an optionally substituted azido group, via a wheatstone (Huisgen)1, 3-dipolar cycloaddition reaction to form a triazole; optionally substituted dienes (e.g., optionally substituted 1, 3-unsaturated compounds such as optionally substituted 1, 3-butadiene, 1-methoxy-3-trimethylsilyl-1, 3-butadiene, cyclopentadiene, cyclohexadiene, or furan) with 4 pi-electron systems and optionally substituted dienophiles or optionally substituted heteroadienophiles (e.g., optionally substituted alkenyl groups or optionally substituted alkynyl groups) with 2 pi-electron systems via reaction by Diels Alder (Diels-Alder) to form cycloalkenes; nucleophiles (e.g., optionally substituted amines or optionally substituted thiols) with a strained heterocyclic electrophile (e.g., optionally substituted epoxide, aziridine ion, or episulfonium ion), via a ring-opening reaction to form a heteroalkyl group; phosphorothioate groups with an iodo group, as in splint linkages of 5 '-iododT containing oligonucleotides to 3' -phosphorothioate oligonucleotides; reaction of an optionally substituted amino group with an aldehyde group or ketone group, such as reaction of a3 '-aldehyde-modified oligonucleotide (which may optionally be obtained by oxidation of a commercially available 3' -glyceryl-modified oligonucleotide) with a5 '-amino oligonucleotide (i.e., in a reductive amination reaction) or a 5' -hydrazine oligonucleotide; optionally substituted amino groups and carboxylic acid groups or thiol groups (e.g., with or without the use of trans-4- (maleimidomethyl) cyclohexane-1-carboxylate succinimidyl ester (SMCC) or the pair of 1-ethyl-3- (3-dimethylaminopropyl) carbodiimide (EDAC); optionally substituted hydrazine and aldehyde or ketone groups; optionally substituted hydroxylamine and aldehyde or ketone groups; or a nucleophile and optionally substituted alkyl halide.

Platinum complexes, alkylating agents, or furan modified nucleotides may also be used as chemically reactive groups to form inter-or intra-chain linkages. Such a reagent may be used between two oligonucleotides, and it may optionally be present in a cross-linked oligonucleotide.

Exemplary non-limiting platinum complexes include cisplatin (cis-diamminedichloroplatinum (II), e.g., to form GG intrachain bonds), antiplatin (trans-diamminedichloroplatinum (II), e.g., to form GXG interchain bonds, where X may be any nucleotide), carboplatin, picoplatin (ZD0473), ormaplatin, or oxaliplatin to form, e.g., GC, CG, AG, or GG bonds. Any of these bonds may be interchain or intrachain bonds.

Exemplary non-limiting alkylating agents include nitrogen mustards (mechlorethamine (e.g., to form GG linkages), chlorambucil, melphalan, cyclophosphamide, prodrug forms of cyclophosphamide (e.g., 4-hydroperoxycyclophosphamide and ifosfamide)), 1, 3-bis (2-chloroethyl) -1-nitrosourea (BCNU, carmustine), aziridines (e.g., mitomycin C, triethylenemelamine, or triethylenethiophosphoramide (thiotepa) to form GG or AG linkages), hexamethylmelamine, alkylsulfonates (e.g., busulfan to form GG linkages), or nitrosoureas (e.g., 2-chlorothiourea to form GG or CG linkages, such as carmustine (BCNU), chlorourethricin, lomustine (CCNU), and semustine (methyl-CCNU)). Any of these bonds may be interchain or intrachain bonds.

Furan modified nucleotides may also be used to form the bond. Upon in situ oxidation (e.g., with N-bromosuccinimide (NBS)), the furan moiety forms a reactive oxyalkylene aldehyde derivative that reacts with the complementary base to form an interchain bond. In some embodiments, the furan modified nucleotide forms a bond with a complementary a or C nucleotide. Exemplary non-limiting furan modified nucleotides include any 2' - (furan-2-yl) propionylamino-modified nucleotide; or an acyclic modified nucleotide of a 2- (furan-2-yl) ethyl glycol nucleic acid.

Photoreactive groups may also be used as reactive groups. Exemplary non-limiting photoreactive groups include an intercalating moiety, a psoralen derivative (e.g., psoralen, HMT-psoralen, or 8-methoxypsoralen), an optionally substituted cyanovinylcarbazole group, an optionally substituted vinylcarbazole group, an optionally substituted cyanovinyl group, an optionally substituted acrylamide group, an optionally substituted diazirine group, an optionally substituted benzophenone (e.g., succinimidyl ester of 4-benzoylbenzoic acid or benzophenone isothiocyanate), an optionally substituted 5- (carboxy) vinyluridine group (e.g., 5- (carboxy) vinyl-2' -deoxyuridine), or an optionally substituted azide group (e.g., an aryl azide or haloaryl azide, such as 4-azido-2, 3,5, succinimidyl ester of 6-tetrafluorobenzoic Acid (ATFB).

The intercalating moiety may also serve as a reactive group. Exemplary non-limiting intercalating moieties include psoralen derivatives, alkaloid derivatives (e.g., berberine, palmatine, berberine, sanguinarine (e.g., an iminium or alkanolamine form thereof, or aristololactam- β -D-glucoside), ethidium cations (e.g., ethidium bromide), acridine derivatives (e.g., proflavine, acridine yellow, or amsacrine), anthracycline derivatives (e.g., doxorubicin, epirubicin, daunorubicin (daunorubicin), idarubicin, and doxorubicin), or thalidomide.

For cross-linking oligonucleotides, any available reactive group (e.g., a group described herein) can be used to form inter-or intra-chain bonds. Exemplary reactive groups include chemically reactive groups, photoreactive groups, intercalating moieties, and reversible co-reactive groups. Crosslinking reagents for use with the crosslinking oligonucleotide include, but are not limited to, alkylating agents (e.g., as described herein), cisplatin (cis-diaminedichloroplatinum (II)), trans-diaminedichloroplatinum (II), psoralen, HMT-psoralen, 8-methoxypsoralen, furan-modified nucleotides, 2-fluoro-deoxyinosine (2-F-dI), 5-bromo-deoxycytidine (5-Br-dC), 5-bromo-deoxyuridine (5-Br-dU), 5-iodo-deoxycytidine (5-I-dC), 5-iodo-deoxyuridine (5-I-dU), trans-4- (maleimidomethyl) cyclohexane-1-carboxylic acid succinimidyl ester, succinimidyl ester, SMCC, EDAC, or acetylthioacetic succinimidyl ester (SATA).

Oligonucleotides may also be modified to contain thiol moieties which can react with various thiol reactive groups such as maleimide, halogen, iodoacetamide and thus can be used to crosslink two oligonucleotides. The thiol group may be attached to the 5 '-or 3' -terminus of the oligonucleotide.

For interchain cross-linking between double-stranded oligonucleotides at pyrimidine (e.g., thymidine) positions, an embedded photoreactive moiety psoralen may be selected. Upon irradiation with ultraviolet light (about 254nm), psoralen intercalates into the duplex and forms covalent interchain crosslinks with the pyrimidine, preferably at the 5' -TpA site. The psoralen moiety may be covalently linked to the modified oligonucleotide (e.g., via an alkane chain, such as C)_1-10Alkyl or polyglycol radicals, e.g. - (CH)₂CH₂O)_nCH₂CH₂-, where n is an integer of 1 to 50). Exemplary psoralen derivatives may also be used, with non-limiting derivatives including 4'- (hydroxyethoxymethyl) -4, 5', 8-trimethylpsoralen (HMT-psoralen) and 8-methoxypsoralen.

The various portions of the cross-linking oligonucleotide may be modified to introduce bonds. For example, a terminal phosphorothioate in an oligonucleotide may also be used to ligate two adjacent oligonucleotides. Halogenated uracils/cytosines may also be used as cross-linker modifications in oligonucleotides. For example, a 2-fluoro-deoxyinosine (2-F-dI) modified oligonucleotide may be reacted with a disulfide containing diamine or thiopropylamine to form a disulfide bond.

As described below, reversible co-reactive groups include those selected from the group consisting of: cyanovinylcarbazole groups, cyanovinyl groups, acrylamide groups, thiol groups, or sulfonylethyl sulfides. Optionally substituted Cyanovinylcarbazole (CNV) groups may also be used in oligonucleotides to crosslink to pyrimidine bases (e.g., cytosine, thymine, and uracil, and their modified bases) in the complementary strand. Upon irradiation at 366nm, the CNV group promotes [2+2] cycloaddition to the adjacent pyrimidine base, which causes interchain crosslinking. Irradiation at 312nm reverses the crosslinking and thus provides a means for reversible crosslinking of the oligonucleotide strand. A non-limiting CNV group is 3-cyanovinylcarbazole, which may include as carboxyvinylcarbazole nucleotide (e.g., as 3-carboxyvinylcarbazole-1 '- β -deoxynucleoside-5' -triphosphate).

The CNV group can be modified to use another reactionThe reactive cyano group is substituted with a substituent group to provide an optionally substituted vinylcarbazole group. Exemplary non-limiting reactive groups for the vinylcarbazole group include-CONR_N1R_N2Wherein each R is_N1And R_N2May be the same or different and are independently H and C_1-6Alkyl radicals, e.g. CONH₂；-CO₂A carboxyl group of H; or C_2-7An alkoxycarbonyl group (e.g., methoxycarbonyl). Further, the reactive group may be located on the alpha or beta carbon of the vinyl group. Exemplary vinyl carbazole groups include cyanovinyl carbazole groups as described herein; aminovinylcarbazole groups (e.g., aminovinylcarbazole nucleotides such as 3-aminovinylcarbazole-1 '- β -deoxynucleoside-5' -triphosphates); carboxyvinylcarbazole groups (e.g., carboxyvinylcarbazole nucleotides such as 3-carboxyvinylcarbazole-1 '- β -deoxynucleoside-5' -triphosphates); and C_2-7An alkoxycarbonyl vinylcarbazole group (e.g., an alkoxycarbonyl vinylcarbazole nucleotide such as 3-methoxycarbonylvinylcarbazole-1 '- β -deoxynucleoside-5' -triphosphate). Additional optionally substituted vinylcarbazole groups and nucleotides with such groups are provided in U.S. patent 7,972,792 and Yoshimura and Fujimoto,Org.Lett.10:3227-3230(2008), which are hereby incorporated by reference in their entirety.

Other reversibly reactive groups include a thiol group and another thiol group to form a disulfide, and a thiol group and a vinyl sulfone group to form a sulfonylethyl sulfide. The thiol-thiol group may optionally include a bond formed by reaction with bis- ((N-iodoacetyl) piperazinyl) sulforhodamine. Other reversibly reactive groups (e.g., such as certain photoreactive groups) include optionally substituted benzophenone groups. A non-limiting example is benzophenone uracil (BPU), which can be used for site-selective formation and sequence-selective formation of interchain crosslinks of BPU-containing oligonucleotide duplexes. This crosslinking can be reversed upon heating, providing a means for reversible crosslinking of the two oligonucleotide strands.

In other embodiments, chemical ligation includes the introduction of analogs of phosphodiester bonds, e.g., for post-selection PCR analysis and sequencing. Exemplary analogs of phosphodiesters include a phosphorothioate linkage (e.g., a linkage as introduced by use of a phosphorothioate group and a leaving group such as an iodo group), a phosphoamide linkage, or a phosphorodithioate linkage (e.g., a linkage as introduced by use of a phosphorodithioate group and a leaving group such as an iodo group).

For any group described herein (e.g., a chemically reactive group, a photoreactive group, an intercalating moiety, a cross-linked oligonucleotide, or a reversible co-reactive group), the group can be incorporated at or near the end of the oligonucleotide or between the 5 '-and 3' -ends. In addition, one or more groups may be present in each oligonucleotide. When a reactive group pair is desired, the oligonucleotide can be designed to facilitate the reaction between the group pair. In a non-limiting example of a cyanovinylcarbazole group co-reactive with a pyrimidine base, the first oligonucleotide may be designed to include a cyanovinylcarbazole group at or near the 5' -terminus. In this example, the second oligonucleotide may be designed to be complementary to the first oligonucleotide and include a co-reactive pyrimidine base at a site that aligns with the cyanovinylcarbazole group when the first and second oligonucleotides hybridize. Any of the groups herein and any oligonucleotide having one or more groups can be designed to facilitate a reaction between the groups to form one or more bonds.

Bifunctional spacer

The bifunctional spacer between the headpiece and the chemical entity may be altered to provide an appropriate spacer moiety and/or to increase the solubility of the headpiece in organic solvents. A variety of spacers are commercially available, which can bind the headpiece to a library of small molecules. The spacer is generally composed of straight or branched chains and may include C_1-10Alkyl, 1 to 10-atom heteroalkyl, C_2-10Alkenyl radical, C_2-10Alkynyl, C_5-10Aryl, 3 to 20-atom ring or polycyclic systems, phosphodiesters, peptides, oligosaccharides, oligonucleotides, oligomers, polymers or polyalkylene glycols (e.g. polyethylene glycols, such as- (CH)₂CH₂O)_nCH₂CH₂-, where n is an integer of 1 to 50), or a combination thereof.

Bifunctional spacers can provide an appropriate spacer moiety between the headpiece of the library and the chemical entity. In certain embodiments, the bifunctional spacer comprises three moieties. Moiety 1 may be a reactive group that forms a covalent bond with DNA, such as a carboxylic acid, preferably activated by N-hydroxysuccinimide (NHS) ester to react with an amino group (e.g., amino-modified dT) on DNA; imides for modifying the 5 'or 3' -end of the single-stranded headpiece (by standard oligonucleotide chemistry); a chemical reaction pair (e.g., azido-alkyne cycloaddition in the presence of a cu (i) catalyst or any described herein); or a thiol-reactive group. Moiety 2 may also be a reactive group with a chemical entity, building block A_nOr the scaffold forms a covalent bond. Such reactive groups may be, for example, amines, thiols, azides or alkynes. Portion 3 may be a chemically inert spacer portion of variable length, introduced between portions 1 and 2. Such spacer moieties can be chains of ethylene glycol units (e.g., PEGs of different lengths), alkanes, alkenes, polyalkene chains, or peptide chains. The spacer may comprise a branched or inserted moiety having a hydrophobic moiety (e.g., a benzene ring) to improve solubility of the headpiece in organic solvents, and a fluorescent moiety (e.g., fluorescein or Cy-3) for library detection purposes. Hydrophobic residues in the design of the headpiece may be varied with the design of the spacer to facilitate library synthesis in organic solvents. For example, the head fragment and spacer combination is designed to have the appropriate residues, where octanol: coefficient of water (P)_oct) For example, 1.0 to 2.5. Spacers can be empirically selected for a given small molecule library design such that the library can be synthesized in organic solvents, e.g., 15%, 25%, 30%, 50%, 75%, 90%, 95%, 98%, 99%, or 100% organic solvents. A mimic reaction can be used prior to library synthesis to alter the spacer to select the appropriate chain length, which dissolves the headpiece in organic solvent. Exemplary spacers include those of: with increased alkyl chain length, increased polyethylene glycol units, with positive charge (to neutralize head)Negative phosphate charge on the fragment), or an increased amount of hydrophobicity (e.g., addition of a benzene ring structure).

Examples of commercially available spacers include amino-carboxylic acid spacers, such as those that are peptides (e.g., Z-Gly-Gly-Gly-Osu (N-. alpha. -benzyloxycarbonyl- (glycine)₃-N-succinimidyl ester) or Z-Gly-Gly-Gly-Gly-Gly-Gly-Osu (N-alpha-benzyloxycarbonyl- (glycine)₆-N-succinimidyl ester, SEQ ID N0:13)), PEG (e.g., Fmoc-amino PEG2000-NHS or amino-PEG (12-24) -NHS) or an alkane acid chain (e.g., Boc-epsilon-aminocaproic acid-Osu); chemical reaction pairs spacers, such as those described herein that bind a peptide moiety (e.g., azidohomoalanine-Gly-Gly-Gly-OSu (SEQ ID NO:2) or propargylglycine-Gly-Gly-Gly-OSu (SEQ ID NO:3)), PEG (e.g., azido-PEG-NHS), or an alkane chain moiety (e.g., 5-azidopentanoic acid, ((s))S) -2- (azidomethyl) -1-Boc-pyrrolidine, 4-azidoaniline, or 4-azido-butane-1-acid N-hydroxysuccinimide ester); thiol-reactive spacers, such as those of PEG (e.g., SM (PEG) n NHS-PEG-maleimide), alkane chains (e.g., 3- (pyridin-2-yldisulfanyl) -propionic acid-Osu or 6- (3' - [ 2-pyridyldithio-))]-propionamido) hexanoic acid sulfosuccinimidyl ester)); and imides used in oligonucleotide synthesis, such as amino-modifying agents (e.g., 6- (trifluoroacetylamino) -hexyl- (2-cyanoethyl) - (N, N-diisopropyl) -phosphoramidite), thiol-modifying agents (e.g., S-trityl-6-mercaptohexyl-1- [ (2-cyanoethyl) - (N, N-diisopropyl)]Phosphoramidites or chemical pair modifiers (e.g., 6-hexyn-1-yl- (2-cyanoethyl) - (N, N-diisopropyl) -phosphoramidite, 3-dimethoxytrityloxy-2- (3- (3-propargyloxypropionylamino) propionylamino) propyl-1-O-succinyl, long chain alkylamino CPG, or 4-azido-but-1-oic acid N-hydroxysuccinimide ester)). Additional spacers are known in the art and those that can be used during library synthesis include, but are not limited to, 5 '-0-dimethoxytrityl-1', 2 '-dideoxyribose-3' - [ (2-cyanoethyl) - (N, N-diisopropyl)]-a phosphoramidite; 9-0-Dimethoxytrityl-triethylene glycol, 1- [ (2-cyanoethyl) - (N, N-diisopropyl)]-a phosphoramidite; 3- (4,4' -Dimethoxytrityloxy) propyl-1- [ (2-cyanoethyl) - (N, N-diisopropyl)]-a phosphoramidite; and 18-O-dimethoxytrityl hexaethylene glycol, 1- [ (2-cyanoethyl) - (N, N-diisopropyl)]-phosphoramidites. Any of the spacers herein may be added in different combinations in series with one another to produce spacers of different desired lengths.

The spacers may also be branched, where branched spacers are well known in the art, and examples may consist of symmetric or asymmetric doublets or symmetric triplets. See, e.g., Newcome et al, Dendritic Molecules: Concepts, Synthesis, Perspectives, VCH Publishers (1996); the results of Boussif et al,Proc.Natl.Acad.Sci.USA92: 7297-; and the Jansen et al, who,Science266:1226(1994)。

method for determining nucleotide sequence of complex

The invention features methods that include determining the nucleotide sequence of a complex such that a coding relationship can be established between the sequence of an assembly marker sequence and a building block (or building block) of a chemical entity. In particular, the identity and/or history of the chemical entity may be inferred from the base sequence in the oligonucleotide. Using this approach, libraries comprising different chemical entities or members (e.g., small molecules or peptides) can be treated with specific marker sequences.

Any of the bonds described herein may be reversible or irreversible. Reversible bonds include photoreactive bonds (e.g., cyanovinylcarbazole groups and thymidine) and redox bonds. Additional connections are described herein.

In an alternative embodiment, the "unreadable" linkage may be enzymatically repaired to produce a readable or at least displaceable linkage. Enzyme repair processes are well known to those skilled in the art and include, but are not limited to, pyrimidine (e.g., thymidine) dimer repair mechanisms (e.g., using a photolyase or glycosylase (e.g., T4 Pyrimidine Dimer Glycosylase (PDG))), base excision repair mechanisms (e.g., using a glycosylase, an apurinic/Apyrimidinic (AP) endonuclease, a Flap endonuclease, or a poly ADP ribose polymerase (e.g., human apurinic/Apyrimidinic (AP) endonuclease, APE 1; endonuclease III (Nth) protein; endonuclease IV; endonuclease V; formamidopyrimidine [ faby ] -DNA glycosylase (Fpy); human 8-oxoguanine glycosylase 1 (. alpha.isoform) (hOGGl); human pgendonuclease VIII-like l (hILNEl)), uracil-DNA glycosylase (UDG); human single-stranded selective monofunctional uracil DNA glycosylase (SMUG 1); and human alkyl adenine DNA glycosylase) A methylase (hAAG)), which may optionally be combined with one or more endonucleases, DNA or RNA polymerases, and/or ligases for repair, a methylation repair mechanism (e.g., using methylguanine methyltransferase), an AP repair mechanism (e.g., using an apurinic/Apyrimidinic (AP) endonuclease (e.g., APE 1; endonuclease III; endonuclease IV; an endonuclease V; fpg; hOGGl; and hNEILl), which may optionally be combined with one or more endonucleases, DNA or RNA polymerases, and/or ligases for repair, nucleotide excision repair mechanisms (e.g., using an excision repair cross-complementary protein or excision nuclease, which may optionally be combined with one or more endonucleases, DNA or RNA polymerases, and/or ligases for repair), and mismatch repair mechanisms (e.g., using endonucleases (e.g., T7 endonuclease I; MutS, MutH and/or MutL) which may optionally be combined with one or more exonucleases, endonucleases, helicases, DNA or RNA polymerases, and/or ligases for repair). Commercial enzyme mixtures can be used to readily provide these types of repair mechanisms, for example, PreCR < Replacmix (New England Biolabs Inc., Ipswich MA), which includes Taq DNA ligase, endonuclease IV, Bst DNA polymerase, Fpg, uracil-DNA glycosylase (UDG), T4 PDG (T4 endonuclease V), and endonuclease VIII.

Method for coding chemical entities within a library

The methods of the invention may utilize libraries having varying numbers of chemical entities encoded by oligonucleotide tags. Examples of building blocks and encoding DNA tags can be found in U.S. patent application publication 2007/0224607, which is incorporated herein by reference.

Each chemical entity is formed by one or more building blocks and optionally a scaffold. The scaffold is used to provide one or more diverse nodes in a particular geometry (e.g., a triazine providing three nodes that are spatially disposed around a heteroaryl ring or linear geometry).

Building blocks and their encoding labels can be added to the headpiece directly or indirectly (e.g., via a spacer) to form a complex. When the head segment includes a spacer, a member or scaffold is added to the end of the spacer. When a spacer is not present, a building block may be added directly to the headpiece or the building block itself may include a spacer that reacts with the functional group of the headpiece. Exemplary spacers and head segments are described herein.

The stent may be added in any useful manner. For example, a scaffold may be added to the end of a spacer or headpiece, and a continuous member may be added to the available diversity nodes of the scaffold. In another example, component A is first placed_nAdded to the spacer or head segment, and then the diversity node of the stent S is connected to the member A_nThe functional group in (1) is reacted. Oligonucleotide labels encoding a particular scaffold may optionally be added to the headpiece or complex. For example, mixing S_nA complex added to n reaction vessels, wherein n is an integer greater than 1 and is labeled S_n(i.e., symbol S)₁，S₂, …，S_n-1，S_n) A functional group bound to the complex.

Building blocks may be added in multiple synthetic steps. For example, an aliquot of the headpiece, optionally with a spacer attached, is divided into n reaction vessels, where n is an integer of 2 or greater. In a first step, component A is placed_nAdding to each n reaction vessel (i.e., building Block A)₁，A₂,… A_n-1，A_nAdded to reaction vessel 1,2, … n-1, n), where n is an integer, and each building block A_nIs unique. In a second step, a scaffold S is added to each reaction vessel to form A_n-an S complex. Optionally, a stent S may be used_nIs added to each reaction vessel to form A_n-S_nA complex, wherein n is an integer greater than two, and eachSupport S_nMay be unique. In a third step, component B is placed_nTo contain A_nIn each n reaction vessel of the S complex (i.e. the building block B)₁，B₂,… B_n-1，B_nTo contain A₁-S，A₂-S,… A_n-1-S，A_n-reaction vessels 1,2, … n-1, n for S complexes) in which each building block B is provided with a plurality of building blocks B_nIs unique. In a further step, component C may be introduced_nTo contain B_n-A_nIn each n reaction vessel of the-S complex (i.e., component C)₁，C₂,… C_n-1，C_nTo contain B₁-A₁-S… B_n-A_n-reaction vessels 1,2, … n-1, n for S complexes) in which each member C is a member_nIs unique. The resulting library will have n³Of a number of n³A labeled complex. In this way, additional synthetic steps can be used to incorporate additional building blocks to further diversify the library.

After formation of the library, the resulting complex may optionally be purified and subjected to a polymerization or ligation reaction, e.g., to a headpiece. This general strategy can be extended to include additional diversity nodes and components (e.g., D, E, F, etc.). For example, the first diversity node reacts with the building block and/or S and is encoded by the oligonucleotide tag. Additional building blocks are then reacted with the resulting complex and subsequent diversity nodes are derived from the additional building blocks, which are encoded by the primers used in the polymerization or ligation reaction.

To form the coding library, oligonucleotide tags are added to the complexes after or before each synthesis step. For example, in the component A_nBefore or after addition to each reaction vessel, marker A_nFunctional groups bound to headpiece (i.e., label A₁，A₂,…A_n-1，A_nAdded to reaction vessel 1,2, … n-1, n) containing the headpiece. Each mark A_nWith different sequences, one for each unique member A_nAssociate and determine the signature A_nIs provided withFor component A_nThe chemical structure of (1). In this way, additional markers are used for coding as additional members or additional stents.

In addition, the last label added to the complex may also include a primer binding sequence or provide a functional group that allows binding (e.g., by ligation) of a primer binding sequence. The primer binding sequences can be used to amplify and/or sequence the oligonucleotide tags of the complexes. Exemplary methods for amplification and for sequencing include Polymerase Chain Reaction (PCR), linear amplification (LCR), Rolling Circle Amplification (RCA), or any other method known in the art for amplifying or determining nucleic acid sequences.

Using these methods, large libraries can be formed with large numbers of encoded chemical entities. For example, head segment is connected to spacer and member A_nReaction, this building block comprised 1,000 different variants (i.e., n = 1,000). For each component A_nLabeling the DNA with A_nLigation or primer extension to the headpiece. These reactions can be performed in1,000 well plates or 10x 100 well plates. All reactants can be combined, optionally purified and resolved into a second set of plates. Next, component B may be used_nThe same procedure was performed, which also included 1,000 different variants. The DNA may be labeled B_nIs connected to A_n-headpiece complex, and all reactions can be combined. The resulting library comprises A_nx B_n1,000 x1,000 combinations (i.e., 000,000 compounds) labeled with 1,000,000 different combinations of labels. The same method can be extended to add a component C_n、D_n、E_nAnd the like. The resulting library can then be used to identify compounds that bind to the target. The structure of the chemical entities bound to the library can optionally be assessed by PCR and sequencing of DNA markers to identify the enriched compounds.

This method may be modified to avoid labeling after each component is added or to avoid merging (or mixing). For example, by combining the member A_nAdding to n reaction vessels (where n is an integer greater than 1) and adding the same building block B₁Added to each reaction well to modify the process. Here, for each chemical entityB₁Are identical and, therefore, do not require oligonucleotide labeling encoding for this building block. After the building blocks are added, the composites may or may not be combined. For example, after the final step of building block addition, the library is not pooled and the pools (pool) are screened separately to identify compounds bound to the target. To avoid pooling all the reactants after synthesis, binding on the sensor surface can be monitored in a high-throughput format (e.g., 384 well plates and 1,536 well plates) using, for example, ELISA, SPR, ITC, Tm change, SEC, or similar assays. For example, A can be labeled with DNA_nCoding means A_nAnd member B may be encoded by its position within the well plate_n. A can then be performed by using a binding assay (e.g., ELISA, SPR, ITC, Tm shift, SEC, or the like), and by performing A by sequencing, microarray analysis, and/or restriction digestion analysis_nMarker analysis to identify candidate compounds. This analysis allows the identification of the building block A which produces the desired molecule_nAnd B_nCombinations of (a) and (b).

The amplification method can optionally include forming a water-in-oil emulsion to form a plurality of aqueous microreactors. Reaction conditions (e.g., concentration of complexes and size of microreactors) can be adjusted to provide (on average) microreactors having at least one member of a library of compounds. Each microreactor may also comprise a target, a single bead capable of binding to a complex or a portion of a complex (e.g., one or more labels) and/or binding to a target, and an amplification reaction solution having one or more necessary reagents for nucleic acid amplification. After amplification of the label in the microreactor, the amplified copy of the label will bind to the bead in the microreactor and the coated bead can be identified by any available method.

Once the building blocks from the first library that bind to the target of interest are identified, a second library can be prepared in an iterative manner. For example, one or two additional diversity nodes can be added and a second library formed and sampled, as described herein. This process can be repeated as many times as necessary to form a molecule having the desired molecular and pharmaceutical properties.

Various attachment techniques may be used to add brackets, members, spacers, keys and indicia. Thus, any of the combining steps described herein may include any available connection technology or technology. Exemplary ligation techniques include enzymatic ligation, e.g., enzymatic ligation using one or more RNA ligase and/or DNA ligase, as described herein; and chemical ligation, e.g., using a chemically reactive pair, as described herein.

Screening method

There are a number of established technical methods for determining binding of a compound to a protein, e.g.by determiningKd. Methods for detecting or quantifying binding of a compound to a target protein include, for example, absorbance, fluorescence, raman scattering, phosphorescence, luminescence, luciferase assays, and radioactivity. Exemplary techniques include Surface Plasmon Resonance (SPR) and Fluorescence Polarization (FP). SPR measures the change in refractive index of a metal surface when a compound binds to a protein immobilized on the metal surface, and FP measures the change in tumbling rate (tumbling rate) caused by the binding of a compound to a protein using the loss of polarization of incident light. In some embodiments, these methods can be used to experimentally determine the binding of a candidate compound to a target protein predicted using the methods of the invention.

Alternatively, compounds that bind to the target protein can be identified using affinity-based methods. For example, a target protein with an affinity tag (e.g., a poly-His tag) can be pre-incubated with saturating concentrations of one or more candidate compounds. Subsequent affinity purification and compound identification (e.g., by using an identity tag) will allow identification of compounds that bind to the target protein.

Target protein

A target protein (e.g., a eukaryotic target protein such as a mammalian target protein or a fungal target protein or a prokaryotic target protein such as a bacterial target protein) is a protein that mediates a disease condition or a symptom of a disease condition. Thus, a desired therapeutic effect can be obtained by modulating (inhibiting or increasing) its activity.

The target protein may be naturally occurring, e.g., wild-type. Alternatively, the target protein may be different from the wild-type protein, but still retain biological function, e.g., as an allelic variant, splice mutant, or biologically active fragment.

In some embodiments, the target protein is an enzyme (e.g., a kinase). In some embodiments, the target protein is a transmembrane protein. In some embodiments, the target protein has a coiled coil structure. In certain embodiments, the target protein is a dimeric complex protein.

In some embodiments, the target protein is a GTPase, such as DIRAS1, DIRAS2, ERAS, GEM, HRAS, KRAS, MRAS, NKIRAS2, NRAS, RALA, RALB, RAP 12, RAP 22, RASD 102, RASL11 2, RASL 2, REM2, rerp, rgl, RRAD, RRAS2, RASL10 2, RASL11 RAB2, RASL 2, rasp 7, rasp 2, rasp 7 RAB 72, rasp 2, RASL 2, rasp 7 RAB7, rasp 7 RAB 72, rasp 7, rasp 7 RAB 72, rasp 7, rasp 72, rasp 7, rasp 72, rasp 7 RAB 72, rasp 7, rasp 72, rasp 7, rasp 72, rasp 7 RAB 72, rasp 7, rasp 72, rasp 7 rabb, rasp 7 RAB 72, rasp 7, rasp 72, RAP2, ARF, ARL5, ARL10, ARL13, ARL, TRIM, ARL4, ARFRP, ARL13, RAN, RHEB, RHEBL, RRAD, GEM, REM, RIT, RHOT, or RHOT. In some embodiments, the target protein is a GTPase activating protein, such as NF1, IQGAP1, PLEXIN-B1, RASAL1, RASAL2, ARHGAP5, ARHGAP8, ARHGAP12, ARHGAP22, ARHGAP25, BCR, DLC1, DLC2, DLC3, GRAF, RALBP1, RAP1GAP, SIPA1, TSC2, AGAP2, ASAP1, or ASAP 3. In some embodiments, the target protein is a guanylate exchanger, such as CNRASGEF, RASGEF1A, RASGRF2, RASGRP1, RASGRP4, SOS1, RALGDS, RGL1, RGL2, RGR, ARHGEF 2, ASEF/ARHGEF 2, ASEF2, DBS, ECT2, GEF-H2, LARG, NET 2, OBSCURIN, P-REX2, PDZ-RHOGEF, TEM 72, TIAM 2, TRIO, VAV2, DOCK2, C3 2, BIDODG 2/AREF 2, A EF3672, FG 2, or FBP 100. In certain embodiments, the target protein is a protein having a protein-protein interaction domain, such as ARM; a BAR; BEACH; BH; BIR; BRCT; BROMO; BTB; c1; c2; a CARD; CC; CALM; CH (CH); CHROMO; CUE; DEATH; DED; DEP; DH; EF-hand; an EH; ENTH; EVH 1; f-box; FERM; FF; FH 2; FHA; FYVE; GAT; GEL; GLUE; GRAM; GRIP; GYF, respectively; HEAT; HECT; IQ; an LRR; MBT; MH 1; MH 2; MIU; NZF; PAS; PB 1; PDZ; PH value; POLO-Box; PTB; a PUF; PWWP; PX; RGS; RING; SAM; SC; SH 2; SH 3; SOCS; SPRY; START; SWIRM; TIR; TPR; TRAF; SNARE; TUBBY; TUDOR; UBA; UEV; UIM; VHL; VHS; WD 40; WW; SH 2; SH 3; TRAF; a bromodomain; or TPR. In some embodiments, the target protein is a heat shock protein, such as Hsp20, Hsp27, Hsp70, Hsp84, α B crystals, TRAP-1, hsf1, or Hsp 90. In certain embodiments, the target protein is an ion channel, such as cav2.2, cav3.2, IKACh, kv1.5, TRPA1, nav1.7, nav1.8, nav1.9, P2X3, or P2X 4. In some embodiments, the target protein is a helical frizzled protein such as geminin, SPAG4, VAV1, MAD1, ROCK1, RNF31, NEDP1, HCCM, EEA1, Vimentin, ATF4, Nemo, SNAP25, Syntaxin 1a, FYCO1, or CEP 250. In certain embodiments, the target protein is a kinase, such as ABL, ALK, AXL, BTK, EGFR, FMS, FAK, FGFR, 2,3, 4, FLT, HER/ErbB, IGF1, INSR, JAK, KIT, MET, pdgf, PDGFRB, RET RON, ROR, ROS, SRC, SYK, TIE, TRKA, TRKB, KDR, AKT, PDK, PKC, RHO, ROCK, RSK, RKS, ATM, ATR, CDK, ik, rkk, GSK3, JNK, ARuB, PLK, pkk, raf, PKN, fack, etc. In some embodiments, the target protein is a phosphatase, such as WIP1, SHP2, SHP1, PRL-3, PTP1B, or STEP. In certain embodiments, the target protein is a ubiquitin ligase, such as BMI-1, MDM2, NEDD4-1, β -TRCP, SKP2, E6AP, or APC/C. In some embodiments, the target protein is a chromatin modifying/remodeling factor, such as that encoded by genes BRG1, BRM, ATRX, PRDM3, ASH1L, CBP, KAT6A, KAT6B, MLL, NSD1, SETD2, EP300, KAT2A, or CREBBP. In some embodiments, the target protein is a transcription factor, such as a transcription factor encoded by: EHF, ELF1, ELF3, ELF4, ELF5, ELK1, ELK3, ELK4, ERF, ERG, ETS 4, ETV4, FEV, FLI 4, GAVPA, SPDEF, SPI 4, SPIC, SPIB, E2F4, ARNTL, BHLHA 4, BHLHB 4, BHLBHB 4, BHE 4, BHLHE4, CLOCK, FIGLA 4, HES 4, HEY 4, HEHEHEY 4, HESABL 4, CALN 4, CALNF 4, CALN 4, CALNF 4, CALN 4, CALNF 4, CALN 4, CALNF 4, CALN, HOXA, HOXAB, HOXB, HOXC, HOXD, IRX, ISL, ISX, LBX, LHX, LMX1, MEIS, MEOX, MIXL, MNX, MSX, NKX-3, NKX-8, NKX-1, NKX-2, NOTO, ONECUT, ONECOX, OTX, PDX, PHOX2, PITX, PINOX, PROP, PRRX, RAX, RAXL, RHOXF, SHIF, SHOX, TGIF, POTGIF 2, VACX, HOXB, HOXCX, HOXB, LBX, MSX, LBX, MSX, NKX-2, NKX-2, NKX, POXU, PFX, SMAD3, CENPB, PAX1, BCL6 1, EGR1, GLIS1, GLI 1, GLIS1, HIC 1, HINFP1, KLF1, MTF1, PRDM1, SCRT1, SNAI 1, SP1, YY1, ZBZBZBN 1, ZBTB 71, FONXNFX 1, FONZNFX 1, FONXNFX 1, FONXN 1, FONX 1, FONXN 1, FONX 1, GATA3, GATA4, or GATA 5; or C-Myc, Max, Stat3, androgen receptor, C-Jun, C-Fox, N-Myc, L-Myc, MITF, Hif-1 alpha, Hif-2 alpha, Bcl6, E2F1, NF-kappa B, Stat5, or ER (contact). In certain embodiments, the target protein is TrkA, P2Y14, mPEGS, ASK1, ALK, Bcl-2, BCL-XL, mSIN1, ROR γ t, IL17RA, eIF4E, TLR 7R, PCSK9, IgE R, CD40, CD40L, Shn-3, TNFR1, TNFR2, IL31RA, OSMR, IL12 β 1,2, Tau, FASN, KCTD 6, KCTD 9, Raptor, Rictor, RALGPA, Membrane connexin family members, BCOR, NCOR, β catenin, AAC 11, PLD 12, Frizzled 12, RaplP, MLL-1, Myb, Ezh 12, RhoGD12, EGFR, CTLA4 (12), GCGC coact), Adiconin R72, GPRp 12, GPR-12, or Nrl 12-12, or NGPR 12-12.

Virtual screening method

Data collection and statistical result generation

In some embodiments, the steps in the virtual screening methods of the invention include obtaining data derived from a DNA-encoding library selection experiment (e.g., an affinity-based experiment) directed to a target protein. Data is selected for reading as DNA sequences, which are then aggregated into statistical reads, such as sequence counts. Aggregation into statistical results is based on grouping common coding compounds, e.g., putative chemical structures encoded by DNA (example level) or partial substructures encoding chemical structures (single, double or triple synthon level). The cut-off value of the statistical results obtained from sequencing of one or more selection conditions is used to determine whether a compound or moiety of the compound binds to the target (binder). Millions to millions (or even billions) of sequences are used per selection condition in order to collect significant statistics reflecting true potential small molecule/protein binding.

Machine learning

Machine Learning methods are known in the art, for example, non-limiting Machine Learning methods include naive Bayes (Na meive Bayes), Random Forest (Random Forest), Decision trees (Decision Tree), support vector machines (support vector Machine), Neural networks (Neural Net), and Deep Learning (Deep Learning).

In some embodiments, each data point from the data collection step is used to train a machine learning algorithm. Each data point includes information derived from the molecular structure (in whole or in part) of the compound from the DNA-encoding library and associated statistics from one or more selection experiments. The structure is used to generate digital inputs (calculated chemical properties such as molecular weight, cLogP) and binary strings (e.g., chemical fingerprints that reflect atoms, groups of atoms, and connectivity within the structure). The reads of these molecular calculations are used as input columns for training and prediction by machine learning algorithms. In some embodiments, the model is constructed such that the only inputs required are those directly derived from the molecular structure. In some embodiments, any structure from which these fingerprints and properties can be calculated can produce a prediction.

In some embodiments, further structural derivatives of the compounds (e.g., core analysis with side chains removed) may be used to generate further fingerprints and property calculations, or alternative structural fingerprints for training and prediction.

In some embodiments, data from one or more DNA-encoding library selections is used to assess whether a molecule is considered to represent an instance of a binding agent (positive), a non-binding agent (negative), or a non-specific binding agent (negative). Although the assessment (positive or negative) is based on the behavior of the encoding molecule in at least one DNA encoding library selection, additional information from other sources can be used to assess the positive and negative classifications used for training. It is also noteworthy that structures known to be synthesized in the library but not showing any counts from sequencing are considered negative examples in training. In some embodiments, a positive control is included within the dataset. For example, binding interaction data from compounds with known binding affinity for the target protein (e.g., known inhibitors or natural ligands) can be included.

In one embodiment, the assessment of binding of the input molecule is determined by detecting a statistically significant enrichment (elevated sequence count) in the selection comprising the target protein. Enrichment under control conditions that did not include the target protein was also used to assess the specificity of binding. Such conditions typically include a resin for capturing the protein during selection, but no addition of the protein. Additional information can be used to determine whether a particular molecule or portion of a molecule is labeled positive, e.g., enriched or not enriched under additional conditions or when selecting for a protein of interest. Information derived from selection for a number of non-target proteins may also be used, for example, a count of the total number of proteins for which a given molecule or portion of molecules has been shown to be enriched in selection. For example, detecting enrichment for a given molecule for several additional targets in a database may result in a negative indication due to lack of specificity.

Molecular representation

In some embodiments of the invention, molecular behavior is used to generate estimated binding calculations. Molecular manifestations include, for example, topological manifestations, electrostatic manifestations, geometric manifestations, or quantum chemical manifestations. The topological representation can be based on atoms, features, or functional groups and their connectivity (e.g., fingerprints, connection tables, molecular connectivity, and/or molecular graphical representations). Electrostatic manifestations include, for example, surface electrons. Geometric representations are, for example, pharmacophores, pharmacophore fingerprints, shape-based fingerprints, and/or 3D molecular coordinates using atoms, features, or functional groups. In some embodiments, quantum chemical representation is used. In some embodiments, the electronic molecular representation is a chemical fingerprint.

In some embodiments, the step in the virtual screening method of the invention comprises generating a chemical fingerprint of both the compound and the candidate compound for which binding interaction data has been generated. Chemical fingerprints may be generated using any method known in the art, such as ECFP6, FCFP6, ECFP4, MACCS, or morgan/ring fingerprints. The chemical fingerprint is then analyzed to identify patterns, e.g., to identify structural features that increase or decrease binding to the target protein. Information generated from chemical fingerprint comparisons of a large number of compounds (e.g., at least 250,000 molecules) may be used to increase the accuracy of the estimated binding interactions generated, as compared to chemical fingerprint comparisons of a smaller number of compounds, e.g., less than 100,000 compounds. In some embodiments, chemical fingerprints are used as the primary information for machine learning in the method.

For example, an exemplary training set input for an 8-bit fingerprint may include:

fingerprints are representations of chemical entities. Machine learning is performed by inputting the training rows (i.e., columns of each compound (i.e., fingerprint bits) plus a training column indicating whether it is a positive or negative embodiment).

Various algorithms (random forest (RF), naive bayes, deep learning, neural networks, etc.) operate by finding patterns that are related to true or false indications. These patterns may involve one or more bits. They can be found by explicitly analyzing statistical results (e.g., naive bayes, random forests) or by empirical feedback from varying model parameters (e.g., neural networks).

Another method that may be used is to add a column of computational properties (e.g., MW, cLogP, tPSA) in addition to the fingerprint. In this case, the machine learning algorithm may utilize these other columns in its statistical analysis or its model parameter search. Using the property in the analysis may improve the accuracy of the prediction compared to a prediction performed without using the property.

The molecules subsequently predicted in this approach are represented in exactly the same way as those represented in the training set, the key difference being that the training columns seen above are now unknown. The model generates prediction values to be filled into a combined feature column (e.g., a combined prediction column). In some embodiments, the column is Boolean (T/F), classified (e.g., non-binding agent, competitive binding agent, non-competitive binding agent), or numeric (e.g., reflecting a probability score for a binding agent).

Only the molecules to be predicted comprising the fingerprint columns may be used with the model generated by the first embodiment described above.

The following is an exemplary prediction with input information extended to include properties that can be used with the model created by the second embodiment above.

Output of

In some embodiments, the generated model will produce a binary score that indicates whether the candidate compound is positive or negative, or a probability score (e.g., from 0 to 1) that indicates the likelihood of assignment of the model to activity/binding of the candidate compound is positive or negative. This value can then be used to make a go/no-go decision (binary case) for a given molecule or to inform the candidate compound of a priority (probability score).

Examples

Example 1

Selection data of soluble epoxy hydrolases (sEH) from a set of libraries are used to train one of several machine learning models (random forest, naive bayes or neural networks) and then used to predict the selection behavior of molecules from libraries not included in the training set against the same target. The libraries used in the training set included a linear peptide library with 25,844,065 compounds, a 3-cyclic pyrazole library with 3,976,320 compounds, a 2-cyclic pyridine library with 5,079,459 compounds, and a 4-cyclic macrocycle library with 1,511,399,304 compounds. Libraries for the prediction set included a 3-ring linear peptide library of 221,580,000 compounds, a 3-ring pyridine library of 285,917,292 compounds, and a 2-ring benzimidazole library of 1,622,820 compounds.

As shown in fig. 1, an enrichment of binders was seen in the prediction set. The 4 quadrants in the figure represent the prediction of positive bisplexan using increasing library numbers (left to right, top to bottom). The Y-axis represents the enrichment of positives in the prediction set compared to random selection from the original population. The Y-axis shows the percentage of positives found in the prediction set in the original set. The results show that for the training and test sets (keeping the bissynthons out of the training set, but from the same library), the enrichment of the prediction set was always 2-2.5 times that of the original population. The prediction set is a double synthon from a library not used for training. In this case, increasing the number of libraries used for training compared to the original population shows an increased positive rate in the predicted population.

Example 2

Selection data for sEH from the same library as example 1 was used with machine learning algorithms (RF, MLP, deep learning) to train and generate models for predicting the activity of molecules not found in DNA-encoding libraries. For example, data is input and a model is generated that predicts the activity of the tested molecule in a conventional High Throughput Screening (HTS) assay (i.e., automated testing of 10K to 1Ms molecules). Predictions by the model are used as a filter to generate a list (e.g., 100 compounds) from an initial list of 10,000 to 100,000 or more molecules. The goal is to identify the molecules in this short list such that the final list is greatly enriched (10X to 100X) in potential rates of active molecules found in the initial set.

As shown in fig. 2, enrichment of predicted molecules greater than 40X has been observed compared to random selection. Fig. 2 shows a number of trials over time due to the improvement of the predictive model. This trend shows an increase in enrichment of major HTS hits and more rigorously confirmed actives in the prediction set compared to random selection. The confirmed actives were subjected to a second, confirmatory biochemical assay and activity was demonstrated. The best results show >40 fold improvement in the resulting prediction set compared to randomly selected molecules from the original population.

Example 3 optimization of prediction

For a given target or targets, there is a known set of HTS data. Multiple parameter settings are tested in order to achieve a high prediction rate. In fact, a high prediction rate is the result of fine tuning the prediction based on HTS results. HTS is used to demonstrate suitability, and the model can then be used to predict new or existing compounds (e.g., commercially available or from a pre-existing proprietary compound library). These molecules can then be tested with higher active rate expectations (e.g., greater than 1% or 10% active molecules) within the prediction set, regardless of the potential active rate of the random sample.

Example 4 optimization of prediction

Data from selections directed against a given target but under different conditions (e.g., using different protein fragments, mutants, isoforms, using closely related targets, using known small molecule competitors, etc.) are used to further refine the definition of positive data in the training set used to train the model.

Example 5 optimization of prediction

Data from selections for 10 to 100 protein targets, mutants, isoforms, etc. were used as a series of additional data columns to define positive or negative examples for training machine learning models.

Other embodiments

Various modifications and variations of the described methods and systems of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. While the invention has been described in connection with specific desired embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in the medical, pharmacological or related fields are intended to be within the scope of the present invention.

The following is claimed in the present application.

Claims

1. A method, comprising the steps of:

2. The method of claim 1, wherein the plurality of binding interaction findings comprises at least one million binding interaction findings.

3. The method of claim 1 or2, wherein at least 95% of the plurality of binding interaction findings represent binding interactions between the target protein and a compound comprising a nucleotide tag encoding the identity of the compound.

4. The method of any one of claims 1 to 3, wherein at least 99% of the plurality of binding interaction findings represent a binding interaction between the target protein and a compound comprising a nucleotide tag encoding the identity of the compound.

5. The method of any one of claims 1 to 4, wherein at least 50% of the plurality of binding interaction findings are determined by simultaneously contacting a plurality of compounds comprising a nucleotide tag encoding the identity of the compound with the target protein.

6. The method according to any one of claims 1 to 5, wherein the method further comprises providing one or more additional plurality of binding interaction findings for one or more additional target proteins, wherein at least 50% of the plurality of binding interaction findings represent binding interactions between the additional target protein and compounds from the plurality of binding interactions with the target protein.

7. The method of claim 6, wherein the list of candidate compounds can be displayed and ranked by the selectivity of a candidate compound for the target protein relative to the one or more additional target proteins.

8. The method of claim 6 or 7, wherein the one or more additional target proteins comprise a mutant of the target protein.

9. The method of any one of claims 1 to 8, wherein the method further comprises providing one or more additional plurality of binding interaction findings of one or more negative control experiments, wherein at least 50% of the plurality of binding interaction findings represent negative control experiments from compounds that bind to the plurality of binding interactions of the target protein.

10. The method of any one of claims 1 to 9, wherein the method further comprises transmitting the list of candidate compounds over the internet or to a display device.

11. The method of any of claims 1-10, wherein the physical computing device is accessed and operated over the internet.

12. The method of any one of claims 1 to 11, wherein the estimated binding interaction is generated using chemical structure comparison.

13. The method of claim 12, wherein the chemical structure comparison utilizes molecular representation.

14. The method of claim 13, wherein the molecular representation comprises a chemical fingerprint.

15. The method of claim 14, wherein the chemical fingerprinting is ECFP6, FCFP6, ECFP4, MACCS or morgan/ring fingerprinting.

16. The method of any one of claims 1-15, wherein the method further comprises generating a confidence score for each estimated binding interaction of a candidate compound, wherein the confidence score is generated using a chemical structure comparison of the candidate compound to one or more compounds from the plurality of binding interactions with the target protein.

17. The method of claim 16, wherein the chemical structure comparison is a principal component analysis.

18. The method of claim 16 or 17, wherein the list of candidate compounds is capable of being displayed and ranked by a confidence score of the estimated binding interaction of the candidate compounds.

19. The method of any one of claims 1 to 18, wherein the method further comprises providing one or more property findings for the set of candidate compounds.

20. The method of claim 19, wherein the one or more property findings comprise molecular weight and/or clogP.

21. The method of claim 19 or 20, wherein the one or more property findings are utilized to generate the estimated binding interaction.

22. The method of any one of claims 19 to 21, wherein the list of candidate compounds is displayable and gradeable by the one or more property findings.

23. The method of any one of claims 1 to 22, wherein the method further comprises (d) synthesizing one or more of the candidate compounds from the list of candidate compounds.

24. The method of claim 23, wherein the method further comprises contacting the one or more synthetic candidate compounds with the target protein to determine one or more experimental binding interactions.

25. A computer readable medium having stored thereon executable instructions for directing a physical computing device to perform a method comprising:

26. A physical computing device having a representation of a set of candidate compounds and programmed with executable instructions to direct the device to perform a method comprising: