EP1807783A4 - Methods for describing a group of chemical structures - Google Patents
Methods for describing a group of chemical structuresInfo
- Publication number
- EP1807783A4 EP1807783A4 EP05788642A EP05788642A EP1807783A4 EP 1807783 A4 EP1807783 A4 EP 1807783A4 EP 05788642 A EP05788642 A EP 05788642A EP 05788642 A EP05788642 A EP 05788642A EP 1807783 A4 EP1807783 A4 EP 1807783A4
- Authority
- EP
- European Patent Office
- Prior art keywords
- common core
- substituent
- attachment
- database
- graphical representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C07—ORGANIC CHEMISTRY
- C07K—PEPTIDES
- C07K1/00—General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
- C07K1/04—General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length on carriers
- C07K1/047—Simultaneous synthesis of different peptide species; Peptide libraries
-
- C—CHEMISTRY; METALLURGY
- C07—ORGANIC CHEMISTRY
- C07K—PEPTIDES
- C07K5/00—Peptides containing up to four amino acids in a fully defined sequence; Derivatives thereof
- C07K5/04—Peptides containing up to four amino acids in a fully defined sequence; Derivatives thereof containing only normal peptide links
- C07K5/08—Tripeptides
- C07K5/0802—Tripeptides with the first amino acid being neutral
- C07K5/0804—Tripeptides with the first amino acid being neutral and aliphatic
- C07K5/0808—Tripeptides with the first amino acid being neutral and aliphatic the side chain containing 2 to 4 carbon atoms, e.g. Val, Ile, Leu
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/80—Data visualisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Definitions
- This invention relates to methods for describing groups of chemical libraries.
- compound libraries can be analyzed by using an algorithm to identify, link, and group chemical substructures.
- a graphical representation of these substructure relationships in a hierarchical, branching format can be generated. This representation can highlight the structural variation explored in developing a library. Activity or property data can be overlayed onto this branched representation. Significant information can be found by pinpointing the effect on these values of substructural changes at key regions in the molecule.
- the libraries can be a collection of compounds assembled only in a database, and need not be a physical collection of compounds.
- a library can be a group of compounds described in a database (such as a medicinal chemistry database) or a patent.
- a method of organizing a group of compounds includes (a) providing a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identifying a common core from at least a portion of the plurality of structures, (c) binning the substituents present on the core according to site of attachment to the core, and (d) repeating steps (a)-(c) using a bin of substituents as the plurality of chemical structures, thereby identifying a subcore from the bin of substituents, if present.
- Step (d) can be repeated for each site of attachment of the identified common core.
- Binning can include clustering the substituents based on a measure of similarity of the substituents.
- Binning can include comparing the substituents to a database including chemical structures.
- Binning can include fragmenting the substituents and comparing the fragments.
- Identifying a common core for a bin of substituents can include manually selecting a common core for the bin of substituents.
- the method can include aligning two or more substituents.
- the method can include identifying a distinct second common core from at least a portion of the plurality of structures.
- the method can be implemented automatically by a computer.
- a common core can be identified by a user or by the computer.
- the identified common core can be a maximal common substructure.
- the measure of similarity can include a measure of structural similarity.
- the measure of structural similarity can include the number of ring bonds.
- the method can include generating a database including a structure of the identified common core.
- the database can include a structure of a substituent attached at a site of attachment to the identified common core.
- the database can include a structure of each substituent attached at the site of attachment to the identified common core.
- the database can include a structure of a substituent present at each site of attachment on the common core.
- the database can include a structure of a subcore.
- the structure of a substituent in the database can be associated with an identifier indicating a common core and a site of attachment on the common core.
- Each structure of a substituent in the database can be associated with an identifier indicating a compound of the group having the substituent.
- the database can include a structure of each core and of each substituent in the plurality of chemical structures.
- the database can include a structure of a compound of the group.
- the structure of a compound of the group can be associated with information about a property of the compound of the group.
- the database can include a structure of each compound of the group.
- a list of compounds having a single site of variation, and the associated information for each compound in the list can be extracted from the database. Every possible list of compounds having a single site of variation, and the associated information for each compound in each list, can be extracted from the database.
- a method of visualizing a group of compounds includes creating a first representation of a plurality of chemical structures representative of at least a portion of the group of compounds, the first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node.
- the root corresponds to a common core of at least a portion of the plurality
- the primary branch point corresponds to a site of attachment of a substituent on the common core
- the leaf node corresponds to a substituent attached to the common core.
- the representation can include an intermediate node connected to a primary branch point, wherein the intermediate node corresponds to a subcore belonging to at least two substituents attached to the corresponding site of attachment.
- the connection between the leaf node and the primary branch point can be free of an intermediate node.
- the intermediate node can be connected to a secondary branch point, the secondary branch point corresponding to a site of attachment of a substituent on the subcore.
- Each site of attachment of a substituent to the common core can be represented by a primary branch point connected to the root.
- Each site of attachment of a substituent to the subcore can be represented by a secondary branch point connected to the intermediate node.
- Each substituent in the plurality can be represented by a leaf node.
- the first representation can be a graphical representation.
- a feature of the graphical representation can illustrate a property of a compound of the group.
- the feature can be color.
- the property can be a biological activity.
- the method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to the same common core structure as the root of the first graphical representation.
- the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the second graphical representation can correspond to the same sites of attachment, subcores and substituents as the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the first graphical representation.
- a feature of the second graphical representation can illustrates a second property of a compound of the group.
- the method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to a common core structure of at least two members of the group and is different from the common core structure corresponding to the root of the first graphical representation.
- the method can include combining a portion of the first graphical representation with a portion of the second graphical representation thereby creating a third graphical representation, wherein the third graphical representation represents at least one compound not represented by either the first or second graphical representation.
- the method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the chemical structures of the second graphical representation are a subset of the chemical structures of the first graphical representation.
- the method can include accessing a database including a structure of a common core, the common core having a site of attachment of a substituent, and a structure of a substituent present at the site of attachment on the common core.
- the database can include a structure of a subcore attached to the common core and having a site of attachment of a substituent. Accessing the database can include counting the number of sites of attachment on the common core, counting the number of subcores attached to each site of attachment on the common core, counting the number of sites of attachment on each subcore, or counting the number of substituents attached to the common core and the number of substituents attached to each subcore.
- a method of visualizing a group of compounds includes (a) providing a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identifying a common core from at least a portion of the plurality of structures, (c) clustering the substituents present at a site of attachment of the identified common core based on a measure of similarity of the substituents, (d) repeating steps (a)-(c) using a cluster of substituents as the plurality of chemical structures, thereby identifying a subcore from the cluster of substituents, if present, and (e) creating a first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node, where the root corresponds to the common core, the primary branch point corresponds to a site of attachment of a substituent on the common core, and the leaf node corresponds to a substituent attached to the common core.
- a computer program for describing a group of compounds includes instructions for causing a computer system to (a) read a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identify a common core from at least a portion of the plurality of structures, and (c) cluster the substituents present at a site of attachment of the identified common core based on a measure of similarity of the substituents.
- the computer program can include instructions for causing the computer system to (d) repeat steps (a)-(c) using a cluster of substituents as the plurality of chemical structures, thereby identifying a subcore for the cluster of substituents, if present.
- the computer program can generate a database including a structure of the identified common core.
- a computer program for accessing a database includes instructions for causing a computer system to retrieve information in response to a user input from a database describing a plurality of chemical structures representative of at least a portion of a group of compounds, the database including a structure of a common core of at least a portion of the plurality of structures, and a structure of a substituent present at a site of attachment of the common core, the structure of the substituent associated with an identifier identifying a compound having the substituent.
- the computer program also include instructions for causing a computer system to create a representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node, wherein the root corresponds to the common core; the branch point corresponds to the site of attachment; and the leaf node corresponds to the substituent.
- the computer program can include instructions for causing the computer system to display a chemical structure of a core, a subcore, a substituent, or a chemical structure of the plurality. Displaying the chemical structure can include displaying information about a property of a compound of the group.
- the user input can include a structure of a substituent, and the computer can retrieve from the database a structure of a compound including the substituent.
- the computer can retrieve from the database the structure of each compound described by the database including the substituent, and information about a property of each compound described by the database including the substituent.
- the user input can include a structure of a substituent, subcore, core, or compound of the group; or a partial structure of a substituent, subcore, core, or compound of the group.
- the computer can retrieve from the database a structure of a compound described by the database including the structure of the user input.
- FlG. 1 is a depiction of a SARTree.
- FTG. 2 is a depiction of a SARTree shown with the structures of compounds represented by the SARTree.
- FIGS. 3A and 3B are flowcharts describing how a SARTree can be generated.
- FIG. 4 is a depiction of a SARTree shown with SAR data.
- FlG. 5 is a depiction of a SARTree shown with SAR data.
- FIG. 6 is a depiction of a forest of SARTrees.
- FIG. 7 is a depiction of a SARTree.
- FlG. 8 is a depiction of a SARTree shown with SAR data.
- FIG. 9 is a depiction of a SARTree and subtrees.
- FIGS. 10a and 10b are depictions of a SARTree with colors depicting activity data.
- FIG. 11 schematically depicts the use of SARTrees in library design.
- FIG. 12 is a depiction of a SARTree and the compounds described by the SARTree.
- FIG. 13 depicts cores identified in a SARTree analysis of a library; the identifiers (1-15) correspond to the SARTrees in FIG. 14.
- FIG. 14 depicts the SARTrees identified in a SARTree analysis of a library; the identifiers (1-15) indicate the core associated with each SARTree.
- FIG. 15 depicts a single SARTree with four different properties overlaid on the SARTree.
- FIG. 16 depicts the 10 SAR tables extracted from a SARTree (at center) describing 106 compounds.
- a connectivity-based method organizes and visualizes the structural variation and properties a group of compounds.
- the SARTree method produces a specialized connected graph (a SARTree) of nodes representing substructures of the compounds and locations on the substructures where variation occurs (i.e, locations on the substructure where an R-group is attached). Connections between the nodes indicate attachments of substructures in the group of compounds.
- Complete molecules are represented in a
- the DIVA software can find and label R-group substituents around a specified core. It then displays these fragments in a SAR table, with columns for associated activity values. This table can be useful for finding the effect of variations on the basic core structure.
- DIVA SAR analysis
- Leadscope breaks down chemical libraries based on predefined structural fragments. The frequencies of these fragments are correlated with biological activity, and presented in charts and graphs. This analysis highlights the presence or absence of fragments in biologically interesting compounds. Leadscope's approach to SAR fails to retain any of connectivity associated with these substructures; no information regarding their position within the molecule is related in the graphs. See Roberts, G. M., et al. (2000) J. Chem. Inf. Comput. ScL 40(6): 1302-1314, which is incorporated by reference in its entirety. DrugPharmer implements a phylogenetic-like tree (PGLT) algorithm as a method for studying compound datasets. This algorithm clusters the compounds and uses MCSS to determine a common substructure for each cluster.
- PGLT phylogenetic-like tree
- substructures are used to define a series of chemical classes.
- the compounds within a given class are then re- clustered, and used to find child classes.
- the resultant data structure is a PGLT of parent and children chemical classes, each defined by a substructure.
- DrugPharmer's approach creates chemical classes of molecular features, and then correlates these specified biological values.
- the substructure associated with these classes can be used to find SAR. This method searches for increasingly larger common substructures to define classes and does not study the SAR effects of smaller structural changes within specific regions of the compounds. See, for example, Nicolaou, C. A. T., et al. (2002) J. Chem. Inf. Comput. ScL 42(5): 1069-1079, which is incorporated by reference in its entirety.
- Distill, from Tripos uses common substructures to classify compounds. These classes can then be visualized in the SYBYL interface through an interactive dendogram.
- the SARTree method can maintain the connectivity information for the structures it describes. In other words, a user can easily determine both the substructures present in the compounds of a library and the relative positions of those substructures to other substructures from a SARTree that describes the library. When the SARTree also conveys information relating to properties of the compounds, it can help a chemist understand the influence of structure on the properties of the compounds.
- a chemical library is a collection of compounds.
- the library can be a physical collection of compounds, or a can be a conceptual collection of compounds.
- a conceptual library can be a collection of structures of compounds, for example, the set of compounds that are expected as the result of carrying out a combinatorial or parallel synthesis.
- a chemical library can include known compounds (i.e. the compound has been synthesized and characterized by a chemist) and unknown compounds, such as, for instance, compounds that a chemist intends to synthesize and characterize.
- a chemical library can be described by a connectivity-based method.
- the connectivity-based method maintains information about the locations of compound fragments on a core structure.
- the structure of a compound can be described as a core having substituents at one or more locations.
- the substituent can be described as including a subcore.
- the subcore in turn can have substituents at one or more locations. This recursive description be extended as far as is necessary or convenient.
- the method can be implemented on a computer.
- the structures of the compounds belonging to the library can be automatically examined to determine what, if any, common core structures is shared by the members of the library. Alternatively, a user can specify a core structure.
- a chemical structure can be a structural formula (e.g., ⁇ * ⁇ being a structural formula for toluene) or a computer-readable structure.
- a computer-readable structure can convey the same information to a chemist as a structural formula. In general, the chemist will find a structural formula more convenient to read than a computer-readable structure.
- a computer executing appropriate software can read a computer-readable structure and display a structural formula to the user. Examples of computer readable structures include SMILES (e.g., clcccccl (C) being a SMILES representation of toluene), and connection tables, such as the connection tables used in a .mol file format, and a .sdf file format.
- the library can be described in by a database.
- the database can be stored in one or more files (e.g., on a magnetic disk, optical disk, or other storage medium), in computer memory, or can be a collection of written entries.
- the database can include information about each compound in the chemical library or a subset of compounds of the library.
- One file can include structural descriptions of the compounds of the library. The structural description describes the compounds by its constituent atoms and the bonds between the atoms. The structural descriptions can be in a standard computer readable format, such as, for example, the .sdf format or SMILES.
- Each compound can be identified with a unique identifier such as a serial number or name. The identifier uniquely refers to both a particular member of the chemical library and to the structure of that compound.
- Each compound can also be associated with one more pieces of information related to the compound.
- Information relating to the compounds can include information about the properties of a compound, such as physical, chemical, or biological properties.
- database can include biological activity data for the compound.
- the activity data can be experimental data or the result of theoretical modeling.
- Experimental activity data can include, for example, the value of a K ⁇ i or IC50 measured in vitro, or an in vivo result.
- Theoretical modeling can include, for example, the results of docking a compound to a protein structure and calculating whether the compound is likely to bind to the protein or not. Any other pertinent information related to the compounds in the library can be stored in the file.
- a second file can include information about the cores and subcores of the library.
- a structural description of each core or subcore can be in the file.
- the structural description can indicate the location of sites of variation, i.e., where on the core an R- group is attached.
- Each core (or subcore) is identified by a core (or subcore) identifier.
- the core or subcore can also be described by its depth in the tree. For example, a core common to all compounds in the library can have a depth of zero (i.e., the root of the tree).
- a subcore attached directly to the core can have a depth of one; another subcore attached to the first subcore can have a depth of two, and so on.
- a third file can include information about the leaves of the library. There can be a structural description of each leaf of the library in the file.
- the structural description of a leaf can include information about its site of attachment to a core or subcore, that is, which atom or atoms of the leaf is bonded to the core or subcore.
- the structural description of a leaf can also include a list of which cores or subcores it is associated in the library. For example, the structural description can refer to the cores by the core identifiers.
- the file can also indicate which compounds of the library have the leaf, by indicating the compound identifiers belonging to compounds having the leaf.
- the information in the three files described above can be combined in a single file, or divided among more than three files. When a chemical library is described by the method, a user can visualize the structural variation present in the entire library.
- the visualization can be chemically intuitive; in other words, a chemist looking at the visualization can readily understand the structure of the compounds and the degree of structural variation in the library.
- the overlay can visually relate changes in compound structure to changes in activity.
- the visualization can be part of an interactive computer program.
- the interactive computer program can allow a user to examine structures of compounds or groups of compounds or to highlight structures or structural features that are associated with properties.
- the program can display different sets of compounds of the library as directed by the user.
- the program can display lists of compounds by name, structure or structural fragment, and present the activity of the compounds as well. In some cases, it can be useful to display an aggregate measure of activity for a group of compounds, such as an average, maximum, minimum, or standard deviation of activity.
- a connectivity-based method can allow the structural variation and physical, chemical or biological properties present across an entire chemical library to be represented in a hierarchical fashion.
- a group of structures is represented by a core that members of the group share.
- a subset of those structures can represented by the core and a subcore attached to the core.
- a further subset can be represented by a substituent attached to the subcore.
- Other subsets can include subcores and substituents attached at different location on the core (or subcore). The subsets can overlap; in other words, an individual compound can belong to more than one subset.
- the hierarchical organization can be represented in a variety of ways.
- One representation is a list or table.
- the list can include names or structures of the core, subcores, and substituents present in the group of compounds.
- the list can include information about the location of the subcores and substituents on the core (or subcore).
- the list can be an interactive list. For example, in response to user input, the computer can display or hide names or structures of subcores or substituents attached at particular location on the core.
- a computer can display a core structure showing sites of variation (which can be indicated by, for example, "R 1 ", "R 2 ", etc.). When a user clicks on a site of variation, the computer can display names or structures of the subcores or substituents present at that site of variation, or the compounds of the group which use that site of variation.
- the structural information can be graphically displayed, for example, as a SARTree graph.
- a SARTree graph is a connected graph, with nodes representing substructural fragments within the chemical library represented by the graph (see FIG. 1).
- the graph contains four classes of nodes: a root core node (double circle), subcore nodes (single circle), R-group attachment nodes (diamonds), and leaf nodes (squares).
- a root core node double circle
- subcore nodes single circle
- R-group attachment nodes diamonds
- leaf nodes squares
- Core nodes represent a core substructure common to all of the compounds depicted in that SARTree. Each SARTree will have one core.
- a library of compounds that lacks a single common substructure can be represented by a forest of SARTrees. Each SARTree in the forest will have one core node at the root of each tree.
- the core node of a SARTree has connectivity paths radiating from it through a set of R-group nodes.
- R-group nodes symbolize the different attachment points for fragments along the core/subcore.
- Each R-group node is connected to a set of subcore and/or leaf nodes.
- Subcore nodes represent a substructure, attached to specific point in the molecule, shared by multiple compounds and with further variation off of that substructure. Subcores are connected to their own R-group nodes.
- the subcore — R-group subcore/leaf connectivity of the graph repeats until each path terminates in one or more leaf nodes.
- Leaf nodes correspond to a whole fragment connected to the core or subcore at the respective attachment point. Leaves may represent one or more compounds which have that fragment at the same point of attachment. A null leaf can be present in each group of leaves. A null leaf can represent a 'standard' substituent, such as hydrogen.
- Whole molecules are represented in a SARTree by combinations of structural variation paths emanating from the core, each terminating in a leaf node. These paths may directly travel from an R-group node to a leaf, or may travel through one or more subcores.
- a single compound might contain fragments from all of the R-group paths attached to the core, or may only utilize some.
- a single compound can be described using all possible R-group paths, some of which can terminate with a null leaf.
- FIG. 2 SARTree generated from this set of structures is shown in FIG. 2 along with the complete structures of compounds associated with each node.
- the core fragment in each structure is colored red, subcore fragments are colored blue, and leaf fragments are colored green.
- FIG. 3 A shows the recursive flow of compound data through the first stage of the algorithm, creating a connectivity profile for the library.
- the algorithm depicted in FIG. 3B takes this connectivity information and generates a SARTree graph for visualization.
- the first stage of the algorithm explores the structural variation within a compound library, by a combination of R-group fragmentation, maximal common substructure (MCSS) searching and compound clustering. Utilizing these methods, the algorithm recursively fragments the compound library, discovers common substructural features, and takes this structural variation information to create a connectivity profile, i.e. a list of all of the cores, subcore, and leaf fragments within the dataset. The second stage takes this connectivity profile and creates the visual representation of the SARTree.
- MCSS maximal common substructure
- the first step of the algorithm is to determine a core structure for a set of compounds.
- the core structure can be manually specified, or can be generated automatically by a MCSS search of all of the compounds.
- the core need not be a maximal common substructure. In some cases it can be advantageous to have a core structure that is smaller than the maximal common substructure, such as, for example, if the library is expected to add members that will share the smaller core structure but not the maximal common substructure of the initial set of compounds.
- the core can also be a Murcko structure.
- a list of all the fragments decorating the central core substructure is generated.
- the algorithm is designed to preserve information describing the location of the fragments on the core. Fragments attached to the same point on the core are grouped together in R-group bins, which can be noted as Rl, R2, R3, and so on. If a single core structure cannot be determined for the entire set, more than one core can be chosen. Each core can then serve as the root of a SARTree.
- the fragments belonging to each R-group bin are analyzed further to identify subcore(s), if present, using one or combination of methods.
- Methods to identify subcores can, without limitation, include the following: • Similarity clustering - MCSS method: Substituents are clustered based on a measure of similarity, and performing maximum common substructure search on each cluster.
- Substituents in the bin are compared with the database of chemical structures to identify subcore(s).
- the compared structure can also be Murcko structures.
- Substituents can be aligned prior to subcore identification. Aligning the substituents can help to prevent artifacts from appearing in the final SARTree. For example, a chemist recognizes that two phenyl groups, each having a meta-chloro substitution, are equivalent. However, a computer does not have the chemical understanding that a chemist does, and so might identify the two m-chlorophenyl groups as being distinct if drawn such that the chloro groups appear in different locations. Aligning the substituents (i.e., so that when drawn, both have the meta-chloro group in the same place) helps the computer to correctly identify equivalent substituents.
- fragments are clustered according to similarity as determined by a set of chemical descriptors. Any structure-based descriptor can be used in clustering. In some circumstances, it can be desirable to use connectivity-based fingerprints or the number of ring bonds as descriptors.
- Each cluster is then searched for a MCSS.
- a maximal common substructure is the greatest number of atoms and bonds in common among a group of structures.
- An MCSS can include a substructure that is not chemically intuitive, such as a fragment of a ring. Clustering of fragments prior to MCSS searching helps to produce chemically meaningful substructures, and to avoid awkward substructures, such as a partial ring. If a significant substructure is found, it is used as a subcore for that R-group. A substructure can be deemed significant if contains more than a specified number of atoms.
- the set of subcores is sorted by size, then tested against the initial set of R-group fragments. Those fragments that contain a subcore are further fragmented into a new set of R-groups attached to that subcore. Those fragments that contain none of the subcores are noted as leaves.
- the algorithm continues recursively, within the newly formed R- group lists, until no significant subcores are found, and each path of variation ends in a set of leaf nodes. See FIG. 3A.
- the recursive nature of the method allows automatic analysis of structural variation many levels deep within a compound library.
- the method can be implemented automatically (e.g., on a computer). If the user desires, the method can be constrained by manually inputting parts of the subcore backbone to be used in fragmentation. For any given R-group, instead of using clustering and MCSS for that set of fragments, a set of subcores can be specified and the fragments will only be tested against those substructures. The user can further decide whether the recursive fragmentation algorithm is continued on the resultant R-groups, or those lists are all marked as leaves. This regulation of the backbone allows a controlled investigation of the library.
- the compounds can be clustered by structural similarity, then searched for a MCSS.
- a diverse subgraph MCSS search can be run, to find a set of common substructures within the library. These substructures can represent cores of potential 'series' within dataset. After merging redundant and overly small series, a 'forest' of SARTrees, can be produced, with each tree representing a series and rooted at its respective core.
- Pipeline Pilot provides a 'chemically-aware' object-oriented programming environment which understands molecular structures. Through built-in components, it is capable of performing complex algorithms, such as R-groups fragmentation, chemical clustering, and MCSS on chemical libraries in numerous data formats. It is designed to rapidly push compounds through pipelines of these computational processes.
- the SARTree dataset produced by the first stage of the algorithm contains all of the chemical and activity data related to the tree, and is completely independent from the second stage user interface. As such, the dataset might be visualized in any other visualization scheme, such as web-based reports.
- An application can allow a user to interact with the SARTree database.
- the application can display a SARTree and can provide a graphical interface for a user.
- the graphical interface can allow a node of the SARTree to be queried for information regarding the fragment it represents, the corresponding compounds that contain the fragment, or other information.
- the application can also produce the SARTree in a manner that reveals other information in the database, such as properties of the compounds in the library. For any such property, each node and edge can be assigned a value from the mean, maximum, minimum, or standard deviation of the property values of the associated compound. Mean or median overlays can be especially useful when trying to find the general contributions of fragments.
- Minimum overlays can highlight the compound of the tree with the lowest assay value; likewise, a maximum overlay highlights the compound with the highest assay value.
- a standard deviation overlay can illustrate which compounds show variation in activity, and thus which fragments are important for determining the activity of a compound.
- the nodes and edges can be adjusted (e.g. size, color, thickness, or other feature) according to the value associated with the feature. Nodes and edges can be adjusted with a simple gradient scale or through a discrete set of bins, specified for that particular library.
- each leaf node can represent multiple compounds having structural variation at other positions.
- specific subtrees of a SARTree can be studied.
- one leaf node is chosen as a fixed node, and only those branches of the SARTree that correspond to compounds including the chosen leaf node are displayed.
- multiple leaf nodes are chosen as fixed nodes, and one or more R group nodes are allowed to be variable.
- the resulting subtree represents a group of related compounds.
- a similar tree can be created by specifying a more extensive core structure during the implementation of the SARTree algorithm.
- every possible SAR table can be generated by selecting an R group node and combinatorially creating each SAR table that varies only at that point. The process is repeated for each R group node in the SARTree.
- These SAR tables can each be represented by a distinct subtree.
- a user can review only those tables for which there the variation in activity exceeds a predetermined threshold, thus indicating that the R group in question has an effect on activity.
- a user can review particular SAR tables based on the number of compounds in the table, the range or standard deviation of activity values in the table, or based on the presence or absence of particular structural fragments in the table.
- this tool computes all possible subsets of the SARtree dataset that only vary at that point. This procedure is done by combinatorially fixing every combination of nodes throughout the rest of the connectivity profile. The activity or property changes for these point-specific changes can be mapped, generating all of the SAR tables associated with that SARtree. These tables can be filtered by number of molecules present, activity range and standard deviation, or node membership to provide a more focused study of the library. This analysis is particularly important because it automatically quantifies the SAR data represented by the SARtree graph.
- the SARtree dataset produced by the first stage of the algorithm contains the entire collection of chemical and activity data related to the tree, and is completely independent from the second stage user interface.
- the dataset might be visualized in any other visualization scheme, such as web-based reports.
- the application can be useful in comparing two libraries of compounds.
- Two libraries can be compared with respect to the diversity of compounds in each library, the similarity of compounds in the two libraries, or other measures.
- the libraries can be compared visually, for example by creating a SARTree for each library and displaying the two SARTrees together.
- the library can be described by a system of identifiers or coordinates.
- the coordinates can be a numerical system, where a number or string of numbers can be used to refer to a core, subcore, or leaf node.
- the coordinates can be used to refer to a single compound of the library, or a group of compounds of the library.
- a generic structure having a single variable position can be referred to by coordinates that designate a core, subcores and leaves shared by members of the group.
- Compounds that have the generic structure can be referred to by the same set coordinates plus an additional coordinate specifying the substituent at the variable position.
- the coordinates can be useful in comparing compounds, groups of compounds, or libraries of compounds.
- a SARTree can be used to define the chemical groups in a library and their relative arrangements to each other.
- the groups can be described in terms of a coordinate in which a core, R group, subcore, and leaf are represented by a point of the coordinate.
- R groups Rl, R2 and R3
- each R group is associated with two subcores, and for each subcore there are many leaves
- the structures of the compounds can be indicated by the selection of R group, subcore and leaf: compound #1 : core, Rl , subcore 2, leaf 4 compound #2: core, R2, subcore 1, leaf 5 compound #3: core, R3, subcore 1, leaf 8.
- the coordinate can expressed in a compressed form, e.g., as a string of numbers or vector denoting R group, subcore and leaf: compound #1 : (1, 2, 4) compound #2: (2, 1, 5) compound #3: (3, 1, 8)
- a group of compounds can be described by one core with two R groups (Rl and R2), where each R group is associated with two subcores, and for each subcore there are many leaves.
- Compounds from this group might include: compound #4: core, Rl, subcore 1, leaf 3, R2, subcore 2, leaf 6 compound #5: core, Rl , subcore 3, leaf 2, R2, subcore 1 , leaf 4 compound #6: core, Rl, subcore 2, leaf 4, R2, subcore 3, leaf 2
- the coordinates for these compounds could be expressed as: compound #4: (1, 1, 3, 2, 2, 6) compound #5: (1, 3, 2, 2, 1, 4) compound #6: (1, 2, 4, 2, 3, 2)
- the size of the coordinate can vary depending on the degree of branching in the library.
- the coordinate defines the identity of a compound by providing an index to the core, R groups, subcores and leaves in each compound.
- the coordinate can also include additional information regarding the compounds.
- the coordinate could include an indicator that describes a property of a leaf (such as, for example, aliphatic, aromatic, hydrogen bond donor, hydrogen bond acceptor, electron withdrawing, or electron releasing).
- the indicator can include information about a property of a compound, such as a biological activity.
- the coordinate can provide a shorthand notation for referring to a compound or series of compounds in a library. This information could be used to define the complexity of a library.
- the complexity of a library can be determined by counting how many R groups, subcores and leaves exist in the library.
- the degree of branching can be included in a measure of complexity. This information could also be used to differentiate a series of libraries, or to compare two libraries.
- the extent of overlap in the coordinates defined between two libraries can be useful in determining the similarity of two libraries, and of the chemical functionality present in the libraries.
- the libraries With a measure of similarity of libraries (e.g., based on the similarity of coordinates of the libraries), the libraries can be clustered by similarity. The clustering can provide a means of identifying libraries that are more or less similar to a given library.
- the SARTrees of those libraries can be joined to create a single SARTree representing of the two libraries.
- Such a display could be useful, for example, for providing a visual representation of two series of compounds having distinct cores, a common subcore structure, and a variety of leaves, some of which can be common to both series.
- the joined SARTree can visually display which of the leaves are present in each series or in both series.
- the application can be useful for highlighting compounds that include a particular substructure.
- the user can supply a substructure of interest, and the application can identify the compounds of the library having the substructure.
- Corresponding branches of a SARTree can be indicated, for example with a different color, or by hiding branches of the SARTree that are not associated with the substructure.
- the substructure of interest need not be a core, subcore or leaf that was identified by the SARTree algorithm. As described above, properties of the selected compounds can be overlaid on the SARTree.
- the application can display the library at different stages of development.
- Subtrees can be generated from the subset of compounds present within a library at those stages of structural exploration. These compound subsets can have dramatically different activity profiles. An early library may have few compounds having low activity, while a later version of the library can include more compounds having higher activity.
- a representation of library progress can be created. Such a display can be useful in tracking the development of a library, both in terms of the structures of the compounds in the library, and in tracking the development of more highly active compounds.
- SARtree can be used to design new chemical libraries. Many libraries are designed without regard for synthetic limitations. As a result, these libraries often include compounds that are impractical to synthesize, either individually or as part of a parallel or combinatorial library synthesis. Using SARtree in library design can aid in the design of synthetically feasible compounds.
- One or more libraries of existing compounds are processed by the SARtree method to identify cores, subcores, and leaves in the library. A subcore and its associated leaves can then be grafted on to a distinct core.
- the distinct core is found in a library, where it is attached to a subcore similar to the grafted subcore. Grafting a similar subcore can favor the synthetic feasibility of the new library, because the substitutions on any given subcore are likely to be transferable to a similar subcore.
- each library analyzed by the SARTree method.
- the core identified in one library is designated core A, the other library's core as core B. See FIG. 11.
- Each library has a substituted phenyl subcore, but the substitutions (i.e., the leaves on the phenyl subcore) are different in the two libraries.
- Two new SARtrees can be created by grafting the phenyl subcore (and associated leaves) from one core to the other.
- the two new SARtrees represent two new chemical libraries.
- the grafting approach can be used to quickly design synthetically feasible compounds for purposes including exploration of SAR space, designing around propietary compounds, and enriching screening libraries.
- Scripts written in MATLAB can define nodes and edges of the SARTree graph. Cytoscape can be used to layout the graph. Finally, the SARTree is visualized in an interactive chemistry platform called Pinpoint, which is written in MATLAB.
- the various techniques, methods, and aspects described above can be implemented in part or in whole using computer-based systems and methods. Additionally, computer-based systems and methods can be used to augment or enhance the functionality described above, increase the speed at which the functions can be performed, and provide additional features and aspects as a part of or in addition to those described elsewhere in this document. Various computer-based systems, methods and implementations in accordance with the above-described technology are presented below.
- a general-purpose computer may have an internal or external memory for storing data and programs such as an operating system (e.g., DOS, Windows 2000TM, Windows XPTM, Windows NTTM, Windows XP, OS/2, UNIX or Linux) and one or more application programs.
- an operating system e.g., DOS, Windows 2000TM, Windows XPTM, Windows NTTM, Windows XP, OS/2, UNIX or Linux
- application programs e.g., DOS, Windows 2000TM, Windows XPTM, Windows NTTM, Windows XP, OS/2, UNIX or Linux
- Examples of application programs include computer programs implementing the techniques described herein, authoring applications (e.g., word processing programs, database programs, spreadsheet programs, or graphics programs) capable of generating documents or other electronic content; client applications (e.g., an Internet Service Provider (ISP) client, an e-mail client, or an instant messaging (IM) client) capable of communicating with other computer users, accessing various computer resources, and viewing, creating, or otherwise manipulating electronic content; and browser applications (e.g., Microsoft's Internet Explorer) capable of rendering standard Internet content and other content formatted according to standard protocols such as the Hypertext Transfer Protocol (HTTP).
- authoring applications e.g., word processing programs, database programs, spreadsheet programs, or graphics programs
- client applications e.g., an Internet Service Provider (ISP) client, an e-mail client, or an instant messaging (IM) client
- ISP Internet Service Provider
- IM instant messaging
- browser applications e.g., Microsoft's Internet Explorer
- HTTP Hypertext Transfer Protocol
- the general-purpose computer includes a central processing unit (CPU) for executing instructions in response to commands, and a communication device for sending and receiving data.
- CPU central processing unit
- the communication device is a modem.
- Other examples include a transceiver, a communication card, a satellite dish, an antenna, a network adapter, or some other mechanism capable of transmitting and receiving data over a communications link through a wired or wireless data pathway.
- the general-purpose computer may include an input/output interface that enables wired or wireless connection to various peripheral devices.
- peripheral devices include, but are not limited to, a mouse, a mobile phone, a personal digital assistant (PDA), a keyboard, a display monitor with or without a touch screen input, and an audiovisual input device.
- the peripheral devices may themselves include the functionality of the general-purpose computer.
- the mobile phone or the PDA may include computing and networking capabilities and function as a general purpose computer by accessing the delivery network and communicating with other computer systems.
- Examples of a delivery network include the Internet, the World Wide Web, WANs, LANs, analog or digital wired and wireless telephone networks (e.g., Public Switched Telephone Network (PSTN), Integrated Services Digital Network (ISDN), and Digital Subscriber Line (xDSL)), radio, television, cable, or satellite systems, and other delivery mechanisms for carrying data.
- PSTN Public Switched Telephone Network
- ISDN Integrated Services Digital Network
- xDSL Digital Subscriber Line
- a communications link may include communication pathways that enable communications through one or more delivery networks.
- a processor-based system can include a main memory, preferably random access memory (RAM), and can also include a secondary memory.
- the secondary memory can include, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc.
- the removable storage drive reads from and/or writes to a removable storage medium.
- a removable storage medium can include a floppy disk, magnetic tape, optical disk, etc., which can be removed from the storage drive used to perform read and write operations.
- the removable storage medium can include computer software and/or data.
- the secondary memory may include other similar means for allowing computer programs or other instructions to be loaded into a computer system.
- Such means can include, for example, a removable storage unit and an interface. Examples of such can include a program cartridge and cartridge interface (such as the found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, and other removable storage units and interfaces, which allow software and data to be transferred from the removable storage unit to the computer system.
- the computer system can also include a communications interface that allows software and data to be transferred between computer system and external devices.
- communications interfaces can include a modem, a network interface (such as, for example, an Ethernet card), a communications port, and a PCMCIA slot and card.
- Software and data transferred via a communications interface are in the form of signals, which can be electronic, electromagnetic, optical or other signals capable of being received by a communications interface. These signals are provided to communications interface via a channel capable of carrying signals and can be implemented using a wireless medium, wire or cable, fiber optics or other communications medium.
- a channel can include a phone line, a cellular phone link, an RF link, a network interface, and other suitable communications channels.
- computer program medium and “computer usable medium” are generally used to refer to media such as a removable storage device, a disk capable of installation in a disk drive, and signals on a channel.
- Computer program products provide software or program instructions to a computer system.
- Computer programs also called computer control logic
- Computer programs are stored in the main memory and/or secondary memory. Computer programs can also be received via a communications interface. Such computer programs, when executed, enable the computer system to perform the features as discussed herein. In particular, the computer programs, when executed, enable the processor to perform the described techniques. Accordingly, such computer programs represent controllers of the computer system.
- the software may be stored in, or transmitted via, a computer program product and loaded into a computer system using, for example, a removable storage drive, hard drive or communications interface.
- the control logic when executed by the processor, causes the processor to perform the functions of the techniques described herein.
- the elements are implemented primarily in hardware using, for example, hardware components such as PAL (Programmable Array Logic) devices, application specific integrated circuits (ASICs), or other suitable hardware components. Implementation of a hardware state machine so as to perform the functions described herein will be apparent to a person skilled in the relevant art(s).
- elements are implanted using a combination of both hardware and software.
- the computer-based methods can be accessed or implemented over the World Wide Web by providing access via a Web Page to the methods described herein. Accordingly, the Web Page is identified by a Universal Resource Locator (URL).
- the URL denotes both the server and the particular file or page on the server.
- a client computer system interacts with a browser to select a particular URL, which in turn causes the browser to send a request for that URL or page to the server identified in the URL.
- the server responds to the request by retrieving the requested page and transmitting the data for that page back to the requesting client computer system (the client/server interaction is typically performed in accordance with the hypertext transport protocol (HTTP)).
- HTTP hypertext transport protocol
- the selected page is then displayed to the user on the client's display screen.
- the client may then cause the server containing a computer program to launch an application to, for example, perform an analysis according to the described techniques.
- the server may download an application to be run on the client to perform an analysis according to the described techniques.
- the VLA-4 library contains 13 compounds created from variations off of a common core. These compounds were assayed for IC 50 values in VLA-4 inhibition (see Singh, J. et al. (2002) J. Med. Chem. 45(14): 2988-2993, which is incorporated by reference in its entirety). See Table 1 and Figure 2. Every structure in the VLA-4 library shares a common PUPA core and varies by the fragment attached to the carbonyl carbon. When analyzed in SARTree, the algorithm finds two common substructures attached to Rl : eight compounds include a phenylamine linker (shown on the left side of FIG. 2) and two compound include a thiazoleamine linker (at the right side of FIG. 2). Both thiazoleamine compounds have high IC 50 values, while the IC5 0 values of the phenylamine compounds vary depending on the variation off of the subcore.
- FIG. 4 presents a SARTree of the VLA-4 library with activity information overlaid as different colors.
- the nodes in the graph are colored in a gradient by the mean log(IC 5 o) values of their associated compounds. Low values of log(IC 5 o) are red, while blue represents high values of log(IC 5 o).
- the two subcore nodes are circled on the SARTree, and shown in the structural diagram (inset). The number of associated compounds (N) and mean IC 50 values for each subcore are listed in the inset.
- the dopamine ⁇ -hydroxylase (D ⁇ H) library presents a small, familiar combinatorial chemistry dataset used in molecular shape and QSAR analysis (see Burke, B. J. and A. J. Hopfinger (1990) J. Med.
- This SARTree provides a visual SAR table of the D ⁇ H dataset.
- the graph can be colored by the results from the inhibition assay to show the effects of different decorations on inhibition.
- the SARTree highlights the consistency of inhibition results for molecules sharing a fragment at a certain point on the ring.
- nodes are colored by classes in standard deviation in IC5 0 .
- the tables show the leaves of each R-group, with the number of associated compounds (N) and standard deviation in IC 50 values (STDDEV).
- the nodes in FIG. 5 are colored in three classes: red for a standard deviation less than 0.5, green for 0.5 to 1.0 and blue for greater than 1.0.
- the actual deviation values are show in tables adjacent to the R-groups.
- the high standard deviation of the multi-compound fragments in the R2 and R5 positions could signify that these positions on the ring are not involved in the binding interaction; therefore, structural variation in this region would have little effect on the molecule's inhibition.
- the SARTree algorithm was applied to a library used in screening against cyclin- dependent kinase 2 (CDK-2).
- CDK-2 cyclin- dependent kinase 2
- CDK-2 inhibition assay results were available for these compounds had associated. IC 50 values were also known for a subset of the compounds.
- the CDK-2 library represents a large screening dataset composed of many sub-libraries each containing their own SAR characteristics. A forest of SARTrees was produced in this scenario, each branching from a core representing the individual scaffolds (FlG. 6). This allows the entire CDK-2 library to be visualized simultaneously, even though the compounds do not share a single common core.
- the forest can be investigated as a whole to find the relative performances of each scaffold.
- the forest in FIG. 6 is colored by the presence of hit compounds associated with each node. Nodes containing any hits are colored blue. A hit was defined as a compound having greater than 50% CDK-2 inhibition and an IC 50 less than 25 ⁇ M. The remaining nodes are colored red. This view can be used to highlight which scaffolds were successful in producing hits, and which of their structural variation paths contributed to those hits.
- the individual trees within the SARTree forest can then be studied separately, to view the properties of compounds within each series.
- the SARTree can also be used to find subseries belonging to a particular scaffold.
- scaffold 5 for example, there were three subcores attached to Rl : a 1,2-dimethoxybenzene fragment, a methoxybenzene fragment, and a phenyl fragment.
- These subcores varied in both average inhibition and hits produced per total associated compounds (see Table 2).
- This analysis shows the higher levels of success within the 1,2 dimethoxybenzene subseries of Rl (see FIG. 7).
- FIG. 7 shows the SARTree for scaffold 6 of the CDK-2 library colored by hit presence (blue indicates nodes containing hit compounds).
- the Rl region of structural variation contains three subcores (circled and numbered in orange). These subcores define three separate subseries of variation within the library.
- scaffold 9 of the CDK-2 dataset trends in the average inhibition and hit compounds produced by different fragment were observed when looking at the different structures attached to a Rl .1 of a benzene ring subcore on Rl (see FIG. 8).
- the nodes are colored by mean inhibition values, in a red-to- blue gradient highlighting the range of values; red nodes represent low inhibition, blue represent high inhibition.
- the outlined region shows the variation on the Rl.1 attachment to the Rl benzene subcore.
- the structural diagram shows the different fragments attached with the average inhibitions and hits from their associated compounds. N is the number of compounds having the leaf shown.
- Table 3 presents example molecules of each MicroSAR fragment from FIG. 8.
- FIG. 9 shows the full SARTree of scaffold 8 (center) and the nine subtrees (labeled (a)-(i)). Each subtree is shown with its associated structure. The single site of variation on the structure is indicated by R.
- the SAR table for subtree (d) is shown in Table 4.
- a SAR table can show the structure-activity relationship for a narrow series of compounds belonging to a large group of compounds. Multiple SAR tables can help identify preferred substituents at each site of variation on a scaffold.
- results of multiple assays can be overlaid on otherwise similar SARTrees to reveal specificity of certain fragments.
- the SARTree for CDK-2 scaffold 9 is shown twice with different overlays.
- the first overlay (a) colors the nodes containing hits blue; the second overlay (b) shows any nodes with an average Chemscore less than -30 as green.
- the overlay comparison can be used to evaluate the specificity of the Chemscore docking score function on this particular set of compounds.
- the structural variation paths with good Chemscores but lacking hit compounds, and vice-versa, can be studied to find the classes of fragments which cause inaccuracies in the docking evaluation.
- any two (or more) properties can be compared for the same group of compounds by preparing similar SARTrees overlaid with different property values.
- Figure 12 shows the SARtree network for a small library of twenty CDK2 inhibitors.
- the individual compounds are shown for easier understanding of the SARtree.
- the SARtree quickly captured the true underlying chemical variation in this library.
- the core structure red
- the core structure has one R-group, and two subcores (phenyl and ethyl) at that R- group (blue).
- SARtree has grouped the arrangement of leaves about the phenyl subcore according to ortho, meta, and para substitution patterns. This type of structural variation summary was not as immediately obvious from just examining a list of the 20 molecules. The difficulty of summarizing structural variation is magnified when the number of molecules and the structural complexity within the compounds increases.
- SARtree emphasizes structural variation within the molecules, and therefore is insensitive to artifacts due to typical descriptor- based similarity methods. For example, the Tanmito coefficients (Tc) for 1 vs. 2 and 4 vs. 8 are indistinguishable using common methods, even though they have differing activities, and from a medicinal chemistry view, are essentially different.
- the SARtree clearly separates these molecules, but keeps their core, subcore, leaf, and R-group relationships. Similarity metrics are widely debated, and the inability of these metrics to distinguish these molecules from each other is due because they are typically whole- molecule comparisons, and subtle but important local structural variations are lost in the noise of rest of the molecule.
- FIG. 15 illustrates the same tree displayed four times with four different properties overlaid: activity, PSA, aLogP, and absence of negative atoms.
- the Lipinski intestinal permeability metric was chosen for each compound, by assigning the number of Lipinski property threshold that each molecule satisfies to each molecule (from zero to four), and polar surface area (see, e.g., CA Lipinski, Adv. Drug Del. Rev. 1997. 23, 3).
- the property SARtrees in Figure 15 suggest that the trend in substitution patterns off of the phenyl ring shown tend to have the most active compounds, but that this substitution pattern also has a modest variability in the number of Lipinski parameters that are passed.
- AlogP in addition, seemed to increase in magnitude in the same region of SARtree space as the active compounds. PSA seemed unaffected, however.
- SARtree can quickly identify such compensatory effects in drug optimization phases and allow the medicinal chemist to guide chemical design towards more favorable drug space.
- the SARtree data structure contains descriptions of how many compounds have variations in one leaf position. All of the collections of compounds where only the leaf is changing or where the attachment of the leaf to the subcore is changing (regioisomers) can be extracted along with activity data. In other words, the SARtree data structure can be exploited to produce every possible SAR table for the compounds in the data set.
- This automatic SAR extraction method was applied to one of the sub screening libraries for CDK2 that contained 106 compounds. Ten SAR tables were found within this library ( Figure 16), in which each of the tables was more or less evenly represented in the number of compounds. The hits are shown in Tables 2, 3 and 5.
- This approach of extracting SAR tables from SARtrees has three advantages. First, it allows us to quickly and automatically extract SAR tables from one or more chemical library (as we will show). Second, it allows us to quantify, per library, the number of SAR tables it contains, which would be useful for quickly assessing the amount of potential SAR information. Third, it allows us to quickly identify and assess any missing or unfilled structure-variation points that would lead to missing or incomplete SAR.
- SARtree permits rapid identification of active scaffolds and structure variation- activity relationships within large screening libraries.
- Screening libraries often contain multiple scaffolds that are each combinatorially expanded to varying degrees and magnitude. Frequently the question of which scaffolds are represented in the actives and inactives arises. It can therefore be useful to arrange the SARtrees of each sub-library contained within the screening library for comparisons — a SARtree forest — shown in Figure 14.
- a SARtree forest shown in Figure 14.
- An important point is that the SARtree forest in Figure 14 represents over 17,000 compounds, which highlights the efficiency of SARtrees to display large numbers of compounds and the general structural variation within the libraries.
- the SARtree forest allows a user to quickly identify putative scaffolds, and compare the structure- variation and activity among the scaffolds.
- Figure 14 quickly shows the active/inactive distribution for the various libraries, and visually separating out libaries that had no actives (series 3, 9, 10, 12, 13, and 14) from those that contain a variety of actives distributed over the remaining SARtrees. The following sections will explore the relationships between the statistics of the SARtrees and the actives.
- R-points describes the degree of variation (DOV) within a SARtree.
- DOV degree of variation
- R-points are the total connections involving core and subcores, and are a simple way of describing the variation at the sub-core level.
- SAR tables select groups of molecules where only one leaf attachment point is varying as described earlier
- the number of SAR tables contained within a library can be a useful way to describe the library. For example, knowing the number of SAR tables for a given library would help quantify the utility of a chemical library for first pass screening or for focused optimization efforts.
- the R-points metric had a high correlation value of 0.73 between with the number of hits in the screening libraries (Table 6).
- a plausible explanation is that as the DOV increases, then so does the likelihood of increasing the chance for finding actives.
- the types of chemical shared substructures found in screening libraries is limited (as we will show), and the DOV seems to be a key metric that describes how these substructures are dispersed within a library.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US60264304P | 2004-08-19 | 2004-08-19 | |
PCT/US2005/029276 WO2006023574A2 (en) | 2004-08-19 | 2005-08-18 | Methods for describing a group of chemical structures |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1807783A2 EP1807783A2 (en) | 2007-07-18 |
EP1807783A4 true EP1807783A4 (en) | 2009-01-28 |
Family
ID=35968139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP05788642A Withdrawn EP1807783A4 (en) | 2004-08-19 | 2005-08-18 | Methods for describing a group of chemical structures |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1807783A4 (en) |
WO (1) | WO2006023574A2 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6131792B2 (en) * | 2013-09-09 | 2017-05-24 | 富士通株式会社 | Information providing apparatus, method, and program |
US11450410B2 (en) | 2018-05-18 | 2022-09-20 | Samsung Electronics Co., Ltd. | Apparatus and method for generating molecular structure |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1847997A (en) * | 1996-01-26 | 1997-08-20 | Robert D. Clark | Method of creating and searching a molecular virtual library using validated molecular structure descriptors |
-
2005
- 2005-08-18 WO PCT/US2005/029276 patent/WO2006023574A2/en active Application Filing
- 2005-08-18 EP EP05788642A patent/EP1807783A4/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
WILD D J ET AL: "VisualiSAR a Web-based application for clustering, structure browsing and structure-activity relationship study", JOURNAL OF MOLECULAR GRAPHICS & MODELLING ELSEVIER USA, vol. 17, no. 2, 1999, pages 85 - 89, XP002385313, ISSN: 1093-3263 * |
Also Published As
Publication number | Publication date |
---|---|
WO2006023574A2 (en) | 2006-03-02 |
EP1807783A2 (en) | 2007-07-18 |
WO2006023574A3 (en) | 2007-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tripathi et al. | Chemically informed analyses of metabolomics mass spectrometry data with Qemistree | |
Brown et al. | On scaffolds and hopping in medicinal chemistry | |
Wetzel et al. | Cheminformatic analysis of natural products and their chemical space | |
Warr | Representation of chemical structures | |
Brown | Chemoinformatics—an introduction for computer scientists | |
WO2011151500A1 (en) | Arrangement and method for finding relationships among data | |
Stumpfe et al. | Methods for SAR visualization | |
Koyutürk | Algorithmic and analytical methods in network biology | |
Medina-Franco | Advances in computational approaches for drug discovery based on natural products | |
Clark et al. | Detection and assignment of common scaffolds in project databases of lead molecules | |
Awale et al. | Similarity mapplet: interactive visualization of the directory of useful decoys and chembl in high dimensional chemical spaces | |
Huang et al. | CATAPULT: data-driven selection of canned patterns for efficient visual graph query formulation | |
Manelfi et al. | “Molecular Anatomy”: a new multi-dimensional hierarchical scaffold analysis tool | |
Kooistra et al. | 3D‐e‐Chem: Structural cheminformatics workflows for computer‐aided drug discovery | |
Klenner et al. | ‘Fuzziness’ in pharmacophore-based virtual screening and de novo design | |
Wester et al. | Scaffold topologies. 2. Analysis of chemical databases | |
Ertl et al. | The scaffold tree: an efficient navigation in the scaffold universe | |
WO2006023574A2 (en) | Methods for describing a group of chemical structures | |
Kolpak et al. | Enhanced SAR maps: expanding the data rendering capabilities of a popular medicinal chemistry tool | |
Clark | 2D depiction of fragment hierarchies | |
US20050130229A1 (en) | Indexing scheme for formulation workflows | |
Kaushal et al. | Analyzing and visualizing expression data with Spotfire | |
Villar et al. | Design of chemical libraries for screening | |
Streit et al. | Navigation and exploration of interconnected pathways | |
Takeuchi et al. | Global assessment of substituents on the basis of analogue series |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20070319 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR MK YU |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20081230 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G01N 33/48 20060101ALI20081219BHEP Ipc: G06F 17/50 20060101ALI20081219BHEP Ipc: G06F 19/00 20060101AFI20081219BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20090327 |