EP1807783A4

EP1807783A4 - Methods for describing a group of chemical structures

Info

Publication number: EP1807783A4
Application number: EP05788642A
Authority: EP
Inventors: Anuj Patel; Donovan N Chin; Juswinder Singh; R Aldrin Denny
Original assignee: Biogen Idec Inc; Biogen Idec MA Inc
Current assignee: Biogen Inc; Biogen MA Inc
Priority date: 2004-08-19
Filing date: 2005-08-18
Publication date: 2009-01-28
Also published as: WO2006023574A2; EP1807783A2; WO2006023574A3

Abstract

A group of compounds, such as a chemical library, can be described by a connectivity based method. The method allows a chemist to visualize the structural variation present in the group of compounds. The method also allows a chemist to visualize activity and property changes associated with the structural variation.

Description

METHODS FOR DESCRIBING A GROUP OF CHEMICAL

STRUCTURES

CLAIM OF PRIORITY This application claims priority to U.S. Patent Application No. 60/602,643, filed

August 19, 2004, which is incorporated by reference in its entirety.

TECHNICAL FIELD

This invention relates to methods for describing groups of chemical libraries.

BACKGROUND

The discovery of structure-activity relationships (SAR) is an important method in medicinal chemistry for determining which structural features in a chemical library are important for a given property. The most basic combinatorial libraries can be defined as variable fragments in R-groups attached to a common core substructure. Larger compound collections can contain multiple levels of variation embedded within these R- group lists. The immense scope of structural variation in these libraries complicates the generation and interpretation of SAR. The discovery of SAR is often laborious, and existing tools to facilitate the process are limited in scope and use.

SUMMARY

Current drug development programs screen vast numbers of compounds in search of lead molecules. Chemical libraries can contain thousands of compounds, due to technologies such as combinatorial and virtual chemistry. High-throughput screening (HTS) and virtual screening (VS) can quickly generate activity or molecular property data for these compounds. Gleaning useful knowledge from these massive datasets in a timely and efficient fashion can be difficult.

In general, compound libraries can be analyzed by using an algorithm to identify, link, and group chemical substructures. A graphical representation of these substructure relationships in a hierarchical, branching format can be generated. This representation can highlight the structural variation explored in developing a library. Activity or property data can be overlayed onto this branched representation. Significant information can be found by pinpointing the effect on these values of substructural changes at key regions in the molecule. By first exploring all of the structural variation within a library and then studying the activity of the compound in relation to this variation, the complete SAR profile of a chemical library can be studied. Two or more libraries can be compared by comparing their respective graphical representations. The libraries can be a collection of compounds assembled only in a database, and need not be a physical collection of compounds. For example, a library can be a group of compounds described in a database (such as a medicinal chemistry database) or a patent.

In one aspect, a method of organizing a group of compounds includes (a) providing a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identifying a common core from at least a portion of the plurality of structures, (c) binning the substituents present on the core according to site of attachment to the core, and (d) repeating steps (a)-(c) using a bin of substituents as the plurality of chemical structures, thereby identifying a subcore from the bin of substituents, if present.

Step (d) can be repeated for each site of attachment of the identified common core. Binning can include clustering the substituents based on a measure of similarity of the substituents. Binning can include comparing the substituents to a database including chemical structures. Binning can include fragmenting the substituents and comparing the fragments. Identifying a common core for a bin of substituents can include manually selecting a common core for the bin of substituents. The method can include aligning two or more substituents. The method can include identifying a distinct second common core from at least a portion of the plurality of structures.

The method can be implemented automatically by a computer. A common core can be identified by a user or by the computer. The identified common core can be a maximal common substructure. The measure of similarity can include a measure of structural similarity. The measure of structural similarity can include the number of ring bonds.

The method can include generating a database including a structure of the identified common core. The database can include a structure of a substituent attached at a site of attachment to the identified common core. The database can include a structure of each substituent attached at the site of attachment to the identified common core. The database can include a structure of a substituent present at each site of attachment on the common core. The database can include a structure of a subcore. The structure of a substituent in the database can be associated with an identifier indicating a common core and a site of attachment on the common core. Each structure of a substituent in the database can be associated with an identifier indicating a compound of the group having the substituent. The database can include a structure of each core and of each substituent in the plurality of chemical structures. The database can include a structure of a compound of the group. The structure of a compound of the group can be associated with information about a property of the compound of the group. The database can include a structure of each compound of the group. A list of compounds having a single site of variation, and the associated information for each compound in the list, can be extracted from the database. Every possible list of compounds having a single site of variation, and the associated information for each compound in each list, can be extracted from the database.

In another aspect, a method of visualizing a group of compounds includes creating a first representation of a plurality of chemical structures representative of at least a portion of the group of compounds, the first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node. The root corresponds to a common core of at least a portion of the plurality, the primary branch point corresponds to a site of attachment of a substituent on the common core, and the leaf node corresponds to a substituent attached to the common core. The representation can include an intermediate node connected to a primary branch point, wherein the intermediate node corresponds to a subcore belonging to at least two substituents attached to the corresponding site of attachment. The connection between the leaf node and the primary branch point can be free of an intermediate node. The intermediate node can be connected to a secondary branch point, the secondary branch point corresponding to a site of attachment of a substituent on the subcore. Each site of attachment of a substituent to the common core can be represented by a primary branch point connected to the root. Each site of attachment of a substituent to the subcore can be represented by a secondary branch point connected to the intermediate node. Each substituent in the plurality can be represented by a leaf node. The first representation can be a graphical representation. A feature of the graphical representation can illustrate a property of a compound of the group. The feature can be color. The property can be a biological activity.

The method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to the same common core structure as the root of the first graphical representation. The primary branch points, intermediate nodes, secondary branch points and leaf nodes of the second graphical representation can correspond to the same sites of attachment, subcores and substituents as the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the first graphical representation. A feature of the second graphical representation can illustrates a second property of a compound of the group.

The method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to a common core structure of at least two members of the group and is different from the common core structure corresponding to the root of the first graphical representation.

The method can include combining a portion of the first graphical representation with a portion of the second graphical representation thereby creating a third graphical representation, wherein the third graphical representation represents at least one compound not represented by either the first or second graphical representation.

The method can include creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the chemical structures of the second graphical representation are a subset of the chemical structures of the first graphical representation.

The method can include accessing a database including a structure of a common core, the common core having a site of attachment of a substituent, and a structure of a substituent present at the site of attachment on the common core. The database can include a structure of a subcore attached to the common core and having a site of attachment of a substituent. Accessing the database can include counting the number of sites of attachment on the common core, counting the number of subcores attached to each site of attachment on the common core, counting the number of sites of attachment on each subcore, or counting the number of substituents attached to the common core and the number of substituents attached to each subcore. In another aspect, a method of visualizing a group of compounds includes (a) providing a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identifying a common core from at least a portion of the plurality of structures, (c) clustering the substituents present at a site of attachment of the identified common core based on a measure of similarity of the substituents, (d) repeating steps (a)-(c) using a cluster of substituents as the plurality of chemical structures, thereby identifying a subcore from the cluster of substituents, if present, and (e) creating a first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node, where the root corresponds to the common core, the primary branch point corresponds to a site of attachment of a substituent on the common core, and the leaf node corresponds to a substituent attached to the common core.

In another aspect, a computer program for describing a group of compounds includes instructions for causing a computer system to (a) read a plurality of chemical structures representative of at least a portion of the group of compounds, (b) identify a common core from at least a portion of the plurality of structures, and (c) cluster the substituents present at a site of attachment of the identified common core based on a measure of similarity of the substituents.

The computer program can include instructions for causing the computer system to (d) repeat steps (a)-(c) using a cluster of substituents as the plurality of chemical structures, thereby identifying a subcore for the cluster of substituents, if present. The computer program can generate a database including a structure of the identified common core.

In another aspect, a computer program for accessing a database includes instructions for causing a computer system to retrieve information in response to a user input from a database describing a plurality of chemical structures representative of at least a portion of a group of compounds, the database including a structure of a common core of at least a portion of the plurality of structures, and a structure of a substituent present at a site of attachment of the common core, the structure of the substituent associated with an identifier identifying a compound having the substituent. The computer program also include instructions for causing a computer system to create a representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node, wherein the root corresponds to the common core; the branch point corresponds to the site of attachment; and the leaf node corresponds to the substituent.

The computer program can include instructions for causing the computer system to display a chemical structure of a core, a subcore, a substituent, or a chemical structure of the plurality. Displaying the chemical structure can include displaying information about a property of a compound of the group. The user input can include a structure of a substituent, and the computer can retrieve from the database a structure of a compound including the substituent. The computer can retrieve from the database the structure of each compound described by the database including the substituent, and information about a property of each compound described by the database including the substituent. The user input can include a structure of a substituent, subcore, core, or compound of the group; or a partial structure of a substituent, subcore, core, or compound of the group. The computer can retrieve from the database a structure of a compound described by the database including the structure of the user input.

Other features or advantages of the present invention will be apparent from the following detailed description of several embodiments, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FlG. 1 is a depiction of a SARTree.

FTG. 2 is a depiction of a SARTree shown with the structures of compounds represented by the SARTree.

FIGS. 3A and 3B are flowcharts describing how a SARTree can be generated.

FIG. 4 is a depiction of a SARTree shown with SAR data.

FlG. 5 is a depiction of a SARTree shown with SAR data.

FIG. 6 is a depiction of a forest of SARTrees. FIG. 7 is a depiction of a SARTree.

FlG. 8 is a depiction of a SARTree shown with SAR data.

FIG. 9 is a depiction of a SARTree and subtrees.

FIGS. 10a and 10b are depictions of a SARTree with colors depicting activity data. FIG. 11 schematically depicts the use of SARTrees in library design.

FIG. 12 is a depiction of a SARTree and the compounds described by the SARTree.

FIG. 13 depicts cores identified in a SARTree analysis of a library; the identifiers (1-15) correspond to the SARTrees in FIG. 14. FIG. 14 depicts the SARTrees identified in a SARTree analysis of a library; the identifiers (1-15) indicate the core associated with each SARTree.

FIG. 15 depicts a single SARTree with four different properties overlaid on the SARTree. FIG. 16 depicts the 10 SAR tables extracted from a SARTree (at center) describing 106 compounds.

DETAILED DESCRIPTION A connectivity-based method organizes and visualizes the structural variation and properties a group of compounds. The SARTree method produces a specialized connected graph (a SARTree) of nodes representing substructures of the compounds and locations on the substructures where variation occurs (i.e, locations on the substructure where an R-group is attached). Connections between the nodes indicate attachments of substructures in the group of compounds. Complete molecules are represented in a

SARTree by combinations of structural variation paths emanating from a core node, each terminating in a leaf node.

The DIVA software can find and label R-group substituents around a specified core. It then displays these fragments in a SAR table, with columns for associated activity values. This table can be useful for finding the effect of variations on the basic core structure.

There are two main limitations of DIVA's SAR analysis. First, this table can quickly become overwhelming with larger datasets. Second, DIVA cannot extend this method beyond the basic core-variants model. In complex libraries, there are often common substructures within R-groups. DIVA cannot continue this analysis to find SAR within these fragment lists.

Leadscope breaks down chemical libraries based on predefined structural fragments. The frequencies of these fragments are correlated with biological activity, and presented in charts and graphs. This analysis highlights the presence or absence of fragments in biologically interesting compounds. Leadscope's approach to SAR fails to retain any of connectivity associated with these substructures; no information regarding their position within the molecule is related in the graphs. See Roberts, G. M., et al. (2000) J. Chem. Inf. Comput. ScL 40(6): 1302-1314, which is incorporated by reference in its entirety. DrugPharmer implements a phylogenetic-like tree (PGLT) algorithm as a method for studying compound datasets. This algorithm clusters the compounds and uses MCSS to determine a common substructure for each cluster. These substructures are used to define a series of chemical classes. The compounds within a given class are then re- clustered, and used to find child classes. The resultant data structure is a PGLT of parent and children chemical classes, each defined by a substructure. DrugPharmer's approach creates chemical classes of molecular features, and then correlates these specified biological values. The substructure associated with these classes can be used to find SAR. This method searches for increasingly larger common substructures to define classes and does not study the SAR effects of smaller structural changes within specific regions of the compounds. See, for example, Nicolaou, C. A. T., et al. (2002) J. Chem. Inf. Comput. ScL 42(5): 1069-1079, which is incorporated by reference in its entirety.

Distill, from Tripos, uses common substructures to classify compounds. These classes can then be visualized in the SYBYL interface through an interactive dendogram. The SARTree method can maintain the connectivity information for the structures it describes. In other words, a user can easily determine both the substructures present in the compounds of a library and the relative positions of those substructures to other substructures from a SARTree that describes the library. When the SARTree also conveys information relating to properties of the compounds, it can help a chemist understand the influence of structure on the properties of the compounds.

A chemical library is a collection of compounds. The library can be a physical collection of compounds, or a can be a conceptual collection of compounds. A conceptual library can be a collection of structures of compounds, for example, the set of compounds that are expected as the result of carrying out a combinatorial or parallel synthesis. A chemical library can include known compounds (i.e. the compound has been synthesized and characterized by a chemist) and unknown compounds, such as, for instance, compounds that a chemist intends to synthesize and characterize.

A chemical library can be described by a connectivity-based method. The connectivity-based method maintains information about the locations of compound fragments on a core structure. The structure of a compound can be described as a core having substituents at one or more locations. When compounds of the library share a structural fragment in a substituent, the substituent can be described as including a subcore. The subcore in turn can have substituents at one or more locations. This recursive description be extended as far as is necessary or convenient. The method can be implemented on a computer. The structures of the compounds belonging to the library can be automatically examined to determine what, if any, common core structures is shared by the members of the library. Alternatively, a user can specify a core structure. In the methods, a chemical structure can be a structural formula (e.g., ^* \ being a structural formula for toluene) or a computer-readable structure. A computer-readable structure can convey the same information to a chemist as a structural formula. In general, the chemist will find a structural formula more convenient to read than a computer-readable structure. A computer executing appropriate software can read a computer-readable structure and display a structural formula to the user. Examples of computer readable structures include SMILES (e.g., clcccccl (C) being a SMILES representation of toluene), and connection tables, such as the connection tables used in a .mol file format, and a .sdf file format. The library can be described in by a database. The database can be stored in one or more files (e.g., on a magnetic disk, optical disk, or other storage medium), in computer memory, or can be a collection of written entries. The database can include information about each compound in the chemical library or a subset of compounds of the library. One file can include structural descriptions of the compounds of the library. The structural description describes the compounds by its constituent atoms and the bonds between the atoms. The structural descriptions can be in a standard computer readable format, such as, for example, the .sdf format or SMILES. Each compound can be identified with a unique identifier such as a serial number or name. The identifier uniquely refers to both a particular member of the chemical library and to the structure of that compound. Each compound can also be associated with one more pieces of information related to the compound. Information relating to the compounds can include information about the properties of a compound, such as physical, chemical, or biological properties. For example, database can include biological activity data for the compound. The activity data can be experimental data or the result of theoretical modeling. Experimental activity data can include, for example, the value of a K_<i or IC50 measured in vitro, or an in vivo result. Theoretical modeling can include, for example, the results of docking a compound to a protein structure and calculating whether the compound is likely to bind to the protein or not. Any other pertinent information related to the compounds in the library can be stored in the file. A second file can include information about the cores and subcores of the library.

A structural description of each core or subcore can be in the file. The structural description can indicate the location of sites of variation, i.e., where on the core an R- group is attached. Each core (or subcore) is identified by a core (or subcore) identifier. The core or subcore can also be described by its depth in the tree. For example, a core common to all compounds in the library can have a depth of zero (i.e., the root of the tree). A subcore attached directly to the core can have a depth of one; another subcore attached to the first subcore can have a depth of two, and so on. A third file can include information about the leaves of the library. There can be a structural description of each leaf of the library in the file. The structural description of a leaf can include information about its site of attachment to a core or subcore, that is, which atom or atoms of the leaf is bonded to the core or subcore. The structural description of a leaf can also include a list of which cores or subcores it is associated in the library. For example, the structural description can refer to the cores by the core identifiers. The file can also indicate which compounds of the library have the leaf, by indicating the compound identifiers belonging to compounds having the leaf. In certain circumstances, the information in the three files described above can be combined in a single file, or divided among more than three files. When a chemical library is described by the method, a user can visualize the structural variation present in the entire library. Properties of the compounds of the library (such as the activity of a compound in an assay) can be visualized as well, and relationships between the structure and activity readily seen. The visualization can be chemically intuitive; in other words, a chemist looking at the visualization can readily understand the structure of the compounds and the degree of structural variation in the library. The overlay can visually relate changes in compound structure to changes in activity. The visualization can be part of an interactive computer program. The interactive computer program can allow a user to examine structures of compounds or groups of compounds or to highlight structures or structural features that are associated with properties.

The program can display different sets of compounds of the library as directed by the user. The program can display lists of compounds by name, structure or structural fragment, and present the activity of the compounds as well. In some cases, it can be useful to display an aggregate measure of activity for a group of compounds, such as an average, maximum, minimum, or standard deviation of activity.

A connectivity-based method can allow the structural variation and physical, chemical or biological properties present across an entire chemical library to be represented in a hierarchical fashion. In one example, a group of structures is represented by a core that members of the group share. At the next level, a subset of those structures can represented by the core and a subcore attached to the core. At another level of the hierarchy, a further subset can be represented by a substituent attached to the subcore. Other subsets can include subcores and substituents attached at different location on the core (or subcore). The subsets can overlap; in other words, an individual compound can belong to more than one subset.

The hierarchical organization can be represented in a variety of ways. One representation is a list or table. The list can include names or structures of the core, subcores, and substituents present in the group of compounds. The list can include information about the location of the subcores and substituents on the core (or subcore). When the list is displayed by a computer, it can be an interactive list. For example, in response to user input, the computer can display or hide names or structures of subcores or substituents attached at particular location on the core. In another representation, a computer can display a core structure showing sites of variation (which can be indicated by, for example, "R¹", "R²", etc.). When a user clicks on a site of variation, the computer can display names or structures of the subcores or substituents present at that site of variation, or the compounds of the group which use that site of variation.

The structural information can be graphically displayed, for example, as a SARTree graph. A SARTree graph is a connected graph, with nodes representing substructural fragments within the chemical library represented by the graph (see FIG. 1). The graph contains four classes of nodes: a root core node (double circle), subcore nodes (single circle), R-group attachment nodes (diamonds), and leaf nodes (squares). In FIG. 1, the connectivity profile of the indicated compound is highlighted, and its constituent fragments are shown.

Core nodes represent a core substructure common to all of the compounds depicted in that SARTree. Each SARTree will have one core. A library of compounds that lacks a single common substructure can be represented by a forest of SARTrees. Each SARTree in the forest will have one core node at the root of each tree.

The core node of a SARTree has connectivity paths radiating from it through a set of R-group nodes. R-group nodes symbolize the different attachment points for fragments along the core/subcore. Each R-group node is connected to a set of subcore and/or leaf nodes.

Subcore nodes represent a substructure, attached to specific point in the molecule, shared by multiple compounds and with further variation off of that substructure. Subcores are connected to their own R-group nodes. The subcore — R-group — subcore/leaf connectivity of the graph repeats until each path terminates in one or more leaf nodes.

Leaf nodes correspond to a whole fragment connected to the core or subcore at the respective attachment point. Leaves may represent one or more compounds which have that fragment at the same point of attachment. A null leaf can be present in each group of leaves. A null leaf can represent a 'standard' substituent, such as hydrogen.

Whole molecules are represented in a SARTree by combinations of structural variation paths emanating from the core, each terminating in a leaf node. These paths may directly travel from an R-group node to a leaf, or may travel through one or more subcores. A single compound might contain fragments from all of the R-group paths attached to the core, or may only utilize some. When the library is described with the use of null leaves, a single compound can be described using all possible R-group paths, some of which can terminate with a null leaf.

Table 1 shows the compounds in a VLA-4 library (see Singh, J. et al. (2002) J. Med. Chem. 45(14): 2988-2993, which is incorporated by reference in its entirety). A

SARTree generated from this set of structures is shown in FIG. 2 along with the complete structures of compounds associated with each node. The core fragment in each structure is colored red, subcore fragments are colored blue, and leaf fragments are colored green.

By examining the compound breakdown of the VLA-4 SARTree in FIG. 2, the relationship between the various classes of nodes in a SARTree and the compounds described in the library can be seen.

Table 1

A SARTree can be generated using a two stage-algorithm. FIG. 3 A shows the recursive flow of compound data through the first stage of the algorithm, creating a connectivity profile for the library. The algorithm depicted in FIG. 3B takes this connectivity information and generates a SARTree graph for visualization.

In general, the first stage of the algorithm explores the structural variation within a compound library, by a combination of R-group fragmentation, maximal common substructure (MCSS) searching and compound clustering. Utilizing these methods, the algorithm recursively fragments the compound library, discovers common substructural features, and takes this structural variation information to create a connectivity profile, i.e. a list of all of the cores, subcore, and leaf fragments within the dataset. The second stage takes this connectivity profile and creates the visual representation of the SARTree.

More specifically, the first step of the algorithm is to determine a core structure for a set of compounds. The core structure can be manually specified, or can be generated automatically by a MCSS search of all of the compounds. The core need not be a maximal common substructure. In some cases it can be advantageous to have a core structure that is smaller than the maximal common substructure, such as, for example, if the library is expected to add members that will share the smaller core structure but not the maximal common substructure of the initial set of compounds. The core can also be a Murcko structure.

After a core structure has been determined, a list of all the fragments decorating the central core substructure is generated. The algorithm is designed to preserve information describing the location of the fragments on the core. Fragments attached to the same point on the core are grouped together in R-group bins, which can be noted as Rl, R2, R3, and so on. If a single core structure cannot be determined for the entire set, more than one core can be chosen. Each core can then serve as the root of a SARTree.

The fragments belonging to each R-group bin are analyzed further to identify subcore(s), if present, using one or combination of methods. Methods to identify subcores can, without limitation, include the following: • Similarity clustering - MCSS method: Substituents are clustered based on a measure of similarity, and performing maximum common substructure search on each cluster.

• Chemical fragmentation method: Substituents are analyzed to generate small structural assemblies or fragments in a chemically meaningful way. The smaller structures are compared within the bin for structural identity, and frequency of occurrence can be calculated. Subcore(s) can be selected automatically using frequency criteria.

• Database searching method: Substituents in the bin are compared with the database of chemical structures to identify subcore(s). The compared structure can also be Murcko structures.

• Manual method: A user manually draws or selects regions of structures from the substituents in the bins.

Substituents can be aligned prior to subcore identification. Aligning the substituents can help to prevent artifacts from appearing in the final SARTree. For example, a chemist recognizes that two phenyl groups, each having a meta-chloro substitution, are equivalent. However, a computer does not have the chemical understanding that a chemist does, and so might identify the two m-chlorophenyl groups as being distinct if drawn such that the chloro groups appear in different locations. Aligning the substituents (i.e., so that when drawn, both have the meta-chloro group in the same place) helps the computer to correctly identify equivalent substituents.

In the clustering method, fragments are clustered according to similarity as determined by a set of chemical descriptors. Any structure-based descriptor can be used in clustering. In some circumstances, it can be desirable to use connectivity-based fingerprints or the number of ring bonds as descriptors. Each cluster is then searched for a MCSS. A maximal common substructure is the greatest number of atoms and bonds in common among a group of structures. An MCSS can include a substructure that is not chemically intuitive, such as a fragment of a ring. Clustering of fragments prior to MCSS searching helps to produce chemically meaningful substructures, and to avoid awkward substructures, such as a partial ring. If a significant substructure is found, it is used as a subcore for that R-group. A substructure can be deemed significant if contains more than a specified number of atoms.

The set of subcores is sorted by size, then tested against the initial set of R-group fragments. Those fragments that contain a subcore are further fragmented into a new set of R-groups attached to that subcore. Those fragments that contain none of the subcores are noted as leaves. The algorithm continues recursively, within the newly formed R- group lists, until no significant subcores are found, and each path of variation ends in a set of leaf nodes. See FIG. 3A.

The recursive nature of the method allows automatic analysis of structural variation many levels deep within a compound library. The method can be implemented automatically (e.g., on a computer). If the user desires, the method can be constrained by manually inputting parts of the subcore backbone to be used in fragmentation. For any given R-group, instead of using clustering and MCSS for that set of fragments, a set of subcores can be specified and the fragments will only be tested against those substructures. The user can further decide whether the recursive fragmentation algorithm is continued on the resultant R-groups, or those lists are all marked as leaves. This regulation of the backbone allows a controlled investigation of the library. If the process is too tightly restricted, however, hidden SAR trends which might have been found in an automatic analysis might be lost. In combinatorial libraries, if the core is known, it can be exploited semi- automatically by simply supplying library and core files; otherwise a MCSS search can be used to determine a core and the library can be processed automatically, but at greater computational expense. The resultant MCSS core substructure may not be the actual scaffold originally used to create the library.

For highly diverse libraries, the compounds can be clustered by structural similarity, then searched for a MCSS. Alternatively, a diverse subgraph MCSS search can be run, to find a set of common substructures within the library. These substructures can represent cores of potential 'series' within dataset. After merging redundant and overly small series, a 'forest' of SARTrees, can be produced, with each tree representing a series and rooted at its respective core.

The algorithm can be implemented in the Pipeline Pilot application (Scitegic). Pipeline Pilot provides a 'chemically-aware' object-oriented programming environment which understands molecular structures. Through built-in components, it is capable of performing complex algorithms, such as R-groups fragmentation, chemical clustering, and MCSS on chemical libraries in numerous data formats. It is designed to rapidly push compounds through pipelines of these computational processes. Other environments, such as OEChem by OpenEye, may provide similar functionality for SARTree.

The SARTree dataset produced by the first stage of the algorithm contains all of the chemical and activity data related to the tree, and is completely independent from the second stage user interface. As such, the dataset might be visualized in any other visualization scheme, such as web-based reports.

An application can allow a user to interact with the SARTree database. The application can display a SARTree and can provide a graphical interface for a user. The graphical interface can allow a node of the SARTree to be queried for information regarding the fragment it represents, the corresponding compounds that contain the fragment, or other information. The application can also produce the SARTree in a manner that reveals other information in the database, such as properties of the compounds in the library. For any such property, each node and edge can be assigned a value from the mean, maximum, minimum, or standard deviation of the property values of the associated compound. Mean or median overlays can be especially useful when trying to find the general contributions of fragments. Minimum overlays can highlight the compound of the tree with the lowest assay value; likewise, a maximum overlay highlights the compound with the highest assay value. A standard deviation overlay can illustrate which compounds show variation in activity, and thus which fragments are important for determining the activity of a compound. The nodes and edges can be adjusted (e.g. size, color, thickness, or other feature) according to the value associated with the feature. Nodes and edges can be adjusted with a simple gradient scale or through a discrete set of bins, specified for that particular library.

It can sometimes be advantageous to display a constrained subset of a SARTree. In a SARTree, each leaf node can represent multiple compounds having structural variation at other positions. In order to more easily find the effects of specific structural variations, specific subtrees of a SARTree can be studied. In one example, one leaf node is chosen as a fixed node, and only those branches of the SARTree that correspond to compounds including the chosen leaf node are displayed. In another example, multiple leaf nodes are chosen as fixed nodes, and one or more R group nodes are allowed to be variable. The resulting subtree represents a group of related compounds. A similar tree can be created by specifying a more extensive core structure during the implementation of the SARTree algorithm. However, it can be more convenient to generate the initial SARTree with a smaller core having a larger number of branches, and later generating subtrees, than to create a larger number of less branched SARTrees initially. In this manner, relationships between different subtrees of a common SARTree can be explored. An algorithm can be used to generate all subtrees containing only a single point of variation. Once the group of subtrees has been organized by the algorithm, a series of traditional SAR tables can be generated. In a SAR table, a series of compounds that vary at a single position are listed along with a measure of their respective activities. From the SARTree data, every possible SAR table can be generated by selecting an R group node and combinatorially creating each SAR table that varies only at that point. The process is repeated for each R group node in the SARTree. These SAR tables can each be represented by a distinct subtree. A user can review only those tables for which there the variation in activity exceeds a predetermined threshold, thus indicating that the R group in question has an effect on activity. A user can review particular SAR tables based on the number of compounds in the table, the range or standard deviation of activity values in the table, or based on the presence or absence of particular structural fragments in the table.

To aid in discovering the SAR present within a SARtree, we have implemented an automated method of SAR table generation. For every R-group within the tree, this tool computes all possible subsets of the SARtree dataset that only vary at that point. This procedure is done by combinatorially fixing every combination of nodes throughout the rest of the connectivity profile. The activity or property changes for these point-specific changes can be mapped, generating all of the SAR tables associated with that SARtree. These tables can be filtered by number of molecules present, activity range and standard deviation, or node membership to provide a more focused study of the library. This analysis is particularly important because it automatically quantifies the SAR data represented by the SARtree graph.

It is important to note that the SARtree dataset produced by the first stage of the algorithm contains the entire collection of chemical and activity data related to the tree, and is completely independent from the second stage user interface. As such, the dataset might be visualized in any other visualization scheme, such as web-based reports.

The application can be useful in comparing two libraries of compounds. Two libraries can be compared with respect to the diversity of compounds in each library, the similarity of compounds in the two libraries, or other measures. The libraries can be compared visually, for example by creating a SARTree for each library and displaying the two SARTrees together. Alternatively, the library can be described by a system of identifiers or coordinates. In one example, the coordinates can be a numerical system, where a number or string of numbers can be used to refer to a core, subcore, or leaf node. The coordinates can be used to refer to a single compound of the library, or a group of compounds of the library. For example, a generic structure having a single variable position can be referred to by coordinates that designate a core, subcores and leaves shared by members of the group. Compounds that have the generic structure can be referred to by the same set coordinates plus an additional coordinate specifying the substituent at the variable position. The coordinates can be useful in comparing compounds, groups of compounds, or libraries of compounds.

A SARTree can be used to define the chemical groups in a library and their relative arrangements to each other. The groups can be described in terms of a coordinate in which a core, R group, subcore, and leaf are represented by a point of the coordinate. For example, for a group of compounds that can be described by one core with three R groups (Rl, R2 and R3), where each R group is associated with two subcores, and for each subcore there are many leaves, the structures of the compounds can be indicated by the selection of R group, subcore and leaf: compound #1 : core, Rl , subcore 2, leaf 4 compound #2: core, R2, subcore 1, leaf 5 compound #3: core, R3, subcore 1, leaf 8.

The coordinate can expressed in a compressed form, e.g., as a string of numbers or vector denoting R group, subcore and leaf: compound #1 : (1, 2, 4) compound #2: (2, 1, 5) compound #3: (3, 1, 8)

In another example, a group of compounds can be described by one core with two R groups (Rl and R2), where each R group is associated with two subcores, and for each subcore there are many leaves. Compounds from this group might include: compound #4: core, Rl, subcore 1, leaf 3, R2, subcore 2, leaf 6 compound #5: core, Rl , subcore 3, leaf 2, R2, subcore 1 , leaf 4 compound #6: core, Rl, subcore 2, leaf 4, R2, subcore 3, leaf 2 The coordinates for these compounds could be expressed as: compound #4: (1, 1, 3, 2, 2, 6) compound #5: (1, 3, 2, 2, 1, 4) compound #6: (1, 2, 4, 2, 3, 2)

The size of the coordinate can vary depending on the degree of branching in the library. In the examples above, the coordinate defines the identity of a compound by providing an index to the core, R groups, subcores and leaves in each compound. The coordinate can also include additional information regarding the compounds. For example, the coordinate could include an indicator that describes a property of a leaf (such as, for example, aliphatic, aromatic, hydrogen bond donor, hydrogen bond acceptor, electron withdrawing, or electron releasing). The indicator can include information about a property of a compound, such as a biological activity. The coordinate can provide a shorthand notation for referring to a compound or series of compounds in a library. This information could be used to define the complexity of a library. In one example, the complexity of a library can be determined by counting how many R groups, subcores and leaves exist in the library. The degree of branching can be included in a measure of complexity. This information could also be used to differentiate a series of libraries, or to compare two libraries. The extent of overlap in the coordinates defined between two libraries can be useful in determining the similarity of two libraries, and of the chemical functionality present in the libraries. With a measure of similarity of libraries (e.g., based on the similarity of coordinates of the libraries), the libraries can be clustered by similarity. The clustering can provide a means of identifying libraries that are more or less similar to a given library.

Two or more SARTrees can be compared using concepts derived from graph theory. Several potential metrics for comparing graphs are described at http://people.hofstra.edu/ geotrans/eng/ch2en/meth2en/ch2m2en.html, which is incorporated by reference in its entirety.

When two libraries include similar or identical substructures (e.g., a subcore), the SARTrees of those libraries can be joined to create a single SARTree representing of the two libraries. Such a display could be useful, for example, for providing a visual representation of two series of compounds having distinct cores, a common subcore structure, and a variety of leaves, some of which can be common to both series. The joined SARTree can visually display which of the leaves are present in each series or in both series.

The application can be useful for highlighting compounds that include a particular substructure. For example, the user can supply a substructure of interest, and the application can identify the compounds of the library having the substructure. Corresponding branches of a SARTree can be indicated, for example with a different color, or by hiding branches of the SARTree that are not associated with the substructure. The substructure of interest need not be a core, subcore or leaf that was identified by the SARTree algorithm. As described above, properties of the selected compounds can be overlaid on the SARTree.

The application can display the library at different stages of development. Subtrees can be generated from the subset of compounds present within a library at those stages of structural exploration. These compound subsets can have dramatically different activity profiles. An early library may have few compounds having low activity, while a later version of the library can include more compounds having higher activity. By viewing a sequence of subtrees corresponding to the progression of compound creation or testing, a representation of library progress can be created. Such a display can be useful in tracking the development of a library, both in terms of the structures of the compounds in the library, and in tracking the development of more highly active compounds.

SARtree can be used to design new chemical libraries. Many libraries are designed without regard for synthetic limitations. As a result, these libraries often include compounds that are impractical to synthesize, either individually or as part of a parallel or combinatorial library synthesis. Using SARtree in library design can aid in the design of synthetically feasible compounds. One or more libraries of existing compounds are processed by the SARtree method to identify cores, subcores, and leaves in the library. A subcore and its associated leaves can then be grafted on to a distinct core. Preferably, the distinct core is found in a library, where it is attached to a subcore similar to the grafted subcore. Grafting a similar subcore can favor the synthetic feasibility of the new library, because the substitutions on any given subcore are likely to be transferable to a similar subcore.

For example, consider two libraries, each analyzed by the SARTree method. The core identified in one library is designated core A, the other library's core as core B. See FIG. 11. Each library has a substituted phenyl subcore, but the substitutions (i.e., the leaves on the phenyl subcore) are different in the two libraries. Two new SARtrees can be created by grafting the phenyl subcore (and associated leaves) from one core to the other. The two new SARtrees represent two new chemical libraries.

The grafting approach can be used to quickly design synthetically feasible compounds for purposes including exploration of SAR space, designing around propietary compounds, and enriching screening libraries.

Scripts written in MATLAB can define nodes and edges of the SARTree graph. Cytoscape can be used to layout the graph. Finally, the SARTree is visualized in an interactive chemistry platform called Pinpoint, which is written in MATLAB. The various techniques, methods, and aspects described above can be implemented in part or in whole using computer-based systems and methods. Additionally, computer-based systems and methods can be used to augment or enhance the functionality described above, increase the speed at which the functions can be performed, and provide additional features and aspects as a part of or in addition to those described elsewhere in this document. Various computer-based systems, methods and implementations in accordance with the above-described technology are presented below.

In one implementation, a general-purpose computer may have an internal or external memory for storing data and programs such as an operating system (e.g., DOS, Windows 2000™, Windows XP™, Windows NT™, Windows XP, OS/2, UNIX or Linux) and one or more application programs. Examples of application programs include computer programs implementing the techniques described herein, authoring applications (e.g., word processing programs, database programs, spreadsheet programs, or graphics programs) capable of generating documents or other electronic content; client applications (e.g., an Internet Service Provider (ISP) client, an e-mail client, or an instant messaging (IM) client) capable of communicating with other computer users, accessing various computer resources, and viewing, creating, or otherwise manipulating electronic content; and browser applications (e.g., Microsoft's Internet Explorer) capable of rendering standard Internet content and other content formatted according to standard protocols such as the Hypertext Transfer Protocol (HTTP).

One or more of the application programs may be installed on the internal or external storage of the general-purpose computer. Alternatively, in another implementation, application programs may be externally stored in and/or performed by one or more device(s) external to the general -purpose computer. The general-purpose computer includes a central processing unit (CPU) for executing instructions in response to commands, and a communication device for sending and receiving data. One example of the communication device is a modem. Other examples include a transceiver, a communication card, a satellite dish, an antenna, a network adapter, or some other mechanism capable of transmitting and receiving data over a communications link through a wired or wireless data pathway.

The general-purpose computer may include an input/output interface that enables wired or wireless connection to various peripheral devices. Examples of peripheral devices include, but are not limited to, a mouse, a mobile phone, a personal digital assistant (PDA), a keyboard, a display monitor with or without a touch screen input, and an audiovisual input device. In another implementation, the peripheral devices may themselves include the functionality of the general-purpose computer. For example, the mobile phone or the PDA may include computing and networking capabilities and function as a general purpose computer by accessing the delivery network and communicating with other computer systems. Examples of a delivery network include the Internet, the World Wide Web, WANs, LANs, analog or digital wired and wireless telephone networks (e.g., Public Switched Telephone Network (PSTN), Integrated Services Digital Network (ISDN), and Digital Subscriber Line (xDSL)), radio, television, cable, or satellite systems, and other delivery mechanisms for carrying data. A communications link may include communication pathways that enable communications through one or more delivery networks.

In one implementation, a processor-based system (e.g., a general-purpose computer) can include a main memory, preferably random access memory (RAM), and can also include a secondary memory. The secondary memory can include, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive reads from and/or writes to a removable storage medium. A removable storage medium can include a floppy disk, magnetic tape, optical disk, etc., which can be removed from the storage drive used to perform read and write operations. As will be appreciated, the removable storage medium can include computer software and/or data.

In alternative embodiments, the secondary memory may include other similar means for allowing computer programs or other instructions to be loaded into a computer system. Such means can include, for example, a removable storage unit and an interface. Examples of such can include a program cartridge and cartridge interface (such as the found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, and other removable storage units and interfaces, which allow software and data to be transferred from the removable storage unit to the computer system.

In one embodiment, the computer system can also include a communications interface that allows software and data to be transferred between computer system and external devices. Examples of communications interfaces can include a modem, a network interface (such as, for example, an Ethernet card), a communications port, and a PCMCIA slot and card. Software and data transferred via a communications interface are in the form of signals, which can be electronic, electromagnetic, optical or other signals capable of being received by a communications interface. These signals are provided to communications interface via a channel capable of carrying signals and can be implemented using a wireless medium, wire or cable, fiber optics or other communications medium. Some examples of a channel can include a phone line, a cellular phone link, an RF link, a network interface, and other suitable communications channels.

In this document, the terms "computer program medium" and "computer usable medium" are generally used to refer to media such as a removable storage device, a disk capable of installation in a disk drive, and signals on a channel. These computer program products provide software or program instructions to a computer system. Computer programs (also called computer control logic) are stored in the main memory and/or secondary memory. Computer programs can also be received via a communications interface. Such computer programs, when executed, enable the computer system to perform the features as discussed herein. In particular, the computer programs, when executed, enable the processor to perform the described techniques. Accordingly, such computer programs represent controllers of the computer system.

In an embodiment where the elements are implemented using software, the software may be stored in, or transmitted via, a computer program product and loaded into a computer system using, for example, a removable storage drive, hard drive or communications interface. The control logic (software), when executed by the processor, causes the processor to perform the functions of the techniques described herein.

In another embodiment, the elements are implemented primarily in hardware using, for example, hardware components such as PAL (Programmable Array Logic) devices, application specific integrated circuits (ASICs), or other suitable hardware components. Implementation of a hardware state machine so as to perform the functions described herein will be apparent to a person skilled in the relevant art(s). In yet another embodiment, elements are implanted using a combination of both hardware and software. In another embodiment, the computer-based methods can be accessed or implemented over the World Wide Web by providing access via a Web Page to the methods described herein. Accordingly, the Web Page is identified by a Universal Resource Locator (URL). The URL denotes both the server and the particular file or page on the server. In this embodiment, it is envisioned that a client computer system interacts with a browser to select a particular URL, which in turn causes the browser to send a request for that URL or page to the server identified in the URL. Typically the server responds to the request by retrieving the requested page and transmitting the data for that page back to the requesting client computer system (the client/server interaction is typically performed in accordance with the hypertext transport protocol (HTTP)). The selected page is then displayed to the user on the client's display screen. The client may then cause the server containing a computer program to launch an application to, for example, perform an analysis according to the described techniques. In another implementation, the server may download an application to be run on the client to perform an analysis according to the described techniques.

EXAMPLES

The VLA-4 library contains 13 compounds created from variations off of a common core. These compounds were assayed for IC₅₀ values in VLA-4 inhibition (see Singh, J. et al. (2002) J. Med. Chem. 45(14): 2988-2993, which is incorporated by reference in its entirety). See Table 1 and Figure 2. Every structure in the VLA-4 library shares a common PUPA core and varies by the fragment attached to the carbonyl carbon. When analyzed in SARTree, the algorithm finds two common substructures attached to Rl : eight compounds include a phenylamine linker (shown on the left side of FIG. 2) and two compound include a thiazoleamine linker (at the right side of FIG. 2). Both thiazoleamine compounds have high IC₅₀ values, while the IC5₀ values of the phenylamine compounds vary depending on the variation off of the subcore.

FIG. 4 presents a SARTree of the VLA-4 library with activity information overlaid as different colors. The nodes in the graph are colored in a gradient by the mean log(IC₅o) values of their associated compounds. Low values of log(IC₅o) are red, while blue represents high values of log(IC₅o). The two subcore nodes are circled on the SARTree, and shown in the structural diagram (inset). The number of associated compounds (N) and mean IC₅₀ values for each subcore are listed in the inset. The dopamine β-hydroxylase (DβH) library presents a small, familiar combinatorial chemistry dataset used in molecular shape and QSAR analysis (see Burke, B. J. and A. J. Hopfinger (1990) J. Med. Chem. 33(1): 274-81, which is incorporated by reference in its entirety). The dataset consisted of 52 molecules, which were generated by decorating the benzene ring in a l-(substituted-benzyl)imidazole-2(3H)-thione core. The DβH inhibition activity of these compounds was reported in log(IC₅₀) values. See FIG. 5. The SARTree generated from this library shows the positional substituents on the ring. Because of the simplicity of the substitutions, the graph generated from the library is represented simply by leaf fragments attached to five points on the core.

This SARTree provides a visual SAR table of the DβH dataset. The graph can be colored by the results from the inhibition assay to show the effects of different decorations on inhibition. When overlaid with standard deviation of log(IC₅₀) values (see FIG. 5), the SARTree highlights the consistency of inhibition results for molecules sharing a fragment at a certain point on the ring. Specifically, nodes are colored by classes in standard deviation in IC5₀. The tables show the leaves of each R-group, with the number of associated compounds (N) and standard deviation in IC₅₀ values (STDDEV).

The nodes in FIG. 5 are colored in three classes: red for a standard deviation less than 0.5, green for 0.5 to 1.0 and blue for greater than 1.0. The actual deviation values are show in tables adjacent to the R-groups. The high standard deviation of the multi-compound fragments in the R2 and R5 positions could signify that these positions on the ring are not involved in the binding interaction; therefore, structural variation in this region would have little effect on the molecule's inhibition. The SARTree algorithm was applied to a library used in screening against cyclin- dependent kinase 2 (CDK-2). The library included approximately 3000 compounds that were generated combinatorially from a set of 15 scaffolds. See, for example, Evensen, E., et al. (2003) J. Med. Chem. 46, 5125-5128; and Sielecki, T.M., et al. (2000) J. Med. Chem. 43, 1-18, each of which is incorporated by reference in its entirety. The number of compounds associated with each scaffold ranges from 10 to nearly 550 molecules.

CDK-2 inhibition assay results were available for these compounds had associated. IC₅₀ values were also known for a subset of the compounds. The CDK-2 library represents a large screening dataset composed of many sub-libraries each containing their own SAR characteristics. A forest of SARTrees was produced in this scenario, each branching from a core representing the individual scaffolds (FlG. 6). This allows the entire CDK-2 library to be visualized simultaneously, even though the compounds do not share a single common core.

The forest can be investigated as a whole to find the relative performances of each scaffold. The forest in FIG. 6 is colored by the presence of hit compounds associated with each node. Nodes containing any hits are colored blue. A hit was defined as a compound having greater than 50% CDK-2 inhibition and an IC₅₀ less than 25 μM. The remaining nodes are colored red. This view can be used to highlight which scaffolds were successful in producing hits, and which of their structural variation paths contributed to those hits.

The individual trees within the SARTree forest can then be studied separately, to view the properties of compounds within each series. The SARTree can also be used to find subseries belonging to a particular scaffold. In scaffold 5, for example, there were three subcores attached to Rl : a 1,2-dimethoxybenzene fragment, a methoxybenzene fragment, and a phenyl fragment. These subcores varied in both average inhibition and hits produced per total associated compounds (see Table 2). This analysis shows the higher levels of success within the 1,2 dimethoxybenzene subseries of Rl (see FIG. 7). FIG. 7 shows the SARTree for scaffold 6 of the CDK-2 library colored by hit presence (blue indicates nodes containing hit compounds). The Rl region of structural variation (enclosed in black outline) contains three subcores (circled and numbered in orange). These subcores define three separate subseries of variation within the library.

Table 2

Subcore N Avg. inhibition Hits

By examining the nested variations within each R-group, we can find the subtle structural changes that affect expressed activity. In scaffold 9 of the CDK-2 dataset, trends in the average inhibition and hit compounds produced by different fragment were observed when looking at the different structures attached to a Rl .1 of a benzene ring subcore on Rl (see FIG. 8). The nodes are colored by mean inhibition values, in a red-to- blue gradient highlighting the range of values; red nodes represent low inhibition, blue represent high inhibition. The outlined region shows the variation on the Rl.1 attachment to the Rl benzene subcore. The structural diagram shows the different fragments attached with the average inhibitions and hits from their associated compounds. N is the number of compounds having the leaf shown.

Table 3 presents example molecules of each MicroSAR fragment from FIG. 8. Table 3

Series of subtrees can be generated from a particular tree. For scaffold 8, nine subtrees were generated, each having only a single site of variation and including four or more compounds. FIG. 9 shows the full SARTree of scaffold 8 (center) and the nine subtrees (labeled (a)-(i)). Each subtree is shown with its associated structure. The single site of variation on the structure is indicated by R. The SAR table for subtree (d) is shown in Table 4. A SAR table can show the structure-activity relationship for a narrow series of compounds belonging to a large group of compounds. Multiple SAR tables can help identify preferred substituents at each site of variation on a scaffold.

Table 4

Results of multiple assays can be overlaid on otherwise similar SARTrees to reveal specificity of certain fragments. In FIG. 9, the SARTree for CDK-2 scaffold 9 is shown twice with different overlays. The first overlay (a) colors the nodes containing hits blue; the second overlay (b) shows any nodes with an average Chemscore less than -30 as green.

In this example, the overlay comparison can be used to evaluate the specificity of the Chemscore docking score function on this particular set of compounds. The structural variation paths with good Chemscores but lacking hit compounds, and vice-versa, can be studied to find the classes of fragments which cause inaccuracies in the docking evaluation. Generally, any two (or more) properties can be compared for the same group of compounds by preparing similar SARTrees overlaid with different property values.

Figure 12 shows the SARtree network for a small library of twenty CDK2 inhibitors. The individual compounds are shown for easier understanding of the SARtree. First, the SARtree quickly captured the true underlying chemical variation in this library. The core structure (red) has one R-group, and two subcores (phenyl and ethyl) at that R- group (blue). There were three compounds that did not have a common subcore, and they were displayed as leaves (green) directly attached to the core. There was more variation shown about the phenyl subcore than the about the ethyl subcore. SARtree has grouped the arrangement of leaves about the phenyl subcore according to ortho, meta, and para substitution patterns. This type of structural variation summary was not as immediately obvious from just examining a list of the 20 molecules. The difficulty of summarizing structural variation is magnified when the number of molecules and the structural complexity within the compounds increases.

The SARtree quickly revealed SAR relationships. For example, overlaying hit information (blue= hit, red=miss) quickly revealed regiochemical effects for -Cl and -F (3 vs. 7 and 5 vs. 6) on activity, while polar -OH (2) results in a hit when replacing hydrophobic moities in the para position. Third, SARtree emphasizes structural variation within the molecules, and therefore is insensitive to artifacts due to typical descriptor- based similarity methods. For example, the Tanmito coefficients (Tc) for 1 vs. 2 and 4 vs. 8 are indistinguishable using common methods, even though they have differing activities, and from a medicinal chemistry view, are essentially different. The SARtree clearly separates these molecules, but keeps their core, subcore, leaf, and R-group relationships. Similarity metrics are widely debated, and the inability of these metrics to distinguish these molecules from each other is due because they are typically whole- molecule comparisons, and subtle but important local structural variations are lost in the noise of rest of the molecule.

Analyzing patterns in chemical structure space can be difficult for large datasets. The addition of multiple assay results or property results magnifies this problem. A SARtree analysis of multiple properties for the CDK2 series was generated. Figure 13 illustrates the fifteen core structures identified and illustrated by the fifteen SARtrees in Figure 14. Multiple properties can be compared interactively with the method, or the data can be viewed side-by-side. FIG. 15 illustrates the same tree displayed four times with four different properties overlaid: activity, PSA, aLogP, and absence of negative atoms. The Lipinski intestinal permeability metric was chosen for each compound, by assigning the number of Lipinski property threshold that each molecule satisfies to each molecule (from zero to four), and polar surface area (see, e.g., CA Lipinski, Adv. Drug Del. Rev. 1997. 23, 3). The property SARtrees in Figure 15 suggest that the trend in substitution patterns off of the phenyl ring shown tend to have the most active compounds, but that this substitution pattern also has a modest variability in the number of Lipinski parameters that are passed. AlogP, in addition, seemed to increase in magnitude in the same region of SARtree space as the active compounds. PSA seemed unaffected, however. This simple comparison highlights a typical dilemma in medicinal chemistry that the favorable biochemical potency of a molecule tends to be compensated by poor drug like properties. SARtree can quickly identify such compensatory effects in drug optimization phases and allow the medicinal chemist to guide chemical design towards more favorable drug space. The SARtree data structure contains descriptions of how many compounds have variations in one leaf position. All of the collections of compounds where only the leaf is changing or where the attachment of the leaf to the subcore is changing (regioisomers) can be extracted along with activity data. In other words, the SARtree data structure can be exploited to produce every possible SAR table for the compounds in the data set. This automatic SAR extraction method was applied to one of the sub screening libraries for CDK2 that contained 106 compounds. Ten SAR tables were found within this library (Figure 16), in which each of the tables was more or less evenly represented in the number of compounds. The hits are shown in Tables 2, 3 and 5.

This approach of extracting SAR tables from SARtrees has three advantages. First, it allows us to quickly and automatically extract SAR tables from one or more chemical library (as we will show). Second, it allows us to quantify, per library, the number of SAR tables it contains, which would be useful for quickly assessing the amount of potential SAR information. Third, it allows us to quickly identify and assess any missing or unfilled structure-variation points that would lead to missing or incomplete SAR.

SARtree permits rapid identification of active scaffolds and structure variation- activity relationships within large screening libraries. Screening libraries often contain multiple scaffolds that are each combinatorially expanded to varying degrees and magnitude. Frequently the question of which scaffolds are represented in the actives and inactives arises. It can therefore be useful to arrange the SARtrees of each sub-library contained within the screening library for comparisons — a SARtree forest — shown in Figure 14. An important point is that the SARtree forest in Figure 14 represents over 17,000 compounds, which highlights the efficiency of SARtrees to display large numbers of compounds and the general structural variation within the libraries. The SARtree forest allows a user to quickly identify putative scaffolds, and compare the structure- variation and activity among the scaffolds. Figure 14 quickly shows the active/inactive distribution for the various libraries, and visually separating out libaries that had no actives (series 3, 9, 10, 12, 13, and 14) from those that contain a variety of actives distributed over the remaining SARtrees. The following sections will explore the relationships between the statistics of the SARtrees and the actives.

Several descriptive statistics for the SARtrees in Figure 14 are shown in Table 5.

Table 5

Sub

Series Total Hits ^a: SARtables Cores Leaves ^b R Points* ^CH Frag**

1 204 1 38 3 50 8 5

2 106 3 10 4 79 13 13

3 72 0 3 4 65 13 1 1

4 254 34 7 4 245 15 13

5 272 20 19 4 131 17 16

6 74 6 7 5 56 15 1 1

7 342 10 5 6 127 22 22

8 427 26 36 5 85 12 12

9 48 0 1 2 42 9 8

10 17 0 0 2 34 6 4

1 1 12 6 1 2 24 7 7

12 269 0 98 1 46 6 5

13 157 0 21 2 48 8 7

14 247 0 44 1 38 5 4

15 332 39 12 11 458 34 26 ^aTables with four or more compounds. ^bTotal number of R Group attachment points at the core and subcores. ^cNumber of hydrogen fragments as leaves

There are two key metrics specific to SARtree. R-points describes the degree of variation (DOV) within a SARtree. R-points are the total connections involving core and subcores, and are a simple way of describing the variation at the sub-core level. Second, the number of SAR tables (select groups of molecules where only one leaf attachment point is varying as described earlier) can be extracted for each SARtree automatically. The number of SAR tables contained within a library can be a useful way to describe the library. For example, knowing the number of SAR tables for a given library would help quantify the utility of a chemical library for first pass screening or for focused optimization efforts.

The R-points metric had a high correlation value of 0.73 between with the number of hits in the screening libraries (Table 6). A plausible explanation is that as the DOV increases, then so does the likelihood of increasing the chance for finding actives. The types of chemical shared substructures found in screening libraries is limited (as we will show), and the DOV seems to be a key metric that describes how these substructures are dispersed within a library.

Table 6

Total Hits SARtable Sub Cores Leaves R Points

^• Total 1.00 0.62 0.45 0.45 0.49 0.45 Hits 0.62 1.00 -0.15 0.74 0.86 0.73

SARtable 0.45 -0.15 1.00 -0.32 -0.17 -0.32 Sub Cores 0.45 0.74 -0.32 I 1.00 0.85 0.96 Leaves 0.49 0.86 -0.17 0.85 L 1.00 0.88 R Points 0.45 0.73 -0.32 0.96 0.88 1.00

The number of SAR tables (defined as those with four or more molecules with only one leaf variation point) found within each libraries was strikingly different for each library.

Other embodiments are within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method of organizing a group of compounds comprising:

(a) providing a plurality of chemical structures representative of at least a portion of the group of compounds;

(b) identifying a common core from at least a portion of the plurality of structures; (c) binning substituents present on the core according to site of attachment to the core;

(d) repeating steps (a)-(c) using a bin of substituents as the plurality of chemical structures, thereby identifying a subcore for the bin of substituents, if present.

2. The method of claim 1, further comprising repeating step (d) for each site of attachment of the identified common core.

3. The method of claim 1, wherein binning further includes clustering the substituents based on a measure of similarity of the substituents.

4. The method of claim 1 , wherein binning further includes comparing the substituents to a database including chemical structures.

5. The method of claim 1, wherein binning further includes fragmenting the substituents and comparing the fragments.

6. The method of claim 1 , wherein identifying a common core for a bin of substituents includes manually selecting a common core for the bin of substituents.

7. The method of claim 1 , further comprising aligning two or more substituents.

8. The method of claim 1 , further comprising identifying a distinct second common core from at least a portion of the plurality of structures.

9. The method of claim 1 , wherein the method is implemented automatically by a computer. 10. The method of claim 9, wherein a common core is identified by a user.

11. The method of claim 9, wherein a common core is identified by the computer.

12. The method of claim 1, wherein the identified common core is a maximal common substructure.

13. The method of claim 3, wherein the measure of similarity includes a measure of structural similarity.

14. The method of claim 13, wherein the measure of structural similarity includes the number of ring bonds.

15. The method of claim 1, further comprising generating a database including a structure of the identified common core.

16. The method of claim 15, wherein the database further includes a structure of a substituent attached at a site of attachment to the identified common core.

17. The method of claim 16, wherein the database further includes a structure of each substituent attached at the site of attachment to the identified common core.

18. The method of claim 15, wherein the database includes a structure of a substituent present at each site of attachment on the common core.

19. The method of claim 15, wherein the database includes a structure of a subcore.

20. The method of claim 15, wherein a structure of a substituent in the database is associated with an identifier indicating a common core and a site of attachment on the common core. 21. The method of claim 15, wherein each structure of a substituent in the database is associated with an identifier indicating a compound of the group having the substituent.

22. The method of claim 15, wherein the database includes a structure of each core and of each substituent in the plurality of chemical structures.

23. The method of claim 15, wherein the database includes a structure of a compound of the group.

24. The method of claim 23, wherein the structure of a compound of the group is associated with information about a property of the compound of the group.

25. The method of claim 23, wherein the database includes a structure of each compound of the group.

26. The method of claim 24, further comprising extracting from the database a list of compounds having a single site of variation, and the associated information for each compound in the list.

27. The method of claim 24, further comprising extracting from the database every possible list of compounds having a single site of variation, and the associated information for each compound in each list.

28. A method of visualizing a group of compounds comprising creating a first representation of a plurality of chemical structures representative of at least a portion of the group of compounds, the first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root corresponds to a common core of at least a portion of the plurality; the primary branch point corresponds to a site of attachment of a substituent on the common core; and the leaf node corresponds to a substituent attached to the common core. 29. The method of claim 28, wherein the representation further includes an intermediate node connected to a primary branch point, wherein the intermediate node corresponds to a subcore belonging to at least two substituents attached to the corresponding site of attachment.

30. The method of claim 28, wherein the connection between the leaf node and the primary branch point is free of an intermediate node.

31. The method of claim 29, wherein the intermediate node is connected to a secondary branch point, the secondary branch point corresponding to a site of attachment of a substituent on the subcore.

32. The method of claim 28, wherein each site of attachment of a substituent to the common core is represented by a primary branch point connected to the root.

33. The method of claim 31, wherein each site of attachment of a substituent to the subcore is represented by a secondary branch point connected to the intermediate node.

34. The method of claim 28, wherein each substituent in the plurality is represented by a leaf node.

35. The method of claim 28, wherein the first representation is a graphical representation.

36. The method of claim 35, wherein a feature of the graphical representation illustrates a property of a compound of the group.

37. The method of claim 35, wherein a feature of the graphical representation illustrates a property of a plurality of compounds of the group.

38. The method of claim 36, wherein the property is a biological activity. 39. The method of claim 35, further comprising creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to the same common core structure as the root of the first graphical representation.

40. The method of claim 39, wherein the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the second graphical representation correspond to the same sites of attachment, subcores and substituents as the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the first graphical representation.

41. The method of claim 40, wherein a feature of the second graphical representation illustrates a second property of a compound of the group.

42. The method of claim 35, further comprising creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to a common core structure of at least two members of the group and is different from the common core structure corresponding to the root of the first graphical representation.

43. The method of claim 42, further comprising combining a portion of the first graphical representation with a portion of the second graphical representation thereby creating a third graphical representation, wherein the third graphical representation represents at least one compound not represented by either the first or second graphical representation.

44. The method of claim 35, further comprising creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the chemical structures of the second graphical representation are a subset of the chemical structures of the first graphical representation. 45. The method of claim 28, further comprising accessing a database including: a structure of a common core, the common core having a site of attachment of a substituent; and a structure of a substituent present at the site of attachment on the common core.

46. The method of claim 45, wherein the database further includes a structure of a subcore attached to the common core and having a site of attachment of a substituent.

47. The method of claim 45, wherein accessing includes counting the number of sites of attachment on the common core.

48. The method of claim 45, wherein accessing includes counting the number of subcores attached to each site of attachment on the common core.

49. The method of claim 48, wherein accessing includes counting the number of sites of attachment on each subcore.

50. The method of claim 48, wherein accessing includes counting the number of substituents attached to the common core and the number of substituents attached to each subcore.

51. The method of any of claims 1-27, further comprising:

(e) creating a first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root corresponds to the common core; the primary branch point corresponds to a site of attachment of a substituent on the common core; and the leaf node corresponds to a substituent attached to the common core.

52. The method of claim 51 , wherein the representation further includes an intermediate node connected to a primary branch point, wherein the intermediate node corresponds to an identified subcore. 53. The method of claim 52, wherein the connection between the leaf node and the primary branch point is free of an intermediate node.

54. The method of claim 52, wherein the intermediate node is connected to a secondary branch point, the secondary branch point corresponding to a site of attachment of a substituent on the subcore.

55. The method of claim 51 , wherein each site of attachment of a substituent to the common core is represented by a primary branch point connected to the root.

56. The method of claim 54, wherein each site of attachment of a substituent to the subcore is represented by a secondary branch point connected to the intermediate node.

57. The method of claim 51 , wherein each substituent in the plurality is represented by a leaf node.

58. The method of claim 51 , wherein the first representation is a graphical representation.

59. The method of claim 58, wherein a feature of the graphical representation illustrates a property of a compound of the group.

60. The method of claim 58, wherein a feature of the graphical representation illustrates a property of a plurality of compounds of the group.

61. The method of claim 59, wherein the property is a biological activity.

62. The method of claim 54, further comprising creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein root of the second graphical representation corresponds to the same common core structure as the root of the first graphical representation. 63. The method of claim 62, wherein the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the second graphical representation correspond to the same sites of attachment, subcores and substituents as the primary branch points, intermediate nodes, secondary branch points and leaf nodes of the first graphical representation.

64. The method of claim 63, wherein a feature of the second graphical representation illustrates a second property of a compound of the group.

65. The method of claim 59, further comprising creating a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to a common core structure of at least two members of the group and is different from the common core structure corresponding to the root of the first graphical representation.

66. The method of claim 51 , further comprising accessing a database including: a structure of a common core present in at least two members of the group, the common core having a site of attachment of a substituent; and a structure of a substituent present at the site of attachment on the common core.

67. The method of claim 66, wherein the database further includes a structure of a subcore attached to the common core and having a site of attachment of a substituent.

68. The method of claim 67, wherein accessing includes counting the number of sites of attachment on the common core.

69. The method of claim 67, wherein accessing includes counting the number of subcores attached to each site of attachment on the common core.

70. The method of claim 69, wherein accessing includes counting the number of sites of attachment on each subcore. 71. The method of claim 69, wherein accessing includes counting the number of substituents attached to the common core and the number of substituents attached to each subcore.

72. A computer program for describing a group of compounds, comprising instructions for causing a computer system to:

(a) read a plurality of chemical structures representative of at least a portion of the group of compounds;

(b) identify a common core from at least a portion of the plurality of structures; and

(c) binning the substituents present on the core according to site of attachment to the core.

73. The computer program of claim 72, further comprising instructions for causing the computer system to:

(d) repeat steps (a)-(c) using a bin of substituents as the plurality of chemical structures, thereby identifying a subcore for the bin of substituents, if present.

74. The computer program of claim 73, further comprising repeating step (d) for each site of attachment of the identified common core.

75. The computer program of claim 73, wherein binning further includes clustering the substituents based on a measure of similarity of the substituents.

76. The computer program of claim 73, wherein binning further includes comparing the substituents to a database including chemical structures.

77. The computer program of claim 73, wherein binning further includes fragmenting the substituents and comparing the fragments.

78. The computer program of claim 73, wherein identifying a common core for a bin of substituents includes manually selecting a common core for the bin of substituents. 79. The computer program of claim 73, further comprising aligning two or more substituents.

80. The computer program of claim 73, further comprising generating a database including a structure of the identified common core.

81. The computer program of claim 80, wherein the database includes a structure of a substituent attached at a site of attachment to the identified common core.

82. The computer program of claim 81, wherein the database includes a structure of each substituent attached at the site of attachment to the identified common core.

83. The computer program of claim 80, wherein the database includes a structure of a substituent present at each site of attachment on the common core.

84. The computer program of claim 80, wherein the database includes a structure of a subcore.

85. The computer program of claim 80, wherein a structure of a substituent in the database is associated with an identifier identifying a common core and a site of attachment on the common core.

86. The computer program of claim 81, wherein each structure of a substituent in the database is associated with an identifier identifying a compound of the group having the substituent.

87. The computer program of claim 81, wherein the database includes a structure of each core and of each substituent in the plurality of chemical structures.

88. The computer program of claim 81 , wherein the database includes a structure of a compound of the group. 89. The computer program of claim 88, wherein the structure of a compound of the group is associated with information about a property of the compound of the group.

90. The computer program of claim 88, wherein the database includes a structure of each compound of the group.

91. A computer program for accessing a database, comprising instructions for causing a computer system to: retrieve information in response to a user input from a database describing a plurality of chemical structures representative of at least a portion of a group of compounds, the database including: a structure of a common core of at least a portion of the plurality of structures; and a structure of a substituent present at a site of attachment of the common core, the structure of the substituent associated with an identifier identifying a compound having the substituent.

92. The computer program of claim 91, wherein the database further includes information about a property of a compound of the group.

93. The computer program of claim 92, wherein the database further includes a structure of a subcore attached to the common core and having a site of attachment of a substituent.

94. The computer program of claim 91, further comprising instructions for causing the computer system to create a first representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node, wherein the root corresponds to the common core; the branch point corresponds to the site of attachment; and the leaf node corresponds to the substituent.

95. The computer program of claim 94, wherein the first representation further includes an intermediate node connected to a primary branch point, wherein the intermediate node corresponds to the subcore. 96. The computer program of claim 95, wherein the first representation is a graphical representation.

97. The computer program of claim 96, wherein a feature of the first representation illustrates a property of a compound of the group.

98. The computer program of claim 97, wherein the computer program includes instructions for causing a computer system to retrieve from the database a chemical structure of a core, a subcore, a substituent, or a chemical structure of the plurality, and to display the chemical structure.

99. The computer program of claim 98, wherein displaying the chemical structure includes displaying information about a property of a compound of the group.

100. The computer program of claim 99, further comprising instructions for causing a computer system to create a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to the same common core structure as the root of the first graphical representation.

101. The computer program of claim 99, further comprising instructions for causing a computer system to create a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the root of the second graphical representation corresponds to a common core structure of at least two members of the group and is different from the common core structure corresponding to the root of the first graphical representation.

102. The computer program of claim 99, further comprising instructions for causing a computer system to create a second graphical representation including a root connected to a primary branch point, the primary branch point being connected to a leaf node; wherein the chemical structures of the second graphical representation are a subset of the chemical structures of the first graphical representation. 103. The computer program of claim 92, wherein the user input includes a structure of a substituent and the computer program includes instructions for causing a computer system to retrieve from the database a structure of a compound including the substituent.

104. The computer program of claim 103, wherein the computer program further includes instructions for causing a computer system to retrieve from the database information about a property of the compound.

105. The computer program of claim 104, wherein the computer program includes instructions for causing a computer system to retrieve from the database the structure of each compound described by the database including the substituent, and information about a property of each compound described by the database including the substituent.

106. The computer program of claim 93, wherein the user input includes a structure of a substituent, subcore, core, or compound of the group; or a partial structure of a substituent, subcore, core, or compound of the group; and the computer program further includes instructions for causing a computer system to retrieve from the database a structure of a compound described by the database including the structure of the user input.