WO2001049244A2

WO2001049244A2 - Database system and method useful for predicting putative ligand binding sites

Info

Publication number: WO2001049244A2
Application number: PCT/IL2001/000009
Authority: WO
Inventors: Marvin Edelman; Yosef Kuttner; Vladimir Sobolev
Original assignee: Yeda Research And Development Co. Ltd.
Priority date: 2000-01-03
Filing date: 2001-01-02
Publication date: 2001-07-12
Also published as: WO2001049244A3; IL133866A0; AU2392601A

Abstract

A method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest is provided. The method is effected by: (a) obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest; and (b) extracting from the structural data at least one consensus structural characteristic characterizing an interaction between the ligand of interest and at least one of the plurality of proteins, thereby identifying at least one consensus structural characteristic of binding sites of the ligand of interest.

Description

DATABASE SYSTEM AND METHOD USEFUL FOR PREDICTING PUTATIVE LIGAND BINDING SITES

FIELD AND BACKGROUND OF THE INVENTION The present invention relates to a database and a system and method utilizing same for predicting ligand binding capabilities of proteins.

Proteins are polymers which perform countless tasks within living organisms including catalyzing chemical reactions, transporting material within and between cells and forming part of their structure. The function of a protein is intimately related to its three dimensional structure. Deciphering protein structure is oftentimes essential to fully understanding protein activity.

Thus, one continuing objective of structural biology is to enable structural characterization of whole proteins or specific regions thereof which are responsible for an activity, such as, for example ligand binding activity.

Ligand binding is effected via a ligand binding site of a protein which is capable of specifically binding to a specific ligand.

A ligand binding site is a three-dimensional region of the protein whose ability to bind a ligand is a function generally associated with the sequence of amino acids that form the ligand binding site of the protein and the proper folding of this amino acid sequence into a three-dimensional structure.

Determining the location of a ligand binding site of a protein is, at times, a difficult undertaking. Typically, to determine the location of a binding site one would express and test deletion mutants of cloned genes for loss of binding activity and after broadly localizing a binding site to a particular domain, chemically synthesize individual peptides from that protein domain, and demonstrate binding of a synthesized peptide to the antibody or ligand of interest.

Other methods employ cleavage or expression products encompassing various regions of a protein which are used to determine ligand binding activity via immunological or binding assays. However, these approaches are disadvantageous in that they are both time consuming and oftentimes provide inaccurate results. In addition, mere binding of a short peptide to ligand does not guarantee that the naturally-occurring epitope is identical or even related to the short peptide. As a result, detailed peptide analysis of entire coding regions has been limited to a few proteins of major economic importance, such as, for example, insulin.

Additional methods such as x-ray crystallography or nuclear magnetic resonance (NMR) spectra enable determination of an atomic structure of a protein or a protein-ligand complex. While these methods can provide detailed and accurate results, they are time consuming and oftentimes difficult to carry out. Moreover, since such methods require structural stability to enable determination, proteins are oftentimes complexed with ligand analogues (mostly inhibitors) rather than the ligand itself. Due to the limitations described above, there has been a surge of interest in theoretical methods for determining ligand binding pocket structures. Advents in theoretical structural biology and computational analysis power have led to the emergence of theoretical approaches designed for predicting binding site regions of proteins based on sequence and/or structural data thereof.

For example, sequence data has been analyzed to uncover preferences for specific amino acids at protein binding sites (Villar and Kauvar, 1994), to predict active sites from amino acid sequence data (Numav and Kidokoros, 1993), and to generate theoretical models from primary sequences of a particular class of proteins (Williamson, 1995). In addition, evolutionary information has been utilized to identify patterns of residue conservation that correlate with functional divergence within a protein family (Lichtarge et al., 1996). Furthermore, structural data has been used to uncover similar spatial arrangement of particular atoms in different proteins. For example, in serine proteases, three amino acid residues, Asp, His and Ser, are arranged similarly in three dimensional space to perform a specific function despite different folding motifs. This particular amino acid residue arrangement has also been found in lipase (Brady et al., 1990), while a similar arrangement containing Glu instead of Asp was found in acetylcholine-esterase (Sussman et al., 1991). These similarities can be extended even further if the location and orientation of bound water molecules are included, as in beta- lactamase (Glu 166 and Asp 170) and Aspartic proteases (Asp215 and Asp32 in endothiapepsin) (Pearl, 1993). Nucleotide base recognition by proteins also shows remarkable similarities despite considerable differences in primary sequence chain folding (Kobayashi and Go, 1997). Denessiouk & Johnson (2000) found similarities in the relative positions of different binding motifs along polypeptide chains from related proteins, but not in their three dimensional configuration, while Moodie et al., (1996) uncovered similarities in shape and polarity properties at ligand/protein interfaces although a specific recognition motif in terms of particular residue/ligand interactions was not uncovered.

These theoretical approaches generally provide predictions based upon consensus sequence data or a crude analysis of three dimensional data. Although such approaches can provide predictions as to the binding site regions of proteins of interest, such predictions are not accurate enough to be considered reliable in most cases. While reducing the present invention to practice, the present inventors have created a novel approach useful for generating consensus structural characteristics of binding domains. As further detailed hereinunder, such consensus structural characteristics enable accurate and rapid detection of putative ligand binding sites.

SUMMARY OF THE INVENTION

According to one aspect of the present invention there is provided A method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest, the method comprising: (a) obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest; and (b) extracting from the structural data at least one consensus structural characteristic characterizing an interaction between the ligand of interest and at least one of the plurality of proteins, thereby identifying at least one consensus structural characteristic of binding sites of a ligand of interest.

According to another aspect of the present invention there is provided a method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest, the method comprising: (a) obtaining first structural data pertaining to a plurality of proteins while complexed with the ligand of interest and identifying a binding site of at least one of the plurality of proteins interacting with the ligand of interest; (b) obtaining second structural data pertaining to the binding sites of the plurality of proteins while free of the ligand of interest; and (c) extracting from the second structural data at least one consensus structural characteristic of a binding site being for binding the ligand of interest, thereby identifying at least one consensus structural characteristic of binding sites of the ligand of interest. According to yet another aspect of the present invention there is provided a method of screening a database of structural data of proteins for individual proteins potentially binding a ligand of interest, the method comprising: (a) defining at least one consensus structural characteristic of structurally characterized binding sites of the ligand of interest; and (b) identifying individual proteins in which the at least one consensus structural characteristic exists.

According to further features in preferred embodiments of the invention described below, step (a) is effected by: (i) obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest; and (ii) extracting from the structural data at least one consensus structural characteristic characterizing an interaction between the ligand of interest and at least one of the plurality of proteins, thereby identifying at least one consensus structural characteristic of binding sites of the ligand of interest.

According to still further features in the described preferred embodiments step (a) is effected by: (i) obtaining first structural data pertaining to a plurality of proteins while complexed with the ligand of interest and identifying structures of the plurality of proteins interacting with the ligand of interest; (ii) obtaining second structural data pertaining to the structures of the plurality of proteins while free of the ligand of interest; and (iii) extracting from the second structural data at least one consensus structural characteristic characterizing a structure being for binding the ligand of interest, thereby identifying at least one consensus structural characteristic of binding sites of the ligand of interest.

According to still another aspect of the present invention there is provided a method of generating a polypeptide capable of binding a ligand of interest comprising: (a) defining at least one consensus structural characteristic of structurally characterized binding sites of the ligand of interest; and (b) synthesizing or modifying the polypeptide so as to include an amino acid region having the at least one consensus structural characteristic thus generating the polypeptide capable of binding a ligand of interest. According to an additional aspect of the present invention there is provided a method of predicting the effect of a binding site modification on binding between a protein and a specific ligand thereof, the method comprising: (a) defining at least one consensus structural characteristic of structurally characterized binding sites of the specific ligand; and (b) determining the effects of the binding site on the binding between the protein and the specific ligand according to the at least one consensus structural characteristic.

According to yet an additional aspect of the present invention there is provided a system useful for screening a database of structural data of proteins for individual proteins potentially binding a ligand of interest, the system including: (a) a data storage media storing, as records, at least one consensus characteristic data defining a binding site of a ligand; and (b) a data processor for executing a software application being for comparing structural data of the individual proteins to the at least one consensus characteristic defining a binding site of a ligand to thereby enable detection of a putative binding site of an individual protein.

According to still further features in the described preferred embodiments the system further comprising a server communicating with or forming a part of, the computing platform, the server being capable of communicating with a user client being operated by a user for providing the user access to the software application.

According to still further features in the described preferred embodiments the server forms a part of a communication network. According to still further features in the described preferred embodiments the communication network is the World Wide Web.

According to still further features in the described preferred embodiments the user client is a computer operating a Web browser application.

According to still an additional aspect of the present invention there is provided a data storage media comprising, as retrievable records, at least one consensus characteristic data defining a binding site of a ligand.

According to still further features in the described preferred embodiments the media is selected from the group consisting of a magnetic media, an optical media and an optical-magnetic media.

According to still further features in the described preferred embodiments the optical media is a computer disk (CD) or a digital video disk (DVD). According to still further features in the described preferred embodiments the at least one consensus structural characteristic is derived from relative positions of atoms of the structurally characterized binding sites.

According to still further features in the described preferred embodiments the at least one consensus structural characteristic is derived from a type of atoms of the structurally characterized binding sites.

According to still further features in the described preferred embodiments the at least one consensus structural characteristic is derived from atomic contacts of atoms of the structurally characterized binding sites.

According to still further features in the described preferred embodiments the atomic contacts include intermolecular and intramolecular atomic contacts. According to still further features in the described preferred embodiments the at least one consensus structural characteristic is derived by structurally superimposing at least said ligand binding site of at least two of said plurality of proteins. According to still further features in the described preferred embodiments the at least one consensus structural characteristic includes cluster data, the cluster data including coordinate information of at least one atom of the binding site ofthe at least one of the plurality of proteins.

According to still further features in the described preferred embodiments the cluster data is generated by: (i) generating from the structural data a table of protein atoms and their number of atomic neighbors within a designated threshold distance; (ii) sorting the protein atoms by decreasing number of neighbors; (iii) removing a first atom and its neighbors from the table of protein atoms; and optionally (iv) repeating steps (i)-(iii) ^to thereby obtain additional clusters, until the table of protein atoms is empty.

The present invention successfully addresses the shortcomings of the presently known configurations by providing a rapid and accurate method which can be utilized to derive a consensus structure for a protein binding site of a ligand of interest.

BRIEF DESCRIPTION OF THE DRAWINGS The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.

In the drawings:

FIG. 1 is a black box diagram illustrating the system of the present invention.

FIG. 2 is a flowchart outlining the steps used to define ligands according to the teachings of the present invention.

FIG. 3 is a flowchart outlining the steps used for generating a table listing the individual atoms and their respective number of neighbors according to the teachings of the present invention.

FIG. 4 is a flowchart outlining the steps used for generating cluster data from the table generated by the flowchart of Figure 3.

FIG. 5 is a table listing putative clusters obtained by the algorithm of the present invention. A marked (filled) cell indicates that the particular PDB entry contributes a member to the cluster, while an unmarked (empty) cell indicates that the particular entry does not. The Roman numerals indicate the atom class.

FIG. 6 is a table illustrating cluster typing of 17 PDB files. Cells are color designated according to the following cluster types: white - Hydrophobic (most frequent class, IV); light gray - Hydrogen Bond

Acceptor (most frequent class, II); gray - Hydrogen Bond Donor (most frequent class, III) and black - empty cells.

FIG. 7 illustrates the shape and positions of clusters 1, 2, 5 and 10, relative to the adenine ring of ATP/ANP (described in the table shown by Figure 5). Cluster 10 shows the position of donor atoms which hydrogen bond with the Nl atom of the 6-member adenine ring; cluster 5 shows the position of acceptor atoms which hydrogen bond with atom N6 at the edge of this ring; clusters 1 and 2 show the positions of hydrophobic atoms which contact in one case the C2 atom of the adenine ring from one side of the ring plane, and in the other case, the C5 atom of the ring from the other side ofthe ring plane. For both of the hydrophobic clusters, multiple atoms of the aromatic ring contact individual protein atoms. The maximum radius of the clusters is 2N.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is of a database of consensus structural characteristics defining a consensus ligand binding site which can be utilized to predict putative ligand binding sites of proteins of interest, to design synthetic ligands and binding proteins and to modify existing ligand binding sites and/or ligands to thereby modify binding specificity, ligand- binding protein affinity and/or binding stability thereof.

The principles and operation of the present invention may be better understood with reference to the drawings and accompanying descriptions.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting. While reducing the present invention to practice an algorithm for finding specific atoms in specific geometrical arrangement which are characteristic of specific ligand binding-sites was developed.

This algorithm, which is termed herein as "CLUSTER", enables generation of characteristics defining ligand binding sites by grouping information derived from similar binding sites and assigning such grouped information to a single coordinate system of superimposed ligand atoms. Contacting atoms of such a group can then be clustered according to spatial location and atom class. Generated cluster centers can then be utilized to form a network of coordinates useful for defining a consensus binding structure. Such consensus binding structures are useful for detecting ligand binding sites and therefore ligand binding activity in uncharacterized proteins, as well as for predicting the effect of protein modifications on ligand binding activity. As used herein the term "ligand" refers to an ion or a molecule capable of specifically binding with a binding site of a protein. A ligand can be, for example, an energy rich molecule, substrate or catalyst capable of binding to an enzyme, a receptor region capable of binding to a protein molecule or an epitope capable of binding to an antibody. Thus, according to one aspect of the present invention there is provided a method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest. The method is effected by obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest and/or while free of this ligand and extracting information from such structural data to thereby generate at least one consensus structural characteristic which characterizes the ligand binding site.

Preferably, the structural data is derived from x-ray crystallography, NMR spectra analysis or theoretical models of these proteins and as such it includes a list of proposed atomic contacts for each atom, three dimensional coordinate data of each atom, distances between adjacent atoms, solvent accessibility of an atom and atom class.

As further detailed in the examples section which follows, such consensus structural characteristic(s) can be derived from information pertaining to the relative positions of the atoms, the type of atoms and the intermolecular and/or intramolecular atomic contacts of atoms defining the binding site of the structurally characterized proteins.

As is further described in Example 1 of the Examples section which follows, such information is preferably derived from a non-redundant set of binding sites of a specific ligand, grouped and clustered according to predefined criteria thus generating one or more consensus structural characteristics.

According to a preferred embodiment of the present invention, a consensus structural characteristic of a specific ligand generated according to the present invention is tested for its ability to predict binding site regions of proteins which are known to bind the specific ligand but which were not used for generating the consensus structural characteristic.

Determining the predictive ability of a consensus structural characteristic is particularly useful in cases where a specific ligand binding site is characterizable by more than one consensus structural characteristic. In such cases, a predictive probability generated for each consensus structural characteristic can be utilized to assign a 'weight' for each consensus structural characteristic of a plurality of consensus structural characteristics used for determining protein ligand binding regions.

The consensus structural characteristics generated according to the teachings of the present invention can be utilized for: (i) screening individual proteins to thereby detect putative binding sites therein, (ii) to enable design of synthetic ligands and binding proteins and (iii) to modify existing ligand binding sites and/or ligands (e.g. drugs) to thereby modify binding specificity, ligand-binding protein affinity and/or binding stability thereof.

For example, consensus structural characteristics generated by the present invention can be utilized by a system useful for screening proteins for putative ligand binding sites.

Thus according to another aspect of the present invention, and as shown in Figure 1, there is provided a system useful for screening a database including structural data of proteins for individual proteins including a putative binding site of a ligand of interest, which is referred to herein as system 10.

System 10 includes a computing platform 12 which includes data storage media 14 which stores, as accessible records, consensus structural characteristic(s) data which defines a consensus binding site of a ligand. System 10 further includes a data processor 16 for executing a software application designed and configured for comparing structural data derived from individual proteins of the database to the consensus structural characteristic(s) data stored in storage media 14. Such a comparison enables detection of a putative binding site of an individual protein or proteins.

According to a preferred embodiment of this aspect of the present invention, system 10 further includes a server 18 communicating with or forming a part of, computing platform 12. Server 18 is capable of communicating with a user client 20 operated by a user and for providing the user access to the software application and optionally the consensus structural characteristic(s) data of computing platform 12.

As used herein, the phrase "user client" generally refers to a computer and includes, but is not limited to, personal computers (PC) having an operating system such as DOS, Windows, OS/2™ or Linux; Macintosh™ computers; computers having JAVA™ -OS as the operating system; and graphical workstations such as the computers of Sun Microsystems™ and Silicon Graphics™, and other computers having some version of the UNIX operating system such as AIX™ or SOLARIS™ of Sun Microsystems™; or any other known and available operating system; personal digital assistants (PDA), cellular telephones having Internet capabilities (e.g., wireless application protocol, WAP) and Web TVs.

For purposes of this specification, the term "Windows™" includes, but is not limited to, Windows2000™, Windows95™, Windows 3x™ in which "x" is an integer such as "1", Windows NT™, Windows98™,

Windows CE™ and any upgraded versions of these operating systems by

Microsoft Corp. (USA).

Preferably, communication between server 18 and user client 20 is mediated by a communication network 22. As used herein, the phrase "communication network" preferably refers to the Internet as manifested by the World Wide Web (WWW) of computers, although the system of the present invention can also be implemented within Intranets or Extranets or any other open or closed communication network. Thus, server 18 provides a user of user client 20 with access to the software application and optionally also the records stored by data storage media 14.

Preferably, such access is provided via a Web site stored, maintained and operated by system 10. As such, a user preferably accesses the data via a Web browser application operating in user client 20.

As used herein, the phrase "Web browser" or the term "browser" refers to any software application which can display text, graphics, or both (using built in features or dedicated plug-ins), from Web pages on World Wide Web sites. Examples of Web browsers include, Netscape navigator, Internet Explorer, Opera, iCab and the like.

Herein, the term "Web site" is used to refer to at least one Web page, and preferably a plurality of Web pages, virtually connected to form a coherent group of interlinked documents.

Herein, the term "Web page" refers to any document written in a mark-up language including, but not limited to, HTML (hypertext mark-up language) or VRML (virtual reality modeling language), dynamic HTML, XML (extended mark-up language) or related computer languages thereof, as well as to any collection of such documents reachable through one specific Internet address or at one specific World Wide Web site, or any document obtainable through a particular URL (Uniform Resource Locator).

Thus, the present invention provides a novel approach for detecting putative ligand binding sites and for further characterizing known ligand binding sites.

The algorithm of the present invention can be utilized to search a protein structure database for a consensus spatial arrangement of atoms around a given chemical moiety. Such a consensus spatial arrangement is based on a network of atom clusters each including a single atom type and each being derived from a ligand-directed, structural multiple alignment of ligand binding sites.

Although atomic level characterization of ligand binding sites has been previously described, such prior art characterization cannot be utilized to define consensus structural characteristics of ligand binding sites. For example, Kobayashi and Go (1997) developed a method to search for protein local structures at ligand binding sites by superimposing two proteins and characterizing superimposed regions according to four atom classes: carbon, oxygen, nitrogen, or sulfur. Although such an approach can provide useful information it cannot be used to derive a consensus structure.

Bruno et al., (1997) generated a computerized listing of crystallographic and theoretical data on intermolecular non-bonded interactions. This listing, which is known as the IsoStar library was later used to identify interaction sites in proteins (Verdonk et al., 1999).

Although the IsoStar library superimposes entries from several files to place all contacting groups in the same system of coordinates, such superimposition merely catalogues the contacting groups and as such it cannot be used for defining consensus structural characteristics of ligand binding sites.

Additional objects, advantages, and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination ofthe following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.

EXAMPLES Reference is now made to the following examples, which together with the above descriptions, illustrate the invention in a non limiting fashion.

EXAMPLE 1 Database Creation The goal of the present invention is to find features common to ligand binding sites of proteins or any given chemical moiety thereof. At the atomic level, numerous protein surface regions which are complementary to small ligands can be uncovered. In addition non-specific regions of ligands that form only a small number of contacts can also be uncovered. To filter out this background, analysis of five or more heavy atoms

(viz. non-hydrogen) of the ligand as well as five or more contact residues between the ligand and the protein was effected. This minimum number of atoms required for ligand binding analysis excludes atomic ions and water molecules.

Ligand selection:

The first step in creating the database of the present invention is effected by selection of ligands. In the PDB database, a ligand is described by its three-letter code name and listed under HETATM or ATOM in the coordinates section of an entry. Different regions of a single ligand which are listed under different codes are considered as separate ligands; no distinction was made between ligands diffused into the protein crystal or co-crystallized. In addition only the ligands listed under ITETATM are selected for analysis, which analysis excludes ligands that are covalently bound to protein atoms as well as EO4^" and SO4^" molecules.

Figure 2 is a flow chart outlining the process of ligand selection. As shown therein, a PDB entry which contains a ligand as determined by X-ray crystallography is printed out and included in the database of the present invention. Creating the non-homologous dataset:

Since some proteins bind several ligands and since a protein can bind the same ligand more then once, especially if this protein has several sub-units, one entry can appear more than once in the database. To generate the non-homologous dataset of the present invention, only one entry from such multiple entries was considered.

A binding pocket is defined by the amino acids of a protein which are in contact with ligand atoms. "Inter-atomic contact" refers to the contact surface between atoms. Since such structures can exhibit a high degree of both sequential and structural homology, two binding pockets are considered different if the list of residues in contact with the ligand (the constituents of the binding site) differs by -50% or more of the residues.

In order not to lose binding data from these homology constraints, the proteins exhibiting a high degree of homology between binding-sites were further analyzed for function (as defined in the ENZYME Nomenclature database). Proteins of the same function were gathered, and only the family member with the highest resolution was retained as a representative. Within this group, structures having a crystallographic resolution of 2.6 A or better were taken for analysis.

Based upon the above described criteria, a database of ATP and

ATP-analogue ligand-containing PDB files was created. This database was divided into two separate data sets, which are shown in Tables 1-2 below: one including an ATP-analogue ligand (Table 1) and the other including an

ATP ligand (Tables 2).

Table I - A non-redundant dataset of protein/ANP complexes according to the resent invention

Table 2 - The 10 protein/ligand complexes in the non-redundant ATP dataset o the resent invention.

The CLUSTER algorithm: While reducing the present invention to practice an algorithm for finding specific atoms in specific geometrical arrangement which are characteristic of, for example, adenine binding-sites was developed.

This algorithm, which is termed herein as "CLUSTER", performs structural multiple alignment of binding-sites, inputs a list of entries from the dataset and, following processing which is described below, outputs files in PDB format.

Processing includes the following steps:

(i) For each entry in the input list find all protein atoms in contact with the ligand and assign their coordinates to a file in PDB format; this step is effected via LPC software (Sobolev et al, 1999).

(ii) Superimpose the molecular structures of these files using the rigid adenine moiety as template. (It is sufficient to use three atoms for this task since three atoms define a plane; ligand atoms which are numbered as C6, C2, and C4 in PDB file 1 ATP were chosen in this case).

(iii) Produce a table of protein atoms and their number of atomic neighbors within a designated threshold distance such that all neighbors contact the same ligand atom. (iv) Sort protein atoms by decreasing number of neighbors. Each atom sorted is considered a center (or 'coordinator') of a cluster, the atom having the largest number of neighbors postulated to be closer to true center.

(v) Remove the first atom and all its neighbors from the table and output to file in a PDB format. This file contains the coordinates of the atoms in a given cluster.

(vi) Repeat step (v), to obtain additional clusters, until no further atoms remain in the table.

The above described processing steps (i-vi) are outlined in Figures 2- 3.

As shown in Figure 3, each individual file of an input which includes the number and identity of the superimposed files (Nf) and the threshold distance between neighboring atoms thereof is processed by aligning binding-pocket structures thus generating a table listing the individual atoms and their respective number of neighbors (as described by step i-ϋi hereinabove). Every atom and its neighbors form contact with the same ligand atom. The threshold distance is the maximum distance between two neighboring atoms; m and k define the current line number in files K and M respectively; Na represents a variable used to count neighbors for each atom.

As shown in Figure 4, the table output of Figure 3 is further processed to determine cluster formation (as described by steps iv-vi hereinabove). Protein atoms in the table generated by the flow chart of Figure 3 are sorted according to a decreasing number of neighbors and each atom of the total (N) is considered as coordinator of a cluster. The algorithm then removes the first atom (M) and all its neighbors (L) and outputs to file in a PDB format. When L is of the same atom class as M, it is listed as ATOM, otherwise it is listed under the REMARK line of the file. To obtain all clusters, the previous steps are repeated until no further atoms remain in the table.

The largest distance between any atom in a cluster and its coordinator is the pseudo radius. The real radius is determined as the largest distance between the calculated geometric center of a cluster and any atom in that cluster.

Thus, at the end of the above described process one obtains files in PDB format each representing a cluster. The data in each file contain ATOM lines which when taken together, define a cluster as a group of atoms from different proteins that are (i) within a given radius in a three dimensional space, (ii) which contact the same ligand atom and (iii) which belong to the same LPC atom class (atoms which fulfill (i) and (ii) but not (iii) are listed under the "REMARK" line described above).

EXAMPLE 2

In an attempt to find ligand binding site similarities at the amino acid residue level, the above described rules and processing steps were utilized to generate two non-redundant datasets, one including seven structures of

5'-Adenyl-imido-triPhospate (ANP), which substitutes nitrogen for the 03B group (oxygen 3 of the beta phosphate, using PDB nomenclature) of

Adenine Tri-Phosphate (ATP) (shown in Table 1 above), and the other including ten structures of ATP (shown in Table 2 above).

The binding pocket residues of the seven ANP files were determined using the LPC software described hereinabove. Rasmol software was used for viewing the proteins and their ligands. The following was observed from analysis of the seven files: (i) In all cases, a positively-charged binding site residue (Arg or Lys) contacts an oxygen of the gamma phosphate; this residue probably serves for hydrolysis of the phosphate group. (ii) In all cases, a Thr or Ser contacts the phosphate tail or ribose moiety ofthe ANP ligand.

(iii) In 6 of 7 cases, an aromatic residue (Phe or Tyr) is in contact with the aromatic adenine ring of the ligand; this aromatic residue probably serves for ring stabilization. (iv) in 5 of 7 cases, a positively-charged binding site residue (Arg or Lys) contacts the adenine or ribose moiety.

Thus, using LPC, we found some common residues in ATP/ANP binding-sites that were not detected in previous sequence studies (for example, Walker et al., 1982). Sometimes, such residues cannot be detected by sequence similarities techniques because they appear in different order or on different chains along the sequence vector. Thus, structural analysis, even at the amino-acid residue level, seems a potentially powerful approach for studying and extracting commonalties in ligand-binding sites. However, this level of analysis does not give us information about the binding site structure itself and is not sufficient for ATP binding predictions.

EXAMPLE 3 To provide a more detailed analysis of protein-ligand binding, an atomic contact level for each cluster was analyzed and defined. Eight classes of atom-atom contacts were defined (Table 3 below). Atom-atom contact analysis characterizes 120 (± 12) variable (contacts) per ANP binding pocket as compared to the 25 variables characterizable by amino acid residues analysis, and as such it enables a more detailed analysis of binding. Table 3 - Classes of atom-atom contacts (from Sobolev et al.,

Bioinformatics 15, 327)

Hydrophilic N and O that can donate and accept hydrogen bonds

Acceptor N or O that can only accept a hydrogen bond

Donor N that can only donate a hydrogen bond

Hydrophobic Cl, Br, I and all C atoms that are not in aromatic rings and do not have a covalent bond to a N or O atom

Aromatic C in aromatic rings irrespective of any other bonds formed by the atom

VI Neutral Non-aromatic C atoms that have a covalent bond to at least one atom from class I or two or more atoms from class II or III; S, F, P, and metal atoms in all cases vπ Neutral-donor Non-aromatic C atoms that have a covalent bond with only one atom of class III

VII 1 Neutral-acceptor Non-aromatic C atoms that have a covalent bond with only one atom of class II

Atom-atom contact analysis according to the present invention identified 13 atomic contacts specific to ANP ligand atoms. These contacts were common to all 7, or in some cases to 6 of the 7 PDB entries studied (Table 4).

Atom-atom contact definition substantially increases the possibility of uncovering a structural consensus at the atomic contact level. Therefore it is highly plausible that atom-atom contact definition is the structural consensus unit defining ligand-protein binding pocket interactions.

EXAMPLE 4

The algorithm described by the flow chart of Figure 3 defines and uncovers a cluster of atoms in a three dimensional space. In principle, a set of atoms defining a cluster can include a mixture of atom classes. From a physical-chemical point of view, atoms having either attractive (legitimate) or repulsive (illegitimate) contacts are probably favored by a cluster.

Thus, one can define a cluster type according to the non-neutral atom class which mainly contributes to the cluster. In specific cases, the definition of a cluster type could be further narrowed. For example, in some PDB files the cluster consists of only the backbone nitrogen of an Ala residue which is hydrogen bonded with a particular ligand atom. In such cases, instead of defining the cluster as a donor type, or as including a nitrogen backbone, one can conclude that the cluster is composed of backbone nitrogen atoms of an Ala residue. Such a conclusion can substantially reduce processing when searching for putative binding sites.

Table 4 - Atomic contacts formed by ANP which are common to all 7 (or 6 of 7) binding pockets

Liεand atom Atom - atom contact Frequency

OXG¹ hydrophilic - donor² 7

OIB hydrophilic — donor 7

C3 neutral - hydrophobic 6

C5 aromatic - hydrophobic 7

C6 aromatic - hydrophobic 6

N6 hydrophilic - hydrophobic 6

N6 hydrophilic - acceptor 6

Nl hydrophilic - donor 6

Nl hydrophilic - hydrophobic 6

C2 aromatic - hydrophobic 7

C2 aromatic - acceptor 6

N3 hydrophilic - hydrophobic 7

C4 aromatic - hydrophobic 7

X represents any ofthe three phosphate gamma oxygens.

²This contact has an additional specificity in that it is always from one of the two positively charged residues (Lys, Arg).

EXAMPLE 5

Definition of cluster types

Figure 5 is a table listing putative clusters obtained by the algorithm of the present invention which is described hereinabove. A marked (filled) cell indicates that the particular PDB entry contributes a member to the cluster, while an unmarked (empty) cell indicates that the particular entry does not. The Roman numerals indicate the atom class. Note that sometimes a PDB entry contributes more than one atom to a cluster (for example, PDB file lank contributes two atoms to Cluster 1). The clusters listed in Figure 5 were constructed solely on the basis of atom proximity (without consideration of atom class). The clustering procedure described herein creates a 'cluster type' based on the identity of the most-frequent, non-neutral atom class (or subclass thereof) contributing to the cluster. A minimal number of members are required for a cluster to be considered useful (for example, half or more of the PDB files being analyzed).

EXAMPLE 6 Defining a cluster

A cluster according to the present invention is defined as proteins atoms from different structures which are:

(i) close one to another following superimposition;

(ii) of the same atom type; and

(iii) form contact with the same ligand atom;

The table illustrated in Figure 6 was generated in accordance with these definitions. Cells are color designated according to the following cluster types: white - Hydrophobic (most frequent class, IV); light gray -

Hydrogen Bond Acceptor (most frequent class, II); gray - Hydrogen Bond

Donor (most frequent class, III) and black - empty cells.

Note that a donor atom (class III) from PDB file lnsy was assigned to cluster 2 (Hydrophobic). Such an assignment is due to the aromatic rings which act as hydrogen bond acceptors and therefore both hydrophobic and donor atoms form attractive contact with the adenine conjugated system and as such are assignable to the Hydrophobic cluster type.

The hydrophobic atom (class IV) from PDB file 4atl was not assigned to cluster 10 (Hydrogen Bond Donor) because all of the atoms of this cluster form a hydrogen bond with the Nl atom of the ligand; however, hydrophobic atom CGI of He- 12 from PDB file 4atl forms a repulsive contact and therefore it was not included in the cluster. EXAMPLE 7 Adenine cluster typing

The type and number of clusters in a protein define its binding structure. The consensus binding structure for a given ligand is determined according to the average relative positions for all acceptable clusters within a multiple-protein data set. This network of clusters generates a unique shape with a specific chemical and/or physical property assigned to each vertex. This unique shape can be used to search and locate real, putative, and pseudo ligand binding sites in any given protein for which resolved atomic coordinates are available.

For example, Figure 7 depicts the shape and positions of clusters 1, 2, 5 and 10, relative to the adenine ring of ATP/ ANP (described in the table shown by Figure 5).

As illustrated in Figure 7, following superimposition, the atoms of a single cluster are in close proximity to each other, are of the same atom type and contact the same ligand atom.

Cluster 10 shows the position of donor atoms which hydrogen bond with the Nl atom ofthe 6-member adenine ring; cluster 5 shows the position of acceptor atoms which hydrogen bond with atom N6 at the edge of this ring; clusters 1 and 2 show the positions of hydrophobic atoms which contact in one case the C2 atom of the adenine ring from one side of the ring plane, and in the other case, the C5 atom of the ring from the other side of the ring plane. For both of the hydrophobic clusters, multiple atoms of the aromatic ring contact individual protein atoms. The maximum radius of the clusters in this example is 2A.

Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.

REFERENCES

1. Vilar HO, Kauvar LM: Amino acid preferences at protein biding sites. FEBS Lett 1994, 349:125-130.

2. Numav N, Kidokoros S: Prediction of the active sites of proteins from amino acid sequences. Biol Pharm Bull 1993, 16:1 160-1163.

3. Williamson RM: information theory analysis of the relationship between primary sequences structure and ligand recognition among a class of facilitated transporters. J Theor Biol 1995, 174: 179-188.

4. Lichtarge O, Bourne HR, Cohen FE: an evolutionary trace method defines binding surface common to protein families. J Mol Biol 1996, 257:342-358.

5. Brady L, Brzozowski AM, Derewenda ZS, Dodson E, Dodson G, Tolley S, Turkenburg JP, Christiansen L, Huge- Jensen B, Norskov L: A serine protase triad forms the catalytic centre of a triacylglycerol lipase. Nature 1990, 343: 767-770.

6. Sussman JL, Harel M, Frolow F, Defner C, Goldman A, Toker L, Silman I: Atomic structure of acetylcholineesterase from Torpedo californica: a prototypic acetylcholine-binding protein. Science 1991, 253: 872-879.

7. Pearl L: Similarity of active site structures. Nature 1993, 362: 24.

8. Kobayashi N, Go N: A method to search for similar protein local structures at ligand binding sites and its application to adenine recognition. Eur Biophys J 1997, 26: 135-144.

9. Shi Y, Berg JM: A direct comparison of the properties of natural and designed zinc finger proteins. Chem Biol 1995, 2: 83-89

10. Moodie SL, Mitchell JBO, Thornton JM: Pattern recognition of adenylate: an example of a fuzzy recognition template. J Mol Biol 1996, 263: 486-500.

11. Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M:. Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15: 327- 332.

12. Walker J, Saraste , Runswick M, Gay N: Distantly related sequences in a- and b- subunits of ATP synthase, myosin, kinases and other ATP- requiring enzymes and common nucleotide binding fold. EMBO. J. 1982, 1: 945-951.

13. Denessiouk K., Johnson M: When fold is not important: a common structural framework for adenine and AMP binding in 12 unrelated protein families. PROTEINS 2000, 38: 310-326.

14. Bruno IJ, Cole JC, Lommerse JPM, Rowland RS, Taylor R, Verdonk ML: IsoStar: a library of information about nonbonded interactions J. Comput. Aided Mol. Des. 1997, 11: 525-537.

15. Verdonk ML, Cole JC, Taylor R: SuperStar: A knowledge-based approach for identifying interaction sites in proteins J. Mol. Biol. 1999, 289:1093-1108.

Claims

WHAT IS CLAIMED IS:

1. A method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest, the method comprising:

(a) obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest; and

(b) extracting from said structural data at least one consensus structural characteristic characterizing an interaction between the ligand of interest and at least one of said plurality of proteins, thereby identifying at least one consensus structural characteristic of a binding site of the ligand of interest.

2. The method of claim 1, wherein the at least one consensus structural characteristic is derived from relative positions of atoms of the binding sites of said plurality of proteins.

3. The method of claim 1, wherein the at least one consensus structural characteristic is derived from a type of atoms of the binding sites of said plurality of proteins.

4. The method of claim 1, wherein the at least one consensus structural characteristic is derived from atomic contacts of atoms of the binding sites of said plurality of proteins.

5. The method of claim 4, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

6. The method of claim 1, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least said ligand binding site of at least two of said plurality of proteins.

7. The method of claim 1, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate information of at least one atom of the binding site of said at least one of said plurality of proteins.

8. The method of claim 7, wherein said cluster data is generated by:

(i) generating from said structural data a table of protein atoms and their number of atomic neighbors within a designated threshold distance; (ii) sorting said protein atoms by decreasing number of neighbors; (iii) removing a first atom and its neighbors from said table of protein atoms; and optionally (iv) repeating steps (i)-(iii) to thereby obtain additional clusters, until said table of protein atoms is empty.

9. A method of identifying at least one consensus structural characteristic of binding sites of a ligand of interest, the method comprising:

(a) obtaining first structural data pertaining to a plurality of proteins while complexed with the ligand of interest and identifying binding sites of at least one of said plurality of proteins interacting with the ligand of interest;

(b) obtaining second structural data pertaining to said at least one of said plurality of proteins while free of the ligand of interest; and (c) extracting from said second structural data at least one consensus structural characteristic of a binding site being for binding the ligand of interest, thereby identifying at least one consensus structural characteristic of a binding site of the ligand of interest.

10. The method of claim 9, wherein the at least one consensus structural characteristic is derived from relative positions of atoms of the binding sites of said plurality of proteins.

11. The method of claim 9, wherein the at least one consensus structural characteristic is derived from a type of atoms of the binding sites of said plurality of proteins.

12. The method of claim 9, wherein the at least one consensus structural characteristic is derived from atomic contacts of atoms of the binding sites of said plurality of proteins.

13. The method of claim 12, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

14. The method of claim 9, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least said ligand binding site of at least two of said plurality of proteins.

15. The method of claim 9, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate information of at least one atom of the binding site of said at least one of said plurality of proteins.

16. The method of claim 15, wherein said cluster data is generated by:

17. A method of screening a database of structural data of proteins for individual proteins potentially binding a ligand of interest, the method comprising:

(a) defining at least one consensus structural characteristic of structurally characterized binding sites of the ligand of interest; and

(b) identifying individual proteins in which said at least one consensus structural characteristic exists.

18. The method of claim 17, wherein step (a) is effected by:

(i) obtaining structural data pertaining to a plurality of proteins while complexed with the ligand of interest; and

(ii) extracting from said structural data at least one consensus structural characteristic characterizing an interaction between the ligand of interest and at least one of said plurality of proteins, thereby identifying at least one consensus structural characteristic of a binding site of the ligand of interest.

19. The method of claim 17, wherein step (a) is effected by:

(i) obtaining first structural data pertaining to a plurality of proteins while complexed with the ligand of interest and identifying structures of at least one of said plurality of proteins interacting with the ligand of interest;

(ii) obtaining second structural data pertaining to said structures of said at least one of said plurality of proteins while free of the ligand of interest; and

(iii) extracting from said second structural data at least one consensus structural characteristic characterizing a structure being for binding the ligand of interest, thereby identifying at least one consensus structural characteristic of a binding site of the ligand of interest.

20. The method of claim 17, wherein the at least one consensus structural characteristic is derived from relative positions of atoms of said structurally characterized binding sites.

21. The method of claim 17, wherein the at least one consensus structural characteristic is derived from a type of atoms of said structurally characterized binding sites.

22. The method of claim 17, wherein the at least one consensus structural characteristic is derived from atomic contacts of atoms of said structurally characterized binding sites.

23. The method of claim 22, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

24. The method of claim 17, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least two of said structurally characterized binding sites.

25. The method of claim 17, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate information of at least one atom of at least one of said structurally characterized binding sites.

26. The method of claim 25, wherein said cluster data is generated by:

27. A method of generating a polypeptide capable of binding a ligand of interest comprising:

(a) defining at least one consensus structural characteristic of structurally characterized binding sites of the ligand of interest; and (b) synthesizing or modifying the polypeptide so as to include an amino acid region having said at least one consensus structural characteristic thus generating the polypeptide capable of binding a ligand of interest.

28. The method of claim 27, wherein the at least one consensus structural characteristic is derived from relative positions of atoms of said structurally characterized binding sites.

29. The method of claim 27, wherein the at least one consensus structural characteristic is derived from a type of atoms of said structurally characterized binding sites.

30. The method of claim 27, wherein the at least one consensus structural characteristic is derived from atomic contacts of atoms of said structurally characterized binding sites.

31. The method of claim 30, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

32. The method of claim 27, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least two of said structurally characterized binding sites.

33. The method of claim 27, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate information of at least one atom of at least one of said structurally characterized binding sites.

34. The method of claim 33, wherein said cluster data is generated by:

35. A method of predicting the effect of a binding site modification on binding between a protein and a specific ligand thereof, the method comprising:

(a) defining at least one consensus structural characteristic of structurally characterized binding sites of the specific ligand; and

(b) determining the effects of the binding site on the binding between the protein and the specific ligand according to said at least one consensus structural characteristic.

36. The method of claim 35, wherein said consensus structural characteristic is derived from parameters selected from the group consisting of relative positions of atoms of said structurally characterized binding sites, a type of atoms of said structurally characterized binding sites, intermolecular atomic contacts of atoms of said structurally characterized binding sites and intramolecular atomic contacts of atoms of said structurally characterized binding sites.

37. A system useful for screening a database of structural data of proteins for individual proteins potentially binding a ligand of interest, the system including:

(a) a data storage media storing, as records, at least one consensus structural characteristic data defining a binding site of the ligand; and

(b) a data processor for executing a software application being for comparing structural data of the individual proteins to said at least one consensus characteristic defining a binding site of the ligand to thereby enable detection of said binding site within an individual protein.

38. The system of claim 37, wherein said at least one consensus structural characteristic data is derived from relative positions of atoms of structurally characterized binding sites of said ligand.

39. The system of claim 37, wherein said at least one consensus structural characteristic data is derived from a type of atoms of structurally characterized binding sites of said ligand.

40. The system of claim 37, wherein said at least one consensus structural characteristic data is derived from atomic contacts of atoms of structurally characterized binding sites of said ligand.

41. The system of claim 40, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

42. The system of claim 37, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least said binding site of at least two proteins being capable of binding the ligand.

43. The system of claim 37, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate infonnation of at least one atom of said binding site of at least one protein being capable of binding the ligand.

44. The system of claim 43, wherein said cluster data is generated by:

(i) generating from said structural data a table of protein atoms and their number of atomic neighbors within a designated threshold distance; (ii) sorting said protein atoms by decreasing number of neighbors; (iii) removing a first atom and its neighbors from said table of protein atoms; and optionally (iv) repeating steps (i)-(ύϊ) to thereby obtain additional clusters, until said table of protein atoms is empty.

45. The system of claim 37, further comprising a server communicating with or forming a part of, said computing platform, said server being capable of communicating with a user client being operated by a user for providing said user access to said software application.

46. The system of claim 45, wherein said server forms a part of a communication network.

47. The system of claim 46, wherein said communication network is the World Wide Web.

48. The system of claim 45, wherein said user client is a computer operating a Web browser application.

49. A data storage media comprising, as retrievable records, at least one consensus structural characteristic data defining a binding site of a ligand.

50. The data storage media of claim 49, wherein said at least one consensus structural characteristic data is derived from relative positions of atoms of structurally characterized binding sites of said ligand.

51. The data storage media of claim 49, wherein said at least one consensus structural characteristic data is derived from a type of atoms of structurally characterized binding sites of said ligand.

52. The data storage media of claim 49, wherein said at least one consensus structural characteristic data is derived from atomic contacts of atoms of structurally characterized binding sites of said ligand.

53. The data storage media of claim 49, wherein said atomic contacts include intermolecular and intramolecular atomic contacts.

54. The data storage media of claim 49, wherein said at least one consensus structural characteristic is derived by structurally superimposing at least a ligand binding site of at least two proteins being capable of binding said ligand.

55. The data storage media of claim 49, wherein said at least one consensus structural characteristic includes cluster data, said cluster data including coordinate information of at least one atom of the ligand binding site of at least one protein being capable of binding said ligand.

56. The data storage media of claim 55, wherein said cluster data is generated by:

(i) generating from said structural data a table of protein atoms and their number of atomic neighbors within a designated threshold distance; (ii) sorting said protein atoms by decreasing number of neighbors; (iii) removing a first atom and its neighbors from said table of protein atoms; and optionally (iv) repeating steps (i)-(iϋ) to thereby obtain additional clusters, until said table of protein atoms is empty.

57. The data storage media of claim 49, wherein the media is selected from the group consisting of a magnetic media, an optical media and an optical-magnetic media.

58. The data storage media of claim 57, wherein said optical media is a computer disk (CD) or a digital video disk (DVD).