WO2001035255A2 - Large scale comparative protein structure modeling - Google Patents

Large scale comparative protein structure modeling Download PDF

Info

Publication number
WO2001035255A2
WO2001035255A2 PCT/US2000/030753 US0030753W WO0135255A2 WO 2001035255 A2 WO2001035255 A2 WO 2001035255A2 US 0030753 W US0030753 W US 0030753W WO 0135255 A2 WO0135255 A2 WO 0135255A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
protein
model
alignment
amino acid
Prior art date
Application number
PCT/US2000/030753
Other languages
French (fr)
Other versions
WO2001035255A3 (en
Inventor
Andrej Sali
Roberto Sanchez
Francisco Melo
Original Assignee
The Rockefeller University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Rockefeller University filed Critical The Rockefeller University
Priority to EP00978441A priority Critical patent/EP1236124A2/en
Priority to JP2001536721A priority patent/JP2003525483A/en
Priority to CA002391469A priority patent/CA2391469A1/en
Publication of WO2001035255A2 publication Critical patent/WO2001035255A2/en
Publication of WO2001035255A3 publication Critical patent/WO2001035255A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment

Definitions

  • This invention relates to processes for generating models of 3-dimensional proteins structures, to the systems needed to implement the processes, and also to the data generated
  • the target protein features of the target protein that do not occur in the template structure. For example, the
  • location of a binding site can be predicted from clusters of charged residues [12], and the size
  • resolution models based on at least 30% sequence identity to a known protein structure frequently allow a refinement of the functional prediction based on sequence alone.
  • the structure of at least one member of most globular folds will be determined in less than ten
  • E. coli (4290 proteins), M. genitalium (468 proteins), C. elegans (7299 proteins, incomplete), and M. jannaschii (1735 proteins), respectively.
  • Modeller used for sequence-structure alignment, model building, and model evaluation can
  • yeast models are also accessible through the yeast
  • the database will make it efficient for both experts and non-experts to use comparative
  • mouse mast cell proteases which have a conserved surface region of
  • fold assignment generally allows a strong prediction of function and design of
  • comparisons may make it possible to select rationally a target whose binding site is structurally most different from the binding sites of all the other proteins that may potentially
  • the invention comprises a computerized process of generating
  • sequence-related protein (at least one of which has a known 3-dimensional structure),
  • structure to the actual structure of all or part of the query protein is selected from all models generated for the query amino acid sequence
  • the process is conducted such that a plurality (two or more) of
  • query protein amino acid sequences are processed in steps (1) through (6) in time periods that overlap each other. It is also preferred that when the processing of a query protein amino acid
  • the process is preferably implemented in a looping manner with the responsibilities for executing the steps distributed in a controlled optimal manner on a plurality of processors in a single or many computers.
  • various processors for example, in a parallel processing mode, various
  • Such stored information is another aspect of this invention as are such data, images, and
  • a database of the models and data created by the process is also an
  • a database that incorporates such models and data with other databases is also an aspect of the invention.
  • a system of interacting databases one of which is a database that
  • the process of this invention can be extended by additional steps in which the electronically stored data of one or more proteins is used in combination with other
  • inhibitors or activators of enzymes or activators or blockers of cell receptors.
  • the process can also be extended (and the data produced by it further used) generally
  • the data can be subject to further improvement by other model-building processes, model improvement processes, and protein structure analysis
  • the structural data of a protein can be used to predict, interpret, and modify its
  • the data and corresponding models produced can be used to annotate (assign information such as a structure and/or function) nucleic acid or protein sequences, for
  • Figure 1 can be understood, for example by the following summary of a general
  • Another general aspect of this invention is a system for accepting as input a query
  • the query amino acid sequence sorts the good models according to their overall accuracy.
  • the interaction between the collections and engines is
  • an engine specified herein refers to a combination of a computer and software with appropriate instructions for the computer.
  • the sequence will be inputted as a text file.
  • Query sequences (also referred to as ORF's) from one or more organisms, can be automatically downloaded from databases such as GENPEPT or TrEMBL using publicly
  • FTP file-transfer-protocol
  • Netscape Netscape
  • sequence search step it is preferred that a search be done for sequences
  • the search is preferably a stringent search (one in which the E value is 10 "1
  • amino acid sequences e.g., of the order of 500,000 sequences or more. Examples are
  • PDB Protein Data Bank
  • This search is also preferably a stringent search. It can be done for one edition of the PDB
  • sequence PSSM profile can be done once and reused many times until the next time it is updated.
  • a non-stringent search is one with an E value cutoff between 10 "4 and 10,000,
  • the search is for
  • sequences related to the query sequence PSSM and the population searched are all the sequences in the 3-D sequence database collection. In another set of such searches, the search
  • the alignment preferably identifies sequence gaps where one sequence has no equivalent amino acid residues in the other (it is preferred to generate, for
  • a gap penalty function is preferably used. It is particularly preferred that the function be a variable gap penalty function. Highly preferred are variable
  • gap penalty functions that favors gaps in regions that have one on or more of the following
  • restraints defined by a conditional probability distribution based on proteins of known 3-dimensional structure. It is further preferred that such a means take into account one or more (most preferably all) of the following types of restraints:
  • a good model is one in which at least 30% of the C ⁇ atoms are within 3.5
  • GA genetic algorithm
  • scoring functions that are a function of protein compactness, sequence
  • identity (seq.ide) , and z score, more preferably are of the form l-a where a is preferably a
  • FIG. 1 shows the system of the invention.
  • Figure 2 shows in panel (A) the percentages of false positives and negatives as a function
  • Figure 3 shows in panel (A) the distribution of the sequence identity between models
  • Figure 4 Modeling a putative interaction of a predicted YDL117W SH3 domain with a
  • proline rich peptide proline rich peptide
  • Figure 6 shows a flow chart illustrating one example of embodying the process of
  • amino acid sequence refers only to such a sequence, there are no considerations
  • amino acid sequence of interest may represent all or part of a polypeptide chain of a naturally occurring protein or human-
  • sequence similarity search between two amino acid sequences, such as one
  • labeled a target sequence and another labeled a template sequence can be illustrated by the following simple case: If one sequence is ATYHCP (using the one letter standard codes to
  • the degree of sequence similarity is scored depends to some extent on the search engine used to do the searching and its parameters.
  • a "target sequence” is the sequence for which information is sought in a process.
  • template sequence of known sequence is the sequence against which the target sequence is
  • a "gap penalty function" is used to identify gaps or discontinuities in an amino acid
  • sequence when one sequence is compared to another. For example if one sequence has a sequence ATYHCPLT and the other has the sequence ATYGSVMCPLT then there is a gap
  • the computer's memory for example either the primary memory or the "permanent"
  • memory e.g., hard drive or tape.
  • the "E value" of a similarity search is the number of sequence matches that by
  • a “database of non-redundant database sequences” is a database in which only
  • comparative modeling can frequently provide a useful 3D model of a protein.
  • Comparative modeling relies on the knowledge of related protein structures: It consists of fold assignment by comparison with all known protein structures, alignment with related
  • present invention relates, in part, to the creation, maintenance, and facilitation of the use of an up-to-date database of accurate comparative models for all known protein
  • a database can be derived, for example, by an automated modeling "pipeline" relying on
  • the database includes
  • the "pipeline” is designed to maximize both the
  • Fold assignment can, for example, be done using PSI BLAST.
  • the matched parts of a protein sequence to be modeled and a single related known protein structure can be aligned by the ALIGN2D command of
  • model building can be done by Modeller using a single
  • the invention can, for example, be implemented on a cluster of 32 processors. (Although only a
  • This cluster is needed in order to help calculate the models for all protein sequences that are related to at least one known protein structure. Equally
  • the size of the computer system is dictated by the following considerations: The calculation of a small ensemble of models for one protein sequence takes about one hour of CPU time on a single
  • Pentium HI processor The 32 processors are used for generating and maintaining an exhaustive
  • This CPU time includes all the steps in modeling a given protein sequence, from fold
  • mouse mast cell proteases have a conserved surface region of positively charged residues that binds proteoglycans [8]. This region is not easily
  • yeast genome [6] Specifically, we show how to automate modeling of thousands of
  • Comparative protein structure modeling of a target sequence consists of (i) identification of known structures related to the target sequence
  • templates (templates), (ii) alignment of the templates with the target sequence, (iii) building a model
  • each of the 6218 ORFs from yeast was compared with each of the 2045 potential templates
  • chains had at most 95% sequence identity to each other, or had length difference of at least
  • the alignment score was more than 22 or less than 19 nats, respectively,where the nat is a
  • Target-Template Alignment To obtain the target-template alignment for comparative
  • Model Building The refined sequence-structure alignment was used by Modeller to construct a 3D model of the ORF region [15- 17]. Model building began by extracting distance and dihedral angle restraints on the target sequence from its alignment with the
  • p(GOOD/Q_SCORE) p(Q_SCORE/GOOD)/[p(Q_SCORE/GOOD) + p(Q_SCORE/BAD)].
  • a model with p(GOOD/Q_SCORE ) above 0.5 is predicted to be in the good class and thus
  • the ORF-PDB matching procedure identified one or more possibly
  • Model evaluation indicates that 1071 (17.2%) of the yeast ORFs
  • reliable model is 176 residues and 85% of the reliable models are longer than 50 residues.
  • the average pairwise sequence identity on which the reliable models are based is 34%.
  • the E. coli number can be compared with
  • Hidden Markov Models [24] are generally considered to be more sensitive for detecting
  • Thermoplasma Acidophilum (lpmaB) proteasome was based on the structure of subunit B of the Thermoplasma Acidophilum (lpmaB) proteasome; the
  • target and the template have only 16% sequence identity, with the alignment significance
  • the location of a binding site can be predicted from clusters of
  • Comparative models are calculated from a
  • 3D modeling provides the best way of either confirming or rejecting a remote match [16], as discussed above. This is important because most of the related protein pairs share
  • the third advantage of 3D modeling over sequence matching is that a 3D model
  • binding site are similar to the features of the well characterized SH3 domains, the model of
  • Figure 4 Modeling a putative interaction of a predicted YDL117W SH3 domain with a
  • proline rich peptide A segment in the yeast ORF YDL117W sequence (top panel) was
  • the 3D model of the SH3 domain in turn allowed to address the biochemical function of YDL117W by calculating a 3D model of a complex between the predicted SH3 domain and a putative ligand, a
  • proline-rich peptide (middle panel).
  • the ligand in the SH3 model is in fact a proline-rich peptide
  • the bottom panel shows a schematic representation of the SH3-peptide
  • deletion avoids insertions and deletions within helices or sheets, buried regions, straight segments, andalso between two residues that are distant in space. Several examples of the improved alignmentsare shown.
  • the variable gap penalty function may also be useful in
  • a useful three-dimensional (3D) model of a given sequence can often be calculated by comparative modeling (Blundell et al., 1987;
  • g u + v.l, that depends on
  • gap penalty should be a logarithmic function of gap length (Benner et al., 1993); this appears to complicate the implementation in the dynamic programming algorithms (Gotoh,1982) and the linear gap penalty is still widely used.
  • Another improvement involves making gap penalty dependent on the (predicted) local
  • the new algorithm can also be applied for 3D template matching and threading.
  • pre-aligned protein sequences i.e., sequence block
  • pre-aligned protein structures and sequences i.e., structure block
  • Needleman & Wunsch is as follows. We are given two sequences of residues and an M
  • the scoring matrix is composed of scores W u describing the differences between residues i and j from the first and second sequence respectively.
  • the goal is to obtain an optimal set of equivalences that match residues of the first sequence to the residues
  • optimal set of equivalences is the one with the smallest alignment score.
  • alignment score is a sum of scores corresponding to matched residues, also increased
  • the recursive dynamic programming formulae for the global alignment of the structure block with the sequence block are:
  • M and N are the lengths of the structure and sequence blocks, respectively, L indicates the maximal allowed gap length, G is the variable gap penalty function, and W is the residue-residue substitution score for positions i and j from the structure and sequence blocks, respectively.
  • the minimal score for the global alignment of the two blocks, d corresponds to the smallest element in D M+1;0 ⁇ j ⁇ N+] and D 0 ⁇ i iM+ltN+1 .
  • the residue-residue equivalence assignments are obtained by backtracking in matrix D, starting from the element d.
  • the main difference in the recursion from the linear gap penalty case is that a slightly slower procedure for finding the optimal gap lengths must be used for gap openings in the block of prealigned sequences v j' because of the penalty dependence on the distance between the two spanning C ⁇ positions in the block of structures T.
  • the CPU time is saved by limiting the minimization over 1 and 1' to values of 1 and 1' that are smaller than L; this is equivalent to limiting the maximal length of a gap to L positions.
  • the new algorithm is only slightly slower than the O (M x N ) variant of the original dynamic programming algorithm with the linear gap penalty function (Gotoh, 1982).
  • the variable gap penalty function is defined as
  • ⁇ (*.*') 1 + [ ⁇ - ⁇ Hi + ⁇ s Si + B B ⁇ + ⁇ c Q + ⁇ D v ⁇ -ax( ⁇ , d - ⁇ ⁇ ) ⁇ )
  • v is the gap extension penalty
  • u is the gap opening penalty
  • e is the maximal number of
  • R is the function that modulates the gap penalty function depending on
  • R is at least 1, but can be larger
  • H is 1 if all the structures in the structure block have all positions from i' to i occupied by helical residues.
  • Sj is 1 if all templates have all positions from i' to i occupied by ⁇ -strand residues.
  • B is the average
  • the algorithm for assignment of ⁇ -helices and ⁇ -strands depends on the C ⁇ positions only. It is based on the idea of matching distance matrices of short segments of residues with
  • ⁇ -strands are assigned only when there are at least two
  • the library C ⁇ distance matrix was calculated by averaging distance matrices for a sample of the corresponding secondary structure segments, which was obtained by running program
  • the secondary structure defining distance matrices and
  • distance RMS i.e., DRMS
  • MDD maximal distance difference
  • cutoffs c t and c 2 are not already assigned to "earlier' secondary
  • residues are all residues in segments with both DRMS and MDD beyond cutoffs c 3 and c 4 , respectively.
  • the residue buriedness is defined as 1 - a, where a is the fractional side-chain solvent
  • Protein data bank In: Crystallographic databases _ Information, content, software
  • Tables and Figures Table 4 Parameters and distance matrices for defining ⁇ -helices and ⁇ -sheets.

Landscapes

  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Peptides Or Proteins (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Preparation Of Compounds By Using Micro-Organisms (AREA)

Abstract

A method and system for creating 3-dimensional models of polypeptides and proteins given their amino acid sequence; also the models they generate.

Description

LARGE SCALE COMPARATIVE PROTEIN STRUCTURE MODELING
The research leading to the present invention was supported, at least in part, by a grant
from . Accordingly, the Government may have certain rights in the invention.
Background
This invention relates to processes for generating models of 3-dimensional proteins structures, to the systems needed to implement the processes, and also to the data generated
by those systems.
In a few years, the genome projects will have provided us with the amino acid sequences of more than a million proteins - the catalysts, inhibitors, messengers, receptors,
transporters, and building blocks of the living organisms [1]. The full potential of the
genome projects will only be realized once we assign and understand the function of these
proteins. While protein function is best determined experimentally, it can sometimes be
predicted by matching the sequence of a protein with proteins of known function [1]. One
way to improve such sequence-based predictions of function is to rely on the known native 3D structure of proteins [1]. The 3D structure of a protein generally provides more
information about its function than sequence because interactions of a protein with other
molecules are determined by amino acid residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function
depends more directly on structure than on sequence, structure is more conserved in evolution
than sequence. The net result is that patterns in space are frequently more recognizable than patterns in sequence. Unfortunately, 3D structures have been determined by x-ray
crystallography or NMR spectroscopy for only a fraction of known protein sequences; while there are approximately 450,000 protein sequences in TrEMBIJSWISS-PROT [2], there are
only 10,000 known protein structures in the Protein Data Bank [3]. However, a useful 3D
model can frequently be obtained by comparative or homology protein structure modeling,
which can construct all-atom 3D models for those proteins that are related to at least one
known protein structure [4].
Despite errors in comparative modeling, it has been applied successfully to many biological
problems [4, 5]. Comparative modeling can be helpful in proposing and testing hypotheses in molecular biology, such as hypotheses about ligand binding sites [6],
substrate specificity [7], and drug design [8]. It can also provide starting models in
x-ray crystallography [9] and NMR spectroscopy [10]. Explicit 3D modeling and model
evaluation are the best way of either confirming or rejecting a match between two remotely
related proteins [11]. This is important because most of the related protein pairs share less than 30% sequence identity [11]. It is frequently possible to predict correctly important
features of the target protein that do not occur in the template structure. For example, the
location of a binding site can be predicted from clusters of charged residues [12], and the size
of a ligand can be predicted from the volume of the binding site cleft [13]. Another use of 3D
models is that some binding and active sites, which cannot possibly be found by searching for local sequence patterns [14], frequently should be detectable by searching for small 3D motifs that are known to bind or act on specific ligands [15]. In general, medium
resolution models based on at least 30% sequence identity to a known protein structure frequently allow a refinement of the functional prediction based on sequence alone.
The fraction of the known protein sequences that have at least one segment related to one or
more known structures currently ranges from 20 to 50%, depending on a genome [16].
Thus, the number of sequences that can be modeled with useful accuracy by comparative modeling is already more than an order of magnitude larger than the number of
experimentally determined protein structures. Furthermore, the fraction of protein sequences
that can be modeled reliably by comparative modeling is increasing rapidly. The main
reasons for this improvement are the increases in the numbers of known folds and the
structures per fold family as well as the improvement in the fold recognition and comparative modeling techniques. It has been estimated that globular protein domains cluster in only a few thousand fold families, approximately 900 of which have already been structurally
defined [18]. Assuming the current growth rate in the number of known protein structures,
the structure of at least one member of most globular folds will be determined in less than ten
years [11]. Structural genomics may in fact accelerate this goal [19]. As a result, comparative modeling will be applicable to most of the globular protein domains soon after
the completion of the human genome project.
Despite the usefulness of comparative modeling, it is still not a common sequence analysis tool for the biologist, partly due to the lack of easy access to reliable and evaluated models.
Our ModBase [5, 11] database of comparative models attempt to resolve this problem.
Our results include the modeling of five procaryotic and eucaryotic genomes [11]. A calculation resulted in the models for substantial segments of 17.2%, 18.1%, 19.2%,
20.4%, and 15.7% of all proteins in the genomes of S. cerevisiae (6218 proteins in
the genome), E. coli (4290 proteins), M. genitalium (468 proteins), C. elegans (7299 proteins, incomplete), and M. jannaschii (1735 proteins), respectively. An important
feature of this study was an evaluation of all the models by a statistical potential
function. This allowed identification of those models that were likely to be based on correct
templates and at least approximately correct alignments. As a result, 236 yeast proteins
without any prior structural information were assigned to a particular fold family; 40 of these proteins did not have any prior functional annotation. All the alignments and models for the
five genomes are available on Internet at URL http://guitar.rockefeller.edu. The program
Modeller used for sequence-structure alignment, model building, and model evaluation can
be obtained from MSI, San Diego, CA.) The yeast models are also accessible through the
Saccharomyces Genome Database (URL http://genome-www.stanford.edu/Saccharomvces/..
Another database related to protein structure is the Swiss-Model database [20].
The database will make it efficient for both experts and non-experts to use comparative
models, allowing them to spend more time designing experiments. In addition, the automation is essential for access to models by non-experts. Finally, automation encourages
development of better methods. Comparative models in the database will be used in many
different ways, depending on their accuracy (pages 395-397 in ref. 5).
Typical applications of comparative modeling. Several applications are listed above. It is
frequently possible to extract more information from a comparative model than from the modeled sequence, or even from its alignment to related protein sequences or structures [5]. For example, while the preferred ligand for brain lipid binding protein can be predicted from
its comparative model, the ligand preferences cannot easily be predicted from the sequence or
its alignment to structurally defined fatty acid binding proteins [13]. Another example is
provided by several mouse mast cell proteases, which have a conserved surface region of
positively charged residues that binds proteoglycans [12]. This region is not easily recognizable in the sequence or its alignment to a known structure because the constituting residues occur at variable and sequentially non-local positions that form a binding site only
when the protease is fully folded.
Fold Assignment. Establishing a match between a given protein sequence and a well
characterized protein is perhaps the most frequent computational task in biology. In
particular, fold assignment generally allows a strong prediction of function and design of
experiments to test it. We argue that fold assignment by alignment, modeling, and model
evaluation, as proposed here, is a strategy that will reveal a significant number of weak relationships that are detected neither by multiple sequence comparison nor by threading. We
have already demonstrated that reliable models can be obtained for an additional 9%
of the yeast proteins, despite insignificant PSI-BLAST [22] matches to known structures on
which the models are based. These non-trivial matches increase the fold assignment coverage of the yeast proteins from 35% for PSI-BLAST to 44%, even though PSI-BLAST is
one of the most sensitive sequence matching programs. An underlying reason is that the evaluation of a 3D model based on an energy function is more sensitive than the evaluation of
a sequence alignment based on an amino acid residue substitution matrix. Similarly, there are also reasons to believe that the present invention will add to the fold assignments obtained by
threading alone, although this has not yet been demonstrated. These reasons are as follows,
(i) Multiple sequence-structure alignments obtained by PSI-BLAST and MODELLER for the
pipeline tend to be more accurate than the pairwise sequence-structure alignments obtained
by most current threading programs, because of the demonstrated usefulness of multiple sequences in alignment [23]. (ii) The model evaluated by the pipeline is more
accurate and more complete than the threading model even when the alignments are equal,
because the insertions, deletions, sidechains, rigid body shifts and distortions are modeled
explicitly, (iii) We can use more complex and therefore more accurate scoring functions for
model evaluation because only a few models per sequence are evaluated in the proposed approach, while threading has to evaluate on the order of 109 structures for each
sequence-structure pair. We note that our approach does not replace multiple sequence
alignment and threading. In fact, it builds on them because they can be used to
propose candidate fold assignments for further processing by comparative model building
and model evaluation.
New applications. Large-scale modeling should encourage new kinds of applications for the
many resulting models, based on their large number and completeness at the level of the
family, organism, or functional network. For example, a collection of experimentally determined complexes of proteins with their ligands, aligned with comparative models for the rest of the family members, will permit a facile comparison of ligand binding requirements
and also reveal permitted substitutions in and around important residues. A specific example
of a new opportunity for tackling existing problems by virtue of providing many protein
models from many genomes is the selection of a target protein for which a drug needs to be developed. A good choice is a protein that is likely to have high ligand specificity; specificity
is important because specific drugs are less likely to be toxic. Large-scale modeling
facilitates imposing the specificity filter in target selection by enabling a structural comparion
of the ligand binding sites of many proteins, either human or from other organisms. Such
comparisons may make it possible to select rationally a target whose binding site is structurally most different from the binding sites of all the other proteins that may potentially
react with the same drug. For example, when a human pathogenic organism needs to be
inhibited, it may be possible to select as the target that pathogen's protein that is structurally
most different from all the human homologs. Alternatively, when a human metabolic
pathway needs to be regulated, the target identification could focus on that particular protein
in the pathway that has the binding site most dissimilar from its human homologs. For such
applications, comparative models of all sequences are needed, even if they are very similar to
each other.
Summary of the Invention
In a first general aspect, the invention comprises a computerized process of generating
a 3-dimensional model of a protein, the process comprising the steps of:
(1) an inputting step, wherein a query protein amino acid sequence is inputted into a
computer;
(2) a sequence search step, wherein one or more protein databases are searched so as
to identify potentially sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein;
(3) an alignment step, wherein for each sequence-related protein an optimal degree of alignment is created between the amino acid sequence of the query protein and that of each
sequence-related protein(at least one of which has a known 3-dimensional structure),;
(4) a model-building step, wherein for each query sequence alignment obtained in the
alignment step, electronically stored retrievable information is created, said information
defining a model of a 3-dimensional structure for all or part of the query protein amino acid
sequence;
(5) a model evaluation step, wherein the good model that is probably most closest in
structure to the actual structure of all or part of the query protein is selected from all models generated for the query amino acid sequence;
(6) a model storage step, wherein information generated in steps (2), (3), (4), and/or
(5) is stored electronically or electromagnetically such that said information is retrievable ( to
generate data or images or structures that define a 3-dimensional protein structure).
The above process is at times referred to in this application as the "pipeline".
In a preferred mode, the process is conducted such that a plurality (two or more) of
query protein amino acid sequences are processed in steps (1) through (6) in time periods that overlap each other. It is also preferred that when the processing of a query protein amino acid
sequence is completed, the process begins again with another query protein amino acid
sequence, unless of course, the processing of all query amino acid sequences is complete. As
a result, the process is preferably implemented in a looping manner with the responsibilities for executing the steps distributed in a controlled optimal manner on a plurality of processors in a single or many computers. For example, in a parallel processing mode, various
possibilities exist, one of them being that multiple sequences are processed in parallel through the entire process, another that multiple sequences are just processed through an
individual step before the next batch of sequences are processed.
Both the computerized process of this invention and the system of this invention will
generate electronically or electromagnetically stored information that is retrievable to
generate data, 2-dimensional images (for example on paper or a computer screen), or structures (such as models) that represent a 3-dimensional protein or polypeptide structure.
Such stored information is another aspect of this invention as are such data, images, and
structures. Similarly, a database of the models and data created by the process is also an
aspect of the invention. A database that incorporates such models and data with other databases (with data such as sequence information, expression information, etc.) is also an aspect of the invention. A system of interacting databases, one of which is a database that
comprises the models and data generated by the present invention is also an aspect of this
invention.
The process of this invention can be extended by additional steps in which the electronically stored data of one or more proteins is used in combination with other
electronically stored data representing the structure of another molecule, such as a ligand, the
use being one in which the ability of the other molecule to bind to, fit with, or dock on, the
protein or proteins is investigated. Such studies can, for example, be used to identify
inhibitors or activators of enzymes; or activators or blockers of cell receptors.
The process can also be extended (and the data produced by it further used) generally
in protein structure analysis. Additionally, the data can be subject to further improvement by other model-building processes, model improvement processes, and protein structure analysis
methods. The structural data of a protein can be used to predict, interpret, and modify its
function and also design molecules with similar function. It can also be used to design ligands
and other compounds that modulate the protein's function.
The data and corresponding models produced can be used to annotate (assign information such as a structure and/or function) nucleic acid or protein sequences, for
example for database purposes. This information can be used in target selection in research
and drug development.
Figure 1 can be understood, for example by the following summary of a general
aspect of the invention, where the bold-faced numbers refer to the reference numbers in the
Figure:
Another general aspect of this invention is a system for accepting as input a query
amino acid sequence and carrying out the computerized process of the invention so as to
output a 3-dimensional model 11 of protein with the query amino acid sequence :
(1) a collection of one or more query amino acid sequences 1;
(2) a collection 13 of one or more databases;
(3) a sequence matching engine 3 for searching one or more protein databases so as to
identify sequence-related proteins, those proteins that have an amino acid sequence exceeding
a pre-specified degree of sequence similarity with the query protein ;
(4) an alignment engine 5 that, for each sequence-related protein, creates an optimal
degree of alignment between the amino acid sequence of the query protein and that of each sequence-related protein, said alignment identifying sequence gaps where one sequence has
no equivalent amino acid residues in the other;
(5) a model building engine 7 that, for each query sequence alignment obtained in the
alignment step, generates electronically stored retrievable information that defines a model of a 3-dimensional structure for all or part of the query protein amino acid sequence ; and
(6) a model evaluation engine 9 that discriminates good models from bad models for
the query amino acid sequence, and sorts the good models according to their overall accuracy.
In a preferred embodiment, the interaction between the collections and engines is
under the control of a process control engine 21.
In general, an engine specified herein refers to a combination of a computer and software with appropriate instructions for the computer.
In the inputting step, most frequently, the sequence will be inputted as a text file.
Query sequences (also referred to as ORF's) from one or more organisms, can be automatically downloaded from databases such as GENPEPT or TrEMBL using publicly
available software such as FTP (file-transfer-protocol) or Netscape.
Within the sequence search step, it is preferred that a search be done for sequences
showing some degree of similarity to the query amino acid sequence; preferably a database of non-redundant sequences is searched. All sufficiently similar sequences are collected and
represent a "PSSM profile" [22], or profile, of the query sequence (The query sequence PSSM profile.). The search is preferably a stringent search ( one in which the E value is 10"1
or smaller, preferably about 5 xlO 3; The E value corresponding to a stringent search will
depend on the size of the database being searched (e.g., for a database of 500,000 sequences,
an E value of 5 x 10"3 would be typical).) It is preferably done with a large database or protein
amino acid sequences (e.g., of the order of 500,000 sequences or more. Examples are
GENPEPT and TrEMBL.). Within the sequence search step, it is also preferred that a search
be done for sequences showing some degree of similarity to each sequence of known 3- dimensional structure in one or more databases. For each such sequence of known 3-D
structure, all sufficiently similar sequences are collected and represent a 3-D sequence PSSM
profile. An example of such a database is the publically accessible Protein Data Bank (PDB).
This search is also preferably a stringent search. It can be done for one edition of the PDB
and then at an appropriate time, days, weeks, or months later, be redone. As a result the 3-D
sequence PSSM profile can be done once and reused many times until the next time it is updated.
Also within the sequence search step, it is preferred that non-stringent searches be
done. A non-stringent search is one with an E value cutoff between 10"4 and 10,000,
typically with an E value cutoff of 100. In one set of such searches, the search is for
sequences related to the query sequence PSSM and the population searched are all the sequences in the 3-D sequence database collection. In another set of such searches, the search
is for sequences related to each 3-D sequence PSSM and each such 3-D sequence PSSM is
compared to the query sequence. The concept of reciprocal PSSM searches has been disclosed ( Teichmann and Chotia, P.N.A.S. (1998).) The use of the low stringency searches is an important advantage of the present
invention. It is possible because of the quality of the alignment, model-building, and model
evaluation steps. The introduction of the low stringency searches increase the chance of
finding an amino acid sequence that although remote from a sequence point of view is
nevertheless associated with a 3-dimensional model structure that is close to or the same as
the correct structure. This is not only important for purposes of analyzing a single query sequence but also when the entire process is used to process thousands of query sequences in
automated fashion. In the latter case, it increases the chances that the process will correctly
analyze a large percentage of the input proteins.
In the alignment step, the alignment preferably identifies sequence gaps where one sequence has no equivalent amino acid residues in the other (it is preferred to generate, for
each sequence-related protein, two possible alignments of the query sequence)
In the alignment step, a gap penalty function is preferably used. It is particularly preferred that the function be a variable gap penalty function. Highly preferred are variable
gap penalty functions that favors gaps in regions that have one on or more of the following
characteristics: are solvent exposed, are curved, are outside secondary structure segments
(i.e., are random coil segments), are between two Cα positions close in space.
In the model building step, it is preferred to use a method that employs a means for
satisfying spatial restraints on the Cartesian coordinates of an atom of a query protein amino
acid sequence, said restraints defined by a conditional probability distribution based on proteins of known 3-dimensional structure. It is further preferred that such a means take into account one or more (most preferably all) of the following types of restraints:
(1) an homology restraint, in which the most likely coordinates are those indicated by
the known-dimensional structures of the best matched amino acid sequences;
(2) physical and chemical restraints, such as:
(A) ideal bond lengths;
(B) the impossibility of two atoms being located in the same space;
(C) ideal bond angles;
(D )ideal improper dihedral angles; and
(E) ideal chiralities.
Values for implementing the chemical restraints are available from the CHARMM program on the world wide web at http://yuri.harvard.edu/charmm/charmm.html ;
MODELLER, a software application for executing the model building step can be purchased
from is available from MSI (San Diego, CA ). MODELLER has been discussed in the scientific literature.
In the model evaluation step it is preferred to compare the 3-dimensional models
obtained in the model-building step to models previously identified as being "good" and to
models previously identified as being "bad". To generate such models, amino acid sequences of proteins of known 3-D structure are run as query sequences through the process
of this invention. A good model is one in which at least 30% of the Cα atoms are within 3.5
A of the equivalent Cα in the known "correct" struture after the model and the correct
structure have been optimally superimposed. Bad models have less than 15% equivalent Cα
atoms that meet that criterion. While and/or after the population of good and bad models have been created, a genetic algorithm (GA) is informed of the data and the GA generates a scoring function. The scoring function is described in more detail below; the code for its
implementation is in the overall pipeline is included in Appendix 1.
Preferred are scoring functions that are a function of protein compactness, sequence
identity (seq.ide) , and z score, more preferably are of the form l-a where a is preferably a
function of the seq.ide (most preferably cos(seq.ide)) and x is preferably a function of
compactness, seq.ide and the z score (most preferably (compactness + seq.ide)/z score)).
Brief description of the drawings
Figure 1 shows the system of the invention.
Figure 2 shows in panel (A) the percentages of false positives and negatives as a function
of model sequence length and in panel (B) the percentage of structure overlap as a function of
percent sequence identity.
Figure 3 shows in panel (A) the distribution of the sequence identity between models
and in (B) the corresponding distribution of the alignment significance score.
Figure 4: Modeling a putative interaction of a predicted YDL117W SH3 domain with a
proline rich peptide.
Figure5: Sample optimization of parameters u ( y axis) and v (x axis) for the gap
penalty function. Figure 6 shows a flow chart illustrating one example of embodying the process of
the invention.
Detailed description
Glossary
An "amino acid sequence" refers only to such a sequence, there are no considerations
of the three-dimensional structure of the protein. An amino acid sequence of interest may represent all or part of a polypeptide chain of a naturally occurring protein or human-
designed protein.
A "sequence similarity search" between two amino acid sequences, such as one
labeled a target sequence and another labeled a template sequence, can be illustrated by the following simple case: If one sequence is ATYHCP (using the one letter standard codes to
represent a sequence of amino acids) and the other sequence is also ATYHCP, then there is
a perfect match. If the other sequence is ATYLLL, then there is only a partial match. How
the degree of sequence similarity is scored depends to some extent on the search engine used to do the searching and its parameters.
A "target sequence" is the sequence for which information is sought in a process. A
"template sequence" of known sequence is the sequence against which the target sequence is
compared for purposes of furthering the process.
A "gap penalty function" is used to identify gaps or discontinuities in an amino acid
sequence when one sequence is compared to another. For example if one sequence has a sequence ATYHCPLT and the other has the sequence ATYGSVMCPLT then there is a gap
in the first sequence corresponding to the underlined portion of the second sequence.
The "generation of retrievable electronically stored information" refers to the
common operation in the execution of a computer program where the information is stored
in the computer's memory, for example either the primary memory or the "permanent"
memory (e.g., hard drive or tape).
The "E value" of a similarity search is the number of sequence matches that by
chance are expected to be as good or better as one retrieved by the search.
A "database of non-redundant database sequences" is a database in which only
unique known protein sequences are retained.
Overview
Native three-dimensional (3D) structure of a protein is valuable in testing,
understanding, and modifying protein function. Thus, it would be useful to know native structure of the thousands of protein sequences that are emerging from the many genome
projects. While 3D structures of only a tiny fraction of known protein sequences have been
defined by x-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy,
comparative modeling can frequently provide a useful 3D model of a protein.
Comparative modeling relies on the knowledge of related protein structures: It consists of fold assignment by comparison with all known protein structures, alignment with related
known protein structures, model building using the alignment, and model evaluation. The
present invention relates, in part, to the creation, maintenance, and facilitation of the use of an up-to-date database of accurate comparative models for all known protein
sequences that are related to at least one known protein structure. It is envisioned that
approximately 450,000 proteins canl be processed initially, resulting in models for
approximately 150,000 proteins, growing at the rate of approximately 50,000 models per
year.
A database can be derived, for example, by an automated modeling "pipeline" relying on
PSI-BLAST and the program Modeller. Comparative models consist of 3D coordinates
for all nonhydrogen atoms in the modeled part of a protein. The database includes
the alignments used to obtain the models. The "pipeline" is designed to maximize both the
number and accuracy of 3-D models. This is achieved by (i) using multiple sequences and structures to increase the sensitivity of fold assignment and accuracy of the alignments, and
(ii) improving the model evaluation scheme to result in a smaller number of false
positives and negatives.
As in any comparative modeling exercise, models are generated in a four step
procedure [4, 5, 11]: (i) Fold assignment, (ii) sequence-structure alignment, (iii) model
building, and (iv) model evaluation. Large-scale comparative modeling is an automated and integrated application of these four steps to thousands of protein sequences, not only a
few. Because large-scale modeling can only be performed in a completely automated manner, the primary current challenge in large-scale comparative modeling is to build an
automated, rapid, robust, sensitive, and accurate comparative modeling pipeline applicable
to whole genomes; such a pipeline should perform at least as well as a human expert on individual proteins. The use of the CLUSTOR program (http://www.activetools.com) allows efficient and robust
execution on a cluster of processors so that approximately 450,000 known protein sequences
can be processed in a reasonable period of time with the existing and requested computers
(i.e., not longer than three months).
Various complete genomic sequences and databases of expressed sequence tags can be
processed.
Fold assignment. Fold assignment can, for example, be done using PSI BLAST.
Alignment. For example, the matched parts of a protein sequence to be modeled and a single related known protein structure can be aligned by the ALIGN2D command of
Modeller This results in improved alignments due to the placement of gaps in the
structurally reasonable contexts. In general, the accuracy of an alignment can also be
increased by comparing many related sequences and structures at the same time.
Model building. For example, model building can be done by Modeller using a single
template structure and the default modeling protocol. Because model accuracy generally
increases with the number of known protein structures used to calculate the model, the
pipeline allows multiple templates to be selected automatically for use by the existing
modeling method [24].
Model evaluation. (See the scoring function in Example 4) Web interface. For Web access to the database see Example 5. A preliminary, small version
of a database, ModBase, is already accessible at http://guitar.rockefeller.edu.
It is essential for the usefulness of the database that it be calculated with the most recent
versions of the TREMBL/SWISS-PROT protein sequence database [2] and the Protein Data
Bank of known protein structures [3].
System implementation
The invention can, for example, be implemented on a cluster of 32 processors. (Although only a
single processor is needed if the intention is to use the process of the invention one query
sequence at a time.). This cluster is needed in order to help calculate the models for all protein sequences that are related to at least one known protein structure. Equally
important, it is also needed to keep the database of models up-to-date with respect to the
growth in the sequence and structure databases, and the improvements in the modeling software.
The size of the computer system is dictated by the following considerations: The calculation of a small ensemble of models for one protein sequence takes about one hour of CPU time on a single
Pentium HI processor. The 32 processors are used for generating and maintaining an exhaustive
database of comparative models. The size of the computations involved is estimated as follows. Given the growth of the number of known protein sequences at the rate larger than 150,000
sequences per year, the growth of the Protein Data Bank at the rate of more than 4,000 structures
per year, and the significant improvements in our modeling software occurring once or twice a
year, a reasonable throughput to keep the database up-to-date is approximately 50,000 models
per 3 months. This allows for 1 hour of CPU time on a single processor per model, approximately the time needed for calculation of one model with the current Modeller
procedure. This CPU time includes all the steps in modeling a given protein sequence, from fold
assignment, alignment, model building, t model evaluation.
A cluster of 16 2-processor boards, with 256 MB of RAM on each board, is offered by Alta
Technology, Salt Lake City. The Clustor node licenses, from Active Tools, San Francisco, are required for efficient distribution of the individual tasks to the 32 processors.
References referred to above
[1] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan. Predicting function: from genes to genomes and back. J. Mol. Biol., 283:707-725, 1998.
[2] A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and
its supplement TrEMBL in 1999. Nuc. Acids Res., 27:49-54, 1999.
[3] E. E. Abola, F. C. Bernstein, S. H. Bryant, T.F Koetzle, and J. Weng. Protein data bank.
In F. H. Allen, G. Bergerhoff, and R. Sievers, editors, Crystallographic databases
Information, content, software systems, scientific applications, pages 107-132. Data
Commission of the International Union of Crystallography, Bon/Cambridge/Chester, 1987.
[4] R. Sanchez and A. Sali. Advances in comparative protein-structure modeling. Curr.
Opin. Struct. Biol., 7:206-214, 1997. [5] R. Sanchez and A. Sali. Comparative protein structure modeling in genomics. J.
Comp. Phys., 151:388-401, 1999.
[6] A. Sali, R. Matsumoto, H. P. McNeil, M. Karplus, and R. L. Stevens. Three-dimensional models of four mouse mast cell chymases. Identification of proteoglycan-binding regions and protease-specific antigenic epitopes. J. Biol. Chem., 268:9023-9034, 1993.
[7] A. Caputo, M. N. G. James, J. C. Powers, D. Hudig, and R. C. Bleackley. Conversion of
the substrate specificity of mouse proteinase granzyme B. Nature Struct. Biol., 1:364-367, 1994.
[8] C. S. Ring, E. Sun, J. H. McKerrow, G. K. Lee, P. J. Rosenthal, I. D. Kuntz,
and F. E. Cohen. Structure-based inhibitor design by using protein models for the development of antiparasitic agents. Proc. Natl. Acad. Sci. USA, 90:3583-3587, 1993.
[9] M. Carson, C. E. Bugg, L. Delucas, and S. Narayana. Comparison of homology
models with the experimental structure of a novel serine protease. Acta Crystallogr.,
D50:889-899, 1994.
[10] T. Nagata, V. Gupta, W-Y. Kim, A. Sali, B. T. Chait, K. Shigesada, Y. Ito,
and M. H. Werner. Immunoglobulin motif DNA recognition and heterodimerization for
the PEBP2/CBF Runt-domain. Nat. Str. Biol., 6:615-619, 1999. [11] R. Sanchez and A. Sali. Large-scale protein structure modeling of the
Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.
[12] R. Matsumoto, A. Sali, N. Ghildyal, M. Karplus, and R. L. Stevens. Packaging of proteases and proteoglycans in the granules of mast cells and other hematopoietic cells.
A cluster of histidines in mouse mast cell protease-7 regulates its binding to heparin
serglycin proteoglycan. J. iol. Chem., 270:19524-19531, 1995.
[13] L. Z. Xu, R. Sanchez, A. Sali, and N. Heintz. Ligand specificity of brain lipid
binding protein. J.Biol.Chem., 271:24711-24719, 1996.
[14] A. Bairoch. PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res.,
20:2013-2018, 1992.
[15] J. S. Fetrow and J. Skolnick. Method for prediction of protein function from
sequence using the sequence-to-structure-to-function paradigm with application to
glutaredoxins/thioredoxins and T 1 ribonucleases. J. Mol. Biol., 281:949-968, 1998.
[16] D. T. Jones. Genthreader: An efficient and reliable protein fold recognition
method for genomic sequences. J. Mol. Biol., 287:797-815, 1999.
[18] C. A. Orengo, F. M. G. Pearl, J. E. Bray, A. E. Todd, A. C. Martin, L. Lo Conte, and J. M. Thornton. The CATH database provides insights into protein structure/function relationship. Nuc. Acids Res., 27:275-279, 1999.
[19] A. Sali. 100,000 protein structures for the biologist. Nature Structural Biology, 5:1029-1032, 1998.
[20] M. C. Peitsch. PROMOD and SWISS-MODEL - Internet-based tools for automated
comparative protein modeling. Biochem. Soc. Trans, 24:274-279, 1996.
[22] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang Zhang, W. Miller, and D. J.
Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res., 25:3389-3402, 1997.
[23] P. Koehl and M. Levitt. A brighter future for protein structure predicton. Nature
Structural Biology, 6:108-111, 1999.
[24] A. Sali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol., 234:779-815, 1993.
[25] S.F. Altschul. Generalized affine gap costs for protein sequence alignment. Proteins,
32:88-96, 1998.
[26] A. Bateman, E. Birney, R. Durbin, S. R. Eddy, R. D. Finn, and E. L. L. Sonnhammer.
Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nuc.
Acids Res., 27:260-262, 1999. [28] M. J. Sippl. Recognition of errors in three-dimensional structures of proteins.
Proteins, 17:355-362, 1993.
[29] F. Corpet, J. Gouzy, and D. Kahn. Recent improvements of the prodom database of
protein domain families. Nuc. Acids Res., 27:263-267, 1999.
[30] S. E. Brenner, D. Barken, and M. Levitt. The presage database for structural
genomics. Nuc. Acids Res., 27:251-253, 1999.
[31] S. A. Chervitz, E. T. Hester, C. A. Ball, K. Dolinski, S. S. Dwight, M. A. Harris, G. Juvik, A. Malekian, S. Roberts, T. Roe, C. Scafe, M. Schroeder, G. Sherlock, S. Weng, Y.
Zhu, J. M. Cherry, and D. Botstein. Using the Saccharomyces Genome Database (SGD)
for analysis of protein similarities and structure. Nucleic Acids Research, 27:74-78,
1999.
Additional Publications of interest:
R. Sanchez and A. Sali. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.
A. Sali. 100,000 protein strucures for the biologist. Nature Structural Biology,
5:1029-1032, 1998. R. Sanchez and A. Sali. Comparative protein structure modeling in genomics. J.
Comp. Phys., 151:388-401, 1999.
Example 1
Large-scale protein structure modeling of the Saccharomyces cerevisiae genome
(Abbreviations: 3D, three-dimensional; PDB, Protein Data Bank; ORF, open reading
frame.)
(All examples in this patent application are intended to be illustrative of the
invention, not to narrow it.)
Summary
Fold assignment, sequence- structure alignment, model building, and model evaluation were
completely automated. As an illustration, the method was applied to the proteins in the Saccharomyces cerevisiae (baker's yeast) genome. It resulted in all-atom 3D models
for substantial segments of 1071 (17%) of the yeast proteins, only 40 of which have had their 3D structure determined experimentally. Of the 1071 modeled yeast proteins,
236 were related clearly to a protein of known structure for the first time; 41 of these have
not been previously characterized at all.
Sequence matching of the proteins encoded by the Saccharomyces cerevisiae (baker's yeast) genome [6] has resulted in assignment of 58% of the yeast proteins into 11 functional classes
with 93 sub-classes ( URL http://www.mips.biochem.mpg.de/mips/yeast/index.html).
One way to add to sequence-based predictions of function would be to determine or
predict the three-dimensional (3D) structures of proteins. The 3D structure of a
protein generally provides more information about its function than its sequence because
interactions of a protein with other molecules are determined by amino acid residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function depends more directly on structure than on
sequence, structure is more conserved in evolution than sequence [7]. The net result is
that patterns in space are frequently more recognizable than patterns in sequence. For
example, several mouse mast cell proteases have a conserved surface region of positively charged residues that binds proteoglycans [8]. This region is not easily
recognizable in the sequence because the constituting residues occur at variable and
sequentially non-local positions that form a binding site only when the protease is fully
folded. Approximately 10,000 protein structures have been determined experimentally by
X-ray crystallography and nuclear magnetic resonance spectroscopy [9] ( http://www.pdb.bnl.gov/statistics.html), while there are over 450,000 entries in the
GenPept sequence database alone [10]). To bridge this increasingly large gap between the
numbers of known protein sequences and structures, we calculated useful all-atom 3D
models for a significant fraction of the translated open reading frames (ORFs) in the
yeast genome [6]. Specifically, we show how to automate modeling of thousands of
proteins and how to predict the overall accuracy of the models with a high degree of
certainty. We also discuss new ways of using a large number of protein models and point out several unexpected similarities between previously uncharacterized yeast ORFs and
proteins of known structure.
MATERIALS AND METHODS
Protein Structure Modeling Method. Comparative protein structure modeling [11,
12] was the method chosen for this study. Comparative protein structure modeling of a target sequence consists of (i) identification of known structures related to the target sequence
(templates), (ii) alignment of the templates with the target sequence, (iii) building a model
based on the alignment, and (iv) evaluation of the model. This flowchart has been
implemented in a UNIX Perl script that calls the appropriate programs for the individual tasks, each of which is described in more detail below. Program Clustor was used to
distribute efficiently smaller jobs on many workstations, without having to adapt the
individual programs for parallel execution (URL http://www.activetools.com). All the
alignments and models are available on Internet at URL http://guitar.rockefeller.edu. The models are also accessible through the Saccharomyces Genome Database (SGD) (URL http ://genome-w w w . stanford.edu/S accharomyces/).
Databases. The 6218 Saccharomyces cerevisiae ORF sequences were obtained from the
SGD, Mycoplasma genitalium and Methanococcus jannaschii sequences were obtained from
The Institute for Genome Research (URL http://www.tigr.org/tdb/mdb/mdb.html), Caenorhabditis elegans sequences from Sanger Centre (URL
ftp://ftp.sanger.ac.uk/pub/databases/wormpep/), and Escherichia coli sequences from the E.
coli Genome Center (URL http://www.genetics.wisc.
edu:80/index.html). Experimentally determined protein structures were obtained from the
Protein Data Bank (PDB) (March, 1997) [9],
Template Search. To find template structures for modeling of the translated ORF
sequences, each of the 6218 ORFs from yeast was compared with each of the 2045 potential templates
corresponding to the protein chains representative of the PDB. The representative protein
chains had at most 95% sequence identity to each other, or had length difference of at least
30 residues or 30%; they were also the highest quality structures within each group.
Although a small fraction of the yeast ORFs (< 7%) is likely to be incorrect [3], this is not a
serious limitation because an ORF which matches a known protein structure is likely to
correspond to a real protein. The matching was done by the program Align, which implements the local dynamic programming method with a new gap penalty function and has
a search sensitivity higher than that of Blast [13]. Each ORF- PDB matching was run with
the default gap penalty parameters first. A match was considered significant or insignificant
if the alignment score was more than 22 or less than 19 nats, respectively,where the nat is a
unit for measuring significance of a match [14]. All the pairs with intermediate matches with scores between 19 and 22 nats were realigned using 600 combinations of the
gap penalty parameters. The match was finally considered significant if the best of the 600
alignments had a score of at least 22 nats. The matching part of the PDB chain from a
significant hit was used as the template structure for the corresponding region of the ORF. Target-Template Alignment. To obtain the target-template alignment for comparative
modeling, the matching parts of the template structure and the ORF sequence were
re-aligned by the use of the Align2d command of the Modeller program [15- 17]. This
program implements a global dynamic programming method for comparison of two
sequences, but also relies on the observation that evolution tends to place residue insertions and deletions in the regions that are solvent exposed, curved, outside secondary structure
segments, and between two Cα positions close in space. Gaps in these structurally reasonable
positions are favored by a variable gap penalty function that is calculated from the template
structure alone. As a result, the alignment errors are reduced by approximately one third
relative to the standard sequence alignment techniques. Nevertheless, there is clearly a need for even more accurate sequence-structure alignments and for using multiple template
structures, so that more accurate models are obtained [16].
Model Building. The refined sequence-structure alignment was used by Modeller to construct a 3D model of the ORF region [15- 17]. Model building began by extracting distance and dihedral angle restraints on the target sequence from its alignment with the
template structure. These template-derived restraints were combined with most of the
CHARMM energy terms [18] to obtain a full objective function. Finally, this function was optimized to construct a model that satisfied all the spatial restraints as well as possible.
Assignment of a model into the v ood' or bad' class. The overall accuracy of a model
was measured by an overlap between the model and the actual structure. The overlap was
defined as the fraction of residues whose C ^„α atoms are within 3.5A of each other in the globally superposed pair of structures. Models that overlap with the correct structures
in more than 30% of their residues were defined here as good' models. A method for
predicting whether or not a given model is good was developed as follows. Using the PDB,
1085 protein chains of known structure that had less than 30% sequence identity to each
other were picked. Comparative models for these proteins were calculated by the standard
procedure described above. In addition, many bad models were obtained by the same
procedure, except that only target-template alignments with a relatively low alignment significance score from 15 to 20 nats were used. In the end, there were 3993 and 6270 good and bad models, respectively. There were more models than proteins because
most proteins were modeled several times on a different template structure each time. The
distribution of the target-template sequence identity for the good models was similar to that
for the matching of the yeast ORFs with PDB chains (Fig 3A). The quality score
(Q-SCORE) of a model was defined as the Prosall Z-Score [19] divided by the natural
logarithm of sequence length, which made Q_SCORE almost independent of sequence length. The Prosall Z-score approximates the difference in free energy of an evaluated
model and the mean free energy of the same sequence threaded through unrelated folds,
expressed in units of standard deviation. The free energies were calculated with statistical potentials of mean force for single residues and pairs of residues [19]. The distributions of Q_SCORE for good and bad models were obtained for different sequence length
ranges. The posterior probability that a model was good, given that it had a certain
Q_SCORE value, was obtained by using the Bayesian theorem [20] and assuming equal
prior probabilities for good and bad models:
p(GOOD/Q_SCORE) = p(Q_SCORE/GOOD)/[p(Q_SCORE/GOOD) + p(Q_SCORE/BAD)]. A model with p(GOOD/Q_SCORE ) above 0.5 is predicted to be in the good class and thus
have at least approximately correct fold. For proteins longer than 100 residues, it is possible
to identify good models with less than 5% of false positives and 8% of false negatives (Fig.
2A).
Prediction of the overall accuracy of a model. For the models predicted to be in the good
class, the fraction of the Cα atoms modeled within 3.5A of the correct positions depends
on the percentage sequence identity between the modeled sequence and the template. This dependence was determined by using the 3993 good models for proteins of known structure
described in the previous paragraph (Fig. 2B). Above 40% sequence identity, the median
overlap between a model and the corresponding experimental structure is more than 90%
(Fig. 2A). There are few errors in the alignment and the model is as close to the correct
structure as the template. Many models in this range have errors that are comparable to the differences between experimental structures of the same protein determined by different
techniques or in different environments [12]. For 30-40% sequence identity, the overlap
between a model and the corresponding experimental structure is 75-90%. Because the
alignment errors begin to appear, the models overlap with the correct structures less than the
templates do. At very low sequence identity of less than 30%, the overlap drops to 50-75%. These model evaluation results can be understood in terms of the well known relationship between structural and sequence similarities of two proteins [7], the "geometrical'
nature of modeling that forces the model to be as close to the template as possible [15], and
the inability of any current modeling procedure to recover from an incorrect alignment [16]. RESULTS
Template Search. The ORF-PDB matching procedure identified one or more possibly
related structures for 2256 or 36.3% of the ORFs (Fig. 3A). The average length of the local
alignments was 174 residues and the average pairwise sequence identity was 27%.
Evaluation of the Models. Model evaluation indicates that 1071 (17.2%) of the yeast ORFs
have at least one segment of residues with a reliable model (Fig. 3). A small number
of ORFs have a reliable model for more than one domain, resulting in the total of 1168 non-overlapping reliable models for all ORFs. In comparison, only 40 of the yeast
proteins have had their structures determined experimentally [9]. The average length of a
reliable model is 176 residues and 85% of the reliable models are longer than 50 residues.
The average pairwise sequence identity on which the reliable models are based is 34%. Most
of the models based on more than the average sequence identity are predicted to overlap
with the correct structures in more than 80% of their residues (Fig. 2B).
Modeling of Additional Genomes. The fraction of the ORFs in yeast that appear to be
modeled reliably is similar to that for several other genomes. The application of our modeling procedure to the genomes of Escherichia coli (4290 ORFs), Mycoplasma genitalium (468 ORFs), Caenorhabditis elegans (7299 ORFs, incomplete), and
Methanococcus janaschii (1735 ORFs) resulted in the 3D models for 18.1%, 19.2%,
20.4%, and 15.7% of all ORFs, respectively. The E. coli number can be compared with
another study in which comparative models based on a suitable template were calculated for 10-15% of the E. coli ORFs [21]. Fold Assignment Rate. Fold recognition [22], sequence profile methods [23] and
Hidden Markov Models [24] are generally considered to be more sensitive for detecting
remote relationships than the local sequence alignment applied here. Thus, in the future,
these methods will supplement the matching by pairwise sequence comparison in our
pipeline for automated comparative protein structure modeling. However, it is not clear how many more accurate models can be calculated for the matches from the more
sophisticated methods. The reason is that accurate 3D modeling requires both a correct fold
assignment and an approximately correct target-template alignment. Unfortunately, it
appears that when a correct target-template match is made in the absence of statistically
significant sequence similarity already detectable by simple methods, it is rarely possible to produce an accurate alignment [25]. Nevertheless, we now estimate what would have
happened to the fold assignment rate alone if fold recognition and Hidden Markov Models
were applied to the yeast genome. A recent automated fold recognition survey assigned
folds to 103 (22.0%) of the 468 ORFs in the small Mycoplasma genitalium genome [26]. In comparison, our procedure resulted in reliable models for 90 of the 468 ORFs (19.2%), 81 of which were shared with the fold recognition survey. For another
benchmark, the PFAM database obtained by Hidden Markov Models [27] related 315 yeast
proteins to a protein with known structure, which is a relatively small fraction of the 1071
matches obtained here. We identified 263 of the 315 PFAM matches, and 248 of these corresponded to reliable models. Thus, fold recognition and Hidden Markov Models
would provide a small but significant increase in the number of target-template matches for
model evaluation by our combined alignment/modeling approach. However, even the
existing procedure based on local sequence alignment appears to be able to identify some matches that were not identified by fold recognition (there are nine such cases for the M. genitalium genome). The reason is that in the combined alignment/modeling procedure
the final decision about whether or not a given match is correct is made by evaluating the 3D
model implied by the alignment, rather than by scoring the alignment directly. Because
model evaluation works well (Fig. 2A), the cutoff for accepting a match at the sequence
matching stage can be lowered significantly, thus minimizing the loss of correct matches
without adding many false positives. This results in a relatively large number of reliable
models based on low sequence similarity (Fig. 3); for example, 261 yeast ORFs have at
least one reliable model based on a match with a significance score worse than 24
nats (Figure 3B), which is too low to establish a real relationship. The combined
alignment/modeling approach to confirming a remote relationship has already been proven
successful in several individual cases [16, 28, 29]. Another example is the model of the
component PRE4 of the yeast 20S proteasome complex (YFR050C). The model was based on the structure of subunit B of the Thermoplasma Acidophilum (lpmaB) proteasome; the
target and the template have only 16% sequence identity, with the alignment significance
score of 22 nats. However, the model of YFR050C was predicted to be good (pG = 0:99).
The crystallographic structure of YFR050C, determined after the model was calculated
(lrypN) [38], showed that the fold assignment was correct.
DISCUSSION
Usefulness of Models with Errors. It is essential for assessing the value of 3D
protein models to estimate their overall accuracy [19, 30]. In general, mistakes in comparative modeling include sidechain packing errors, small distortions and rigid body
shifts in correctly aligned regions, errors in inserted regions (loops), incorrect alignments,
and incorrect templates [16]. Fortunately, a 3D model does not have to be absolutely perfect to be helpful in biology [11]. One reason is that knowing only the fold of a
protein is relatively frequently sufficient to predict its approximate biochemical function.
For example, only nine out of 80 fold families known in 1994 contained proteins
(domains) that were not in the same functional class, although 32% of all protein structures belonged to one of the nine superfolds [4]. A model is likely to have the correct fold when
the overlap with the actual structure is at least 30%. Such models are obtained when a
correct template and an approximately correct alignment are used. This appears to be the case for 1071 ORFs, as predicted by our model evaluation procedure (Figure 2). Models for
two yeast ORFs calculated before the actual structures were deposited to PDB are discussed in Sanchez and Sali, Proc. Natl. Acad. Sci. U.S.A 95 , 13597-13602 (1998).
Almost half of the 1071 reliably modeled ORFs share more than approximately 35%
sequence identity with their templates (Figure 3A). In such cases, it is frequently possible
to predict correctly important features of the target protein that do not occur in the template
structure. For example, the location of a binding site can be predicted from clusters of
charged residues [8], and the size of a ligand can be predicted from the volume of the
binding site cleft [31].
Usefulness of Comparative Models. Comparative models are calculated from a
sequence
alignment between the protein to be modeled and a related protein of known structure. Thus,
a question arises as to what additional insights that are not already possible from sequence
matching alone can possibly be obtained by 3D modeling. The first advantage of 3D modeling is that it provides the best way of either confirming or rejecting a remote match [16], as discussed above. This is important because most of the related protein pairs share
less than 30% sequence identity (Fig. 3A). For example, only 10.7% of the yeast ORFs
have been matched reliably with known structures by Fasta (URL
http://pedant.mips.biochem.mpg.de/frishman/pedant.html), as opposed to 17.2% in our
study. Another case in point is that 236 of the 1071 yeast ORFs with reliable models had
no previously identified links to a protein of known structure in the major annotations
of the yeast genome, including Sacch3D, Pedant, GeneQuiz, and PFAM (Table 1). Of these
236 proteins for which some structural information is now available, 41 also did not have a
clear link to a protein sequence with known function. A subset of these 41 newly
characterized proteins is listed in Table 1. Additional confidence in these matches is
provided by the conservation of the known functionally important residues in the target
models.
The second advantage of 3D modeling over sequence matching is that some binding and
active sites cannot possibly be found by searching for local sequence patterns [32,33], but frequently should be detectable by searching for small 3D motifs that are known to bind or
act on specific ligands [34]. This is a consequence of the facts (i) that structure is more
conserved than sequence [7], (ii) that 3D motifs tend to consist of residues distant in sequence, and (iii) that there are some 3D motifs whose residues do not follow the same order in sequence, even though they have the same arrangement in space. An example of
this is the serine catalytic triad that almost certainly arose by convergent evolution in serine
proteases of the trypsin and subtilisin type, and also in some lipases [34]. The 3D motifs
could be defined in terms of features extracted from known protein-ligand structures, such as
the constituting atoms and distances between them, shape, secondary structure, and electrostatic properties. Enumeration of active and binding sites for many proteins in
the genome, such as various metal and nucleotide binding sites, will facilitate
experimental determination of protein function.
The third advantage of 3D modeling over sequence matching is that a 3D model
frequently allows a refinement of the functional prediction based on sequence alone because the ligand binding is most directly determined by the structure of the binding site rather than
its sequence. An example of this is provided by a predicted SH3 domain in the yeast ORF
YDL117W (Tab. 1). Since there are known 3D structures of SH3 domains bound to
proline-rich peptide ligands, it was possible to calculate a 3D model of such a complex for
the putative yeast SH3 domain (The 3-dimensional coordinates of the protein with the SH3 domain and those of the ligand are shown in tables B-2 and B-3; see also Fig. 4). Based on
the model, the SH3 residues that interact with the peptide were predicted. This model can then be used to construct site-directed mutants with altered or destroyed binding capacity,
which in turn could be used to test hypotheses about the sequence-structure-function
relationships for this SH3 domain. In addition, since the structural features of the putative
binding site are similar to the features of the well characterized SH3 domains, the model of
the complex increases the likelihood that an actual SH3 domain has been recognized,
irrespective of the specific peptide ligand modeled into the SH3 cleft.
Conclusion. Our results show that comparative modeling efficiently increases the value
of sequence information from the genome projects, although it is not yet possible to model
all proteins with useful accuracy. The main bottlenecks are the absence of structurally
defined members in many protein families and the difficulties in detection of weak similarities, both for fold recognition and sequence-structure alignment. However, while
only 900 out of the total of a few thousand domain folds are known [35, 36], the
structure of most globular folds is likely to be determined in less than ten years [35]. Thus, comparative modeling will conceivably be applicable to most of the globular protein
domains close to the completion of the human genome project.
The computations were done on Silicon Graphics, SUN, DEC and Linux PC computers.
References for Example 1
[1] Oliver, S. G. (1996) Nature 379, 597-600.
[2] Koonin, E. V & Mushegian, A. R. (1996) Curr. Opin. Gen. Dev. 6, 757-762.
[3] Dujon, B. (1996) Trends Genet. 12, 263-270.
[4] Orengo, C. A, Jones, D. T, & Thornton, J. M. (1994) Nature 372, 631-634.
[5] Miklos, G. L. G & Rubin, G. M. (1996) Cell 86, 521-529.
[6] Goffeau, A, Barrell, B. G, Bussey, H, Davis, R. W, Dujon, B, H, H. F, Galibert, F,
Hoheisel, J. D, Jacq, C, Johnston, M, Louis, E. J, Mewes, H. W, Murakami, Y, Philippsen,
P, Tettelin, H, & Oliver, S. G. (1996) Science 274, 563-567.
[7] Chothia, C & Lesk, A. M. (1986) EMBO J. 5, 823-826.
[8] Matsumoto, R, Sali, A, Ghildyal, N, Karplus, M, & Stevens, R. L. (1995) J. Biol.
Chem. 270, 19524-19531.
[9] Abola, E. E, Bernstein, F. C, Bryant, S. H, Koetzle, T, & Weng, J. (1987) in Crystallographic databases _ Information, content, software systems, scientific applications,
eds. Allen, F. H, Bergerhoff, G, & Sievers, R. (Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester), pp. 107-132.
[10] Benson, D. A, Boguski, M. S, Lipman, D. J, Ostell, J, & Ouellette, B. F. F. (1997)
Nucleic
Acids Res 26, 1-7.
[11] Johnson, M. S, Srinivasan, N, Sowdhamini, R, & Blundell, T. L. (1994) CRC
Crit. Rev. Biochem. Mol. Biol. 29, 1-68.
[12] Sanchez, R & Sali, A. (1997) Curr. Opin. Struct. Biol. 7, 206-214.
[13] Altschul, S. F. (1998) Proteins 32, 88-96.
[14] Altschul, S. F & Gish, W. (1996) Methods Enzymol 266, 460-480.
[15] Sali, A & Blundell, T. L. (1993) J. Mol. Biol. 234, 779-815.
[16] Sanchez, R & Sali, A. (1997) Proteins Suppl. 1, 50-58.
[17] Dunbrack Jr., R. L, Gerloff, D. L, Bower, M, Chen, X, Lichtarge, O, & Cohen, F. E.
(1997) Folding & Design 2, R27-R42.
[18] Brooks, B. R, Bruccoleri, R. E, Olafson, B. D, States, D. J, Swaminathan, S, &
Karplus, M. (1983) J. Comp. Chem. 4, 187-217.
[19] Sippl, M. J. (1993) Proteins 17, 355-362.
[20] Box, G. E. P & Tiao, G. C. (1992) Bayesian Inference in Statistical Analysis.
(Wiley- Interscience).
[21] Peitsch, M. C, Wilkins, M. R, Tonella, L,Sanchez, J. C, Appel, R. D, & Hochstrasser,
D. F.
(1997) Electrophoresis 18, 498-501. [22] Bowie, J. U, L"uthy, R, & Eisenberg, D. (1991) Science 253, 164-170.
[23] Altschul, S. F, Madden, T. L, Schaffer, A. A, Zhang, J. Z, Miller, W, & Lipman, D. J.
(1997) Nucleic Acids Res. 25, 3389-3402. [24] Krogh, A, Brown, M, Mian, I. S, Sjolander, K, & Haussler, D. (1994) J. Mol.
Biol. 235, 1501-1531.
[25] Levitt, M. (1997) Proteins Suppl. 1, 92-104.
[26] Fischer, D & Eisenberg, D. (1997) Proc. Natl. Acad. Sci. USA 94, 11929-11934.
[27] Sonnhammer, E. L. L, Eddy, S. R, & Durbin, R. (1997) Proteins 28, 405-420.
[28] Guenther, B, Onrust, R, Sali, A, O'Donnell, M, & Kuriyan, J. (1997) Cell 91,
335-345.
[29] Wolf, E, Vassilev, A, Makino, Y, Sali, A, Nakatani, Y, & Burley, S. K. (1998) Cell
94, 51-61.
[30] Luthy, R, Bowie, J. U, & Eisenberg, D. (1992) Nature 356, 83-85.
[31] Xu, L. Z,Sanchez, R, Sali, A, & Heintz, N. (1996) J.Biol.Chem. 271, 24711-24719.
[32] Bairoch, A. (1992) Nucl. Acids Res. 20, 2013-2018.
[33] Pawson, T. (1995) Nature 373, 573-580.
[34] Wallace, A, Borkakoti, N, & Thornton, J. M. (1997) Protein Sci. 6, 2308-2323.
[35] Holm, L & Sander, C. (1996) Science 273, 595-602.
[36] Hubbard, T. J. P, Murzin, A. G, Brenner, S. E, & Chothia, C. (1997) Nucleic Acids
Research 25, 236-239.
[37] Shilton, B. H, Li, Y, Tesier, D, Thomas, D. Y, & Cygler, M. (1996) Protein Sci. 5,
395.
[38] Groll, M, Ditzel, L, Lowe, J, Stock, D, Bochtler, M, Bartunik, H. D, & Huber,
R. (1997) Nature 386, 463.
[39] Sicheri, F & Kuriyan, J. (1997) Curr. Opin. Struct. Biol. 7, 777-785.
[40] Musacchio, A, Saraste, M, & Wilmanns, M. (1994) Nat. Str. Biol. 1, 546-551. [41] Nicholls, A, Sharp, K. A, & Honig, B. (1991) Proteins 11, 281-296.
[42] Wallace, A. C, Laskowski, R. A, & Thornton, J. M. (1995) Protein Engineering 8,
127-134.
Table 1: Examples of previously uncharacterized yeast proteins. These ORFs o
not have clear similarity to any protein of known function according to the
following sources (October 31, 1997); MIPS (URL
http://www.mips.biochem.mpg.de/mips/yeast/index.html), YPD (URL http://quest7.proteome.com/YPDhome.html). GeneQuiz (
http://www.sander.ebi.ac.uk/genequiz), Sacch3D (http://.www.sander.ebi.ac.uk/genequiz).
Pedant (URL http://pedant.mips.biochem.mpg.de/frishman pedant.html), and PFAM [27].
The examples were selected partly by considering conservation of the functionally important
residues (conserved features). Thus they have seqeuence similarity to known protein structures than most of the other previously uncharacterized yeas proteins. For each ORF and
its correspond ing template, the starting and ending residues of the matching regions are
indicated. The number in parenthesis in the percent sequence identity column is the
alignment significance score in nats [13]. The overall model accuracy is given by
p(GOOD/Q SCORE). The complete list of 236 previously uncharacterized yeast proteins with reliable models is avalable at http://guitar.rockefeller.edu.
Table 1
Examples of previously uncharacterized yeast proteins with reliable models
Yeast Protein Related protein of unknown 3D structure ORF Residues PDB code residues name
YDL117W 13-64 llckA 65A-115A P56-LCK SH3 domain
YCR033W 885-935 lidz 140-190 C-MYB DNA binding domain
YNL181W 44-341 IfmcA 2A-215A 7-α-hydroxysteroid dehydrogenase
YOR221C 124-368 lmla 87-296 malonyl-COA ACP transacylase
YPL217C 63-182 letu 5-145 elongation factor Tu (domain I)
Table 1 (continued)
Yeast Protein Percent sequence Model
ORF Residues Identity accuracy Conserved features
YDL117W 13-64 30 (24.5) 0.97 W31 conserved; other binding residues conserved or similar.
YCR033W 885-935 21 (22.3) 0.99 N interacting with DNA is conserved; K's replaced by R's.
YNL181W 44-341 14 (25.5) 0.98 K163 conserved; Y159F.
YOR221C 124-368 17 (23.7) 0.95 Active site residues S92, Rl 17, and
H201 are conserved.
YPL217C 63-182 22 (22.7) 0.86 GTP binding loops are similar.
Conserved GKTTL motif.
Table 2
1 LYS N 8.73 -1.21 17.5 27 ILE N 1.009 10.3 13.514
1 LYS CA 9.955 -1.882 17.01 27 ILE CA 0.728 9.488 12.369
1 LYS CB 10.104 -3.274 17.65 27 ILE CB 0.405 8.059 12.7
1 LYS CG 10.136 -3.27 19.17 27 ILE CG2 -0.736 8.049 13.737
1 LYS CD 11 .29 -247 19.78 27 ILE CG1 0.073 7.296 11.407
1 LYS CE 1 1.255 -2.468 21.32 27 ILE CD1 1.196 7.277 10.385
1 LYS NZ 12.492 -1.872 21.86 27 ILE C -0.423 10.023 11.583
1 LYS C 9.849 -2.097 15.54 27 ILE O -1.558 10.066 12.053
1 LYS O 8.754 -2.081 14.98 28 ALA N -0.123 10.476 10.35
2 ALA N 10.998 -2.299 14.87 28 ALA CA -1.128 10.859 9.406
2 ALA CA 10.93 -2.547 13.46 28 ALA CB -0.539 1 1.46 8.1 19
2 ALA CB 12.091 -1.923 12.67 28 ALA C •1.818 9.586 9.035
2 ALA C 11.017 -4.026 13.29 28 ALA O -3.036 9.541 8.858 ALA O 11.974 -4.664 13.73 29 GLY N -1.011 8.508 8.936 ARG N 9.967 -4.617 12.7 29 GLY CA -1.472 '7.196 8.583 ARG CA 9.951 -6.025 12.46 29 GLY C -0.771 *6.717 7.342 ARG CB 8.548 -6.588 12.16 29 GLY O -0.484 5.528 7.22 ARG CG 7.785 -5.932 11 .01 30SER N -0.491 7.605 6.368 ARG CO 6.346 -6.446 10.93 30SER CA 0.198 7.157 5.184 ARG NE 5.879 -6.614 12.34 30 SER CB 0.324 8.263 4.12 ARG CZ 4.973 -7.572 12.67 30SER OG -0.963 8.664 3.673 ARG NH1 4.416 -8.362 1 1.7 30 SER C 1.592 6.754 5.56 ARG NH2 4.614 -7.759 13.97 30 SER O 1.956 5.579 5.549 ARG C 10.892 -6.377 11 .35 31 TRP N 2.406 7.766 5.919 ARG O 11.51 -7.437 11.37 31 TRP . CA 3.758 7.561 6.347 TYR N 11.022 -5.494 10.34 31 TRP CB 4.821 8.207 5.426 TYR CA 1 1.873 -5.83 9.235 31 TRP CG 4.993 7.527 4.075 TYR CB 11.105 -6.078 7.928 31 TRP CD2 5.951 6.479 3.819 TYR CG 10.235 -7.269 8.131 31 TRP CD1 4.336 7.742 2.904 TYR CD1 10.717 -8.542 7.935 31 TRP NE1 4.819 6.9 1.923 TYR CD2 8.93 -7.109 8.528 31 TRP CE2 5.809 6.121 2.476 TYR CE1 9.911 -9.64 8.125 31 TRP CE3 6.862 5.874 4.628 TYR CE2 8.113 -8.196 8.722 31 TRP CZ2 6.593 5.139 1.928 TYR CZ 8.606 -9.464 8.519 31 TRP CZ3 7.651 4.883 4.077 TYR OH 7.773 -10.585 8.715 31 TRP CH2 7.516 4.525 2.749 TYR C 12.789 -4.683 8.959 31 TRP C 3.824 8.217 7.687 TYR O 12.542 -3.555 9.374 31 TRP O 3.134 9.208 7.931 GLY N 13.91 -4.974 8.266 32 PHE N 4.634 7.657 8.604 GLY CA 14.816 -3.93 7.899 32 PHE CA 4.727 8.178 9.937 GLY C 14.225 -3.242 6.713 32 PHE CB 5.01 7.104 11.004 GLY O 13.528 -3.864 5.91 32 PHE CG 3.818 6.259 11.263 TRP N 14.482 -1.927 6.57 32 PHE CD1 3.398 5.323 10.351 TRP CA 13.966 -1.249 5.42 32 PHE CD2 3.141 6.388 12.45 TRP CB 12.664 -0.476 5.69 32 PHE CE1 2.298 4.542 10^613 TRP CG 11.902 -0.047 4.459 32 PHE CE2 2.046 5.605 12.72 TRP CD2 11.907 1.279 3.909 32 PHE CZ 1.619 4.677 11.799 TRP CD1 11.069 -0.789 3.674 32 PHE C 5 913 9.081 10.001 TRP NE1 10.562 -0.01 2.663 32 PHE O 6.851 8.959 9.215 TRP CE2 11.066 1.265 2.796 33 TYR N 5.888 10.027 10.959 TRP CE3 12.554 2.416 4.295 33 TYR CA * 7.029 10.864 11.16 TRP CZ2 10.861 2.389 2.052 33 TYR CB 6.687 12.36 11.286 TRP CZ3 12.347 3.549 3.541 33 TYR CG 7.97 13.113 11.228 TRP CH2 11.516 3 533 2.441 33 TYR CD1 8.513 13.436 10.007 TRP C 15.019 0.284 4.995 33 TYR CD2 8.628 13.498 12.373 TRP O 15.772 0.226 5.824 33 TYR CE1 9.696 14.131 9.926 Table 2 (continued)
7 SER N 15.11 -0.024 3.677 33 TYR CE2 9 812 14.194 12.299 7 SER CA 16.128 0.868 3.212 33 TYR CZ 10.348 14.512 11.074 7 SER CB 16.951 0.296 2.041 33 TYR OH 11.563 15.225 1 1 7 SER OG 17.954 1.219 1.64 33 TYR C 7.571 10.392 12.47 7 SER C 15.427 2.089 2.712 33 TYR O 6.83 10.281 13.446
7 SER O 14.465 2 1.948 34 GLY N 8.877 10.062 12.526
8 GLY N 15.901 3.269 3.156 34 GLY CA 9.35 9.5 13.758 8 GLY CA 15.296 4.502 2.749 34 GLY C 10.794 9.818 13.996 8 GLY C 15.546 4.672 1.291 34 GLY O 11.495 10.367 13.145
8 GLY O 16.69 4.694 0.836 35 LYS N 11.242 9.451 15.215
9 GLN N 14.449 4.758 0.519 35 LYS CA 12.57 9.622 15.72 9 GLN CA 14.565 4.948 -0.889 35 LYS CB 12.585 10.285 17.107 GLN CB 13.236 4.681 -1.616 35 LYS CG 13.93 10.191 17.83 GLN CG 13.335 4.803 -3.133 35 LYS CD 14.067 11.109 19.045 GLN CD 12.03 4.288 -3.724 35 LYS CE 15.381 * 10.917 19.8 GLN OE1 11.658 4.649 -4.838 35 LYS NZ 15.385 11.738 21.031 GLN NE2 11.317 3.415 -2.961 35 LYS C 13.16 8.267 15.916 GLN C 15.005 6.353 -1.175 35 LYS O 12.499 7.344 16.391 GLN O 15.909 6.575 -1.986 36 LEU N 14447 8.123 15.566 THR N 14.403 7.331 -0.462 36 LEU CA 15 105 6.862 15.739 THR CA 14.609 8.724 -0.735 36 LEU CB 16 167 6.568 14.664 THH CB 13.313 9.415 -1.075 36 LEU CG 15.584 6.482 13.24 THR OG1 13.537 10.723 -1.581 36 LEU CD2 14.406 5.493 13.183 THR CG2 12.422 9.454 0.178 36 LEU CD1 16.678 6.178 12.201 THR C 15.249 9.407 0.444 36 LEU C 15.825 6.923 17.04 THR O 15.693 8.775 1.402 36 LEU O 16.601 7.844 17.29 LYS N 15.372 10.747 0.332 37 LEU N 15.567 5.937 17.919 LYS CA 15.938 1 1.634 1.309 37 LEU CA 16.252 5.898 19.175 LYS CB 15.993 13.083 0.783 37 LEU CB 15.804 4.749 20.089 LYS CG 16.576 14.097 1.772 37 LEU CG 14.354 4.896 20.57 LYS CD 16.779 15.504 1.194 37 LEU CD2 14.059 6.337 21.02 LYS CE 18.057 15.671 0.371 37 LEU CD1 14.029 3.852 21.65 LYS NZ 18.22 17.091 -0.019 37 LEU C 17.687 5.684 18.854 LYS C 15.102 11.67 2.554 37 LEU O 18.571 6.217 19.521 LYS O 15.63 11.629 3.663 38 ARG N 17.945 4.879 17.807 GLY N 13.767 11.761 2.391 38 ARG CA 19.292 4.623 17.409 GLY CA 12.856 11.95 3.492 38 ARG CB 19.514 3.243 16.761 GLY C 12.757 10.787 4.44 38 ARG CG 20.979 2.983 16.404 GLY O 12.737 10.978 5.657 38 ARG CD 21.862 2.755 17.634 ASP N 12.672 9.548 3.93 38 ARG NE 23.284 2.875 17.205 ASP CA 12.423 8.424 4.795 38 ARG CZ 24.27 2.899 18.151 ASP CB 11.711 7.274 4.065 38 ARG NH1 23 952 2.717 19.465 ASP CG 12.52 6.935 2.827 38 ARG NH2 25.567 3.125 17.788 ASP OD1 13.666 7.442 2.7 38 ARG C 19.637 5.635 16.372 ASP OD2 11.985 6.186 1 .97 38 ARG O 18 863 5.941 15.466 ASP C 13.661 7.934 5.489 39 ASN N 20.845 6.19 16.489 ASP O 14.785 8.263 5.109 39 ASN CA 21.337 7.172 15.575 LEU N 13.47 7.134 6.567 39 ASN CB 21.086 6.797 14.101 LEU CA 14.603 6.632 7.297 39 ASN CG 21.747 7.84 13.207 LEU CB 14.475 6.597 8.829 39 ASN OD1 21.116 8.804 12.779 LEU CG 14.062 7.919 9.468 39 ASN ND2 23.063 7.655 12.915 LEU CD2 14.373 7.96 10.97 39 ASN C 20.667 8.47 15.852 LEU CD1 12.589 8.163 9.163 39 ASN O 21.069 9.494 15.3 LEU C 14.81 5.196 6.95 40 LYS N 19.673 8.459 16.769 LEU O 13.88 4.487 6.568 40 LYS CA 18.999 9.657 17.185 Table 2 (continued)
5 GLY N 16.067 4.731 7.087 40 LYS CB 19 913 10.584 18.005 5 GLY CA 16.382 3.354 6.86 40 LYS CG 19.221 11.827 18.574 5 GLY C 16.545 2.779 8.223 40 LYS CD 20.094 12.608 19.565 5 GLY O 16.927 3.486 9.156 40 LYS CE 19.504 13.954 19.988 6 PHE N 16.246 1.478 8.388 40 LYS NZ 20.478 14.712 20.797 6 PHE CA 16.36 0.946 9.708 40 LYS C 18.579 10.39 15.956 6 PHE CB 15.12 1.235 10.57 40 LYS O 18.669 11.614 15.897 6 PHE CG 13.906 0.831 9.807 41 LYS N 18.104 9.648 14.936 6 PHE CD 1 13.456 -0.469 9.813 41 LYS CA 17.77 10.297 13.707 6 PHE CD2 13.218 1.773 9.075 41 LYS CB 17.216 9.359 12.626 6 PHE CE1 12.33 -0.817 9.106 41 LYS CG 17.436 9.899 11.212 PHE CE2 12.092 1.431 8.367 41 LYS CD 16.977 1 1.342 11.003 PHE CZ 11.646 0.133 8.386 41 LYS CE 17.424 11.938 9.664 PHE C 16.611 -0.519 9.676 41 LYS NZ 17.159 13.394 9.637 PHE O 16.405 -1.197 8.671 41 LYS C 16.693 11.255 14.041 LEU N 17.103 -1.025 10.82 41 LYS O 16.678 12.365 13.53 LEU CA 17 395 -2.409 10.99 42 CYS N 15.719 10.804 14.849 LEU CB 18.75 -2.647 1 .68 42 CYS CA 14.712 1 Ϊ.656 15.406 LEU CG 19.948 -2.132 10.86 42 CYS CB 15 225 12.597 16.521 LEU CD2 21.285 -2.604 11.45 42 CYS SG 16.477 13.803 15.98 LEU CD1 19.894 -0.60S 10.66 42 CYS C 14.057 12.446 14.326 LEU C 16.321 -2 93 11.88 42 CYS O 13.455 13.486 14.587 LEU O 15.732 -2.186 12.67 43 SER N 14.131 11.967 13.074 GLU N 16.016 -4.231 11.77 43 SER CA 13.5 12.714 12.037 GLU CA 14.984 -4.794 12.58 43 SER CB 14.321 13.925 11.57 GLU CB 14.786 -6.293 12.31 43 SER OG 14.545 14.812 12.654 GLU CG 14.342 -6.591 10.87 43 SER C 13.385 11.82 10.864 GLU CD 14.608 -8.068 10.6 43 SER O 14.302 11.061 10.548 GLU OE1 15.09 -8.765 11.54 44 GLY N 12.232 11.893 10.183 GLU OE2 14.341 -8.513 9.454 44 GLY CA 12.107 11.117 8.995 GLU C 15.446 -4.653 13.99 44 GLY C 10.779 10.445 9 GLU O 16.639 -4.755 14.28 44 GLY O 10.012 10.536 9.956 GLY N 14.505 -4.386 14.92 45 TYR N 10.52 9.685 7.918 GLY CA 14.838 -4.294 16.31 45 TYR CA 9.284 8.986 7.76 GLY C 15.061 -2.863 16.69 45 TYR CB 8.478 9.402 6.517 GLY O 15.207 -2.553 17.87 45 TYR CG 7.964 10.783 6.731 ASP N 15.087 -1.946 15.71 45 TYR CD1 8.775 1 1.88 6.55 ASP CA 15.309 -0 56 16.02 45 TYR CD2 6.649 10.972 7.089 ASP CB 15.615 0.29 14.77 45 TYR CE1 8.273 13.145 6.748 ASP CG 16.229 1.619 15.19 45 TYR CE2 6.146 12.235 7.288 ASP OD1 16.37 1.853 16.42 45 TYR CZ 6.962 13.325 7.119 ASP OD2 16.568 2.419 14.28 45 TYR OH 6.454 14.626 7.319 ASP C 14.055 -0.042 16.67 45"TYR C 9.587 7.535 7.592 ASP O 12.964 -0.528 16.37 45 TYR O 10.615 7.157 7.031 ILE N 14.177 0.969 17.55 46 PHE N 8.69 6.683 8.125 ILE CA 13.032 1.495 18.25 46 PHE CA 8.814 5.26 7.992 ILE CB 13.237 1.679 19.73 46 PHE CB 9.343 4.538 9.249 ILE CG2 11.932 2.268 20.29 46 PHE CG 8.457 4.829 10.405 ILE CG1 13.654 0.371 20.42 46 PHE CD1 8.694 5.933 11.19 ILE CD1 -15.088 -0.051 20.1 46 PHE CD2 7.404 3.998 10.711 ILE C 12.787 2.877 17.74 J 46 PHE CE1 7.888 6.213 12.267 ILE O 13.73 3.622 17.47 46 PHE CE2 6 592 4.272 11.787 MET N 11.505 3.259 17.57 46 PHE CZ 6.834 5.381 12.564 MET CA 11.258 4.583 17.1 46 PHE C 7.459 '4.729 7.671 MET CB 10.837 4.63 15.62 46 PHE O 6.444 5.385 7.917 Table 2 (continued)
2 MET CG 11.927 4.165 14.65 47 PRO N 7.431 3.561 7.096 2 MET SD 12.265 2.377 14.69 47 PRO CA 6.193 2.942 6.723 2 MET CE 13.56 2.436 13.42 47 PRO CD 8.552 3.049 6.329 2 MET C 10.146 5.184 17.88 47 PRO CB 6.549 1.835 5.729 MET O 9.194 4.51 1 18.28 47 PRO CG 8.073 1.67 5.857 GLU N 10.258 6.496 18.15 47 PRO C 5.44 2.469 7.923 GLU CA 9.195 7.185 18.81 47 PRO O 6.057 2.003 8.B8 GLU CB 9.675 8.269 19.79 48 HIS N 4.102 2.58 7.868 GLU CG 8.537 8.969 20.53 48 HIS CA 3.229 2.171 8.925 GLU CD 8.01 8.012 21 .59 48 HIS ND1 0.445 2.21 10.746 GLU OE1 8.85 7.397 22.3 48 HIS CG 0.7 1 .882 9.431 GLU OE2 6.762 7.881 21.7 48 HIS CB 1.761 2.503 8.566 GLU C 8.472 7.863 17.69 48 HIS NE2 -1.047 '•0.659 10.178 GLU O 9.098 8.473 16.83 48 HIS CD2 -0.218 0.934 9.103 VAL N 7.131 7.76 17.66 48 HIS CE1 -0.609 1.45 11.141 VAL CA 6.448 8.363 16.56 48 HIS C 3.331 0.692 9.128 VAL CB 5.254 7.585 16.08 48 HIS O 3 475 0.219 10.255 VAL CGI 4.236 7.483 17.23 49 ASN N 3.28 -0.075 8.026 VAL CG2 4.689 8 276 14.83 49 ASN CA 3.22 -1 .498 8.147 VAL C 5.974 9.722 16.96 49 ASN CB 2.877 -2.199 6.822 VAL O 5.316 9.889 17.98 49 ASN CG 3 967 -1 .898 5.801 THR N 6.352 10.746 16.16 49 ASN OD1 4.331 -0.748 5.55 THR CA 5.891 12.069 16.46 49 ASN ND2 4.522 -2.98 5.192 THR CB 6.549 13.148 15.63 49 ASN C 4.452 -2.074 8.778 THR OG1 6.101 14.43 16.05 49 ASN O 4.319 -2.962 9.621 THR CG2 6.236 12.941 14.15 50 PHE N 5.67 -1.621 8.419 THR C 4.408 12.122 16.24 50 PHE CA 6.813 -2.171 9.096 THR O 3.67 12.569 17.12 50 PHE CB 8.137 -1.567 8.598 ARG N 3.918 11.649 15.08 50 PHE CG 8.367 -2.037 7.201 ARG CA 2.504 1 1.662 14.85 50 PHE CD1 7.782 -1.413 6.122 ARG CB 1 .92 13.066 14.6 50 PHE CD2 9.188 -3.114 6.978 ARG CG 0.399 13.081 14.4 50 PHE CE1 8.017 -1 .858 4.839 ARG CD -0.411 12.788 15.67 50 PHE CE2 9.431 -3.569 5.703 ARG NE -1.854 12.843 15.29 50 PHE CZ 8.848 -2.938 4.625 ARG CZ -2.82 12.68 16.24 50 PHE C 6.579 -1 .746 10.541 ARG NH1 -2.468 12.495 17.54 50 PHE O 6.729 -2.606 11.453 ARG NH2 -4.142 12.714 15.89 50 PHE OXT 6.25 -0.551 10.746 ARG C 2.241 10.835 13.63 ARG O 3.119 10.651 12.79
Table 3
1 PRO N 8.47 9.24 0.67 56 PRO N 5.63 -0.16 1.07 1 PRO CA 7.82 10.5 0.39 56 PRO CA 5.69 -1.53 0.61 PRO CD 7.65 8.77 1.82 56 PRO CD 6.83 0.24 1.77 PRO CB 7.85 1 1.3 1.7 56 PRO CB 7.08 -2.04 0.96 PRO CG 7.65 10 2.69 56 PRO CG 7.89 -0.78 1.3 PRO C 7.35 10.7 -1.02 56 PRO C 4.7 -2.35 1.37 PRO O 6.76 1 1.7 -1.44 56 PRO O 4.4 -2 2.51 LEU N 7.64 9.59 -1.67 57 PRO N 4.24 -3.4 0.8 LEU CA 7.42 9.31 -3.04 57 PRO CA 3.37 -4.27 1.53 LEU CB 8.74 9.56 -3 85 57 PRO CD 3.96 -3.43 -0.63 LEU CG 8.94 11 -4.48 57 PRO CB 2.58 -5.07 0.47 LEU CD2 8.61 12.1 -3.54 57 PRO CG 3.31 , -4.81 -0.86 LEU CD1 8.19 11 -5.8 57 PRO C 4.24 ' '-5.1 2.43 LEU C 6.79 7.92 -3.12 57 PRO O 5.45 -5.15 2.21 LEU O 5.57 7.87 -2.9 58 LEU N 3.65 •5.74 3.47 PRO N 7.42 6.77 -3.37 58 LEU CA 4.49 -6.49 4 36 PRO CA 6.63 5.59 -3.65 58 LEU CB 3.76 -7.1 5.57 PRO CD 8.8 6.63 -3.8 58 LEU CG 4.72 -7.67 6.65 PRO CB 7.59 4.59 -4.21 58 LEU CD2 4 -8.66 7.59 PRO CG 8.78 5.42 -4.71 58 LEU CD1 5.44 •6.53 7.38 PRO C 5.94 5.21 -2.38 58 LEU C 5.05 -7.62 3.56 PRO O 6 24 5 67 -1 .3 58 LEU O 4.43 -8.1 2.62 PRO N 4.94 4.45 -2.67 59 PRO N 6.24 -8 3 92 PRO CA 4.02 3.95 -1.69 59 PRO CA 6.85 -9.09 3.2 PRO CD 4.44 4.34 -4.03 59 PRO CD 7.23 -7.03 4.36 PRO CB 2.87 3.29 -2.47 59 PRO CB 8.34 -8.76 3.11 PRO CG 3.41 3.2 -3.92 59 PRO CG 8.57 -7.75 4 24 PRO C 4.72 3.05 -0.76 59 PRO C 6.58 •10.4 3.95 PRO O 5.76 2.5 -1.12 59 PRO O 6.76 ■ ■10.4 5.19 LEU N 4.15 2.88 0.45 59 PRO OXT 6.21 • ■11.4 3.29 LEU CA 4.72 2.01 1.43 LEU CB 3.87 1.96 2.69 LEU CG 3.71 3.28 3.44 LEU CD2 3.03 3.03 4.79 LEU CD1 2.95 4.33 2.6 LEU C 4.61 0.63 0.87 LEU O 3.61 ' 0.27 0.27
Figure 2. Predicting the overall accuracy of comparative models. The good and bad
models for proteins of known structure are used to tune the prediction of reliability of a model when the actual structure is not known (Fig. 3). See Materials and Methods for
details. (A) A rule for assigning a comparative model into either the 'good' or 'bad' class,
based on its Q_SCORE. The inset shows the distributions of Q_SCORE for the good and
bad models with 100 to 150 residues. Such distributions are used with the Bayes theorem to
calculate the posterior probability that a model is good, given that it has a certain Q_SCORE
value, p(GOOD/Q_SCORE ). The main plot shows the percentages of false positives (bad
models classified as good) and false negatives (good models classified as bad) as a function
of sequence length. The curves were obtained by the jack-knife procedure. (B) A rule for estimating the accuracy of a reliable model (as predicted by its Q_SCORE), based on the
percentage sequence identity to the template. The overlaps of an experimentally determined
protein structure with its model (continuous line) and with a template on which the model
was based (dashed line) are shown as a function of the target-template sequence identity.
This identity was calculated from the modeling alignment. The structure overlap is defined
as the fraction of the equivalent Cα atoms. For comparison of the model with the actual structure (filled circles), two Cα were considered equivalent if they were within 3.5A of
each other and belonged to the same residue. For comparison of the template structure with
the actual target structure (open circles), two Cα atoms were considered equivalent if
they were within 3.5A after alignment and rigid-body superposition by the align3d command in Modeller [15]. The points correspond to the median values and the error
bars in the positive and negative directions correspond to the average positive and negative differences from the median, respectively. Points labeled in α, β, γ, correspond to the models
reported in Fig. 1C of Proc. Natl. Acad. Sci. U.S.A_95 , 13597-13602 (1998). The empty
circle at 25% sequence identity corresponds to an unusually accurate model (illustrated in
Fig. 3B of Proc. Natl. Acad. Sci. U.S.A_95 , 13597-13602 (1998)).
: Protein structure models for yeast ORFs. (A) Distribution of the sequence identity
between the models and the corresponding templates as a function of model sequence length. The 3992 reliable models for substantial segments of 1071 different ORFs that are predicted
to be based on a correct template and approximately correct alignment are represented by the
upper bars for a given point , and the 4588 unreliable models that are predicted to be based
on a mostly incorrect alignment or an incorrect template are represented by the lower bars
for a given point . The last histogram at label "All/6' is the sum of the other six histograms divided by six. (B) The corresponding distribution of the alignment significance score calculated by the program Align [13].
Figure 4: Modeling a putative interaction of a predicted YDL117W SH3 domain with a
proline rich peptide. A segment in the yeast ORF YDL117W sequence (top panel) was
predicted to be remotely related to the SH3 domains, many of which have known 3D
structure (Tab. 1). The automated prediction was possible because of the sensitivity
afforded by evaluating a 3D model implied by the match. The 3D model of the SH3 domain in turn allowed to address the biochemical function of YDL117W by calculating a 3D model of a complex between the predicted SH3 domain and a putative ligand, a
proline-rich peptide (middle panel). The ligand in the SH3 model is in fact a proline-rich
segment that occurs downstream in the same ORF. This peptide, PLPPLPPLP (positions
212-220), contains the signature PXXP sequence typical of the SH3 binding peptides [39].
The coordinates of the protein and the ligand in the complex are shown in Tables B-2 and B-
3, respectively. The model of the complex was obtained by the same comparative method as
the model of the SH3 domain [15], relying on the crystallographic structure of the complex
between the FYN SH3 domain and its peptide ligand (PPAYPPPPVP) [40]. Both inter- and
intra-molecular interactions between SH3 domains and Pro-rich peptides have already been documented [39]. The SH3 residues making hydrophobic contacts and hydrogen bonds to the
ligand peptide. The bottom panel shows a schematic representation of the SH3-peptide
interaction [42]. This model should facilitate designing experiments such as site-directed
mutagenesis for maping of functionally important residues on the SH3 domain as well as its ligand. This should be compared to the starting point where no functional information about
this ORF or about the proteins previously related to it was known. More generally, the
wealth of information in the bottom panel of this Figure and Table 2 (protein coordinates)
and Table 3 (ligand coordinates) relative to the top, sequence-only panel of this figure is a
case in point for the utility of structural models in planning biological experiments. For the many proteins whose structures have not been determined by experiment, maximal structural information is obtained by both (i) establishing a match to a known
protein structure and (ii) calculating an all-atom 3D model based on that match, using the
methods described in this paper. Example 2
Variable gap penalty function for protein sequence-structure alignment
(Abbreviations: 3D, three-dimensional; PDB, Protein DataBank)
Overview
One of the main sources of error in comparative protein structure modeling are misalignments
of the target sequence with template structures. We describe a dynamic programming
procedurefor aligning a sequence with a structure that mitigates this problem. The procedure
uses avariable gap penalty function that depends on the structural context of an insertion or a
deletion.lt avoids insertions and deletions within helices or sheets, buried regions, straight segments, andalso between two residues that are distant in space. Several examples of the improved alignmentsare shown. The variable gap penalty function may also be useful in
other applications wheresequence-structure comparisons are needed, such as in template
matching and threading. Introduction
In a few years, the genome projects will have provided us with amino acid sequences of
approximately 500,000 proteins. The full potential of the genome projects will only
be realized once we can assign, understand, and manipulate the function of these new
proteins. Such control of protein function generally requires the knowledge of protein
three-dimensional structure. Unfortunately, experimental methods for protein structure determination, such as X-ray crystallography and NMR spectroscopy, are time consuming
and not successful with all proteins; consequently, three-dimensional structures have
been determined for only a tiny fraction of proteins for which the amino acid sequence is
known. Fortunately, in the absence of a high-resolution protein structure determined by
X-ray crystallography or NMR spectroscopy, a useful three-dimensional (3D) model of a given sequence can often be calculated by comparative modeling (Blundell et al., 1987;
Greer, 1990; Johnson et al., 1994; Bajorath et al., 1994; Holm et al., 1994; Sali, 1995;
Sanchez & Sali, 1997).
Comparative or homology protein modeling uses experimentally determined protein
structures
(templates) to predict the conformation of another protein with a similar amino acid
sequence (target). This is possible because a small change in the sequence usually results in a small change in the 3D structure (Lesk & Chothia, 1986). All comparative modeling
methods begin with an alignment between the target and templates; the main difference
between the different modeling methods is in how the 3D model is calculated from a given
alignment. Once an appropriate template structure is found, the usefulness of comparative
protein models has been limited by the errors in sidechain packing, distortions in correctly and incorrectly aligned regions, distortions in unaligned regions, and, most importantly, by
the difficulty of sequence alignment when sequence identity between the target and
templates is less than about 35% (Sali et al., 1995). If the alignment is incorrect, the atoms
will be positioned incorrectly by all the current comparative modeling methods. Even a shift
in the alignment by only one residue will produce an rms error in the backbone atoms in the order of 4 A . When the sequence identity between the sequence and structure is about 30%, the best methods for aligning sequences with structures align incorrectly about 20% of
residues, as judged by structure-structure alignments (Johnson & Overington, 1993). This is
a major problem because about one half of all pairs of related proteins are related by less than 30% sequence identity.
In principle, the best approach for aligning a sequence with a structure would be to score all
possible alignments by scoring the best full model of the sequence as implied by the
corresponding alignment (Sali et al., 1995). However, this computationally difficult problem has not been solved in practice yet. So far, the existing approaches for improving sequence-structure alignments relativeto the dynamic programming solution from a
sequence-sequence alignment (Needleman & Wunsch,1970) all rely on the availability of
structural information for one of the proteins in the alignment. Generally, the objective
function that is optimized to get the best alignment consists of two parts:that corresponding to the aligned regions and that corresponding to the insertions and deletions (i.e.,gaps). To
improve the scoring of equivalent regions, relative to the scoring by a single 20 by 20 amino
acid substitution table of the Dayhoff type (Dayhoff et al., 1978), environment dependent
amino acid substitution tables have been used in template matching (Overington et al., 1990;
Bowie et al., 1991; Overington et al., 1992), and single body and pairwise statistical potentials have been used in threading (Jones et al., 1992; Godzik et ah, 1992). Another
group of improvements concentrate on the gap penalty rather than on the scores for the
aligned regions. Usually, a linear gap penalty function is used, g = u + v.l, that depends on
the gap initiation and extension parameters, u and v, and on the number of residues in the
gap, 1. The optimal values for parameters u and v in sequence - sequence alignments have been explored exhaustively (Barton & Sternberg, 1987; Johnson & Overington, 1993;
Vogt et al., 1995). Several variations on the simple linear gap penalty have also been
proposed. For example, it has been shown from analysis of gap length distribution in
reference alignments that the gap penalty should be a logarithmic function of gap length (Benner et al., 1993); this appears to complicate the implementation in the dynamic programming algorithms (Gotoh,1982) and the linear gap penalty is still widely used.
Another improvement involves making gap penalty dependent on the (predicted) local
secondary structure (Lesk et al., 1986) and on the local variability of the already aligned
sequences (Taylor, 1986).
One important type of structural information has not yet been used in sequence-structure
align¬
ments: It has been observed anecdotally that the insertions relative to a given structure tend to happen between residues that are close in space. Similarly, regions that span two residues
close in space tend to be deleted more frequently. In this communication, we propose a
dynamic programming algorithm with a variable gap penalty function that attempts to mimic
this observation. In addition, we also incorporate other structural information to facilitate insertions or deletions of residues that are exposed, occur in bends, and are not within secondary structure segments. The new algorithm is almost as fast as the dynamic
programming version with the linear gap penalty. Several sample alignments obtained by the
linear and variable gap penalty functions are compared to show advantages of the variable
gap penalty function in comparative modeling. In addition to comparative modeling, the new algorithm can also be applied for 3D template matching and threading.
2 Variable gap penalty for sequence - structure alignment
We describe a dynamic programming algorithm of the Needleman & Wunsch type
(Needleman & Wunsch, 1970) to obtain an optimal alignment of one or more
pre-aligned protein sequences (i.e., sequence block) with one or more pre-aligned protein structures and sequences (i.e., structure block). The distinguishing feature of the
algorithm is its variable gap penalty function. The algorithm is implemented in the
ALIGN2D command of the computer program Modeller (This procedure is implemented in
the computer program Modeller, that is freely available to academic researchers via World
Wide Web at URL http://guitar.rockefeller.edu. Graphical interfaces to Modeller are provided by Quanta, Insightϋ, and GeneExplorer (MSI, San Diego, CA; e-mail dje@msi.com). For detailed discussion of this and related problems see (Sankoff &
Kruskal, 1983).)
1. The method as described here is for the global alignment only (Needleman & Wunsch, 1970), although the ALIGN2D command implements calculation of locally optimal
alignments as well (Smith & Waterman, 1981; Sali et al., 1998). The algorithm can be used either with similarity or distance residue-residue scores (Sali et al., 1998) and its extension to
environment dependent substitution matrices is straightforward.
The problem of the optimal alignment of two sequences as addressed by the algorithm of
Needleman & Wunsch is as follows. We are given two sequences of residues and an M
times N scoring matrix W where M and N are the numbers of residues in the first and
second sequence. The scoring matrix is composed of scores Wu describing the differences between residues i and j from the first and second sequence respectively. The goal is to obtain an optimal set of equivalences that match residues of the first sequence to the residues
of the second sequence. The equivalence assignments are subject to the following
"progression rule': for residues i and k from the first sequence and residues j and 1
from the second sequence, if residue i is equivalenced to residue j, if residue k is equivalenced to residue 1 and if k is greater than i, then 1 must also be greater than j. The
optimal set of equivalences is the one with the smallest alignment score. The
alignment score is a sum of scores corresponding to matched residues, also increased
for occurrences of non-equivalenced residues (i.e., gaps).
The recursive dynamic programming formulae for the global alignment of the structure block with the sequence block are:
DiJ = in--ι«-x(0, i-L)<i'< -, ( ',j' + Gijj i + Wi ) max(0, j-L) < j' <j
Figure imgf000059_0001
Do,j = Gi +iflfl WM+ij = 0
Figure imgf000059_0002
where M and N are the lengths of the structure and sequence blocks, respectively, L indicates the maximal allowed gap length, G is the variable gap penalty function, and W is the residue-residue substitution score for positions i and j from the structure and sequence blocks, respectively. The 20 by 20 residue-residue weight matrix asl.mat, whose values lie between 0 and 1000 (Sali et al.,1995), is used to obtain W. D is calculated for i = M + 1 and j = N + 1. The minimal score for the global alignment of the two blocks, d, corresponds to the smallest element in D M+1;0<j ≤N+] and D0<i iM+ltN+1 . The residue-residue equivalence assignments (i.e., the alignment) are obtained by backtracking in matrix D, starting from the element d.
Function G is the variable gap penalty function for simultaneous insertions from positions i' to i in the structure block and from positions j' to j in the sequence block. That is, for the situation where positions i' and j' are aligned with each other and positions i and j are aligned with each other, while the intervening positions in either or both blocks are not aligned with each other. If i' = i - 1 (j' = j - 1), there is no insertion in the structure (sequence) block. This formulation of the alignment by global dynamic programming allows for the gap penalty function of any form.
The main difference in the recursion from the linear gap penalty case is that a slightly slower procedure for finding the optimal gap lengths must be used for gap openings in the block of prealigned sequences vj' because of the penalty dependence on the distance between the two spanning Cα positions in the block of structures T. The CPU time is saved by limiting the minimization over 1 and 1' to values of 1 and 1' that are smaller than L; this is equivalent to limiting the maximal length of a gap to L positions. In practice, the new algorithm is only slightly slower than the O (M x N ) variant of the original dynamic programming algorithm with the linear gap penalty function (Gotoh, 1982). The variable gap penalty function is defined as
0, / = 0 and /' = 0
Gi . ' =
R{i,i')u + {l + l')υ l>0orl'>0
' -i' + l, 0 < i < M; τnax(0,i-i' + l-e = 0ori = M + l
(2) j-j' + l, 0<j<N; ' max{0,j-f + 1 - e j' = 0 or j = -V + 1
Λ(*.*') = 1 + [ω-γHi + ωsSi + BB{ + ωcQ + ωD vπ-ax(θ, d - άβ)ι)
where 1 and 1' are the lengths of insertions in the structure and sequence blocks, respectively,
v is the gap extension penalty, u is the gap opening penalty, e is the maximal number of
residues at sequence termini which are not penalized with a gap-penalty if not equivalenced
{i.e., overhangs), and R is the function that modulates the gap penalty function depending on
the structure block at the position of the insertion. R is at least 1, but can be larger
to make gap opening more difficult in the following circumstances: within helices or
strands, at buried positions, or at straight main chain positions. H; is 1 if all the structures in the structure block have all positions from i' to i occupied by helical residues. Sj is 1 if all templates have all positions from i' to i occupied by β-strand residues. B; is the average
buriedness of residues from position i' to i in the structure block, is the average backbone
straightness of residues in the structure block from position i' to i, d is the distance between
the Cα atoms at positions i' and i, averaged over all structures in the structure block, d0 is the
distance that is small enough to correspond to no increase in the opening gap penalty, and γ
is a constant to be determined empirically. The values of all four features, H, S, B,
and C, lie between 0 and 1. See the next Section for exact definitions of secondary
structure, backbone straightness, and residue buriedness. Reasonable values for all
parameters (w; v; ω^ d0 ;γ) were obtained by a trial-and-error procedure (Results). The variable gap penalty function is reduced to the special case of the linear gap penalty
function when all weights ω, are set to 0.
2.1 Secondary structure assignments
The algorithm for assignment of α-helices and β-strands depends on the Cα positions only. It is based on the idea of matching distance matrices of short segments of residues with
library' distance matrices corresponding to the individual secondary structure types (Richards & Kundrot, 1988). The main difference between (Richards & Kundrot, 1988) and
the current implementation is that β-strands are assigned only when there are at least two
spatially neighboring β-strands that can form a β-sheet. For each secondary structure type,
the library Cα distance matrix was calculated by averaging distance matrices for a sample of the corresponding secondary structure segments, which was obtained by running program
Dssp (Kabsch & Sander, 1983) on a 10 high-quality unrelated protein structures.
Distances that lay more than 2 standard deviations away from the mean of all distances were
omitted from the final averages. The secondary structure defining distance matrices and
parameters are shown in Table 4. The algorithm assigns helices first and strands second.
For each secondary structure type, it proceeds as follows:
1. Define the degree of the current secondary structure fit for each Cα atom. Use two
criteria: distance RMS (i.e., DRMS) (Levitt, 1983) and maximal distance difference (MDD). The RMS and MDD are both obtained by comparing the library distance matrix
with the distance matrix for a segment starting at the given Cα position. Assign the current
secondary structure type to all Cα's in all segments whose DRMS and MDD are less
than cutoffs ct and c2, respectively, and are not already assigned to "earlier' secondary
structure types.
2. Split kinked contiguous segments of the same type into separate segments: Kinking
residues are all residues in segments with both DRMS and MDD beyond cutoffs c3 and c4 , respectively. The actual single kink residue separating the two new segments of the same
type is the central kinking residue.
3. If the current secondary structure type is β-strand: Eliminate those runs of strand residues that are not closer than c5 A to other strand residues, separated by at least two residues in
sequence.
4. Eliminate those segments that are shorter than the cutoff length (c6 ).
5. Remove the isolated kinking residues (those that occur on their own or begin or end a
segment).
2.2 Backbone straightness
Local mainchain curvature at residue i is defined as the angle 0 <α < 180 between the
least-squares lines through Cα atoms i - 3 to i, and from i + 3 to i. Straightness is defined as
1 for all residues within helices and strands, and as 1 - min (180°, max (0, α )/180°) otherwise.
2.3 Residue buriedness
The residue buriedness is defined as 1 - a, where a is the fractional side-chain solvent
accessibility ranging from 0 to 1 (Sali & Overington, 1994). 3 Reference alignments for optimization of parameters
Reference alignments were obtained by least-squares superposition of two proteins of known
structure (Abola et al., 1987). All the proteins were at least 30 residues long and were
determined by either X-ray crystallography at resolution 3A or by NMR. All the pairs had
proteins with the sequence identity in the range from 30% to 45%; none of the pairs had
both proteins with more than 50% sequence identity to any other pair. MODELLER'S
ALIGN3D with the cutoff of 5A was used for least-squares superposition. Bad
superpositions were removed.
References
Abola, E. E„ Bernstein, F. C, Bryant, S. H., Koetzle, T., & Weng, J. (1987).
Protein data bank. In: Crystallographic databases _ Information, content, software
systems, scientific applications, (Allen, F. H., Bergerhoff, G., & Sievers, R., eds) pp.
107-132. Data Commission of the International Union of Crystallography
Bonn/Cambridge/Chester.
Bajorath, J., Stenkamp, R., & Aruffo, A. (1994). Knowledge-based model building of
proteins: Concepts and examples. Protein Sci. 2, 1798-1810.
Barton, G. J. & Sternberg, M. J. (1987). Evaluation and improvements in the automatic alignment of proteins sequences. Protein Eng. 1, 89-94.
Benner, S. A., Gonnet, G. H., & Cohen, M. A. (1993). Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229, 1065-1082.
Blundell, T. L., Sibanda, B. L., Steinberg, M. J. E., & Thornton, J. M. (1987).
Knowledge-based prediction of protein structures and the design of novel molecules.
Nature, 326, 347-352.
Bowie, J. U., L"uthy, R., & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164-170.
Dayhoff, M. O., Schwartz, R. M., & Orcutt, B. C. (1978). In: Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, (Dayhoff, M. O., ed) pp. 345-352. National
Biomedical Research Foundation Washington D.C.
Godzik, A., Kolinski, A., & Skolnick, J. (1992). Topology fingerprint approach to the
inverse protein folding problem. J. Mol. Biol. 227, 227-238.
Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708.
Greer, J. (1990). Comparative modelling methods: application to the family of the
mammalian serine proteases. Proteins, 7, 317-334.
Holm, L., Rost, B., Sander, C, Schneider, R., & Vriend, G. (1994). Data based modeling of proteins.In: Statistical mechanics, Protein Structure, and Protein Substrate Interactions,
(Doniach, S„ ed) pp. 277-296. Plenum Press New York.
Johnson, M. S. & Overington, J. P. (1993). A structural basis for sequence
comparisons: An evaluation of scoring methodologies. J. Mol. Biol. 233, 716-738.
Johnson, M. S., Srinivasan, N., Sowdhamini, R., & Blundell, T. L. (1994). Knowledge-based
protein modelling. CRC Crit. Rev. Biochem. Mol. Biol. 29, 1-68.
Jones, D. T., Taylor, W. R., & Thornton, J. M. (1992). A new approach to protein fold recognition. Nature, 358, 86-89.
Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: Pattern recognition
of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637.
Koretke, K. K., Luthey-Schulten, Z., & Wolynes, P. G. (1996). Self-consistently optimized
statistical mechanical energy functions for sequence structure alignment. Prot. Sci. 5,
1043-1059.
Lesk, A. M. & Chothia, C. H. (1986). The response of protein structures to amino-acid sequence changes. Philos. Trans. R. Soc. London Ser. B, 317, 345-356.
Lesk, A. M., Levitt, M., & Chothia, C. (1986). Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1, 77-78. Levitt, M. (1983). Molecular dynamics of native protein. II. Analysis and nature of
motion. J. Mol. Biol. 168, 621-657.
Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453.
Overington, J., Donnelly, D., Johnson, M. S., Sali, A., & Blundell, T. L. (1992).
Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Sci. 1, 216-226.
Overington, J., Johnson, M. S., Sali, A., & Blundell, T. L. (1990). Tertiary structural
constraints on protein evolutionary diversity; templates, key residues and structure
prediction. Proc. Roy. Soc. Lond. B 241, 132-145.
Richards, F. M. & Kundrot, C. E. (1988). Identification of structural motifs from protein
coordinate data: Secondary structure and first level super-secondary structure. Proteins, 3,
71-84.
Sali, A. (1995). Modelling mutations and homologous proteins. Curr. Opin. Biotech. 6, 437-451.
Sali, A., Badretdinov, A., Sanchez, R., & Feyfant, E. (1998). Modeller, A Protein Structure Modeling Program, Release 5. URL http://guitar.rockefeller.edu/. Sali, A. & Overington, J. (1994). Derivation of rules for comparative protein
modeling from a database of protein structure alignments. Protein Sci. 3, 1582-1596.
Sali, A., Potterton, L., Yuan, F., van Vlijmen, H., & Karplus, M. (1995). Evaluation of comparative protein structure modeling by MODELLER. Proteins, 23, 318-326.
Sanchez, R. & Sali, A. (1997). Advances in comparative protein-structure modeling. Curr.
Opin. Struct. Biol. 7, 206-214.
Sankoff, D. & Kruskal, J. B. (1983). Time warps, string edits, and macromolecules:
The theory and practice of sequence comparison. Reading, MA: Addison-Wesley
Publishing Company.
Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences.
J. Mol.Biol. 147, 195-197.
Taylor, W. R. (1986). Identification of protein sequence homology by consensus template
alignment. J. Mol. Biol. 188, 233-258.
Vogt, G., Etzold, T., & Argos, P. (1995). An assessment of amino acid exchange matrices in
aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249, 816-831.
Tables and Figures Table 4: Parameters and distance matrices for defining α-helices and β-sheets. The
distances for the α helix are shown in the upper right triangle of the matrix, and the distances
for the β-strand are shown in the lower left triangle. See text for description of cutoffs c,.
Table 4
a-helix: c- = 0.50, C2 = 0.90, C3 = = 0.80, C3 = 1.50, C5 = 0.00, C6 = 6
/3-sheet: C\ = 0.80, C2 = 2.00, C3 = = 1.30, C = 1.60, C5 = 6.10, C6 = 5
- 3.802 5.518 5.541 6.683
3.804 - 3.802 5.519 5.556
6.601 3.802 - 3.801 5.530
9.678 6.607 3.803 - 3.801
12.679 9.665 6.558 3.803
Table 5. Optimization of parameters for the gap pernalty function
Table 5
RUN u v α/tf ωg ωs ωc wo d0 7 SCORE
1 -900 -50 0 0 0 0 0 0 0 3780
3 -900 -50 O50 075 0 0 0 0 0 3357
6 -900 -50 0 0 0.90 0 0 0 0 3418
7 -900 -50 0 0 0 1.20 0 0 0 3520
8 -900 -50 0 0 0 0 Q 0 0 3731
9 -900 -50 0 0 0 0 0.6 J 0.50 3717
11 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 ' 0.5 2954
12 -450 0 O40 0.75 0.90 1.20 0.6 7.0 •0.5 2944
13 -450 0 0.50 L3Q 0.90 1.20 0.6 7.0 0.5 2926
14 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 0.5 2953
15 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 0.5 2953
16 -450 0 0.50 0.75 0.90 1.20 09 7.0 0.5 2924
17 -450 0 0.50 0.75 0.90 1.20 0.6 7JS 1.10 2926
18 -450 0 0.40 1.30 0.90 1.20 0.9 7.6 1.1 2900
19 -450 0 0.35 1.30 0.90 1.20 0.9 7.6 1.1 2900
20 -450 0 0.35 L20 0.90 1.20 0.9 7.6 1.1 2901
21 -450 0 0.35 1.20 0.90 1.20 0.9 7.6 1.1 2902
22 -450 0 0.35 1.20 0.90 1.20 0.9 7.6 1.1 2901
23 -450 0 0.35 1.20 0.90 1.20 06 7.6 1.1 2879
24 -450 0 0.35 1.20 0.90 1.20 0.6 8 5 1-2 2872
25 -450 0 0.35 1.20 0.90 1.20 , 0.6 8.6 1.2 « 2876
26 -300 0 0.35 1.20 0.90 1-20 0.6 8.6 1.2 3101
27 -300 0 0.35 1.20 0.90 1.20 0.6 8.6 1.2 2849
28 -300 0 0.35 1.20 0.90 1.20 0.6 8.6 1.2 3074
29 -950 ; :80 0.35 1.20 0.90 1.20 0.6 8.6 1.2 4120 underlined . ... parameters that were optimized in the current run. Example 3
Role of PSI BLAST
To find template structures for modeling of the target sequences program PSI-BLAST
was used (Altschul et al., 1997). PSI-BLAST iteratively collects sets of intermediate sequences to find homologs. The main steps of the procedure are: (i) for a given
sequence, an initial set of homologs is collected from the sequence database using a
conventional scoring matrix (for example, we use BLOSUM62); (ii) a weighted multiple
alignment is made from the query sequence and the homologs whose match scores are better than a specified E-value cutoff (for example, we use 0.0005); (iii) a position specific scoring
matrix is constructed from this alignment; (iv) this matrix is then used to search the database
for new homologs; (v) new homologs with a good match score are used to construct a new
position-specific scoring matrix, which is then used in a further search for homologs; and
(vi) rounds of matix reconstruction and new searches are iterated until no new homologs are found or until the number of iterations reaches a specified limit (for example, we use 20). The parameters used for the search have been shown to be optimal for the application of
PSI-BLAST as a fold-recognition method (Park et al., 1998). After the PSI-BLAST search
converges or reaches the maximal number of iterations the position-specific scoring matrix is
stored. A new searchcan be done using Gapped BLAST with the query and the stored
scoring matrix against a representative set of PDB chains. The representative protein chains have at most 95% sequence identity to each other, or have length difference of at least 30 residues or 30%; they are also the highest quality structures within each group
(highest resolution).
Since the scoring matrix that is generated depends on the query sequence this type of
search
is not symmetric. For this reason position specific scoring matrices are also
calculated for the sequences of the representative PDB structures and those are used to search against the target sequences using the same parameters described above.
Usually a match with an E-value of 10"4 is considered significant, for example, matches
with E-values down to 102 can be accepted with the intention of finding more remote relationships. Using such a permissive cutoff of course increase the number of false positive
hits, but this is dealt with in the model evaluation step. Also other typical problems of remote matches with local alignment procedures like short coverage of the query sequence
and matches of different structures for the same region of the query sequence are dealt with
in the sequence-structure alignment and model evaluation steps.
Template selection.. Structures in PDB were clustered by comparing them against each
other. Structures that have a high enough structural similarity are aligned to each other and
form one single template containing several structures. These mega-templates are used to
construct the models. The prealignment of the structures avoids the need to realign them during the targettemplate alignment step. A template is selected for modeling if any of its
component structures was matched against the target sequence during the template
search step. The mega-templates form a depository of multiple structure alignments such as cath,scop, picasso, homstrad, except that they are for alignments between realatively
similar structures only.
References for this example
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. Z., Miller, W., & Lipman, D. J.
(1997). Nucl. Acids Res. 25, 3389-3402.
Park, J., Karplus, K., Barret, C, Hughey, R., Haussler, D., Hubbard, T., & Chothia, C.
(1998). J. Mol. Biol, 284, 1201-1210.
Example 4
The scoring function
(The following references are also useful as regards explaining how the program generates parameters from the PDB:F. Melo and E. Feytmans, J. Mol. Biol. Vol. 267, pp 207-222 (1997);
and F. Melo and E. Feytmans, J. Mol. Biol., vol. 277, pp. 1141-1152 (1998).)
The current implementation of the model evaluation module uses a scoring function that depends on three variables:
•.
1.- Compactness
2.- Sequence identity
3.- z-score of combined energy (non-bonded interactions + accessibility surface)
The compactness is a measure of how globular and spheric a protein model- is. The second variable represents the percentage of sequence identity in the alignment between target and template. The z-score is calculated using as a reference system the sequence space rather than the structure space. In this particular case, 200 random sequences are threaded in the same fold to obtain the combined energy z-score of the model. The energy terms taken into account involve pairwise contacts and exposure to solvent. The pairwise term is a statistical potential that uses Ca and Cβ atoms and includes all terms within an Euclidean distance of 15 A. The surface accessible term involves only Cβ atoms using a distance threshold of 10 A.
The scoring function has been obtained by running a genetic algorithm (GA) which evolves mathematical expressions that are evaluated for their ability to do a non-linear transform of the variables that maximize the standard distance between good and bad models in the training set. After several runs of this GA, the best mathematical expression was selected. This was the structure 341, thus the scoring funtion was named then GA341. The application of GA341 to the test set used in our study (about 2800 good models and about 5800 bad models) resulted in a standard distance of 5.172 between good and bad models. The GA341 expression, which constitutes the current scoring function, relates, the three variable mentioned above, and its structure is as follows:
{compactness + seq.ide) zscore
G-4341 = 1 — [cos {seq.ide)]
The calculation of the three independent variables is described in detail in the next pages. The dependency of this scoring function in the three variables is shown in several graphs in the pages attached. A summary of the performance of the method in the benchmark ".'(• have used to test the method is also shown. The classification procedure works as follows:
G-4341«or<! > 0.5 =» model is classsified as GOOD G-4341,eor- < 0.5 =*• model is classified as BAD
Compactness: This is a measurement of how compact a globular protein is. The mathematical expression to calculate the compactness is as follows:
Figure imgf000076_0001
where di is the largest distance between non-bonded residue pairs observed in the protein; and VAA is the volume of each residue.
z-score: The first step in model evaluation consists in the calculation of the energy of all non-bonded interactions E D) and the sum of the energy of exposure to solvent for each residue {EAS). Then, the sequence of the model is shuffled and two hundred random sequences are generated. These sequences are threaded into the model fold and the energy of non-bonded interactions and exposure to the solvent are calculated for each one of these random models. In this second step, two distributions of energy that will be used as a reference system are generated: Ef?D and EAS. Then the standard deviation for each distribution is calculated {σ*?D and σAS). These standard deviations σ?D and σAS are used to normalize E D and EAS, obtaining E ° and EAS respectively. Also, they are used to normalize each value in each distribution. Once all the energies are normalized, the combined energy is calculated as follows: βCOMB _ jβDD , SjAS
Then, for each random model, the combined energy is also calculated: βCOMB __ βDD , βAS
Then, the standard deviation {σ?OMB) and the average {μ^OMB) of the distribution of combined energies are calculated and used to obtain the z-score of the model: πCOMB ..COMB z rCOMB fira r m σCOMB sequence identity: The third and last dependent variable in GA341 expression is the percentage of sequence identity in the target-template alignment. This is simply the number of identical residues in the alignment divided by the total number of aligned positions.
The detailed parametrization of the statistical potentials (distance dependent and accessible surface) is given in the tables below:
Distance-dependent potential that describes the energy of interaction between non-bonded residue airs
Figure imgf000077_0001
Accessible surface potential that describes the energy of ex osure to solvent for each residue
Figure imgf000077_0002
Example 5
ModBase
Contents
The database currently contains models for segments of approximately 17,000 proteins
from the completely sequenced genomes of Saccharomyces cerevisiae, Mycoplasma
genitalium, Caenorhabditis elegans, Escherichia coli, Methanobacterium
thermoautotrophicum, Synechocystis sp., Pyrococcus horikoshii, Methanococcus jannaschii, Haemophilus influenzae, and Mycoplasma pneumoniae, as well as all Arabidopsis thaliana
and Homo sapiens proteins in the SwissProt database [27]. The sources of the genomes are listed at http://guitar.rockefeller.edu/modbase/sources.html. Each model has its
non-hydrogen atom coordinates stored in a flat file in the PDB format. The database also
contains all fold assignments, alignments, and model evaluations.
Models are generated with an entirely automated four step procedure implemented in the
Mod-
Pipe pipeline software [10, 28]: (i) Fold assignment, (ii) sequence-structure alignment, (iii)
model building, and (iv) model evaluation. The procedure can be applied independently and in parallel on a cluster of workstations to thousands of protein sequences, including complete genomes and large protein sequence databases. For fold assignment, each
sequence from a genome is compared with a non-redundant set of proteins of known 3D
structure using Psi-Blast [29]. Next, for each target protein sequence, a multiple global alignment with the matching structures is constructed by the ALIGN2D command in the
program Modeller [30]. This alignment tends to be more accurate than the
Psi-Blast alignment because (i) it includes all the sequences and structures that are
sufficiently similar to the target sequence, (ii) it uses a structure-dependent gap penalty function to position gaps in a structurally reasonable environment, and (iii) it matches
complete structural domains as obtained from the known template structures [Roberto Sanchez, Francisco Melo, Nebojsa Mirkovic and Andrej Sali, in preparation]. In the third
step, the sequence-structure alignment is used to build a 3D model for the matched parts of
the target protein sequence by the program Modeller. Finally, the model is evaluated as
discussed next.
Model evaluation is essential for assessing the value of 3D protein models in any protein
structure prediction [ 31, 32]. It is especially important for ModPipe because a relatively
permissive cutoff is used to select known protein structures for model building in the first
fold assignment step. This permissivness reduces the number of missed hits, but it also increases the number of false fold assignments and alignment mistakes. The fold
assignment errors begin to appear when relatively dissimilar template-target sequences are
matched (i.e., < 30% sequence identity). In addition, even if the fold is assigned
correctly, errors in the alignment may still result in a bad model. The alignment errors can be significant when the sequence identity drops below 35%. A reliable model is obtained only if both the correct fold assignment and an approximately correct alignment are made.
The overall accuracy of a model is measured by an overlap between the model and the actual
structure. The overlap is defined as the fraction of residues whose Cα atoms are within 3.5A of each other in the globally superposed pair of structures. Models that overlap with the correct structures in more than 30% of their residues are defined here as good' models.
Such models are likely to have a correct fold, which is frequently sufficient for coarse
prediction of protein function [33]. A method for calculating the probability of whether a
given model is good, pG, was developed [10] and is used to evaluate all the models in ModBase. If a given model has pG > 0.5, it is called a 'reliable' model. The method
depends on a statistical scoring function [32] and was calibrated using 3,993 and 6,270 good
and bad models for 1,085 proteins of known structure [10]. An assessment of the method by
the jack-knife procedure indicated that for models longer than 100 residues the classification
results in less than 5% of false positives and less than 8% of false negatives.
Combined 3D modeling and model evaluation is the best way of either confirming or
rejecting a match between remotely related sequence and structure [10, 34]. This is
important because most of the related protein pairs share less than 30% sequence identity [10]. As a result ModBase includes reliable models based on templates that are not
detectable as significant matches by PSI-BLAST alone.
Access and Interface
ModBase has a web interface at http://guitar.rockefeller.edu/modbase/. Models for yeast
proteins are also accessible through links from the Sacch3D [35] database at http://genome-www. stanford.edu/Sacch3D. The database is searchable by SwissProt/TrEMBL and GenPept accession numbers, as well as by ORF names, keywords,
model reliability, model size, target-template sequence identity, and alignment significance. It is also possible to perform sequence similarity searches against the model sequences using
Blast [29]. Searching results in a table of models satisfying all search criteria. The table lists the modeled regions, the templates used to construct the models, target-template similarities,
and model reliabilities. For each model, it also includes links to a more detailed description
of the model, to a summary of all models for a given protein, and to the PDB for a detailed
description of the template structure used in modeling. If the modeled sequence is present in SwissProt/TrEMBL, its description is displayed together with a link to the database. The
model description page contains a graphical representation of the target-template
alignment. In addition, it is linked to the model coordinates in the PDB format, to the
target-template alignment used to derive the model , and to a display of the model by the
3D visualization program Rasmol [36] . The model description page also contains links to the ModBase entries related to the target sequence and to the CATH domains [17] contained in the model. Finally, statistical data, such as distributions of several model
properties in ModBase can also be displayed.
Using Comparative Models
It is frequently possible to extract more information from a comparative model than
from the modeled sequence alone, or even from its alignment to a related protein
structure [28]. For example, the preferred ligand of brain lipid binding protein could be
predicted correctly from the volume and shape of the ligand binding cleft in its comparative model [37]. Another example is provided by mouse mast cell proteases, some of which have a conserved surface region of positively charged residues that binds proteogl yeans [38]. This region is not easily recognizable in the sequence or its alignment to a known structure
because the constituting residues occur at variable and sequentially non-local positions in sequence that form a binding site only when the protease is fully folded.
In general, comparative modeling has been applied successfully to many biological
problems [6]. It can be helpful in proposing and testing hypotheses in molecular
biology, such as hypotheses about ligand binding sites [37, 38], substrate specificity [39], drug design [40], and protein-protein interactions [41]. It can also provide starting
models in x-ray crystallography [42] and NMR spectroscopy [43]. Another use of 3D
models is that some binding and active sites, which cannot possibly be found by searching
for local sequence patterns, frequently should be detectable by searching for small 3D
motifs that are known to bind or act on specific ligands [44- 46]. Finally, comparative models in combination with model evaluation can also be used to
confirm or reject remote sequence-structure relationships, complementing the existing
sequence matching and threading methods for fold assignment [10, 34].
References for this example
[1] Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., and Walters, L.
(1998) Science 282, 682-689.
[2] Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y.
(1998) /. Mol. Biol. 283, 707-725.
[3] Koonin, E. V., Tatusov, R. L., and Galperin, M. Y. (1998) Curr. Opin. Str. Biol. 3,
355-363.
[4] Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette, B. F. F., Rapp, B. A.,
and Wheeler, D. L. (1999) Nucl. Acids Res. 27, 12-17.
[5]Abola, E. E., Bernstein, F. C, Bryant, S. H., Koetzle, T., and Weng, J. (1987)
Protein data bank In F. H. Allen, G. Bergerhoff, and R. Sievers, (ed.), Crystallographic databases- Information, content, software systems, scientific applications, pp. 107-132 Data
Commission of the International Union of Crystallography Bonn/Cambridge/Chester.
[6] Johnson, M. S., Srinivasan, N., Sowdhamini, R., and Blundell, T. L. (1994) CRC Crit.
Rev.
Biochem. Mol. Biol. 29, 1-68. [8] Guex, N., Diemand, A., and Peitsch, M. C. (1999) Trends Biochem. Sci. 24, 364-367.
[9] Fischer, D. and Eisenberg, D. (1997) Proc. Natl. Acad. Sci. USA 94, 11929-11934.
[10] Sanchez, R. and Sali, A. (1998) Proc. Natl. Acad. Sci. USA 95, 13597-13602.
[11] Rychlewski, L., Zhang, B., and Godzik, A. (1998) Fold. Des. 3, 229-238.
[12] Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C, Sunyaev, S., Yuan, Y.,
and Bork, P. (1998) J. Mol. Biol. 280, 323-326.
[13] Grandori, R. (1998) Prot. Eng. 11, 1129-1135.
[14] Teichmann, S. A., Park, J., and Chothia, C. (1998) Proc. Natl. Acad. Sci. USA
22, 14658-14663.
[15] Jones, D. T. (1999) /. Mol. Biol. 287, 797-815.
[16] Hubbard, T. J. P., Ailey, B., Brenner, S. E., Murzin, A. G., and Chothia, C.
(1999) Nucl. Acids Res. 27, 254-256.
[17] Orengo, C. A., Pearl, F. M. G., Bray, J. E., Todd, A. E., Martin, A. C, Conte, L.
L., and Thornton, J. M. (1999) Nucl. Acids Res. 27, 275-279. [18] Holm, L. and Sander, C. (1999) Nucl. Acids Res. 27, 244-247.
[19] Holm, L. and Sander, C. (1996) Science 273, 595-602.
[20] Terwilhger, T. C, Waldo, G., Peat, T. S., Newman, J. M., Chu, K., and Berendzen, J.
(1998) Protein Sci. 7, 1851-1856.
[21] Sali, A. (1998) Nat. Struct. Biol. 5, 1029-1032.
[22] Zarembinski, T. I., Hung, L. W., Mueller-Dieckmann, H. J., Kim, K. K., Yokota, H.,
Kim, R., and Kim, S. H. (1998) Proc. Nat. Acad. Sci. 95, 15189-15193.
[23] Burley, S. K., Almo, S. C, Bonanno, J. B., , Capel, M., Chance, M. R., Gaasterland,
T., Lin, D., Sali, A., Studier, F. W., and Swaminathan, S. (1999) Nat. Genet. 23,
151-157.
[24] Montelione, G. T. and Anderson, S. (1999) Nat. Str. Biol. 6, 11-12.
[25] Cort, J. R., Koonin, E. V., Bash, P. A., and Kennedy, M. A. (1999) Nucl.
Acids Res. 27, 4018-4027.
[26] Peitsch, M. C, Wilkins, M. R., Tonella, L.,Sanchez, J. C, Appel, R. D., and Hochstrasser, D. F. (1997) Electrophoresis 18, 498-501.
[27] Bairoch, A. and Apweiler, R. (1999) Nucl. Acids Res. 27, 49-54.
[28] Sanchez, R. and Sali, A. (1999) J. Comp. Phys. 151, 388-401.
[29] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. Z., Miller, W., and Lipman,
D. J.
(1997) Nucl. Acids Res. 25, 3389-3402.
[30] Sali, A. and Blundell, T. L. (1993) J. Mol. Biol. 234, 779-815.
[31] Luthy, R., Bowie, J. U., and Eisenberg, D. (1992) Nature 356, 83-85.
[32] Sippl, M. J. (1993) Proteins 17, 355-362.
[33] Orengo, C. A., Jones, D. T., and Thornton, J. M. (1994) Nature 372, 631-634.
[34] Guenther, B., Onrust, R., Sali, A., O'Donnell, M., and Kuriyan, J. (1997) Cell 91,
335-345.
[35] Chervitz, S. A., Hester, E. T., Ball, C. A., Dolinski, K., Dwight, S. S., Harris, M. A.,
Juvik, G., Malekian, A., Roberts, S., Roe, T., Scafe, C, Schroeder, M., Sherlock, G., Weng, S., Zhu, Y., Cherry, J. M., and Botstein, D. (1999) Nucl. Acids Res. 27, 74-78.
[36] Sayle, R. and Milner-White, E. J. (1995) Trends in Biochemical Sciences 20, 374.
[37] Xu, L. Z.,Sanchez, R., Sali, A., and Heintz, N. (1996) J.Biol.Chem. 271, 24711-24719.
[38] Matsumoto, R., Sali, A., Ghildyal, N., Karplus, M., and Stevens, R. L. (1995) /. Biol.
Chem. 270, 19524-19531.
[39] Caputo, A., James, M. N. G., Powers, J. C, Hudig, D., and Bleackley, R. C. (1994)
Nature Struct. Biol. 1, 364-367.
[40] Ring, C. S., Sun, E., McKerrow, J. H., Lee, G. K., Rosenthal, P. J., Kuntz, I. D., and
Cohen, F. E. (1993) Proc. Natl. Acad. Sci. USA 90, 3583-3587.
[41] Vakser, I. A. (1997) Proteins Suppl. 1, 226-230.
[42] Carson, M., Bugg, C. E., Delucas, L., and Narayana, S. (1994) Acta Crystallogr. D50,
889-899.
[43] Nagata, T., Gupta, V., Kim, W.-Y., Sali, A., Chait, B. T., Shigesada, K., Ito, Y., and
Werner, M. H. (1999) Nat. Str. Biol. 6, 615-619. [44] Wallace, A., Borkakoti, N., and Thornton, J. M. (1997) Protein Sci. 6, 2308-2323.
[45] Fetrow, J. S. and Skolnick, J. (1998) J. Mol. Biol. 281, 949-968.
[46] Kleywegt, G. J. (1999) J. Mol. Biol. 285, 1887-1897.
[47] Dunbrack Jr., R. L., Gerioff, D. L., Bower, M., Chen, X., Lichtarge, O., and
Cohen, F. E. (1997) Folding & Design 2, R27-R42.
[48] Koehl, P. and Levitt, M. (1999) Nat. Struct. Biol. 6, 108-111.
[49] Jones, D. (1997) Curr. Opin. Struct. Biol. 7, 377-387.
[50] Eddy, S. R. (1996) Curr. Opin. Struct. Biol. 6, 361-365.
Example 6
Example of 3-dimensional structure obtained for the YBL007C yeast protein using the
process and system of this, including PSI BLAST and the scoring function set forth above.
In Table 6, the 6 columns for each atom, reading left to right, are for: Residue number,
amino acid that is the residue, atom name for which the coordinates are shown, and the x, y, and z coordinates, respectively.
Table 6
Figure imgf000090_0001
Table 6 (continued)
12 THR CA 4.25| -15.83β| 24.247 18 ILE CD1 6.1861 -13.3l|31.432
12 THR CB S.277 -15.715 25.336 18 ILE -9.283 -13.38 27.633
12 THR OGl 4.656 15.863| 26.607 18 -8.897 -12.62 26.747
12 THR CC2 S.926 -14.323 25.24 19 GLN 10.58 -13.47 27.997
12 THR 3.547 -17.136 24.502 19IGLN 11.62 -12.67 27.419
12 THR 2.323 -17.217 24.561 19 GLN CB -12.83 -13.51 27.004
13 PRO 4.333 -18.178 24.556 19 GLN 13.49 -14.16128.221
13 PRO CA 3.799 19.473J 24.88 19 GLN ■14.67 -14.98 27.753
13 PRO CD 5.471| -18.266| 23.656 19 OE1 15.38 -14.61 26.819
13 4.859 -20.481 24.43! 19 CLN NE2 -14.9 -16.13 28.434
13 PRO CG 5.598 -19.754 23.3 19 GLN -12.1 -11.75 28. SI
13 3.395 -19.607 26.326 19|GLN 11.82 -11.98129.685
13 2.51 -20.409 26.621 20 GLU 12.85 10.69128.144
14 4.061 -18.868 27.239 20 GLU -13.37 -9.782 29.133
14 GLU CA 3.845 -18.919 28.665 20 GLU CB -14.22 -8.617 28.574
14 GLU CB 4.963 -18.236 29.468 20 GLU CG -13.48 -7.53 27.784
14 GLU CO 6.289 -18.999 .29.423 20 GLU CD -14.49 -6.442 27.422
14 CD 7.269 -18.2621 30.32 20 GLU OE1 15.69 -6.622 27.751
14 GLU OE1 6.874 -17.207 I 30.885 20|GLU OE2 14.07 5.415126.817
14 GLU OE2 8.423 -18.746 30.46 20 GLU -14.3 10.56130.009
14 2.557 -18.277 29.101 20 GLU 14.97 -11.49 29.561
14 GLU 1.924 -18.737 30.049 21 ASP 14.34 -10.19 31.303 lSlGLU 2.15 -17.185 28.427 21 ASP -IS. -10.77 32.297
ISIGLU 1.033 -16.351 28.794 21 ASP CB -16.7 -10.77 31.914
15|CLU 0.91S -15.081 27.924 21 ASP CG 17.31 -9.387132.085 lSfGLU CG 2.03 -14.049 28.094 21 ASP OD1 -16.59 -8.462 32.534
IS GLU CD 1.715 -13.223 29.327 21 ASP OD2 -18.53 9.245 31.781
15 OE1 1.911 -13.742 30.458 21|ASP -14.84 -12.2 32.567
IS GLU OE2 1.259 -12.063 29.151 21 ASP •15.57 -12.89 33.275
15 GLU -0.27 -17.061 28.601 22 ASP -13.7 -12.7 32.061
15 GLU -0.35 -18.095 27.94 22 ASP 13.39 -14.06 32.398
16 -1.33 -16.503 29.222 22 ASP CB -12.29 -14.71 31.541
16 LEU -2.66 -17.051 29.12 12.26 -16.19 31.881
16 LEU -3.28 -17.361 30.495 22 ASP OD1 13.37 -16.76
16 CG -4.77 -17.789 30.428 22 -11.15 -16.77 31.97
16 LEU CD2 -5.4 -17.848 31.835 22 ASP -12.9 -14.04 33.808
16 LEU CD1 -4.94 -19.104 29.651 22 ASP (-12.32 -13.05 34.246
16 LEU -3.57 -16.057 28.455 23 LEU -13.13 -15.13 34.56
16 LEU -3.63 -14.903 28.876 23 -12.67 -15.15 35.914
17 -4.29 -16.492 27.394 23 LEU -13.7 -15.68 36.929
17 ALA -5.24 -15.657 26.716 23 LEU CG 14.94 -14.77 37.043
17 CB -5.72 -16.211 25.361 23 LEU CD2 •15.73 -14.75 35.73
17 ALA -6.46 -15.51 27.578
23 LEU -14.56 -13.36 37.536
17 ALA -6.79 -16.537 28.276 23 11.45 -16.01 35.968
18 ILE -7.16 -14.423 27.553 23 LEU 11.39 -17.09 35.393
18 ILE CA -8.32 -14.253 28.376 24 LEU 10.42 -15.5 36.675
18 ILE -7.99 -13.52 29.647 24 LEU CA -9.138 -16.12 36.764
18 ILE CG2 -9.27 -13.361 30.482
24 LEU CB -8.023 -15.19 36.243 lβllLE CGI -6.86 -14.228 30.407 24 LEU CG -8.06 -14.85 34.736
LEU CD2 -9.315 -14.06 34.357
24 LEU CD1 -7.85 -16.1 33.864
24|LEU 8.8391 -16.38 38.208 Table 6 (continued)
Figure imgf000092_0001
Table 6 (continued ) 6 TRP CDl 0.289 -10.82 39.539 42 VAL 14.1 -22.4 35.77 6 TRP NE1 -0.219 -9.655 40.058 42 VAL -15.43 22.12 36.299 6 TRP CE2 -1.542 -9.55 39.682 42 VAL CB -16.14-20.99 35.596 6 TRP CE3 -3.11 -10.85 38.4 42 VAL CGI 17.57-20.85 36.142 6 CZ2 -2.479 -8.59 39.945 42 VAL CC2 -15.3 19.72 35.812 6 CZ3 4.049 -9.877 38.66 42 VAL -16.22 -23.38 36.129 6 TRP CH2 3.738 -8.769 39.419 42 VAL -15.9 24.22 35.293 6 TRP 0.61S -14.56 37.022 43 ILE -17.28 23.53 36.95 6 TRP 0.474 -14.64 35.8031 43 ILE CA -18.14 -24.68 36.907 7 THR 0.619 15.64 37.8211 43 ILE -19.17-24.69 38.001 7 THR CA 0.37 -16.94 37.271 43 ILE CC2 20.13 -23.5 37.778 7 THR CB 1.129 -18.04 37.958 43 CGI -19.87 -26.06 38.032 7 THR OG1 0.747 18.1139.324 43 CDl -20.75 -26.29 39.259 7 THR CC2 2.632 -17.74 37.843 43 ILE -18.87 -24.65 3S.606 7 THR -1.082 -17.17 37.515 43 ILE -19.31 -23.6 35.14
-1.554 -17.02 38.64 44 GLY -18.99 25.84 34.975
-1.841 -17.53 36.466 44 CA -19.69 25.93 33.73
VAL CA 3.251 -17.61 36.682 44 -18.74 -25.62 32.612
VAL CB 3.962 -16.45 36.063 44 GLY -17.53 25.76 32.743 VAL 5.466 -16.62 36.287 45 -19.3 -25.18 31.468 VAL CG2 -3.382 15.1636.659 45 CA -18.56 •24.89 30.279
VAL ■3.795 18.8736.084 45 SER -19.45 24.49 29.099
3.225 -19.44 35.154 45 SER -20.1 -23.26 29.381
LYS -4.923 19.3536.649 45 SER -17.64 -23.74 30.536 LYS 5.592 -20.51 36.144 45 SER -16.62 23.59 29.859 LYS 5.442 -21.71 37.105 46 -17.95 -22.91 31.543
CG -6.079 -23.03 36.657 46 ASP -17.17 -21.74 31.788
LYS CD -7.595 -23.08 36.839 46 CB -17.66 i-20.92 32.994
LYS 7.998 -23.32 38.298 46 ASP -19.03 20.35 32.642
-7.39 -24.58 38.788 46 ASP OD1 20.32 31.43
LYS -7.038 -20.15 35.965 46 ASP -19.74 ■19.91 33.586
7.651 -19.54 36.B42 46 ASP -15.75 -22.14 32.063
LYS -7.617 -20.48 34.794 46 ASP -15.47 -23.24 32.537
LYS CA -8.987 -20.14 34.523 47 -14.83 -21.21 31.739
LYS 9.264 -19.82 33.043 47 CA -13.41 -21.34 31.894
CG -8.552 -18.6 32.47 47 SER -12.96 21.53 33.34
LYS CD -8.56 -18.59 30.938 47 SER OG -13.21 -20.34 34.079
CE -9.944 -18.83 30.328 47 -22.45 31.052
NZ -9.85 -18.84 28.85 47 SER -13.42 -23.57 31.058
LYS 9.842 -21.34 34.793 48 GLU -11.86 -22.12 30.269
LYS 9.632 -22.39 34.192 48 -11.17 22.99 29.363
10.83 -21.25 35.712 48 -10.25 ■22.24 28.388
-11.7 -22.38 35.847 48 CG -10.97 21.28 27.44
CB -11.2 -23.53 36.749 48 GLU CD -11.62 -22.08 26.329
ARG •12.09 -24.78 36.631 48 GLU OEl -11.94 23.28 26.573
ARG CD ■12.02 -25.74 37.815 48 GLU OE2 -11.81 -21.52 25.218
NE 10.66 -26.36 37.882 48 GLU -10.32 23.95 30.142
ARG CZ -10.26 -26.94 39.048 48 GLU -10.04 25.05 29.675
NH1 11.11 ■26.99 40.113 49 -9.887 23.53 31.348
ARG NH2 9.003 -27.48 39.159 49 GLU CA -8.946 24.17 32.236
ARG 13.02 -21.97 36.418 49 -9.254 -25.65 32.578
ARG -13.09 -21.28 37.437 49 GLU CG -8.835 -26.67 31.51 Table 6 (continued)
Figure imgf000094_0001
Table 6 (continued)
Figure imgf000095_0001
Example 7
Additional Overview of the process
Coarse Parallelization in ModPipe
Parallelization in ModPipe occurs in each of its basic modules: GETJV1ATRICES for
calculation of PSI-BLAST position specific substitution matrices (PSSM), SEARCH for
template search, PARSE for PSI-BLAST output parsing, ALIGN for template selection and
alignment, MODEL for model building and EVAL for model evaluation. There are three forms
of coarse parallelization: using the CLUSTOR program, using a queuing system or running locally on a symmetric multiprocessor machine.
Coarse parallelization using CLUSTOR (ActiveTools, San Francisco, CA)
ModPipe can make use of CLUSTOR to distribute computations on several computers.
CLUSTOR allows distributed computing over any number of computers (nodes) by copying
input files from a central machine (root) to the nodes, executing programs on the nodes, and
copying the program output files from the nodes back to the root. The procedure for each
ModPipe step works as follows:
1. PSI-BLAST PSSM calculation: The ModPipe GETJMATRICES module prepares a
CLUSTOR run file that will process one sequence at a time on each of the nodes using the get natrix ModPipe routine. The input file is the sequence that is copied from the
root to the node, getjnatrix calculates a PSSM by running PSI-BLAST with the input
file sequence against the non-redundant database of sequences (nr). Both PSI-BLAST
and the nr database must be previously installed on the nodes. Once the PSI-BLAST run
on the node finishes, CLUSTOR copies the PSSM file back to the root. The same
procedure is used to obtain PSSMs for all the template sequences.
2. Template Search: The ModPipe SEARCH module prepares a CLUSTOR run file that
will call the do_search ModPipe routine on the nodes to make two searches per
sequence (filtered and non-filtered PSI-BLAST search). The inputs for this search are
the sequence file and the PSSM file obtained in the previous step. Both files are copied
to the node where do_search runs PSI-BLAST against the template sequences using the
input sequence plus its PSSM as a query. Each of the two runs produces one output file. The two output files are copied back to the root.
3. Parsing: The ModPipe PARSE module prepares a CLUSTOR run file that will execute
the appropriate ModPipe routines on the nodes to parse the BLAST output files from the search step and produce a table of target template hits for each target or template sequence. The input for parsing is the BLAST output file. The do parse ModPipe
routine is called by CLUSTOR on the node and the table of hits produced by the routine
is copied back to the root.
4. Template selection and alignment: The ModPipe ALIGN module creates a CLUSTOR
run file that will execute the do_align ModPipe routine on the nodes to do both template selection and target-template alignment. The hit tables produced by the previous step are
copied over to the nodes along with the target and template sequences, target and
template PSSMs, and the template structures. On the node do_align calls ModPipe
routines for template selection, and MODELLER and/or BLAST to generate the
alignments. The alignment files and an extended hit table, containing alignment and
template selection information, are copied back to the root. MODELLER and BLAST must be previously installed on all nodes.
5. Modeling: The ModPipe MODEL module creates a CLUSTOR run file that calls the
makejnodel ModPipe routine for each alignment calculated in the previous step. The
alignment file and template structure are copied to the nodes where CLUSTOR calls the
makejnodel routine, which in turn executes MODELLER to calculate the model. The model file is then copied back to the root along with and extended table that includes the
model identifier in addition to all the previous alignment data. MODELLER must be
previously installed on all nodes.
6. Evaluation: The ModPipe EVAL module creates a CLUSTOR run file that calls
ModPipe' s make_eval routine for each model calculated in the previous step. The model
file is copied to the node where CLUSTOR calls make_eval, which in turn executes the
eval program to evaluate the model structure. The evaluation result is returned to the root to be included into the data table. Coarse parallelization using a queueing system
ModPipe can make use of basically any queuing system to distribute computations as
batch jobs on several computers. A queuing system allows distributed computing over any
number of computers given that they all have access to one common directory, usually through
an NFS mounted file system. The queuing system can then schedule the execution of ModPipe modules on one or more processors on the queuing system nodes (computers controlled by the
queuing system). In this case the file transfer to and from the nodes is not essential because they
all have access to a common directory. But for the sake of efficiency ModPipe modules copy
the necessary data and programs to temporary local directories on each node. The transfer of
input and output is then accomplished by simple copying from and to the NFS mounted file
system, respectively. The queuing system only controls the execution of the modules. Each module, when running in the queuing system mode, takes as arguments the number of batch jobs
and the number of processors to be used per batch job to process a list of inputs (sequences,
alignment or models). The module then divides the input list according to the number of batch jobs that will be executed and creates a new set of scripts (one for each batch job) along with a file containing command line options for the individual jobs to be executed on the node. Once
the scripts created by the module are executed on the nodes they call a multiprocessing module,
which runs a number of copies of the corresponding subroutine {getjnatrix for
GET MATRICES, do_search for SEARCH, etc.) according to the number of processors that
should be used on that node. Each individual sub routine job takes its arguments from the files
created by the module and processes a single input (sequence, alignment or model). The input
files are copied by the sub routine from the common NFS mounted directory to the node. The
input and output files are the same that were described for the CLUSTOR parallelization. Coarse parallelization using a symmetric multiprocessor (SMP) system
ModPipe can do coarse parallelization locally on an SMP by using the methods described for
the queuing system. When running locally on an SMP the ModPipe modules create a single script (equivalent to a single batch job) that will be executed on the SMP locally instead of being submitted to the queuing system. The rest of the procedure is the same with the script
calling a multiprocessing routine that in turn executes one copy of the corresponding sub
routine for each processor on the SMP.
Example 8
Additional illustrative example of the process.
Figure 6 shows a flowchart for comparative protein structure modeling on the
genome scale [11]. The figure relates to a particular example of the process of the
invention.
(It is preferable to use PSIBLAST instead of Align. It is preferable to use the scoring
function and assessment method described elsewhere in this application, rather than
PROSAπ for model evaluation.) To find template structures for modeling of the protein
sequence search of the sequences is compared with each of the 2045 potential templates corresponding to the protein chains representative of the Protein Data Bank (PDB) of
known protein structures [3]. The representative PDB proteins at most 95% sequence
identity to each other, or have length difference of at least 30 residues or 30%; they are also
the highest quality structures within each group. The matching is done by the program Align
[25], which implements he local dynamic programming method with a new gap penalty function and has a search sensitivity higher than that of BLAST.Each sequence - structure
matching is run with the default gap penalty parameters first (Not to be confused with the
variable gap penalty function noted below.) A match is considered significant or insignificant if the alignment score is more than 22 or less than 19 nats, respectively, where the nat is a unit for measuring significance of a match [25]. All the pairs with intermediate
matches with scores between 19 and 22 nats are realigned using 600combinations of the gap
penalty parameters. The match is finally considered significant if the best of the 600
alignments has a score of at least 22 nats. The PDB chain from a significant match is used as
the template structure for the corresponding region of the sequence. To obtain
target-template alignment for comparative modeling, the matching parts of the template
structure and the protein sequence are re-aligned by the use of the Align2d command of the
Modeller program [24]. This command implements a global dynamic programming method
for comparison of two sequences, but also relies on the observation that evolution tends to place residue insertions and deletions in the regions that are solvent exposed, curved, outside secondary structure segments, and between two Cα positions close in space. Gaps in these
structurally reasonable positions are favored by a variable gap penalty function that is
calculated from the template structure alone. As a result, the alignment errors are reduced by
approximately one third relative to the standard sequence alignment techniques. The refined sequence-structure alignment is used by Modeller to construct a 3D model of the matched protein sequence region, containing all mainchain and sidechain non-hydrogen atoms.
Model building begins by extracting distance and dihedral angle restraints on the target
sequence from its alignment with the template structure. These template-derived restraints
are combined with most of the CHARMM energy terms to obtain a full objective function. Finally, this function is optimized to construct a model that satisfies all the spatial restraints
as well as possible. The overall accuracy of the resulting model is predicted by a procedure
that relies on a Z-score from the program PROS AH [28]. The PROS All Z-score
approximates the difference in free energy of an evaluated model and the mean free energy of the same sequence threaded through unrelated folds, expressed in units of standard
deviation. The free energies are calculated with statistical potentials of mean force for single
residues and pairs of residues [28]. Using many models of proteins with known structure,
the distributions of the Prosall Z-score were obtained for good models, which have more
than 30% of their Cα atoms within 3.5 A of their actual positions, and for bad models. These distributions are used with the Bayesian theorem to calculate the probability that a given
model with a certain Z-score is either good or bad. Once a model is predicted to be good, its
overall accuracy is evaluated more precisely based on an empirical relationship between the
fraction of the correctly modeled Cα atoms and the percentage sequence identity to the template [11]. The modeling flowchart in this Figure can result in duplicate and overlapping models of some sequence regions. The flowchart has been implemented in a UNIX Perl
script that calls the appropriate programs for the individual tasks. Program Clustor is used to
distribute efficiently smaller jobs on many workstations, without having to adapt the
individual programs for parallel execution (http://www.activetools.com).
References for this example 3] E. E. Abola, F. C. Bernstein, S. H. Bryant, T.F Koetzle, and J. Weng. Protein data bank.
In F. H. Allen, G. Bergerhoff, and R. Sievers, editors, Crystallographic databases _ Information, content, software systems, scientific applications, pages 107-132. Data
Commission of the International Union of Crystallography, Bon/Cambridge/Chester, 1987.
[11] R. Sanchez and A. Sali. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.
[24] A. Sali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol., 234:779-815, 1993.
[25] S.F. Altschul. Generalized affine gap costs for protein sequence alignment. Proteins,
32:88-96, 1998.
[28] M. J. Sippl. Recognition of errors in three-dimensional structures of proteins. Proteins, 17:355-362, 1993.
Appendix 1 evall.pl
# collects all sel files from $count++ ,- the sequence directories, $mod[$count] = $mod; extracts the $sid[$count] = $sid;
# alignments and calculates } model . } else {
$nomodel++; use lib $ENV{MODPIPELIB}; push(@notfound, $mod) ; use RSlib: : Error; } use RSlib: rmodbase; } use ModPipe: : isc; close (LIST) ; use Getopt : :Long;
# program name if ( $check ) {
$program = $0; message "$nomodel model files $program =~ s/.*\///g; wexe not found"; foreach $model (Snotfound) {
# check input message (" $model not found") }
&check_options; message "$failed models with $listfile = $ARGV[0]; previously failed evaluation"; message "$withevall models
# initialize modpipe were already successfuly configura ion evaluated" ; if ( ! modpipe_init ) { }
Error (0, "modpipe_init failed") } if ( $count < 1 ) {
# read list file message "no models to
$nomodel = 0 ; process . " ; $withevall = 0; exit;
$count = 0; } else { $failed = 0; message "processing $count open (LIST,$listfile) ; models" ; while ( <LIST> ) { } chomp;
@F = split(/\|/,$_) ; if ( $check ) { exit } $mod = $F[2]; $mod =~ s/ //g; $moddir = ModDir ($mod) ; if ( $run eq "lsf " | | $run eq $modfile = "lsflocal" ) { $moddir. " /$mod.pdb" ;
$datfile = # LSF run : NG is number $moddir . " /$mod . at " ; of processor groups, NP is
# does the model file number of processor exist? # per group. For if ( -e $modfile | | -e example 2x32 submits jobs on two "$modfile.gz" ) { groups of open (DAT, $datfile) ; # 32 processors while ( <DAT> ) { each, this means submitting two chom ; bsub jobs
($sid, $evall) = # with -n32 and (split(/\|/,$_)) [13,14]; -ptile=32 each. } close (DAT) ; # create tmp directory for if ( $evall && ! $replace S.& this run $evall ne "F" ) { $lsfdir = "lsf-evall" .time;
$withevall++; system( " kdir $lsfdir"); } else { $tmpdir = if ( $evall eg "F" ) { $tmpdirroot. "/$lsfdir"; $failed++ } Appendix 1 (continued)
print "LSF SETUP in directory #print RUN "use lib $lsfdir\n"; \"/home/modpipe/lib/perl5\";\n" ; print "RUNNING in directory print RUN "use lib $tmpdir\n" ; \ " $ENV{MODPIPELIB} \ " ; \n" ; print RUN "use
# parse --lsfp option for RSlib: :multiproc; \n" ; number of groups (ng) and number print RUN "unless ( -e of \"$tmpdir\" ) {\n";
# processors per group print RUN " system(\"mkdir (np) -p $tmpdir\") ;\n";
($ng,$np) = split (/x/ , $lsfp) ; print RUN "}\n"; print RUN
# decide which queue to "multiprocl($np, \"$perl use unless it was passed as an $scriptdir/make_evall .pl\" , \"$da argument tfile\") ;";
# (this is ACL specific) close (RUN) ; unless ( $queue ) { $queue =
"small" } # now submit this group's jobs to the queue
# use 12 hour wall time # It is important to limit unless it was passed as an send the job to the background argument "&" to make unless ( $wtime ) { $wtime = # the program continue "12:00" } to the next group. if ( !$test && $run eq "lsf"
# divide the data to be ) { processed into ng groups system("bsub -q $queue -
$nsg = int ($count/$ng) ; $wtime -n $np -R $rest = $count%$ng; \"span[ptile=$np] \ " -o $seq = 1; $lsfdir/group$i .out $perl for $i ( 1 .. $ng ) { $lsfdir/group$i.run £-"); $ns[$i] = $nsg; } if ( $i <= $rest ) { $ns[$i]++ } #print "group $i : using-
$ini = $seq ; $fin = $seq + queue $queue with wtime=$wtime $ns[$i] -1; and $np processors . \n" ;
$seq = $fin+l; # THE FOLLOWING LINE CAN #print "group $i -> $ns[$i] BE USED TO RUN THIS LOCALLY ON : $ini - $fin\n"; AN SMP
# divide sequences for . # if the lsfp option is each group $i set to IxN where N is the number
$datfile = of "$lsfdir/group$i .dat" ; # processors to be used open (DAT, ">$datfile") ; if ( !$test && $run eq for $s ( $ini .. $fin ) { "lsflocal" ) { printf DAT ("%s -sid %d - system( "$perl tmpdir=%s %s $lsfdir/group$i .run 2>&1 >
%s\n" , $mod[$s] , $sid[$s] , $tmpdir, $lsfdir/group$i . log &"); $debugopt) ; }
} } close (DAT) ; }
# prepare script for group open (RUN, ">$lsfdir/group$i .run" ) exit ; Appendix 1 (continued)
# SUBROUTINES
# check options sub check_options { my $result = GetOptions( "run=s" => \$run,
"lsfp=s" => \$lsfp,
"queue=s"
=> \$queue, "wtime=s"
=> \$wtime, "check" =>
\$check, "test" =>
\$test, "replace"
=> \$replace, "debug=i"
=> \$debug
); if ( $#ARGV < 0 I I ( ! $run && ! $check ) ) {
.-, $message = "usage : $program
"modeldatafile - ' ...-,run= (clustor I lsf | lsflocal) [- "lsfp=NGxNP -queue=queuename - wtime=HH:MM] [-replace] [-check] ![ -test ] [ -debug=debuglevel ] " ,- Error (0, $message) ; } if ( $#ARGV > 0 I I ! $result) {
$message = "don't understand options" ;
Error ( 0 , $message) ; }
$DEBUG = $debug; if ( $debug ) { $debugopt = "- debug=$debug" }
Appendix 1 (continued) ModEval.pm
# ! /usr/local/bin/perl sub ZscoreO { package ModPipe :: odEval ; my $pdbfile = $_[0] ; use RSlib:.: Error; my $zscore = 'prosa-zscore use strict; $pdbfile | grep Z-SCORE'; use vars 'SISA', '©EXPORT', chomp ($zscore) ;
$NAME\ '$VERSION', ' $DATE ' , $zscore =
'$AUTHOR' ; (split ( /\s+/ , $zscore) ) [1] ; require Exporter; if ( $zscore eq "" ) { return
@ISA = qw(Exporter) ; 1 }
ΘEXPORT = qw(EnePairl EneSurfl return (0, $zscore) ;
ZScorel Compactness Evall EvalO } ZscoreO) ;
# -— Evall
# # INPUT : pdbfile, sequence identity # OUTPUT :
#
# ModPipe: :ModEval sub Evall { # my $NAME = "ModPipe: :ModEval " ; ( $result , $enepair, $enesurf, $zsco
$VERSION = "1.00"; re,$comp);
$DATE = "09-27-1999";
$AUTHOR = "Roberto Sanchez ( my $nargs = 2; sancher\@rockefeller . edu ) " ; check_args ( $nargs , \@_) ; my ($pdbfile, $seqid) = (§•_;
# $seqid = $seqid/100;
# ($result, $enepair) = EnePairl ($pdbfile) ; if ( $result ) {
# EvalO (old Prosall based Error (1, "EnePairl failed") ; pG) return 1 }
# Error (3, "EnePairl =
$enepair" ) ; sub EvalO { use RSlib: :pG; ($result , $enesurf) = my ($result, $zscore) ; EneSurf1 ($pdbfile) ; my ($pdbfile, $seqlen, $pglib) = if ( $result ) {
@_; Error (1, "EneSurfl failed") ; return 1 }
($result, $zscore) = Error (3 , "EneSurf1 =
ZscoreO ($pdbfile) ; $enesurf ") ; if ( $result ) {
Error (1, "ZscoreO failed") ; ($result , $zscore) = return 1 } ZScorel ($pdbfile) ,•
Error (3, "ZscoreO = $zscore"); if ( $result ) {
Erro (1, "ZScorel failed") ; my ($pg,$nzs) = return 1 } get_pg($seqlen, $zscore, $pglib) ; Error (3 , "Zscorel = $zscore");
Error(3,"pg = $pg, nzs =
$nzs"); ($result, $comp) = return ( 0, $pg) ; Compactness ($pdbfile) ;
} if ( $result ) {
Error (1, "Compactness failed") ;
# ZscoreO return 1 } Appendix 1 (continued)
Error ( 3 , "Compactness return ( 0 , $enepair) ,- $comp " ) ; } my $ score = 1 - ( ( cos ( $seqid) ) * * ( EneSurfl ( $comp+ $seqid) /exp ( $zscore ) ) ) ; INPUT OUTPUT return ( 0 , $score ) ; sub EneSurfl { my $nargs = 1; check_args ($nargs, \@_) ;
# EnePairl
# INPUT : my ($pdbfile) = @_;
# OUTPUT : my $exe = $main: :enesurflexerun; sub EnePairl { my $potential = $main: :surflpotentialrun; my $nargs = 1; check_args ( $nargs , \@_) ; Error (3, "running on $pdbfile using $exe with $potential" ) ; my ($pdbfile) = @_; # make sure files are in my $exe = place $main: :enepairlexerun; CopyEneSurflFiles ( ) ; my $potential = # check files and $main: :pairlpotentialrun; executable unless ( -e $exe ) {
Error ( 3 ," running on $pdbfile Error (0, "couldn' t find using $exe with $potential" ) ; executable $exe") }
# make sure files are in unless ( -e $pdbfile ) { place Error (1, "couldn' t find PDB
CopyEnePairlFiles ( ) ; file $pdbfile") ;
# check files and return 1; executable } unless ( -e $exe ) { unless ( -e $potential ) { Error ( 0 , " couldn ' t find Error (0, "couldn' t find executable $exe") } potential file $potential") ; unless ( -e $pdbfile ) { } Error (1, "couldn't find PDB file $pdbfile") ,- my $output = '$exe $potential return 1 ; $pdbfilex ; } unless ( $output =~ unless ( -e $potential ) { /pair_surf/ ) { Error (0, "couldn' t find Error (1, "failed for potential file $potential") ; $pdbfile") ; } return 1;
} my $output = " $exe $potential my $enesurf = $pdbfile* ; (split (/\s+/,$output) ) [2] ; unless ( $output =- /pair_ene/ return (0, $enesurf ) ; ) { }
Error (1, "failed for $pdbfile") ; return 1 ;
} # -- ZScorel my $enepair = # INPUT (split (Ms+/,$output) ) [2] ; # OUTPUT Appendix 1 (continued) sub ZScorel {
# Compactness my $nargs = 1; # INPUT : check_args ($nargs, \@_) ; # OUTPUT : my ($pdbfile) = (_•_; sub Compactness { my $exe = $main: : zscorelexerun; my $nargs = 1 ; check_args ( $nargs , \@_) ;
Error (3, "running on $pdbfile using $exe" ) ; my ($pdbfile) = @_;
# make sure files are in my $exe = place $main: : compactnessexerun;
CopyZScorelFiles ( ) ;
# check files and Error (3, "running on $pdbfile executable using $exe" ) ; unless ( -e $exe ) { # make sure files are in Error (0, "couldn' t find place executable $exe") } CopyCompactnessFiles ( ) ; unless ( -e $pdbfile ) { # check files and Error (1, "couldn' t find PDB executable file $pdbfile") ; unless ( -e $exe ) { return 1 ,- Error (0, "couldn' t find } executable $exe") }
# check prerequisite files unless ( -e $pdbfile ) { my $rfilel = Error (1, "couldn1 t find PDB
"$pdbfile. ene_pair. native" ; file $pdbfile") ; my $rfile2 = return 1 ; " $pdbfile . ene_pair . random" ; } my $rfile3 = " $pdbfile . ene_surf . native " ; my $output = * $exe $pdbfile - my $rfile4 = unless ( $output =- " $pdbfile . ene_surf . random" ; /compactness/ ) {
Error ( 1 , " failed for if ( !-e $rfilel | | ! -e $pdbfile") ; $rfile2 || ! -e $rfile3 || ! -e return 1; $rfile4 ) { }
Error(l,"One or more of my $comp = these files is (split (/\s+/,$output) ) [2] ; missing: \n\t\t$rfilel\n\t\t$rfil return (0,$comp); e2\n\t\t$rfile3\n\t\t$rfile4\n") } return 1 ;
} my $output = '$exe $pdbfile" ; # CopyEnePairlFiles unless ( $output =~ /Z-score/ ) { sub CopyEnePairlFiles {
Error (1, "failed for use File: :Basename; $pdbfile") ; my ($name, $path) ; return 1; my $nargs = 0 ;
} check_args ( $nargs , \@_) ; my $zscore = my $exesrc = (split(/\s+/,$output) ) [2] ; $main: :enepairlexesrc; return (0,$zscore); my $potentialsrc = } $main: :pairlpotentialsrc; Appendix 1 (continued) my $exerun = Error ( 0 , "couldn ' t find $main: :enepairlexerun; potential $potentialsrc" ) ; my $potentialrun = } $main: :pairlpotentialrun; unless ( -e $exerun ) { unless ( -e $exesrc ) { ($name, $path) = Error (0, "couldn' t find fileparse($exerun) ; executable $exesrc"); unless ( -e $path ) { } system ( "mkdir -p $path" unless ( -e $potentialsrc ) { } Error (0 , "couldn ' t find system("cp $exesrc potential $potentialsrc" ) ; $exerun" ) ,- } } unless ( -e $exerun ) { unless ( -e $potentialrun ) { ($name, $path) = ($name, $path) = fileparse ($exerun) ; fileparse ($potentialrun) ; unless ( -e $path ) { unless ( -e $path ) { system( "mkdir -p $path"); system( "mkdir -p $path"); } } systemC'cp $exesrc systemC'cp $potentialsrc $exerun" ) ; $potentialrun" ) ; } } unless ( -e $potentialrun ) { ($name, $path) = fileparse ($ρotentialrun) ; unless ( -e $path ) { # CopyZScorelfiles - system( "mkdir -p $path" ) ; } sub CopyZScorelFiles { systemC'cp $potentialsrc use File: :Basename; $potentialrun" ) ; my ($name, $path) ,- } my $nargs = 0; check_args ( $nargs , \@_) ;
} my $exesrc = $main: : zscorelexesrc;
# CopyEneSurflFiles my $exerun = $main: : zscorelexerun; sub CopyEneSurflFiles { use File: :Basename; unless ( -e $exesrc ) { my ($name, $path) ; ^Error (0, "couldn't find my $nargs = 0; executable $exesrc"); check_args ( $nargs , \@_) ; } my $exesre = $main: :enesurflexesrc; unless ( -e $exerun ) { my $potentialsrc = ($name, $path) = $main: : surflpotentialsrc; fileparse ($exerun) ; my $exerun = unless ( -e $path ) { $main: :enesurflexerun; system( "mkdir -p $ρath") my $potentialrun = } $main: : surflpotentialrun; system("cp $exesrc $exerun" ) ; unless ( -e $exesrc ) { } Error (0, "couldn' t find executable $exesrc"); } unless ( -e $potentialsrc ) { -- CopyCompactnessFiles Appendix 1 (continued)
sub CopyCompactnessFiles { use File: :Basename; my ( $name , $path) ; my $nargs = 0; check_args ($nargs, \@_) ; my $exesrc = $main: :compactnessexesrc; my $exerun = $main: :compactnessexerun; unless ( -e $exesrc ) { Error ( 0 , "couldn ' t find executable $exesrc"); } unless ( -e $exerun ) { ($name, $path) = fileparse ($exerun) ; unless ( -e $path ) { system) "mkdir -p $path") } system ("cp $exesrc $exerun" ) ; }
}
1;

Claims

What is claimed is:
1. A computerized process of generating a 3-dimensional model of a protein, the process comprising the steps of:
(1) an inputting step, wherein a query protein amino acid sequence is inputted into a
computer;
(2) a sequence search step, wherein one or more protein databases are searched so as
to identify potentially sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein and
for which the 3-dimensional structure is known;
(3) an alignment step, wherein for each sequence-related protein an optimal degree of
alignment is created between the amino acid sequence of the query protein and that of each sequence-related protein, at least one which has a known 3-dimensional structure;;
(4) a model-building step, wherein for each query sequence alignment obtained in the
alignment step, electronically stored retrievable information that defines a model of a 3-
dimensional structure for all or part of the query protein amino acid sequence is created;
(5) a model evaluation step, wherein the model that is probably most closest in structure to the actual structure of the query protein is selected from all models generated for
the query amino acid sequence;
(6) a model storage step, wherein information generated in steps (5) is stored
electronically, magnetically, or electromagnetically, such that said information is retrievable.
2. A process of Claim 1 wherein a plurality of query protein amino acid sequences
are processed in steps (1) through (6) in time periods that overlap each other.
3. A process of Claim 1 wherein a variable gap penalty function is used in the
alignment step.
4. A process of Claim 1 wherein, in the model evaluation step, a model is compared
to previously identified populations of good and bad models.
5. A process of Claim 1 wherein, in the model evaluation step, a scoring function is used, said function dependent on compactness, sequence identity, and z-score is used in the
model evaluation step, and optionally additional variables.
6. A process of Claim 1, wherein in the sequence searching step, a low stringency
search for sequence similarity is part of the step.
7. A process of Claim 6 wherein a high stringency search is done before the low
stringency search is done.
8. A process of Claim 7 wherein the high stringency searches comprises (1) one
between the query sequence and a database with a plurality of known amino acid sequences
corresponding to either known or unknown 3-dimensional structures and (2) one between the
sequences of a database of proteins of known 3-dimensional structure and sequences of the database with a plurality of known amino acid sequences corresponding to either known or
unknown structures.
9. A process of Claim 1 which further comprises a ligand-docking step, in which the ability of a molecule to bind to a model generated by a step of the process of Claim 1 is tested.
10. A system for accepting as input a query amino acid sequence and carrying out the
computerized process of the invention so as to output a 3-dimensional model of protein with
the query amino acid sequence :
(1) a collection of one or more query amino acid sequences;
(2) a collection of one or more databases;
(3) a sequence matching engine for searching one or more protein databases so as to
identify sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein ;
(4) an alignment engine that, for each sequence-related protein, creates an optimal
degree of alignment between the amino acid sequence of the query protein and that of each sequence-related protein, said alignment identifying sequence gaps where one sequence has
no equivalent residues in the other;
(5) a model building engine that, for each query sequence alignment obtained in the
alignment step, generates electronically stored retrievable information that defines a model
of a 3-dimensional structure for the query protein amino acid sequence ; and
(6) a model evaluation engine that discriminates good models from bad models for the query amino acid sequence, and sorts the models according to their accuracy.
11. A system of Claim 10 wherein the interaction between the collections and
engines is under the control of a process control engine.
12. Electronically, magnetically, or electromagnetically stored data generated by the
process of Claim 1.
13. Electronically, magnetically, or electromagnetically stored data generated by step (2), (3), (4), or (5) of the process of Claim 1.
14. A 2-dimensional or 3-dimensional representation of the data of Claim 12.
15. A 2-dimensional or 3-dimensional representation of the data of Claim 13.
16. Electronically, magnetically, or electromagnetically stored data generated by the
system of Claim 10.
17. A computerized process of generating a 3-dimensional model of a protein, the
process comprising the steps of:
(1) an inputting step, wherein a query protein amino acid sequence is inputted into a
computer;
(2) a sequence search step, wherein one or more protein databases are searched so as
to identify potentially sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein and
for which the 3-dimensional structure is known;
(3) an alignment step, wherein for each sequence-related protein an optimal degree of
alignment is created between the amino acid sequence of the query protein and that of each
sequence-related protein, at least one which has a known 3-dimensional structure;; (4) a model-building step, wherein for each query sequence alignment obtained in the
alignment step, electronically stored retrievable information that defines a model of a 3-
dimensional structure for all or part of the query protein amino acid sequence is created;
(5) a model evaluation step, wherein the model that is probably most closest in
structure to the actual structure of the query protein is selected from all models generated for the query amino acid sequence;
(6) a model storage step, wherein information generated in steps (2), (3), (4) or (5) is
stored electronically, magnetically, or electromagnetically, such that said information is
retrievable.
PCT/US2000/030753 1999-11-09 2000-11-08 Large scale comparative protein structure modeling WO2001035255A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00978441A EP1236124A2 (en) 1999-11-09 2000-11-08 Large scale comparative protein structure modeling
JP2001536721A JP2003525483A (en) 1999-11-09 2000-11-08 Large-scale comparative protein structure modeling
CA002391469A CA2391469A1 (en) 1999-11-09 2000-11-08 Large scale comparative protein structure modeling

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US43773899A 1999-11-09 1999-11-09
US09/437,738 1999-11-09

Publications (2)

Publication Number Publication Date
WO2001035255A2 true WO2001035255A2 (en) 2001-05-17
WO2001035255A3 WO2001035255A3 (en) 2002-06-06

Family

ID=23737682

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/030753 WO2001035255A2 (en) 1999-11-09 2000-11-08 Large scale comparative protein structure modeling

Country Status (4)

Country Link
EP (1) EP1236124A2 (en)
JP (1) JP2003525483A (en)
CA (1) CA2391469A1 (en)
WO (1) WO2001035255A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001069508A2 (en) * 2000-03-14 2001-09-20 Inpharmatica Limited Multiple sequence alignment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997048060A1 (en) * 1996-06-14 1997-12-18 Immunex Corporation Method and system for protein modeling

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993001484A1 (en) * 1991-07-11 1993-01-21 The Regents Of The University Of California A method to identify protein sequences that fold into a known three-dimensional structure

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997048060A1 (en) * 1996-06-14 1997-12-18 Immunex Corporation Method and system for protein modeling

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
BOWIE J U ET AL: "A METHOD TO IDENTIFY PROTEIN SEQUENCES THAT FOLD INTO A KNOWN THREE-DIMENSIONAL STRUCTURE" SCIENCE, AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE,, US, vol. 253, 12 July 1991 (1991-07-12), pages 164-170, XP000764972 ISSN: 0036-8075 *
LESK A M: "ASSESSMENT OF AB INITIO PROTEIN STRUCTURE PREDICTION" PROCEEDINGS OF THE SECOND ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY. RECOMB '98. NEW YORK, NY, MARCH 22 - 25 1998, PROCEEDINGS OF THE ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY, NEW YORK, NY, ACM, US, 22 March 1998 (1998-03-22), pages 163-171, XP000869301 ISBN: 0-89791-976-9 *
SALI A ET AL: "COMPARATIVE PROTEIN MODELLING BY SATISFACTION OF SPATIAL RESTRAINTS" JOURNAL OF MOLECULAR BIOLOGY, LONDON, GB, vol. 234, 1993, pages 779-815, XP002948615 ISSN: 0022-2836 *
SALI, A.: "Evaluation of Comparative Protein Structure Modeling by MODELLER-3" PROTEINS: STRUCTURE, FUNCTION, AND GENETICS, SUPPL., [Online] vol. 1, 1997, pages 50-58, XP002190864 Retrieved from the Internet: <URL:http://guitar.rockefeller.edu/publica tions> [retrieved on 2002-02-20] *
SALI, A.: "Modeling Mutations and Homologous Proteins" CURRENT OPINION IN BIOTECHNOLOGY, vol. 6, no. 4, August 1995 (1995-08), pages 437-451, XP008000909 ISSN: 0958-1669 *
SALI, A: "100,000 protein structures for the biologist" NATURE STRUCTURAL BIOLOGY, [Online] vol. 5, no. 12, December 1998 (1998-12), pages 1029-1032, XP002190863 Retrieved from the Internet: <URL:http://guitar.rockefeller.edu/publica tions> [retrieved on 2002-02-20] *
SANCHEZ, R.; SALI, A.: "Advances in comparative protein-structure modelling" CURRENT OPINION IN STRUCTURAL BIOLOGY, [Online] vol. 7, 1997, pages 206-214, XP002190862 ISSN: 0959-440X Retrieved from the Internet: <URL:http://guitar.rockefeller.edu/publica tions> [retrieved on 2002-02-20] *
XIRU ZHANG: "A HYBRID ALGORITHM FOR DETERMINING PROTEIN STRUCTURE" IEEE EXPERT, IEEE INC. NEW YORK, US, vol. 9, no. 4, 1 August 1994 (1994-08-01), pages 66-74, XP000485629 ISSN: 0885-9000 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001069508A2 (en) * 2000-03-14 2001-09-20 Inpharmatica Limited Multiple sequence alignment
WO2001069508A3 (en) * 2000-03-14 2002-06-13 Inpharmatica Ltd Multiple sequence alignment

Also Published As

Publication number Publication date
WO2001035255A3 (en) 2002-06-06
CA2391469A1 (en) 2001-05-17
JP2003525483A (en) 2003-08-26
EP1236124A2 (en) 2002-09-04

Similar Documents

Publication Publication Date Title
Dhakal et al. Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions
Aggarwal et al. DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks
Schauperl et al. AI-based protein structure prediction in drug discovery: impacts and challenges
Higgins et al. Bioinformatics: Sequence, Structure and Databanks: A Practical Approach
Pérot et al. Druggable pockets and binding site centric chemical space: a paradigm shift in drug discovery
Aloy et al. Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking
Balaji et al. PALI—a database of Phylogeny and ALIgnment of homologous protein structures
Frishman et al. Functional and structural genomics using PEDANT
Fiser Protein structure modeling in the proteomics era
JP2002523057A (en) Methods and systems for predicting protein function
Hillisch et al. Modern methods of drug discovery
Akalın Introduction to bioinformatics
Linial et al. Methodologies for target selection in structural genomics
Vakser et al. Predicting 3D structures of protein-protein complexes
Sánchez et al. Comparative protein structure modeling in genomics
Dawson et al. The classification of protein domains
Popov et al. Knowledge of native protein–protein interfaces is sufficient to construct predictive models for the selection of binding candidates
Moriaud et al. Computational fragment-based approach at PDB scale by protein local similarity
Xu et al. Protein databases on the internet
WO2001018627A2 (en) Method and apparatus for computer automated detection of protein and nucleic acid targets of a chemical compound
Schafroth et al. Predicting peptide binding to MHC pockets via molecular modeling, implicit solvation, and global optimization
AU780941B2 (en) System and method for searching a combinatorial space
WO2001035255A2 (en) Large scale comparative protein structure modeling
US20030180803A1 (en) Lead molecule generation
Li et al. Simultaneous Prediction of Interaction Sites on the Protein and Peptide Sides of Complexes through Multilayer Graph Convolutional Networks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): CA JP MX

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref country code: JP

Ref document number: 2001 536721

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 2391469

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2000978441

Country of ref document: EP

AK Designated states

Kind code of ref document: A3

Designated state(s): CA JP MX

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

WWP Wipo information: published in national office

Ref document number: 2000978441

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2000978441

Country of ref document: EP