WO2001035255A2

WO2001035255A2 - Large scale comparative protein structure modeling

Info

Publication number: WO2001035255A2
Application number: PCT/US2000/030753
Authority: WO
Inventors: Andrej Sali; Roberto Sanchez; Francisco Melo
Original assignee: The Rockefeller University
Priority date: 1999-11-09
Filing date: 2000-11-08
Publication date: 2001-05-17
Also published as: WO2001035255A3; CA2391469A1; JP2003525483A; EP1236124A2

Abstract

A method and system for creating 3-dimensional models of polypeptides and proteins given their amino acid sequence; also the models they generate.

Description

LARGE SCALE COMPARATIVE PROTEIN STRUCTURE MODELING

The research leading to the present invention was supported, at least in part, by a grant

from . Accordingly, the Government may have certain rights in the invention.

Background

This invention relates to processes for generating models of 3-dimensional proteins structures, to the systems needed to implement the processes, and also to the data generated

by those systems.

In a few years, the genome projects will have provided us with the amino acid sequences of more than a million proteins - the catalysts, inhibitors, messengers, receptors,

transporters, and building blocks of the living organisms [1]. The full potential of the

genome projects will only be realized once we assign and understand the function of these

proteins. While protein function is best determined experimentally, it can sometimes be

predicted by matching the sequence of a protein with proteins of known function [1]. One

way to improve such sequence-based predictions of function is to rely on the known native 3D structure of proteins [1]. The 3D structure of a protein generally provides more

information about its function than sequence because interactions of a protein with other

molecules are determined by amino acid residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function

depends more directly on structure than on sequence, structure is more conserved in evolution

than sequence. The net result is that patterns in space are frequently more recognizable than patterns in sequence. Unfortunately, 3D structures have been determined by x-ray

crystallography or NMR spectroscopy for only a fraction of known protein sequences; while there are approximately 450,000 protein sequences in TrEMBIJSWISS-PROT [2], there are

only 10,000 known protein structures in the Protein Data Bank [3]. However, a useful 3D

model can frequently be obtained by comparative or homology protein structure modeling,

which can construct all-atom 3D models for those proteins that are related to at least one

known protein structure [4].

Despite errors in comparative modeling, it has been applied successfully to many biological

problems [4, 5]. Comparative modeling can be helpful in proposing and testing hypotheses in molecular biology, such as hypotheses about ligand binding sites [6],

substrate specificity [7], and drug design [8]. It can also provide starting models in

x-ray crystallography [9] and NMR spectroscopy [10]. Explicit 3D modeling and model

evaluation are the best way of either confirming or rejecting a match between two remotely

related proteins [11]. This is important because most of the related protein pairs share less than 30% sequence identity [11]. It is frequently possible to predict correctly important

features of the target protein that do not occur in the template structure. For example, the

location of a binding site can be predicted from clusters of charged residues [12], and the size

of a ligand can be predicted from the volume of the binding site cleft [13]. Another use of 3D

models is that some binding and active sites, which cannot possibly be found by searching for local sequence patterns [14], frequently should be detectable by searching for small 3D motifs that are known to bind or act on specific ligands [15]. In general, medium

resolution models based on at least 30% sequence identity to a known protein structure frequently allow a refinement of the functional prediction based on sequence alone.

The fraction of the known protein sequences that have at least one segment related to one or

more known structures currently ranges from 20 to 50%, depending on a genome [16].

Thus, the number of sequences that can be modeled with useful accuracy by comparative modeling is already more than an order of magnitude larger than the number of

experimentally determined protein structures. Furthermore, the fraction of protein sequences

that can be modeled reliably by comparative modeling is increasing rapidly. The main

reasons for this improvement are the increases in the numbers of known folds and the

structures per fold family as well as the improvement in the fold recognition and comparative modeling techniques. It has been estimated that globular protein domains cluster in only a few thousand fold families, approximately 900 of which have already been structurally

defined [18]. Assuming the current growth rate in the number of known protein structures,

the structure of at least one member of most globular folds will be determined in less than ten

years [11]. Structural genomics may in fact accelerate this goal [19]. As a result, comparative modeling will be applicable to most of the globular protein domains soon after

the completion of the human genome project.

Despite the usefulness of comparative modeling, it is still not a common sequence analysis tool for the biologist, partly due to the lack of easy access to reliable and evaluated models.

Our ModBase [5, 11] database of comparative models attempt to resolve this problem.

Our results include the modeling of five procaryotic and eucaryotic genomes [11]. A calculation resulted in the models for substantial segments of 17.2%, 18.1%, 19.2%,

20.4%, and 15.7% of all proteins in the genomes of S. cerevisiae (6218 proteins in

the genome), E. coli (4290 proteins), M. genitalium (468 proteins), C. elegans (7299 proteins, incomplete), and M. jannaschii (1735 proteins), respectively. An important

feature of this study was an evaluation of all the models by a statistical potential

function. This allowed identification of those models that were likely to be based on correct

templates and at least approximately correct alignments. As a result, 236 yeast proteins

without any prior structural information were assigned to a particular fold family; 40 of these proteins did not have any prior functional annotation. All the alignments and models for the

five genomes are available on Internet at URL http://guitar.rockefeller.edu. The program

Modeller used for sequence-structure alignment, model building, and model evaluation can

be obtained from MSI, San Diego, CA.) The yeast models are also accessible through the

Saccharomyces Genome Database (URL http://genome-www.stanford.edu/Saccharomvces/..

Another database related to protein structure is the Swiss-Model database [20].

The database will make it efficient for both experts and non-experts to use comparative

models, allowing them to spend more time designing experiments. In addition, the automation is essential for access to models by non-experts. Finally, automation encourages

development of better methods. Comparative models in the database will be used in many

different ways, depending on their accuracy (pages 395-397 in ref. 5).

Typical applications of comparative modeling. Several applications are listed above. It is

frequently possible to extract more information from a comparative model than from the modeled sequence, or even from its alignment to related protein sequences or structures [5]. For example, while the preferred ligand for brain lipid binding protein can be predicted from

its comparative model, the ligand preferences cannot easily be predicted from the sequence or

its alignment to structurally defined fatty acid binding proteins [13]. Another example is

provided by several mouse mast cell proteases, which have a conserved surface region of

positively charged residues that binds proteoglycans [12]. This region is not easily recognizable in the sequence or its alignment to a known structure because the constituting residues occur at variable and sequentially non-local positions that form a binding site only

when the protease is fully folded.

Fold Assignment. Establishing a match between a given protein sequence and a well

characterized protein is perhaps the most frequent computational task in biology. In

particular, fold assignment generally allows a strong prediction of function and design of

experiments to test it. We argue that fold assignment by alignment, modeling, and model

evaluation, as proposed here, is a strategy that will reveal a significant number of weak relationships that are detected neither by multiple sequence comparison nor by threading. We

have already demonstrated that reliable models can be obtained for an additional 9%

of the yeast proteins, despite insignificant PSI-BLAST [22] matches to known structures on

which the models are based. These non-trivial matches increase the fold assignment coverage of the yeast proteins from 35% for PSI-BLAST to 44%, even though PSI-BLAST is

one of the most sensitive sequence matching programs. An underlying reason is that the evaluation of a 3D model based on an energy function is more sensitive than the evaluation of

a sequence alignment based on an amino acid residue substitution matrix. Similarly, there are also reasons to believe that the present invention will add to the fold assignments obtained by

threading alone, although this has not yet been demonstrated. These reasons are as follows,

(i) Multiple sequence-structure alignments obtained by PSI-BLAST and MODELLER for the

pipeline tend to be more accurate than the pairwise sequence-structure alignments obtained

by most current threading programs, because of the demonstrated usefulness of multiple sequences in alignment [23]. (ii) The model evaluated by the pipeline is more

accurate and more complete than the threading model even when the alignments are equal,

because the insertions, deletions, sidechains, rigid body shifts and distortions are modeled

explicitly, (iii) We can use more complex and therefore more accurate scoring functions for

model evaluation because only a few models per sequence are evaluated in the proposed approach, while threading has to evaluate on the order of 10⁹ structures for each

sequence-structure pair. We note that our approach does not replace multiple sequence

alignment and threading. In fact, it builds on them because they can be used to

propose candidate fold assignments for further processing by comparative model building

and model evaluation.

New applications. Large-scale modeling should encourage new kinds of applications for the

many resulting models, based on their large number and completeness at the level of the

family, organism, or functional network. For example, a collection of experimentally determined complexes of proteins with their ligands, aligned with comparative models for the rest of the family members, will permit a facile comparison of ligand binding requirements

and also reveal permitted substitutions in and around important residues. A specific example

of a new opportunity for tackling existing problems by virtue of providing many protein

models from many genomes is the selection of a target protein for which a drug needs to be developed. A good choice is a protein that is likely to have high ligand specificity; specificity

is important because specific drugs are less likely to be toxic. Large-scale modeling

facilitates imposing the specificity filter in target selection by enabling a structural comparion

of the ligand binding sites of many proteins, either human or from other organisms. Such

comparisons may make it possible to select rationally a target whose binding site is structurally most different from the binding sites of all the other proteins that may potentially

react with the same drug. For example, when a human pathogenic organism needs to be

inhibited, it may be possible to select as the target that pathogen's protein that is structurally

most different from all the human homologs. Alternatively, when a human metabolic

pathway needs to be regulated, the target identification could focus on that particular protein

in the pathway that has the binding site most dissimilar from its human homologs. For such

applications, comparative models of all sequences are needed, even if they are very similar to

each other.

Summary of the Invention

In a first general aspect, the invention comprises a computerized process of generating

a 3-dimensional model of a protein, the process comprising the steps of:

(1) an inputting step, wherein a query protein amino acid sequence is inputted into a

computer;

(2) a sequence search step, wherein one or more protein databases are searched so as

to identify potentially sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein;

(3) an alignment step, wherein for each sequence-related protein an optimal degree of alignment is created between the amino acid sequence of the query protein and that of each

sequence-related protein(at least one of which has a known 3-dimensional structure),;

(4) a model-building step, wherein for each query sequence alignment obtained in the

alignment step, electronically stored retrievable information is created, said information

defining a model of a 3-dimensional structure for all or part of the query protein amino acid

sequence;

(5) a model evaluation step, wherein the good model that is probably most closest in

structure to the actual structure of all or part of the query protein is selected from all models generated for the query amino acid sequence;

(6) a model storage step, wherein information generated in steps (2), (3), (4), and/or

(5) is stored electronically or electromagnetically such that said information is retrievable ( to

generate data or images or structures that define a 3-dimensional protein structure).

The above process is at times referred to in this application as the "pipeline".

In a preferred mode, the process is conducted such that a plurality (two or more) of

query protein amino acid sequences are processed in steps (1) through (6) in time periods that overlap each other. It is also preferred that when the processing of a query protein amino acid

sequence is completed, the process begins again with another query protein amino acid

sequence, unless of course, the processing of all query amino acid sequences is complete. As

a result, the process is preferably implemented in a looping manner with the responsibilities for executing the steps distributed in a controlled optimal manner on a plurality of processors in a single or many computers. For example, in a parallel processing mode, various

possibilities exist, one of them being that multiple sequences are processed in parallel through the entire process, another that multiple sequences are just processed through an

individual step before the next batch of sequences are processed.

Both the computerized process of this invention and the system of this invention will

generate electronically or electromagnetically stored information that is retrievable to

generate data, 2-dimensional images (for example on paper or a computer screen), or structures (such as models) that represent a 3-dimensional protein or polypeptide structure.

Such stored information is another aspect of this invention as are such data, images, and

structures. Similarly, a database of the models and data created by the process is also an

aspect of the invention. A database that incorporates such models and data with other databases (with data such as sequence information, expression information, etc.) is also an aspect of the invention. A system of interacting databases, one of which is a database that

comprises the models and data generated by the present invention is also an aspect of this

invention.

The process of this invention can be extended by additional steps in which the electronically stored data of one or more proteins is used in combination with other

electronically stored data representing the structure of another molecule, such as a ligand, the

use being one in which the ability of the other molecule to bind to, fit with, or dock on, the

protein or proteins is investigated. Such studies can, for example, be used to identify

inhibitors or activators of enzymes; or activators or blockers of cell receptors.

The process can also be extended (and the data produced by it further used) generally

in protein structure analysis. Additionally, the data can be subject to further improvement by other model-building processes, model improvement processes, and protein structure analysis

methods. The structural data of a protein can be used to predict, interpret, and modify its

function and also design molecules with similar function. It can also be used to design ligands

and other compounds that modulate the protein's function.

The data and corresponding models produced can be used to annotate (assign information such as a structure and/or function) nucleic acid or protein sequences, for

example for database purposes. This information can be used in target selection in research

and drug development.

Figure 1 can be understood, for example by the following summary of a general

aspect of the invention, where the bold-faced numbers refer to the reference numbers in the

Figure:

Another general aspect of this invention is a system for accepting as input a query

amino acid sequence and carrying out the computerized process of the invention so as to

output a 3-dimensional model 11 of protein with the query amino acid sequence :

(1) a collection of one or more query amino acid sequences 1;

(2) a collection 13 of one or more databases;

(3) a sequence matching engine 3 for searching one or more protein databases so as to

identify sequence-related proteins, those proteins that have an amino acid sequence exceeding

a pre-specified degree of sequence similarity with the query protein ;

(4) an alignment engine 5 that, for each sequence-related protein, creates an optimal

degree of alignment between the amino acid sequence of the query protein and that of each sequence-related protein, said alignment identifying sequence gaps where one sequence has

no equivalent amino acid residues in the other;

(5) a model building engine 7 that, for each query sequence alignment obtained in the

alignment step, generates electronically stored retrievable information that defines a model of a 3-dimensional structure for all or part of the query protein amino acid sequence ; and

(6) a model evaluation engine 9 that discriminates good models from bad models for

the query amino acid sequence, and sorts the good models according to their overall accuracy.

In a preferred embodiment, the interaction between the collections and engines is

under the control of a process control engine 21.

In general, an engine specified herein refers to a combination of a computer and software with appropriate instructions for the computer.

In the inputting step, most frequently, the sequence will be inputted as a text file.

Query sequences (also referred to as ORF's) from one or more organisms, can be automatically downloaded from databases such as GENPEPT or TrEMBL using publicly

available software such as FTP (file-transfer-protocol) or Netscape.

Within the sequence search step, it is preferred that a search be done for sequences

showing some degree of similarity to the query amino acid sequence; preferably a database of non-redundant sequences is searched. All sufficiently similar sequences are collected and

represent a "PSSM profile" [22], or profile, of the query sequence (The query sequence PSSM profile.). The search is preferably a stringent search ( one in which the E value is 10^"1

or smaller, preferably about 5 xlO ³; The E value corresponding to a stringent search will

depend on the size of the database being searched (e.g., for a database of 500,000 sequences,

an E value of 5 x 10^"3 would be typical).) It is preferably done with a large database or protein

amino acid sequences (e.g., of the order of 500,000 sequences or more. Examples are

GENPEPT and TrEMBL.). Within the sequence search step, it is also preferred that a search

be done for sequences showing some degree of similarity to each sequence of known 3- dimensional structure in one or more databases. For each such sequence of known 3-D

structure, all sufficiently similar sequences are collected and represent a 3-D sequence PSSM

profile. An example of such a database is the publically accessible Protein Data Bank (PDB).

This search is also preferably a stringent search. It can be done for one edition of the PDB

and then at an appropriate time, days, weeks, or months later, be redone. As a result the 3-D

sequence PSSM profile can be done once and reused many times until the next time it is updated.

Also within the sequence search step, it is preferred that non-stringent searches be

done. A non-stringent search is one with an E value cutoff between 10^"4 and 10,000,

typically with an E value cutoff of 100. In one set of such searches, the search is for

sequences related to the query sequence PSSM and the population searched are all the sequences in the 3-D sequence database collection. In another set of such searches, the search

is for sequences related to each 3-D sequence PSSM and each such 3-D sequence PSSM is

compared to the query sequence. The concept of reciprocal PSSM searches has been disclosed ( Teichmann and Chotia, P.N.A.S. (1998).) The use of the low stringency searches is an important advantage of the present

invention. It is possible because of the quality of the alignment, model-building, and model

evaluation steps. The introduction of the low stringency searches increase the chance of

finding an amino acid sequence that although remote from a sequence point of view is

nevertheless associated with a 3-dimensional model structure that is close to or the same as

the correct structure. This is not only important for purposes of analyzing a single query sequence but also when the entire process is used to process thousands of query sequences in

automated fashion. In the latter case, it increases the chances that the process will correctly

analyze a large percentage of the input proteins.

In the alignment step, the alignment preferably identifies sequence gaps where one sequence has no equivalent amino acid residues in the other (it is preferred to generate, for

each sequence-related protein, two possible alignments of the query sequence)

In the alignment step, a gap penalty function is preferably used. It is particularly preferred that the function be a variable gap penalty function. Highly preferred are variable

gap penalty functions that favors gaps in regions that have one on or more of the following

characteristics: are solvent exposed, are curved, are outside secondary structure segments

(i.e., are random coil segments), are between two C_α positions close in space.

In the model building step, it is preferred to use a method that employs a means for

satisfying spatial restraints on the Cartesian coordinates of an atom of a query protein amino

acid sequence, said restraints defined by a conditional probability distribution based on proteins of known 3-dimensional structure. It is further preferred that such a means take into account one or more (most preferably all) of the following types of restraints:

(1) an homology restraint, in which the most likely coordinates are those indicated by

the known-dimensional structures of the best matched amino acid sequences;

(2) physical and chemical restraints, such as:

(A) ideal bond lengths;

(B) the impossibility of two atoms being located in the same space;

(C) ideal bond angles;

(D )ideal improper dihedral angles; and

(E) ideal chiralities.

Values for implementing the chemical restraints are available from the CHARMM program on the world wide web at http://yuri.harvard.edu/charmm/charmm.html ;

MODELLER, a software application for executing the model building step can be purchased

from is available from MSI (San Diego, CA ). MODELLER has been discussed in the scientific literature.

In the model evaluation step it is preferred to compare the 3-dimensional models

obtained in the model-building step to models previously identified as being "good" and to

models previously identified as being "bad". To generate such models, amino acid sequences of proteins of known 3-D structure are run as query sequences through the process

of this invention. A good model is one in which at least 30% of the Cα atoms are within 3.5

A of the equivalent Cα in the known "correct" struture after the model and the correct

structure have been optimally superimposed. Bad models have less than 15% equivalent Cα

atoms that meet that criterion. While and/or after the population of good and bad models have been created, a genetic algorithm (GA) is informed of the data and the GA generates a scoring function. The scoring function is described in more detail below; the code for its

implementation is in the overall pipeline is included in Appendix 1.

Preferred are scoring functions that are a function of protein compactness, sequence

identity (seq.ide) , and z score, more preferably are of the form l-a where a is preferably a

function of the seq.ide (most preferably cos(seq.ide)) and x is preferably a function of

compactness, seq.ide and the z score (most preferably (compactness + seq.ide)/z score)).

Brief description of the drawings

Figure 1 shows the system of the invention.

Figure 2 shows in panel (A) the percentages of false positives and negatives as a function

of model sequence length and in panel (B) the percentage of structure overlap as a function of

percent sequence identity.

Figure 3 shows in panel (A) the distribution of the sequence identity between models

and in (B) the corresponding distribution of the alignment significance score.

Figure 4: Modeling a putative interaction of a predicted YDL117W SH3 domain with a

proline rich peptide.

Figure5: Sample optimization of parameters u ( y axis) and v (x axis) for the gap

penalty function. Figure 6 shows a flow chart illustrating one example of embodying the process of

the invention.

Detailed description

Glossary

An "amino acid sequence" refers only to such a sequence, there are no considerations

of the three-dimensional structure of the protein. An amino acid sequence of interest may represent all or part of a polypeptide chain of a naturally occurring protein or human-

designed protein.

A "sequence similarity search" between two amino acid sequences, such as one

labeled a target sequence and another labeled a template sequence, can be illustrated by the following simple case: If one sequence is ATYHCP (using the one letter standard codes to

represent a sequence of amino acids) and the other sequence is also ATYHCP, then there is

a perfect match. If the other sequence is ATYLLL, then there is only a partial match. How

the degree of sequence similarity is scored depends to some extent on the search engine used to do the searching and its parameters.

A "target sequence" is the sequence for which information is sought in a process. A

"template sequence" of known sequence is the sequence against which the target sequence is

compared for purposes of furthering the process.

A "gap penalty function" is used to identify gaps or discontinuities in an amino acid

sequence when one sequence is compared to another. For example if one sequence has a sequence ATYHCPLT and the other has the sequence ATYGSVMCPLT then there is a gap

in the first sequence corresponding to the underlined portion of the second sequence.

The "generation of retrievable electronically stored information" refers to the

common operation in the execution of a computer program where the information is stored

in the computer's memory, for example either the primary memory or the "permanent"

memory (e.g., hard drive or tape).

The "E value" of a similarity search is the number of sequence matches that by

chance are expected to be as good or better as one retrieved by the search.

A "database of non-redundant database sequences" is a database in which only

unique known protein sequences are retained.

Overview

Native three-dimensional (3D) structure of a protein is valuable in testing,

understanding, and modifying protein function. Thus, it would be useful to know native structure of the thousands of protein sequences that are emerging from the many genome

projects. While 3D structures of only a tiny fraction of known protein sequences have been

defined by x-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy,

comparative modeling can frequently provide a useful 3D model of a protein.

Comparative modeling relies on the knowledge of related protein structures: It consists of fold assignment by comparison with all known protein structures, alignment with related

known protein structures, model building using the alignment, and model evaluation. The

present invention relates, in part, to the creation, maintenance, and facilitation of the use of an up-to-date database of accurate comparative models for all known protein

sequences that are related to at least one known protein structure. It is envisioned that

approximately 450,000 proteins canl be processed initially, resulting in models for

approximately 150,000 proteins, growing at the rate of approximately 50,000 models per

year.

A database can be derived, for example, by an automated modeling "pipeline" relying on

PSI-BLAST and the program Modeller. Comparative models consist of 3D coordinates

for all nonhydrogen atoms in the modeled part of a protein. The database includes

the alignments used to obtain the models. The "pipeline" is designed to maximize both the

number and accuracy of 3-D models. This is achieved by (i) using multiple sequences and structures to increase the sensitivity of fold assignment and accuracy of the alignments, and

(ii) improving the model evaluation scheme to result in a smaller number of false

positives and negatives.

As in any comparative modeling exercise, models are generated in a four step

procedure [4, 5, 11]: (i) Fold assignment, (ii) sequence-structure alignment, (iii) model

building, and (iv) model evaluation. Large-scale comparative modeling is an automated and integrated application of these four steps to thousands of protein sequences, not only a

few. Because large-scale modeling can only be performed in a completely automated manner, the primary current challenge in large-scale comparative modeling is to build an

automated, rapid, robust, sensitive, and accurate comparative modeling pipeline applicable

to whole genomes; such a pipeline should perform at least as well as a human expert on individual proteins. The use of the CLUSTOR program (http://www.activetools.com) allows efficient and robust

execution on a cluster of processors so that approximately 450,000 known protein sequences

can be processed in a reasonable period of time with the existing and requested computers

(i.e., not longer than three months).

Various complete genomic sequences and databases of expressed sequence tags can be

processed.

Fold assignment. Fold assignment can, for example, be done using PSI BLAST.

Alignment. For example, the matched parts of a protein sequence to be modeled and a single related known protein structure can be aligned by the ALIGN2D command of

Modeller This results in improved alignments due to the placement of gaps in the

structurally reasonable contexts. In general, the accuracy of an alignment can also be

increased by comparing many related sequences and structures at the same time.

Model building. For example, model building can be done by Modeller using a single

template structure and the default modeling protocol. Because model accuracy generally

increases with the number of known protein structures used to calculate the model, the

pipeline allows multiple templates to be selected automatically for use by the existing

modeling method [24].

Model evaluation. (See the scoring function in Example 4) Web interface. For Web access to the database see Example 5. A preliminary, small version

of a database, ModBase, is already accessible at http://guitar.rockefeller.edu.

It is essential for the usefulness of the database that it be calculated with the most recent

versions of the TREMBL/SWISS-PROT protein sequence database [2] and the Protein Data

Bank of known protein structures [3].

System implementation

The invention can, for example, be implemented on a cluster of 32 processors. (Although only a

single processor is needed if the intention is to use the process of the invention one query

sequence at a time.). This cluster is needed in order to help calculate the models for all protein sequences that are related to at least one known protein structure. Equally

important, it is also needed to keep the database of models up-to-date with respect to the

growth in the sequence and structure databases, and the improvements in the modeling software.

The size of the computer system is dictated by the following considerations: The calculation of a small ensemble of models for one protein sequence takes about one hour of CPU time on a single

Pentium HI processor. The 32 processors are used for generating and maintaining an exhaustive

database of comparative models. The size of the computations involved is estimated as follows. Given the growth of the number of known protein sequences at the rate larger than 150,000

sequences per year, the growth of the Protein Data Bank at the rate of more than 4,000 structures

per year, and the significant improvements in our modeling software occurring once or twice a

year, a reasonable throughput to keep the database up-to-date is approximately 50,000 models

per 3 months. This allows for 1 hour of CPU time on a single processor per model, approximately the time needed for calculation of one model with the current Modeller

procedure. This CPU time includes all the steps in modeling a given protein sequence, from fold

assignment, alignment, model building, t model evaluation.

A cluster of 16 2-processor boards, with 256 MB of RAM on each board, is offered by Alta

Technology, Salt Lake City. The Clustor node licenses, from Active Tools, San Francisco, are required for efficient distribution of the individual tasks to the 32 processors.

References referred to above

[1] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan. Predicting function: from genes to genomes and back. J. Mol. Biol., 283:707-725, 1998.

[2] A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and

its supplement TrEMBL in 1999. Nuc. Acids Res., 27:49-54, 1999.

[3] E. E. Abola, F. C. Bernstein, S. H. Bryant, T.F Koetzle, and J. Weng. Protein data bank.

In F. H. Allen, G. Bergerhoff, and R. Sievers, editors, Crystallographic databases

Information, content, software systems, scientific applications, pages 107-132. Data

Commission of the International Union of Crystallography, Bon/Cambridge/Chester, 1987.

[4] R. Sanchez and A. Sali. Advances in comparative protein-structure modeling. Curr.

Opin. Struct. Biol., 7:206-214, 1997. [5] R. Sanchez and A. Sali. Comparative protein structure modeling in genomics. J.

Comp. Phys., 151:388-401, 1999.

[6] A. Sali, R. Matsumoto, H. P. McNeil, M. Karplus, and R. L. Stevens. Three-dimensional models of four mouse mast cell chymases. Identification of proteoglycan-binding regions and protease-specific antigenic epitopes. J. Biol. Chem., 268:9023-9034, 1993.

[7] A. Caputo, M. N. G. James, J. C. Powers, D. Hudig, and R. C. Bleackley. Conversion of

the substrate specificity of mouse proteinase granzyme B. Nature Struct. Biol., 1:364-367, 1994.

[8] C. S. Ring, E. Sun, J. H. McKerrow, G. K. Lee, P. J. Rosenthal, I. D. Kuntz,

and F. E. Cohen. Structure-based inhibitor design by using protein models for the development of antiparasitic agents. Proc. Natl. Acad. Sci. USA, 90:3583-3587, 1993.

[9] M. Carson, C. E. Bugg, L. Delucas, and S. Narayana. Comparison of homology

models with the experimental structure of a novel serine protease. Acta Crystallogr.,

D50:889-899, 1994.

[10] T. Nagata, V. Gupta, W-Y. Kim, A. Sali, B. T. Chait, K. Shigesada, Y. Ito,

and M. H. Werner. Immunoglobulin motif DNA recognition and heterodimerization for

the PEBP2/CBF Runt-domain. Nat. Str. Biol., 6:615-619, 1999. [11] R. Sanchez and A. Sali. Large-scale protein structure modeling of the

Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.

[12] R. Matsumoto, A. Sali, N. Ghildyal, M. Karplus, and R. L. Stevens. Packaging of proteases and proteoglycans in the granules of mast cells and other hematopoietic cells.

A cluster of histidines in mouse mast cell protease-7 regulates its binding to heparin

serglycin proteoglycan. J. iol. Chem., 270:19524-19531, 1995.

[13] L. Z. Xu, R. Sanchez, A. Sali, and N. Heintz. Ligand specificity of brain lipid

binding protein. J.Biol.Chem., 271:24711-24719, 1996.

[14] A. Bairoch. PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res.,

20:2013-2018, 1992.

[15] J. S. Fetrow and J. Skolnick. Method for prediction of protein function from

sequence using the sequence-to-structure-to-function paradigm with application to

glutaredoxins/thioredoxins and T 1 ribonucleases. J. Mol. Biol., 281:949-968, 1998.

[16] D. T. Jones. Genthreader: An efficient and reliable protein fold recognition

method for genomic sequences. J. Mol. Biol., 287:797-815, 1999.

[18] C. A. Orengo, F. M. G. Pearl, J. E. Bray, A. E. Todd, A. C. Martin, L. Lo Conte, and J. M. Thornton. The CATH database provides insights into protein structure/function relationship. Nuc. Acids Res., 27:275-279, 1999.

[19] A. Sali. 100,000 protein structures for the biologist. Nature Structural Biology, 5:1029-1032, 1998.

[20] M. C. Peitsch. PROMOD and SWISS-MODEL - Internet-based tools for automated

comparative protein modeling. Biochem. Soc. Trans, 24:274-279, 1996.

[22] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang Zhang, W. Miller, and D. J.

Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res., 25:3389-3402, 1997.

[23] P. Koehl and M. Levitt. A brighter future for protein structure predicton. Nature

Structural Biology, 6:108-111, 1999.

[24] A. Sali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial

restraints. J. Mol. Biol., 234:779-815, 1993.

[25] S.F. Altschul. Generalized affine gap costs for protein sequence alignment. Proteins,

32:88-96, 1998.

[26] A. Bateman, E. Birney, R. Durbin, S. R. Eddy, R. D. Finn, and E. L. L. Sonnhammer.

Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nuc.

Acids Res., 27:260-262, 1999. [28] M. J. Sippl. Recognition of errors in three-dimensional structures of proteins.

Proteins, 17:355-362, 1993.

[29] F. Corpet, J. Gouzy, and D. Kahn. Recent improvements of the prodom database of

protein domain families. Nuc. Acids Res., 27:263-267, 1999.

[30] S. E. Brenner, D. Barken, and M. Levitt. The presage database for structural

genomics. Nuc. Acids Res., 27:251-253, 1999.

[31] S. A. Chervitz, E. T. Hester, C. A. Ball, K. Dolinski, S. S. Dwight, M. A. Harris, G. Juvik, A. Malekian, S. Roberts, T. Roe, C. Scafe, M. Schroeder, G. Sherlock, S. Weng, Y.

Zhu, J. M. Cherry, and D. Botstein. Using the Saccharomyces Genome Database (SGD)

for analysis of protein similarities and structure. Nucleic Acids Research, 27:74-78,

1999.

Additional Publications of interest:

R. Sanchez and A. Sali. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.

A. Sali. 100,000 protein strucures for the biologist. Nature Structural Biology,

5:1029-1032, 1998. R. Sanchez and A. Sali. Comparative protein structure modeling in genomics. J.

Comp. Phys., 151:388-401, 1999.

Example 1

Large-scale protein structure modeling of the Saccharomyces cerevisiae genome

(Abbreviations: 3D, three-dimensional; PDB, Protein Data Bank; ORF, open reading

frame.)

(All examples in this patent application are intended to be illustrative of the

invention, not to narrow it.)

Summary

Fold assignment, sequence- structure alignment, model building, and model evaluation were

completely automated. As an illustration, the method was applied to the proteins in the Saccharomyces cerevisiae (baker's yeast) genome. It resulted in all-atom 3D models

for substantial segments of 1071 (17%) of the yeast proteins, only 40 of which have had their 3D structure determined experimentally. Of the 1071 modeled yeast proteins,

236 were related clearly to a protein of known structure for the first time; 41 of these have

not been previously characterized at all.

Sequence matching of the proteins encoded by the Saccharomyces cerevisiae (baker's yeast) genome [6] has resulted in assignment of 58% of the yeast proteins into 11 functional classes

with 93 sub-classes ( URL http://www.mips.biochem.mpg.de/mips/yeast/index.html).

One way to add to sequence-based predictions of function would be to determine or

predict the three-dimensional (3D) structures of proteins. The 3D structure of a

protein generally provides more information about its function than its sequence because

interactions of a protein with other molecules are determined by amino acid residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function depends more directly on structure than on

sequence, structure is more conserved in evolution than sequence [7]. The net result is

that patterns in space are frequently more recognizable than patterns in sequence. For

example, several mouse mast cell proteases have a conserved surface region of positively charged residues that binds proteoglycans [8]. This region is not easily

recognizable in the sequence because the constituting residues occur at variable and

sequentially non-local positions that form a binding site only when the protease is fully

folded. Approximately 10,000 protein structures have been determined experimentally by

X-ray crystallography and nuclear magnetic resonance spectroscopy [9] ( http://www.pdb.bnl.gov/statistics.html), while there are over 450,000 entries in the

GenPept sequence database alone [10]). To bridge this increasingly large gap between the

numbers of known protein sequences and structures, we calculated useful all-atom 3D

models for a significant fraction of the translated open reading frames (ORFs) in the

yeast genome [6]. Specifically, we show how to automate modeling of thousands of

proteins and how to predict the overall accuracy of the models with a high degree of

certainty. We also discuss new ways of using a large number of protein models and point out several unexpected similarities between previously uncharacterized yeast ORFs and

proteins of known structure.

MATERIALS AND METHODS

Protein Structure Modeling Method. Comparative protein structure modeling [11,

12] was the method chosen for this study. Comparative protein structure modeling of a target sequence consists of (i) identification of known structures related to the target sequence

(templates), (ii) alignment of the templates with the target sequence, (iii) building a model

based on the alignment, and (iv) evaluation of the model. This flowchart has been

implemented in a UNIX Perl script that calls the appropriate programs for the individual tasks, each of which is described in more detail below. Program Clustor was used to

distribute efficiently smaller jobs on many workstations, without having to adapt the

individual programs for parallel execution (URL http://www.activetools.com). All the

alignments and models are available on Internet at URL http://guitar.rockefeller.edu. The models are also accessible through the Saccharomyces Genome Database (SGD) (URL http ://genome-w w w . stanford.edu/S accharomyces/).

Databases. The 6218 Saccharomyces cerevisiae ORF sequences were obtained from the

SGD, Mycoplasma genitalium and Methanococcus jannaschii sequences were obtained from

The Institute for Genome Research (URL http://www.tigr.org/tdb/mdb/mdb.html), Caenorhabditis elegans sequences from Sanger Centre (URL

ftp://ftp.sanger.ac.uk/pub/databases/wormpep/), and Escherichia coli sequences from the E.

coli Genome Center (URL http://www.genetics.wisc.

edu:80/index.html). Experimentally determined protein structures were obtained from the

Protein Data Bank (PDB) (March, 1997) [9],

Template Search. To find template structures for modeling of the translated ORF

sequences, each of the 6218 ORFs from yeast was compared with each of the 2045 potential templates

corresponding to the protein chains representative of the PDB. The representative protein

chains had at most 95% sequence identity to each other, or had length difference of at least

30 residues or 30%; they were also the highest quality structures within each group.

Although a small fraction of the yeast ORFs (< 7%) is likely to be incorrect [3], this is not a

serious limitation because an ORF which matches a known protein structure is likely to

correspond to a real protein. The matching was done by the program Align, which implements the local dynamic programming method with a new gap penalty function and has

a search sensitivity higher than that of Blast [13]. Each ORF- PDB matching was run with

the default gap penalty parameters first. A match was considered significant or insignificant

if the alignment score was more than 22 or less than 19 nats, respectively,where the nat is a

unit for measuring significance of a match [14]. All the pairs with intermediate matches with scores between 19 and 22 nats were realigned using 600 combinations of the

gap penalty parameters. The match was finally considered significant if the best of the 600

alignments had a score of at least 22 nats. The matching part of the PDB chain from a

significant hit was used as the template structure for the corresponding region of the ORF. Target-Template Alignment. To obtain the target-template alignment for comparative

modeling, the matching parts of the template structure and the ORF sequence were

re-aligned by the use of the Align2d command of the Modeller program [15- 17]. This

program implements a global dynamic programming method for comparison of two

sequences, but also relies on the observation that evolution tends to place residue insertions and deletions in the regions that are solvent exposed, curved, outside secondary structure

segments, and between two C_α positions close in space. Gaps in these structurally reasonable

positions are favored by a variable gap penalty function that is calculated from the template

structure alone. As a result, the alignment errors are reduced by approximately one third

relative to the standard sequence alignment techniques. Nevertheless, there is clearly a need for even more accurate sequence-structure alignments and for using multiple template

structures, so that more accurate models are obtained [16].

Model Building. The refined sequence-structure alignment was used by Modeller to construct a 3D model of the ORF region [15- 17]. Model building began by extracting distance and dihedral angle restraints on the target sequence from its alignment with the

template structure. These template-derived restraints were combined with most of the

CHARMM energy terms [18] to obtain a full objective function. Finally, this function was optimized to construct a model that satisfied all the spatial restraints as well as possible.

Assignment of a model into the ^v ood' or bad' class. The overall accuracy of a model

was measured by an overlap between the model and the actual structure. The overlap was

defined as the fraction of residues whose C ^„α atoms are within 3.5A of each other in the globally superposed pair of structures. Models that overlap with the correct structures

in more than 30% of their residues were defined here as good' models. A method for

predicting whether or not a given model is good was developed as follows. Using the PDB,

1085 protein chains of known structure that had less than 30% sequence identity to each

other were picked. Comparative models for these proteins were calculated by the standard

procedure described above. In addition, many bad models were obtained by the same

procedure, except that only target-template alignments with a relatively low alignment significance score from 15 to 20 nats were used. In the end, there were 3993 and 6270 good and bad models, respectively. There were more models than proteins because

most proteins were modeled several times on a different template structure each time. The

distribution of the target-template sequence identity for the good models was similar to that

for the matching of the yeast ORFs with PDB chains (Fig 3A). The quality score

(Q-SCORE) of a model was defined as the Prosall Z-Score [19] divided by the natural

logarithm of sequence length, which made Q_SCORE almost independent of sequence length. The Prosall Z-score approximates the difference in free energy of an evaluated

model and the mean free energy of the same sequence threaded through unrelated folds,

expressed in units of standard deviation. The free energies were calculated with statistical potentials of mean force for single residues and pairs of residues [19]. The distributions of Q_SCORE for good and bad models were obtained for different sequence length

ranges. The posterior probability that a model was good, given that it had a certain

Q_SCORE value, was obtained by using the Bayesian theorem [20] and assuming equal

prior probabilities for good and bad models:

p(GOOD/Q_SCORE) = p(Q_SCORE/GOOD)/[p(Q_SCORE/GOOD) + p(Q_SCORE/BAD)]. A model with p(GOOD/Q_SCORE ) above 0.5 is predicted to be in the good class and thus

have at least approximately correct fold. For proteins longer than 100 residues, it is possible

to identify good models with less than 5% of false positives and 8% of false negatives (Fig.

2A).

Prediction of the overall accuracy of a model. For the models predicted to be in the good

class, the fraction of the C_α atoms modeled within 3.5A of the correct positions depends

on the percentage sequence identity between the modeled sequence and the template. This dependence was determined by using the 3993 good models for proteins of known structure

described in the previous paragraph (Fig. 2B). Above 40% sequence identity, the median

overlap between a model and the corresponding experimental structure is more than 90%

(Fig. 2A). There are few errors in the alignment and the model is as close to the correct

structure as the template. Many models in this range have errors that are comparable to the differences between experimental structures of the same protein determined by different

techniques or in different environments [12]. For 30-40% sequence identity, the overlap

between a model and the corresponding experimental structure is 75-90%. Because the

alignment errors begin to appear, the models overlap with the correct structures less than the

templates do. At very low sequence identity of less than 30%, the overlap drops to 50-75%. These model evaluation results can be understood in terms of the well known relationship between structural and sequence similarities of two proteins [7], the "geometrical'

nature of modeling that forces the model to be as close to the template as possible [15], and

the inability of any current modeling procedure to recover from an incorrect alignment [16]. RESULTS

Template Search. The ORF-PDB matching procedure identified one or more possibly

related structures for 2256 or 36.3% of the ORFs (Fig. 3A). The average length of the local

alignments was 174 residues and the average pairwise sequence identity was 27%.

Evaluation of the Models. Model evaluation indicates that 1071 (17.2%) of the yeast ORFs

have at least one segment of residues with a reliable model (Fig. 3). A small number

of ORFs have a reliable model for more than one domain, resulting in the total of 1168 non-overlapping reliable models for all ORFs. In comparison, only 40 of the yeast

proteins have had their structures determined experimentally [9]. The average length of a

reliable model is 176 residues and 85% of the reliable models are longer than 50 residues.

The average pairwise sequence identity on which the reliable models are based is 34%. Most

of the models based on more than the average sequence identity are predicted to overlap

with the correct structures in more than 80% of their residues (Fig. 2B).

Modeling of Additional Genomes. The fraction of the ORFs in yeast that appear to be

modeled reliably is similar to that for several other genomes. The application of our modeling procedure to the genomes of Escherichia coli (4290 ORFs), Mycoplasma genitalium (468 ORFs), Caenorhabditis elegans (7299 ORFs, incomplete), and

Methanococcus janaschii (1735 ORFs) resulted in the 3D models for 18.1%, 19.2%,

20.4%, and 15.7% of all ORFs, respectively. The E. coli number can be compared with

another study in which comparative models based on a suitable template were calculated for 10-15% of the E. coli ORFs [21]. Fold Assignment Rate. Fold recognition [22], sequence profile methods [23] and

Hidden Markov Models [24] are generally considered to be more sensitive for detecting

remote relationships than the local sequence alignment applied here. Thus, in the future,

these methods will supplement the matching by pairwise sequence comparison in our

pipeline for automated comparative protein structure modeling. However, it is not clear how many more accurate models can be calculated for the matches from the more

sophisticated methods. The reason is that accurate 3D modeling requires both a correct fold

assignment and an approximately correct target-template alignment. Unfortunately, it

appears that when a correct target-template match is made in the absence of statistically

significant sequence similarity already detectable by simple methods, it is rarely possible to produce an accurate alignment [25]. Nevertheless, we now estimate what would have

happened to the fold assignment rate alone if fold recognition and Hidden Markov Models

were applied to the yeast genome. A recent automated fold recognition survey assigned

folds to 103 (22.0%) of the 468 ORFs in the small Mycoplasma genitalium genome [26]. In comparison, our procedure resulted in reliable models for 90 of the 468 ORFs (19.2%), 81 of which were shared with the fold recognition survey. For another

benchmark, the PFAM database obtained by Hidden Markov Models [27] related 315 yeast

proteins to a protein with known structure, which is a relatively small fraction of the 1071

matches obtained here. We identified 263 of the 315 PFAM matches, and 248 of these corresponded to reliable models. Thus, fold recognition and Hidden Markov Models

would provide a small but significant increase in the number of target-template matches for

model evaluation by our combined alignment/modeling approach. However, even the

existing procedure based on local sequence alignment appears to be able to identify some matches that were not identified by fold recognition (there are nine such cases for the M. genitalium genome). The reason is that in the combined alignment/modeling procedure

the final decision about whether or not a given match is correct is made by evaluating the 3D

model implied by the alignment, rather than by scoring the alignment directly. Because

model evaluation works well (Fig. 2A), the cutoff for accepting a match at the sequence

matching stage can be lowered significantly, thus minimizing the loss of correct matches

without adding many false positives. This results in a relatively large number of reliable

models based on low sequence similarity (Fig. 3); for example, 261 yeast ORFs have at

least one reliable model based on a match with a significance score worse than 24

nats (Figure 3B), which is too low to establish a real relationship. The combined

alignment/modeling approach to confirming a remote relationship has already been proven

successful in several individual cases [16, 28, 29]. Another example is the model of the

component PRE4 of the yeast 20S proteasome complex (YFR050C). The model was based on the structure of subunit B of the Thermoplasma Acidophilum (lpmaB) proteasome; the

target and the template have only 16% sequence identity, with the alignment significance

score of 22 nats. However, the model of YFR050C was predicted to be good (pG = 0:99).

The crystallographic structure of YFR050C, determined after the model was calculated

(lrypN) [38], showed that the fold assignment was correct.

DISCUSSION

Usefulness of Models with Errors. It is essential for assessing the value of 3D

protein models to estimate their overall accuracy [19, 30]. In general, mistakes in comparative modeling include sidechain packing errors, small distortions and rigid body

shifts in correctly aligned regions, errors in inserted regions (loops), incorrect alignments,

and incorrect templates [16]. Fortunately, a 3D model does not have to be absolutely perfect to be helpful in biology [11]. One reason is that knowing only the fold of a

protein is relatively frequently sufficient to predict its approximate biochemical function.

For example, only nine out of 80 fold families known in 1994 contained proteins

(domains) that were not in the same functional class, although 32% of all protein structures belonged to one of the nine superfolds [4]. A model is likely to have the correct fold when

the overlap with the actual structure is at least 30%. Such models are obtained when a

correct template and an approximately correct alignment are used. This appears to be the case for 1071 ORFs, as predicted by our model evaluation procedure (Figure 2). Models for

two yeast ORFs calculated before the actual structures were deposited to PDB are discussed in Sanchez and Sali, Proc. Natl. Acad. Sci. U.S.A 95 , 13597-13602 (1998).

Almost half of the 1071 reliably modeled ORFs share more than approximately 35%

sequence identity with their templates (Figure 3A). In such cases, it is frequently possible

to predict correctly important features of the target protein that do not occur in the template

structure. For example, the location of a binding site can be predicted from clusters of

charged residues [8], and the size of a ligand can be predicted from the volume of the

binding site cleft [31].

Usefulness of Comparative Models. Comparative models are calculated from a

sequence

alignment between the protein to be modeled and a related protein of known structure. Thus,

a question arises as to what additional insights that are not already possible from sequence

matching alone can possibly be obtained by 3D modeling. The first advantage of 3D modeling is that it provides the best way of either confirming or rejecting a remote match [16], as discussed above. This is important because most of the related protein pairs share

less than 30% sequence identity (Fig. 3A). For example, only 10.7% of the yeast ORFs

have been matched reliably with known structures by Fasta (URL

http://pedant.mips.biochem.mpg.de/frishman/pedant.html), as opposed to 17.2% in our

study. Another case in point is that 236 of the 1071 yeast ORFs with reliable models had

no previously identified links to a protein of known structure in the major annotations

of the yeast genome, including Sacch3D, Pedant, GeneQuiz, and PFAM (Table 1). Of these

236 proteins for which some structural information is now available, 41 also did not have a

clear link to a protein sequence with known function. A subset of these 41 newly

characterized proteins is listed in Table 1. Additional confidence in these matches is

provided by the conservation of the known functionally important residues in the target

models.

The second advantage of 3D modeling over sequence matching is that some binding and

active sites cannot possibly be found by searching for local sequence patterns [32,33], but frequently should be detectable by searching for small 3D motifs that are known to bind or

act on specific ligands [34]. This is a consequence of the facts (i) that structure is more

conserved than sequence [7], (ii) that 3D motifs tend to consist of residues distant in sequence, and (iii) that there are some 3D motifs whose residues do not follow the same order in sequence, even though they have the same arrangement in space. An example of

this is the serine catalytic triad that almost certainly arose by convergent evolution in serine

proteases of the trypsin and subtilisin type, and also in some lipases [34]. The 3D motifs

could be defined in terms of features extracted from known protein-ligand structures, such as

the constituting atoms and distances between them, shape, secondary structure, and electrostatic properties. Enumeration of active and binding sites for many proteins in

the genome, such as various metal and nucleotide binding sites, will facilitate

experimental determination of protein function.

The third advantage of 3D modeling over sequence matching is that a 3D model

frequently allows a refinement of the functional prediction based on sequence alone because the ligand binding is most directly determined by the structure of the binding site rather than

its sequence. An example of this is provided by a predicted SH3 domain in the yeast ORF

YDL117W (Tab. 1). Since there are known 3D structures of SH3 domains bound to

proline-rich peptide ligands, it was possible to calculate a 3D model of such a complex for

the putative yeast SH3 domain (The 3-dimensional coordinates of the protein with the SH3 domain and those of the ligand are shown in tables B-2 and B-3; see also Fig. 4). Based on

the model, the SH3 residues that interact with the peptide were predicted. This model can then be used to construct site-directed mutants with altered or destroyed binding capacity,

which in turn could be used to test hypotheses about the sequence-structure-function

relationships for this SH3 domain. In addition, since the structural features of the putative

binding site are similar to the features of the well characterized SH3 domains, the model of

the complex increases the likelihood that an actual SH3 domain has been recognized,

irrespective of the specific peptide ligand modeled into the SH3 cleft.

Conclusion. Our results show that comparative modeling efficiently increases the value

of sequence information from the genome projects, although it is not yet possible to model

all proteins with useful accuracy. The main bottlenecks are the absence of structurally

defined members in many protein families and the difficulties in detection of weak similarities, both for fold recognition and sequence-structure alignment. However, while

only 900 out of the total of a few thousand domain folds are known [35, 36], the

structure of most globular folds is likely to be determined in less than ten years [35]. Thus, comparative modeling will conceivably be applicable to most of the globular protein

domains close to the completion of the human genome project.

The computations were done on Silicon Graphics, SUN, DEC and Linux PC computers.

References for Example 1

[1] Oliver, S. G. (1996) Nature 379, 597-600.

[2] Koonin, E. V & Mushegian, A. R. (1996) Curr. Opin. Gen. Dev. 6, 757-762.

[3] Dujon, B. (1996) Trends Genet. 12, 263-270.

[4] Orengo, C. A, Jones, D. T, & Thornton, J. M. (1994) Nature 372, 631-634.

[5] Miklos, G. L. G & Rubin, G. M. (1996) Cell 86, 521-529.

[6] Goffeau, A, Barrell, B. G, Bussey, H, Davis, R. W, Dujon, B, H, H. F, Galibert, F,

Hoheisel, J. D, Jacq, C, Johnston, M, Louis, E. J, Mewes, H. W, Murakami, Y, Philippsen,

P, Tettelin, H, & Oliver, S. G. (1996) Science 274, 563-567.

[7] Chothia, C & Lesk, A. M. (1986) EMBO J. 5, 823-826.

[8] Matsumoto, R, Sali, A, Ghildyal, N, Karplus, M, & Stevens, R. L. (1995) J. Biol.

Chem. 270, 19524-19531.

[9] Abola, E. E, Bernstein, F. C, Bryant, S. H, Koetzle, T, & Weng, J. (1987) in Crystallographic databases _ Information, content, software systems, scientific applications,

eds. Allen, F. H, Bergerhoff, G, & Sievers, R. (Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester), pp. 107-132.

[10] Benson, D. A, Boguski, M. S, Lipman, D. J, Ostell, J, & Ouellette, B. F. F. (1997)

Nucleic

Acids Res 26, 1-7.

[11] Johnson, M. S, Srinivasan, N, Sowdhamini, R, & Blundell, T. L. (1994) CRC

Crit. Rev. Biochem. Mol. Biol. 29, 1-68.

[12] Sanchez, R & Sali, A. (1997) Curr. Opin. Struct. Biol. 7, 206-214.

[13] Altschul, S. F. (1998) Proteins 32, 88-96.

[14] Altschul, S. F & Gish, W. (1996) Methods Enzymol 266, 460-480.

[15] Sali, A & Blundell, T. L. (1993) J. Mol. Biol. 234, 779-815.

[16] Sanchez, R & Sali, A. (1997) Proteins Suppl. 1, 50-58.

[17] Dunbrack Jr., R. L, Gerloff, D. L, Bower, M, Chen, X, Lichtarge, O, & Cohen, F. E.

(1997) Folding & Design 2, R27-R42.

[18] Brooks, B. R, Bruccoleri, R. E, Olafson, B. D, States, D. J, Swaminathan, S, &

Karplus, M. (1983) J. Comp. Chem. 4, 187-217.

[19] Sippl, M. J. (1993) Proteins 17, 355-362.

[20] Box, G. E. P & Tiao, G. C. (1992) Bayesian Inference in Statistical Analysis.

(Wiley- Interscience).

[21] Peitsch, M. C, Wilkins, M. R, Tonella, L,Sanchez, J. C, Appel, R. D, & Hochstrasser,

D. F.

(1997) Electrophoresis 18, 498-501. [22] Bowie, J. U, L"uthy, R, & Eisenberg, D. (1991) Science 253, 164-170.

[23] Altschul, S. F, Madden, T. L, Schaffer, A. A, Zhang, J. Z, Miller, W, & Lipman, D. J.

(1997) Nucleic Acids Res. 25, 3389-3402. [24] Krogh, A, Brown, M, Mian, I. S, Sjolander, K, & Haussler, D. (1994) J. Mol.

Biol. 235, 1501-1531.

[25] Levitt, M. (1997) Proteins Suppl. 1, 92-104.

[26] Fischer, D & Eisenberg, D. (1997) Proc. Natl. Acad. Sci. USA 94, 11929-11934.

[27] Sonnhammer, E. L. L, Eddy, S. R, & Durbin, R. (1997) Proteins 28, 405-420.

[28] Guenther, B, Onrust, R, Sali, A, O'Donnell, M, & Kuriyan, J. (1997) Cell 91,

335-345.

[29] Wolf, E, Vassilev, A, Makino, Y, Sali, A, Nakatani, Y, & Burley, S. K. (1998) Cell

94, 51-61.

[30] Luthy, R, Bowie, J. U, & Eisenberg, D. (1992) Nature 356, 83-85.

[31] Xu, L. Z,Sanchez, R, Sali, A, & Heintz, N. (1996) J.Biol.Chem. 271, 24711-24719.

[32] Bairoch, A. (1992) Nucl. Acids Res. 20, 2013-2018.

[33] Pawson, T. (1995) Nature 373, 573-580.

[34] Wallace, A, Borkakoti, N, & Thornton, J. M. (1997) Protein Sci. 6, 2308-2323.

[35] Holm, L & Sander, C. (1996) Science 273, 595-602.

[36] Hubbard, T. J. P, Murzin, A. G, Brenner, S. E, & Chothia, C. (1997) Nucleic Acids

Research 25, 236-239.

[37] Shilton, B. H, Li, Y, Tesier, D, Thomas, D. Y, & Cygler, M. (1996) Protein Sci. 5,

395.

[38] Groll, M, Ditzel, L, Lowe, J, Stock, D, Bochtler, M, Bartunik, H. D, & Huber,

R. (1997) Nature 386, 463.

[39] Sicheri, F & Kuriyan, J. (1997) Curr. Opin. Struct. Biol. 7, 777-785.

[40] Musacchio, A, Saraste, M, & Wilmanns, M. (1994) Nat. Str. Biol. 1, 546-551. [41] Nicholls, A, Sharp, K. A, & Honig, B. (1991) Proteins 11, 281-296.

[42] Wallace, A. C, Laskowski, R. A, & Thornton, J. M. (1995) Protein Engineering 8,

127-134.

Table 1: Examples of previously uncharacterized yeast proteins. These ORFs o

not have clear similarity to any protein of known function according to the

following sources (October 31, 1997); MIPS (URL

http://www.mips.biochem.mpg.de/mips/yeast/index.html), YPD (URL http://quest7.proteome.com/YPDhome.html). GeneQuiz (

http://www.sander.ebi.ac.uk/genequiz), Sacch3D (http://.www.sander.ebi.ac.uk/genequiz).

Pedant (URL http://pedant.mips.biochem.mpg.de/frishman pedant.html), and PFAM [27].

The examples were selected partly by considering conservation of the functionally important

residues (conserved features). Thus they have seqeuence similarity to known protein structures than most of the other previously uncharacterized yeas proteins. For each ORF and

its correspond ing template, the starting and ending residues of the matching regions are

indicated. The number in parenthesis in the percent sequence identity column is the

alignment significance score in nats [13]. The overall model accuracy is given by

p(GOOD/Q SCORE). The complete list of 236 previously uncharacterized yeast proteins with reliable models is avalable at http://guitar.rockefeller.edu.

Table 1

Examples of previously uncharacterized yeast proteins with reliable models

Yeast Protein Related protein of unknown 3D structure ORF Residues PDB code residues name

YDL117W 13-64 llckA 65A-115A P56-LCK SH3 domain

YCR033W 885-935 lidz 140-190 C-MYB DNA binding domain

YNL181W 44-341 IfmcA 2A-215A 7-α-hydroxysteroid dehydrogenase

YOR221C 124-368 lmla 87-296 malonyl-COA ACP transacylase

YPL217C 63-182 letu 5-145 elongation factor Tu (domain I)

Table 1 (continued)

Yeast Protein Percent sequence Model

ORF Residues Identity accuracy Conserved features

YDL117W 13-64 30 (24.5) 0.97 W31 conserved; other binding residues conserved or similar.

YCR033W 885-935 21 (22.3) 0.99 N interacting with DNA is conserved; K's replaced by R's.

YNL181W 44-341 14 (25.5) 0.98 K163 conserved; Y159F.

YOR221C 124-368 17 (23.7) 0.95 Active site residues S92, Rl 17, and

H201 are conserved.

YPL217C 63-182 22 (22.7) 0.86 GTP binding loops are similar.

Conserved GKTTL motif.

Table 2

1 LYS N 8.73 -1.21 17.5 27 ILE N 1.009 10.3 13.514

1 LYS CA 9.955 -1.882 17.01 27 ILE CA 0.728 9.488 12.369

1 LYS CB 10.104 -3.274 17.65 27 ILE CB 0.405 8.059 12.7

1 LYS CG 10.136 -3.27 19.17 27 ILE CG2 -0.736 8.049 13.737

1 LYS CD 11 .29 -247 19.78 27 ILE CG1 0.073 7.296 11.407

1 LYS CE 1 1.255 -2.468 21.32 27 ILE CD1 1.196 7.277 10.385

1 LYS NZ 12.492 -1.872 21.86 27 ILE C -0.423 10.023 11.583

1 LYS C 9.849 -2.097 15.54 27 ILE O -1.558 10.066 12.053

1 LYS O 8.754 -2.081 14.98 28 ALA N -0.123 10.476 10.35

2 ALA N 10.998 -2.299 14.87 28 ALA CA -1.128 10.859 9.406

2 ALA CA 10.93 -2.547 13.46 28 ALA CB -0.539 1 1.46 8.1 19

2 ALA CB 12.091 -1.923 12.67 28 ALA C •1.818 9.586 9.035

2 ALA C 11.017 -4.026 13.29 28 ALA O -3.036 9.541 8.858 ALA O 11.974 -4.664 13.73 29 GLY N -1.011 8.508 8.936 ARG N 9.967 -4.617 12.7 29 GLY CA -1.472 '7.196 8.583 ARG CA 9.951 -6.025 12.46 29 GLY C -0.771 ^*6.717 7.342 ARG CB 8.548 -6.588 12.16 29 GLY O -0.484 5.528 7.22 ARG CG 7.785 -5.932 11 .01 30SER N -0.491 7.605 6.368 ARG CO 6.346 -6.446 10.93 30SER CA 0.198 7.157 5.184 ARG NE 5.879 -6.614 12.34 30 SER CB 0.324 8.263 4.12 ARG CZ 4.973 -7.572 12.67 30SER OG -0.963 8.664 3.673 ARG NH1 4.416 -8.362 1 1.7 30 SER C 1.592 6.754 5.56 ARG NH2 4.614 -7.759 13.97 30 SER O 1.956 5.579 5.549 ARG C 10.892 -6.377 11 .35 31 TRP N 2.406 7.766 5.919 ARG O 11.51 -7.437 11.37 31 TRP . CA 3.758 7.561 6.347 TYR N 11.022 -5.494 10.34 31 TRP CB 4.821 8.207 5.426 TYR CA 1 1.873 -5.83 9.235 31 TRP CG 4.993 7.527 4.075 TYR CB 11.105 -6.078 7.928 31 TRP CD2 5.951 6.479 3.819 TYR CG 10.235 -7.269 8.131 31 TRP CD1 4.336 7.742 2.904 TYR CD1 10.717 -8.542 7.935 31 TRP NE1 4.819 6.9 1.923 TYR CD2 8.93 -7.109 8.528 31 TRP CE2 5.809 6.121 2.476 TYR CE1 9.911 -9.64 8.125 31 TRP CE3 6.862 5.874 4.628 TYR CE2 8.113 -8.196 8.722 31 TRP CZ2 6.593 5.139 1.928 TYR CZ 8.606 -9.464 8.519 31 TRP CZ3 7.651 4.883 4.077 TYR OH 7.773 -10.585 8.715 31 TRP CH2 7.516 4.525 2.749 TYR C 12.789 -4.683 8.959 31 TRP C 3.824 8.217 7.687 TYR O 12.542 -3.555 9.374 31 TRP O 3.134 9.208 7.931 GLY N 13.91 -4.974 8.266 32 PHE N 4.634 7.657 8.604 GLY CA 14.816 -3.93 7.899 32 PHE CA 4.727 8.178 9.937 GLY C 14.225 -3.242 6.713 32 PHE CB 5.01 7.104 11.004 GLY O 13.528 -3.864 5.91 32 PHE CG 3.818 6.259 11.263 TRP N 14.482 -1.927 6.57 32 PHE CD1 3.398 5.323 10.351 TRP CA 13.966 -1.249 5.42 32 PHE CD2 3.141 6.388 12.45 TRP CB 12.664 -0.476 5.69 32 PHE CE1 2.298 4.542 10^613 TRP CG 11.902 -0.047 4.459 32 PHE CE2 2.046 5.605 12.72 TRP CD2 11.907 1.279 3.909 32 PHE CZ 1.619 4.677 11.799 TRP CD1 11.069 -0.789 3.674 32 PHE C 5 913 9.081 10.001 TRP NE1 10.562 -0.01 2.663 32 PHE O 6.851 8.959 9.215 TRP CE2 11.066 1.265 2.796 33 TYR N 5.888 10.027 10.959 TRP CE3 12.554 2.416 4.295 33 TYR CA ^* 7.029 10.864 11.16 TRP CZ2 10.861 2.389 2.052 33 TYR CB 6.687 12.36 11.286 TRP CZ3 12.347 3.549 3.541 33 TYR CG 7.97 13.113 11.228 TRP CH2 11.516 3 533 2.441 33 TYR CD1 8.513 13.436 10.007 TRP C 15.019 ^■0.284 4.995 33 TYR CD2 8.628 13.498 12.373 TRP O 15.772 0.226 5.824 33 TYR CE1 9.696 14.131 9.926 Table 2 (continued)

7 SER N 15.11 -0.024 3.677 33 TYR CE2 9 812 14.194 12.299 7 SER CA 16.128 0.868 3.212 33 TYR CZ 10.348 14.512 11.074 7 SER CB 16.951 0.296 2.041 33 TYR OH 11.563 15.225 1 1 7 SER OG 17.954 1.219 1.64 33 TYR C 7.571 10.392 12.47 7 SER C 15.427 2.089 2.712 33 TYR O 6.83 10.281 13.446

7 SER O 14.465 2 1.948 34 GLY N 8.877 10.062 12.526

8 GLY N 15.901 3.269 3.156 34 GLY CA 9.35 9.5 13.758 8 GLY CA 15.296 4.502 2.749 34 GLY C 10.794 9.818 13.996 8 GLY C 15.546 4.672 1.291 34 GLY O 11.495 10.367 13.145

8 GLY O 16.69 4.694 0.836 35 LYS N 11.242 9.451 15.215

9 GLN N 14.449 4.758 0.519 35 LYS CA 12.57 9.622 15.72 9 GLN CA 14.565 4.948 -0.889 35 LYS CB 12.585 10.285 17.107 GLN CB 13.236 4.681 -1.616 35 LYS CG 13.93 10.191 17.83 GLN CG 13.335 4.803 -3.133 35 LYS CD 14.067 11.109 19.045 GLN CD 12.03 4.288 -3.724 35 LYS CE 15.381 ^* 10.917 19.8 GLN OE1 11.658 4.649 -4.838 35 LYS NZ 15.385 11.738 21.031 GLN NE2 11.317 3.415 -2.961 35 LYS C 13.16 8.267 15.916 GLN C 15.005 6.353 -1.175 35 LYS O 12.499 7.344 16.391 GLN O 15.909 6.575 -1.986 36 LEU N 14447 8.123 15.566 THR N 14.403 7.331 -0.462 36 LEU CA 15 105 6.862 15.739 THR CA 14.609 8.724 -0.735 36 LEU CB 16 167 6.568 14.664 THH CB 13.313 9.415 -1.075 36 LEU CG 15.584 6.482 13.24 THR OG1 13.537 10.723 -1.581 36 LEU CD2 14.406 5.493 13.183 THR CG2 12.422 9.454 0.178 36 LEU CD1 16.678 6.178 12.201 THR C 15.249 9.407 0.444 36 LEU C 15.825 6.923 17.04 THR O 15.693 8.775 1.402 36 LEU O 16.601 7.844 17.29 LYS N 15.372 10.747 0.332 37 LEU N 15.567 5.937 17.919 LYS CA 15.938 1 1.634 1.309 37 LEU CA 16.252 5.898 19.175 LYS CB 15.993 13.083 0.783 37 LEU CB 15.804 4.749 20.089 LYS CG 16.576 14.097 1.772 37 LEU CG 14.354 4.896 20.57 LYS CD 16.779 15.504 1.194 37 LEU CD2 14.059 6.337 21.02 LYS CE 18.057 15.671 0.371 37 LEU CD1 14.029 3.852 21.65 LYS NZ 18.22 17.091 -0.019 37 LEU C 17.687 5.684 18.854 LYS C 15.102 11.67 2.554 37 LEU O 18.571 6.217 19.521 LYS O 15.63 11.629 3.663 38 ARG N 17.945 4.879 17.807 GLY N 13.767 11.761 2.391 38 ARG CA 19.292 4.623 17.409 GLY CA 12.856 11.95 3.492 38 ARG CB 19.514 3.243 16.761 GLY C 12.757 10.787 4.44 38 ARG CG 20.979 2.983 16.404 GLY O 12.737 10.978 5.657 38 ARG CD 21.862 2.755 17.634 ASP N 12.672 9.548 3.93 38 ARG NE 23.284 2.875 17.205 ASP CA 12.423 8.424 4.795 38 ARG CZ 24.27 2.899 18.151 ASP CB 11.711 7.274 4.065 38 ARG NH1 23 952 2.717 19.465 ASP CG 12.52 6.935 2.827 38 ARG NH2 25.567 3.125 17.788 ASP OD1 13.666 7.442 2.7 38 ARG C 19.637 5.635 16.372 ASP OD2 11.985 6.186 1 .97 38 ARG O 18 863 5.941 15.466 ASP C 13.661 7.934 5.489 39 ASN N 20.845 6.19 16.489 ASP O 14.785 8.263 5.109 39 ASN CA 21.337 7.172 15.575 LEU N 13.47 7.134 6.567 39 ASN CB 21.086 6.797 14.101 LEU CA 14.603 6.632 7.297 39 ASN CG 21.747 7.84 13.207 LEU CB 14.475 6.597 8.829 39 ASN OD1 21.116 8.804 12.779 LEU CG 14.062 7.919 9.468 39 ASN ND2 23.063 7.655 12.915 LEU CD2 14.373 7.96 10.97 39 ASN C 20.667 8.47 15.852 LEU CD1 12.589 8.163 9.163 39 ASN O 21.069 9.494 15.3 LEU C 14.81 5.196 6.95 40 LYS N 19.673 8.459 16.769 LEU O 13.88 4.487 6.568 40 LYS CA 18.999 9.657 17.185 Table 2 (continued)

5 GLY N 16.067 4.731 7.087 40 LYS CB 19 913 10.584 18.005 5 GLY CA 16.382 3.354 6.86 40 LYS CG 19.221 11.827 18.574 5 GLY C 16.545 2.779 8.223 40 LYS CD 20.094 12.608 19.565 5 GLY O 16.927 3.486 9.156 40 LYS CE 19.504 13.954 19.988 6 PHE N 16.246 1.478 8.388 40 LYS NZ 20.478 14.712 20.797 6 PHE CA 16.36 0.946 9.708 40 LYS C 18.579 10.39 15.956 6 PHE CB 15.12 1.235 10.57 40 LYS O 18.669 11.614 15.897 6 PHE CG 13.906 0.831 9.807 41 LYS N 18.104 9.648 14.936 6 PHE CD 1 13.456 -0.469 9.813 41 LYS CA 17.77 10.297 13.707 6 PHE CD2 13.218 1.773 9.075 41 LYS CB 17.216 9.359 12.626 6 PHE CE1 12.33 -0.817 9.106 41 LYS CG 17.436 9.899 11.212 PHE CE2 12.092 1.431 8.367 41 LYS CD 16.977 1 1.342 11.003 PHE CZ 11.646 0.133 8.386 41 LYS CE 17.424 11.938 9.664 PHE C 16.611 -0.519 9.676 41 LYS NZ 17.159 13.394 9.637 PHE O 16.405 -1.197 8.671 41 LYS C 16.693 11.255 14.041 LEU N 17.103 -1.025 10.82 41 LYS O 16.678 12.365 13.53 LEU CA 17 395 -2.409 10.99 42 CYS N 15.719 10.804 14.849 LEU CB 18.75 -2.647 1 .68 42 CYS CA 14.712 1 Ϊ.656 15.406 LEU CG 19.948 -2.132 10.86 42 CYS CB 15 225 12.597 16.521 LEU CD2 21.285 -2.604 11.45 42 CYS SG 16.477 13.803 15.98 LEU CD1 19.894 -0.60S 10.66 42 CYS C 14.057 12.446 14.326 LEU C 16.321 -2 93 11.88 42 CYS O 13.455 13.486 14.587 LEU O 15.732 -2.186 12.67 43 SER N 14.131 11.967 13.074 GLU N 16.016 -4.231 11.77 43 SER CA 13.5 12.714 12.037 GLU CA 14.984 -4.794 12.58 43 SER CB 14.321 13.925 11.57 GLU CB 14.786 -6.293 12.31 43 SER OG 14.545 14.812 12.654 GLU CG 14.342 -6.591 10.87 43 SER C 13.385 11.82 10.864 GLU CD 14.608 -8.068 10.6 43 SER O 14.302 11.061 10.548 GLU OE1 15.09 -8.765 11.54 44 GLY N 12.232 11.893 10.183 GLU OE2 14.341 -8.513 9.454 44 GLY CA 12.107 11.117 8.995 GLU C 15.446 -4.653 13.99 44 GLY C 10.779 10.445 9 GLU O 16.639 -4.755 14.28 44 GLY O 10.012 10.536 9.956 GLY N 14.505 -4.386 14.92 45 TYR N 10.52 9.685 7.918 GLY CA 14.838 -4.294 16.31 45 TYR CA 9.284 8.986 7.76 GLY C 15.061 -2.863 16.69 45 TYR CB 8.478 9.402 6.517 GLY O 15.207 -2.553 17.87 45 TYR CG 7.964 10.783 6.731 ASP N 15.087 -1.946 15.71 45 TYR CD1 8.775 1 1.88 6.55 ASP CA 15.309 -0 56 16.02 45 TYR CD2 6.649 10.972 7.089 ASP CB 15.615 0.29 14.77 45 TYR CE1 8.273 13.145 6.748 ASP CG 16.229 1.619 15.19 45 TYR CE2 6.146 12.235 7.288 ASP OD1 16.37 1.853 16.42 45 TYR CZ 6.962 13.325 7.119 ASP OD2 16.568 2.419 14.28 45 TYR OH 6.454 14.626 7.319 ASP C 14.055 -0.042 16.67 45^"TYR C 9.587 7.535 7.592 ASP O 12.964 -0.528 16.37 45 TYR O 10.615 7.157 7.031 ILE N 14.177 0.969 17.55 46 PHE N 8.69 6.683 8.125 ILE CA 13.032 1.495 18.25 46 PHE CA 8.814 5.26 7.992 ILE CB 13.237 1.679 19.73 46 PHE CB 9.343 4.538 9.249 ILE CG2 11.932 2.268 20.29 46 PHE CG 8.457 4.829 10.405 ILE CG1 13.654 0.371 20.42 46 PHE CD1 8.694 5.933 11.19 ILE CD1 -15.088 -0.051 20.1 46 PHE CD2 7.404 3.998 10.711 ILE C 12.787 2.877 17.74 ^J 46 PHE CE1 7.888 6.213 12.267 ILE O 13.73 3.622 17.47 46 PHE CE2 6 592 4.272 11.787 MET N 11.505 3.259 17.57 46 PHE CZ 6.834 5.381 12.564 MET CA 11.258 4.583 17.1 46 PHE C 7.459 ^'4.729 7.671 MET CB 10.837 4.63 15.62 46 PHE O 6.444 5.385 7.917 Table 2 (continued)

2 MET CG 11.927 4.165 14.65 47 PRO N 7.431 3.561 7.096 2 MET SD 12.265 2.377 14.69 47 PRO CA 6.193 2.942 6.723 2 MET CE 13.56 2.436 13.42 47 PRO CD 8.552 3.049 6.329 2 MET C 10.146 5.184 17.88 47 PRO CB 6.549 1.835 5.729 MET O 9.194 4.51 1 18.28 47 PRO CG 8.073 1.67 5.857 GLU N 10.258 6.496 18.15 47 PRO C 5.44 2.469 7.923 GLU CA 9.195 7.185 18.81 47 PRO O 6.057 2.003 8.B8 GLU CB 9.675 8.269 19.79 48 HIS N 4.102 2.58 7.868 GLU CG 8.537 8.969 20.53 48 HIS CA 3.229 2.171 8.925 GLU CD 8.01 8.012 21 .59 48 HIS ND1 0.445 2.21 10.746 GLU OE1 8.85 7.397 22.3 48 HIS CG 0.7 1 .882 9.431 GLU OE2 6.762 7.881 21.7 48 HIS CB 1.761 2.503 8.566 GLU C 8.472 7.863 17.69 48 HIS NE2 -1.047 '•0.659 10.178 GLU O 9.098 8.473 16.83 48 HIS CD2 -0.218 0.934 9.103 VAL N 7.131 7.76 17.66 48 HIS CE1 -0.609 1.45 11.141 VAL CA 6.448 8.363 16.56 48 HIS C 3.331 0.692 9.128 VAL CB 5.254 7.585 16.08 48 HIS O 3 475 0.219 10.255 VAL CGI 4.236 7.483 17.23 49 ASN N 3.28 -0.075 8.026 VAL CG2 4.689 8 276 14.83 49 ASN CA 3.22 -1 .498 8.147 VAL C 5.974 9.722 16.96 49 ASN CB 2.877 -2.199 6.822 VAL O 5.316 9.889 17.98 49 ASN CG 3 967 -1 .898 5.801 THR N 6.352 10.746 16.16 49 ASN OD1 4.331 -0.748 5.55 THR CA 5.891 12.069 16.46 49 ASN ND2 4.522 -2.98 5.192 THR CB 6.549 13.148 15.63 49 ASN C 4.452 -2.074 8.778 THR OG1 6.101 14.43 16.05 49 ASN O 4.319 -2.962 9.621 THR CG2 6.236 12.941 14.15 50 PHE N 5.67 -1.621 8.419 THR C 4.408 12.122 16.24 50 PHE CA 6.813 -2.171 9.096 THR O 3.67 12.569 17.12 50 PHE CB 8.137 -1.567 8.598 ARG N 3.918 11.649 15.08 50 PHE CG 8.367 -2.037 7.201 ARG CA 2.504 1 1.662 14.85 50 PHE CD1 7.782 -1.413 6.122 ARG CB 1 .92 13.066 14.6 50 PHE CD2 9.188 -3.114 6.978 ARG CG 0.399 13.081 14.4 50 PHE CE1 8.017 -1 .858 4.839 ARG CD -0.411 12.788 15.67 50 PHE CE2 9.431 -3.569 5.703 ARG NE -1.854 12.843 15.29 50 PHE CZ 8.848 -2.938 4.625 ARG CZ -2.82 12.68 16.24 50 PHE C 6.579 -1 .746 10.541 ARG NH1 -2.468 12.495 17.54 50 PHE O 6.729 -2.606 11.453 ARG NH2 -4.142 12.714 15.89 50 PHE OXT 6.25 -0.551 10.746 ARG C 2.241 10.835 13.63 ARG O 3.119 10.651 12.79

Table 3

1 PRO N 8.47 9.24 0.67 56 PRO N 5.63 -0.16 1.07 1 PRO CA 7.82 10.5 0.39 56 PRO CA 5.69 -1.53 0.61 PRO CD 7.65 8.77 1.82 56 PRO CD 6.83 0.24 1.77 PRO CB 7.85 1 1.3 1.7 56 PRO CB 7.08 -2.04 0.96 PRO CG 7.65 10 2.69 56 PRO CG 7.89 -0.78 1.3 PRO C 7.35 10.7 -1.02 56 PRO C 4.7 -2.35 1.37 PRO O 6.76 1 1.7 -1.44 56 PRO O 4.4 -2 2.51 LEU N 7.64 9.59 -1.67 57 PRO N 4.24 -3.4 0.8 LEU CA 7.42 9.31 -3.04 57 PRO CA 3.37 -4.27 1.53 LEU CB 8.74 9.56 -3 85 57 PRO CD 3.96 -3.43 -0.63 LEU CG 8.94 11 -4.48 57 PRO CB 2.58 -5.07 0.47 LEU CD2 8.61 12.1 -3.54 57 PRO CG 3.31 _, -4.81 -0.86 LEU CD1 8.19 11 -5.8 57 PRO C 4.24 ^' ^'-5.1 2.43 LEU C 6.79 7.92 -3.12 57 PRO O 5.45 -5.15 2.21 LEU O 5.57 7.87 -2.9 58 LEU N 3.65 •5.74 3.47 PRO N 7.42 6.77 -3.37 58 LEU CA 4.49 -6.49 4 36 PRO CA 6.63 5.59 -3.65 58 LEU CB 3.76 -7.1 5.57 PRO CD 8.8 6.63 -3.8 58 LEU CG 4.72 -7.67 6.65 PRO CB 7.59 4.59 -4.21 58 LEU CD2 4 -8.66 7.59 PRO CG 8.78 5.42 -4.71 58 LEU CD1 5.44 •6.53 7.38 PRO C 5.94 5.21 -2.38 58 LEU C 5.05 -7.62 3.56 PRO O 6 24 5 67 -1 .3 58 LEU O 4.43 -8.1 2.62 PRO N 4.94 4.45 -2.67 59 PRO N 6.24 -8 3 92 PRO CA 4.02 3.95 -1.69 59 PRO CA 6.85 -9.09 3.2 PRO CD 4.44 4.34 -4.03 59 PRO CD 7.23 -7.03 4.36 PRO CB 2.87 3.29 -2.47 59 PRO CB 8.34 -8.76 3.11 PRO CG 3.41 3.2 -3.92 59 PRO CG 8.57 -7.75 4 24 PRO C 4.72 3.05 -0.76 59 PRO C 6.58 •10.4 3.95 PRO O 5.76 2.5 -1.12 59 PRO O 6.76 ■ ■10.4 5.19 LEU N 4.15 2.88 0.45 59 PRO OXT 6.21 • ■11.4 3.29 LEU CA 4.72 2.01 1.43 LEU CB 3.87 1.96 2.69 LEU CG 3.71 3.28 3.44 LEU CD2 3.03 3.03 4.79 LEU CD1 2.95 4.33 2.6 LEU C 4.61 0.63 0.87 LEU O 3.61 ^' 0.27 0.27

Figure 2. Predicting the overall accuracy of comparative models. The good and bad

models for proteins of known structure are used to tune the prediction of reliability of a model when the actual structure is not known (Fig. 3). See Materials and Methods for

details. (A) A rule for assigning a comparative model into either the 'good' or 'bad' class,

based on its Q_SCORE. The inset shows the distributions of Q_SCORE for the good and

bad models with 100 to 150 residues. Such distributions are used with the Bayes theorem to

calculate the posterior probability that a model is good, given that it has a certain Q_SCORE

value, p(GOOD/Q_SCORE ). The main plot shows the percentages of false positives (bad

models classified as good) and false negatives (good models classified as bad) as a function

of sequence length. The curves were obtained by the jack-knife procedure. (B) A rule for estimating the accuracy of a reliable model (as predicted by its Q_SCORE), based on the

percentage sequence identity to the template. The overlaps of an experimentally determined

protein structure with its model (continuous line) and with a template on which the model

was based (dashed line) are shown as a function of the target-template sequence identity.

This identity was calculated from the modeling alignment. The structure overlap is defined

as the fraction of the equivalent C_α atoms. For comparison of the model with the actual structure (filled circles), two C_α were considered equivalent if they were within 3.5A of

each other and belonged to the same residue. For comparison of the template structure with

the actual target structure (open circles), two C_α atoms were considered equivalent if

they were within 3.5A after alignment and rigid-body superposition by the align3d command in Modeller [15]. The points correspond to the median values and the error

bars in the positive and negative directions correspond to the average positive and negative differences from the median, respectively. Points labeled in α, β, γ, correspond to the models

reported in Fig. 1C of Proc. Natl. Acad. Sci. U.S.A_95 , 13597-13602 (1998). The empty

circle at 25% sequence identity corresponds to an unusually accurate model (illustrated in

Fig. 3B of Proc. Natl. Acad. Sci. U.S.A_95 , 13597-13602 (1998)).

: Protein structure models for yeast ORFs. (A) Distribution of the sequence identity

between the models and the corresponding templates as a function of model sequence length. The 3992 reliable models for substantial segments of 1071 different ORFs that are predicted

to be based on a correct template and approximately correct alignment are represented by the

upper bars for a given point , and the 4588 unreliable models that are predicted to be based

on a mostly incorrect alignment or an incorrect template are represented by the lower bars

for a given point . The last histogram at label "All/6' is the sum of the other six histograms divided by six. (B) The corresponding distribution of the alignment significance score calculated by the program Align [13].

proline rich peptide. A segment in the yeast ORF YDL117W sequence (top panel) was

predicted to be remotely related to the SH3 domains, many of which have known 3D

structure (Tab. 1). The automated prediction was possible because of the sensitivity

afforded by evaluating a 3D model implied by the match. The 3D model of the SH3 domain in turn allowed to address the biochemical function of YDL117W by calculating a 3D model of a complex between the predicted SH3 domain and a putative ligand, a

proline-rich peptide (middle panel). The ligand in the SH3 model is in fact a proline-rich

segment that occurs downstream in the same ORF. This peptide, PLPPLPPLP (positions

212-220), contains the signature PXXP sequence typical of the SH3 binding peptides [39].

The coordinates of the protein and the ligand in the complex are shown in Tables B-2 and B-

3, respectively. The model of the complex was obtained by the same comparative method as

the model of the SH3 domain [15], relying on the crystallographic structure of the complex

between the FYN SH3 domain and its peptide ligand (PPAYPPPPVP) [40]. Both inter- and

intra-molecular interactions between SH3 domains and Pro-rich peptides have already been documented [39]. The SH3 residues making hydrophobic contacts and hydrogen bonds to the

ligand peptide. The bottom panel shows a schematic representation of the SH3-peptide

interaction [42]. This model should facilitate designing experiments such as site-directed

mutagenesis for maping of functionally important residues on the SH3 domain as well as its ligand. This should be compared to the starting point where no functional information about

this ORF or about the proteins previously related to it was known. More generally, the

wealth of information in the bottom panel of this Figure and Table 2 (protein coordinates)

and Table 3 (ligand coordinates) relative to the top, sequence-only panel of this figure is a

case in point for the utility of structural models in planning biological experiments. For the many proteins whose structures have not been determined by experiment, maximal structural information is obtained by both (i) establishing a match to a known

protein structure and (ii) calculating an all-atom 3D model based on that match, using the

methods described in this paper. Example 2

Variable gap penalty function for protein sequence-structure alignment

(Abbreviations: 3D, three-dimensional; PDB, Protein DataBank)

Overview

One of the main sources of error in comparative protein structure modeling are misalignments

of the target sequence with template structures. We describe a dynamic programming

procedurefor aligning a sequence with a structure that mitigates this problem. The procedure

uses avariable gap penalty function that depends on the structural context of an insertion or a

deletion.lt avoids insertions and deletions within helices or sheets, buried regions, straight segments, andalso between two residues that are distant in space. Several examples of the improved alignmentsare shown. The variable gap penalty function may also be useful in

other applications wheresequence-structure comparisons are needed, such as in template

matching and threading. Introduction

In a few years, the genome projects will have provided us with amino acid sequences of

approximately 500,000 proteins. The full potential of the genome projects will only

be realized once we can assign, understand, and manipulate the function of these new

proteins. Such control of protein function generally requires the knowledge of protein

three-dimensional structure. Unfortunately, experimental methods for protein structure determination, such as X-ray crystallography and NMR spectroscopy, are time consuming

and not successful with all proteins; consequently, three-dimensional structures have

been determined for only a tiny fraction of proteins for which the amino acid sequence is

known. Fortunately, in the absence of a high-resolution protein structure determined by

X-ray crystallography or NMR spectroscopy, a useful three-dimensional (3D) model of a given sequence can often be calculated by comparative modeling (Blundell et al., 1987;

Greer, 1990; Johnson et al., 1994; Bajorath et al., 1994; Holm et al., 1994; Sali, 1995;

Sanchez & Sali, 1997).

Comparative or homology protein modeling uses experimentally determined protein

structures

(templates) to predict the conformation of another protein with a similar amino acid

sequence (target). This is possible because a small change in the sequence usually results in a small change in the 3D structure (Lesk & Chothia, 1986). All comparative modeling

methods begin with an alignment between the target and templates; the main difference

between the different modeling methods is in how the 3D model is calculated from a given

alignment. Once an appropriate template structure is found, the usefulness of comparative

protein models has been limited by the errors in sidechain packing, distortions in correctly and incorrectly aligned regions, distortions in unaligned regions, and, most importantly, by

the difficulty of sequence alignment when sequence identity between the target and

templates is less than about 35% (Sali et al., 1995). If the alignment is incorrect, the atoms

will be positioned incorrectly by all the current comparative modeling methods. Even a shift

in the alignment by only one residue will produce an rms error in the backbone atoms in the order of 4 A . When the sequence identity between the sequence and structure is about 30%, the best methods for aligning sequences with structures align incorrectly about 20% of

residues, as judged by structure-structure alignments (Johnson & Overington, 1993). This is

a major problem because about one half of all pairs of related proteins are related by less than 30% sequence identity.

In principle, the best approach for aligning a sequence with a structure would be to score all

possible alignments by scoring the best full model of the sequence as implied by the

corresponding alignment (Sali et al., 1995). However, this computationally difficult problem has not been solved in practice yet. So far, the existing approaches for improving sequence-structure alignments relativeto the dynamic programming solution from a

sequence-sequence alignment (Needleman & Wunsch,1970) all rely on the availability of

structural information for one of the proteins in the alignment. Generally, the objective

function that is optimized to get the best alignment consists of two parts:that corresponding to the aligned regions and that corresponding to the insertions and deletions (i.e.,gaps). To

improve the scoring of equivalent regions, relative to the scoring by a single 20 by 20 amino

acid substitution table of the Dayhoff type (Dayhoff et al., 1978), environment dependent

amino acid substitution tables have been used in template matching (Overington et al., 1990;

Bowie et al., 1991; Overington et al., 1992), and single body and pairwise statistical potentials have been used in threading (Jones et al., 1992; Godzik et ah, 1992). Another

group of improvements concentrate on the gap penalty rather than on the scores for the

aligned regions. Usually, a linear gap penalty function is used, g = u + v.l, that depends on

the gap initiation and extension parameters, u and v, and on the number of residues in the

gap, 1. The optimal values for parameters u and v in sequence - sequence alignments have been explored exhaustively (Barton & Sternberg, 1987; Johnson & Overington, 1993;

Vogt et al., 1995). Several variations on the simple linear gap penalty have also been

proposed. For example, it has been shown from analysis of gap length distribution in

reference alignments that the gap penalty should be a logarithmic function of gap length (Benner et al., 1993); this appears to complicate the implementation in the dynamic programming algorithms (Gotoh,1982) and the linear gap penalty is still widely used.

Another improvement involves making gap penalty dependent on the (predicted) local

secondary structure (Lesk et al., 1986) and on the local variability of the already aligned

sequences (Taylor, 1986).

One important type of structural information has not yet been used in sequence-structure

align¬

ments: It has been observed anecdotally that the insertions relative to a given structure tend to happen between residues that are close in space. Similarly, regions that span two residues

close in space tend to be deleted more frequently. In this communication, we propose a

dynamic programming algorithm with a variable gap penalty function that attempts to mimic

this observation. In addition, we also incorporate other structural information to facilitate insertions or deletions of residues that are exposed, occur in bends, and are not within secondary structure segments. The new algorithm is almost as fast as the dynamic

programming version with the linear gap penalty. Several sample alignments obtained by the

linear and variable gap penalty functions are compared to show advantages of the variable

gap penalty function in comparative modeling. In addition to comparative modeling, the new algorithm can also be applied for 3D template matching and threading.

2 Variable gap penalty for sequence - structure alignment

We describe a dynamic programming algorithm of the Needleman & Wunsch type

(Needleman & Wunsch, 1970) to obtain an optimal alignment of one or more

pre-aligned protein sequences (i.e., sequence block) with one or more pre-aligned protein structures and sequences (i.e., structure block). The distinguishing feature of the

algorithm is its variable gap penalty function. The algorithm is implemented in the

ALIGN2D command of the computer program Modeller (This procedure is implemented in

the computer program Modeller, that is freely available to academic researchers via World

Wide Web at URL http://guitar.rockefeller.edu. Graphical interfaces to Modeller are provided by Quanta, Insightϋ, and GeneExplorer (MSI, San Diego, CA; e-mail dje@msi.com). For detailed discussion of this and related problems see (Sankoff &

Kruskal, 1983).)

1. The method as described here is for the global alignment only (Needleman & Wunsch, 1970), although the ALIGN2D command implements calculation of locally optimal

alignments as well (Smith & Waterman, 1981; Sali et al., 1998). The algorithm can be used either with similarity or distance residue-residue scores (Sali et al., 1998) and its extension to

environment dependent substitution matrices is straightforward.

The problem of the optimal alignment of two sequences as addressed by the algorithm of

Needleman & Wunsch is as follows. We are given two sequences of residues and an M

times N scoring matrix W where M and N are the numbers of residues in the first and

second sequence. The scoring matrix is composed of scores W_u describing the differences between residues i and j from the first and second sequence respectively. The goal is to obtain an optimal set of equivalences that match residues of the first sequence to the residues

of the second sequence. The equivalence assignments are subject to the following

"progression rule': for residues i and k from the first sequence and residues j and 1

from the second sequence, if residue i is equivalenced to residue j, if residue k is equivalenced to residue 1 and if k is greater than i, then 1 must also be greater than j. The

optimal set of equivalences is the one with the smallest alignment score. The

alignment score is a sum of scores corresponding to matched residues, also increased

for occurrences of non-equivalenced residues (i.e., gaps).

The recursive dynamic programming formulae for the global alignment of the structure block with the sequence block are:

^DiJ = ⁱⁿ--ι«-x(0, i-L)<i'< -, ( ',j' + Gijj ⁱ + Wi ) max(0, j-L) < j' <j

Do,j = Gi +iflfl W_M+i_j = 0

where M and N are the lengths of the structure and sequence blocks, respectively, L indicates the maximal allowed gap length, G is the variable gap penalty function, and W is the residue-residue substitution score for positions i and j from the structure and sequence blocks, respectively. The 20 by 20 residue-residue weight matrix asl.mat, whose values lie between 0 and 1000 (Sali et al.,1995), is used to obtain W. D is calculated for i = M + 1 and j = N + 1. The minimal score for the global alignment of the two blocks, d, corresponds to the smallest element in D _{M+1;0<j ≤N+]} and D_{0<i iM+ltN+1} . The residue-residue equivalence assignments (i.e., the alignment) are obtained by backtracking in matrix D, starting from the element d.

Function G is the variable gap penalty function for simultaneous insertions from positions i' to i in the structure block and from positions j' to j in the sequence block. That is, for the situation where positions i' and j' are aligned with each other and positions i and j are aligned with each other, while the intervening positions in either or both blocks are not aligned with each other. If i' = i - 1 (j' = j - 1), there is no insertion in the structure (sequence) block. This formulation of the alignment by global dynamic programming allows for the gap penalty function of any form.

The main difference in the recursion from the linear gap penalty case is that a slightly slower procedure for finding the optimal gap lengths must be used for gap openings in the block of prealigned sequences ^vj' because of the penalty dependence on the distance between the two spanning C_α positions in the block of structures T. The CPU time is saved by limiting the minimization over 1 and 1' to values of 1 and 1' that are smaller than L; this is equivalent to limiting the maximal length of a gap to L positions. In practice, the new algorithm is only slightly slower than the O (M x N ) variant of the original dynamic programming algorithm with the linear gap penalty function (Gotoh, 1982). The variable gap penalty function is defined as

0, / = 0 and /' = 0

Gi . ' =

R{i,i')u + {l + l')υ l>0orl'>0

' -i' + l, 0 < i < M; τnax(0,i-i' + l-_e = 0ori = M + l

(2) j-j' + l, 0<j<N; ' max{0,j-f + 1 - e j' = 0 or j = -V + 1

Λ(*.*') = 1 + [ω-γHi + ω_sSi + _BB_{ + ω_cQ + ω_{D v}π-ax(θ, d - ά_β)ι)

where 1 and 1' are the lengths of insertions in the structure and sequence blocks, respectively,

v is the gap extension penalty, u is the gap opening penalty, e is the maximal number of

residues at sequence termini which are not penalized with a gap-penalty if not equivalenced

{i.e., overhangs), and R is the function that modulates the gap penalty function depending on

the structure block at the position of the insertion. R is at least 1, but can be larger

to make gap opening more difficult in the following circumstances: within helices or

strands, at buried positions, or at straight main chain positions. H; is 1 if all the structures in the structure block have all positions from i' to i occupied by helical residues. Sj is 1 if all templates have all positions from i' to i occupied by β-strand residues. B_; is the average

buriedness of residues from position i' to i in the structure block, is the average backbone

straightness of residues in the structure block from position i' to i, d is the distance between

the Cα atoms at positions i' and i, averaged over all structures in the structure block, d₀ is the

distance that is small enough to correspond to no increase in the opening gap penalty, and γ

is a constant to be determined empirically. The values of all four features, H, S, B,

and C, lie between 0 and 1. See the next Section for exact definitions of secondary

structure, backbone straightness, and residue buriedness. Reasonable values for all

parameters (w; v; ω^ d₀ ;γ) were obtained by a trial-and-error procedure (Results). The variable gap penalty function is reduced to the special case of the linear gap penalty

function when all weights ω, are set to 0.

2.1 Secondary structure assignments

The algorithm for assignment of α-helices and β-strands depends on the Cα positions only. It is based on the idea of matching distance matrices of short segments of residues with

library' distance matrices corresponding to the individual secondary structure types (Richards & Kundrot, 1988). The main difference between (Richards & Kundrot, 1988) and

the current implementation is that β-strands are assigned only when there are at least two

spatially neighboring β-strands that can form a β-sheet. For each secondary structure type,

the library C_α distance matrix was calculated by averaging distance matrices for a sample of the corresponding secondary structure segments, which was obtained by running program

Dssp (Kabsch & Sander, 1983) on a 10 high-quality unrelated protein structures.

Distances that lay more than 2 standard deviations away from the mean of all distances were

omitted from the final averages. The secondary structure defining distance matrices and

parameters are shown in Table 4. The algorithm assigns helices first and strands second.

For each secondary structure type, it proceeds as follows:

1. Define the degree of the current secondary structure fit for each C_α atom. Use two

criteria: distance RMS (i.e., DRMS) (Levitt, 1983) and maximal distance difference (MDD). The RMS and MDD are both obtained by comparing the library distance matrix

with the distance matrix for a segment starting at the given Cα position. Assign the current

secondary structure type to all C_α's in all segments whose DRMS and MDD are less

than cutoffs c_t and c₂, respectively, and are not already assigned to "earlier' secondary

structure types.

2. Split kinked contiguous segments of the same type into separate segments: Kinking

residues are all residues in segments with both DRMS and MDD beyond cutoffs c₃ and c₄ , respectively. The actual single kink residue separating the two new segments of the same

type is the central kinking residue.

3. If the current secondary structure type is β-strand: Eliminate those runs of strand residues that are not closer than c₅ A to other strand residues, separated by at least two residues in

sequence.

4. Eliminate those segments that are shorter than the cutoff length (c₆ ).

5. Remove the isolated kinking residues (those that occur on their own or begin or end a

segment).

2.2 Backbone straightness

Local mainchain curvature at residue i is defined as the angle 0 <α < 180 between the

least-squares lines through C_α atoms i - 3 to i, and from i + 3 to i. Straightness is defined as

1 for all residues within helices and strands, and as 1 - min (180°, max (0, α )/180°) otherwise.

2.3 Residue buriedness

The residue buriedness is defined as 1 - a, where a is the fractional side-chain solvent

accessibility ranging from 0 to 1 (Sali & Overington, 1994). 3 Reference alignments for optimization of parameters

Reference alignments were obtained by least-squares superposition of two proteins of known

structure (Abola et al., 1987). All the proteins were at least 30 residues long and were

determined by either X-ray crystallography at resolution 3A or by NMR. All the pairs had

proteins with the sequence identity in the range from 30% to 45%; none of the pairs had

both proteins with more than 50% sequence identity to any other pair. MODELLER'S

ALIGN3D with the cutoff of 5A was used for least-squares superposition. Bad

superpositions were removed.

References

Abola, E. E„ Bernstein, F. C, Bryant, S. H., Koetzle, T., & Weng, J. (1987).

Protein data bank. In: Crystallographic databases _ Information, content, software

systems, scientific applications, (Allen, F. H., Bergerhoff, G., & Sievers, R., eds) pp.

107-132. Data Commission of the International Union of Crystallography

Bonn/Cambridge/Chester.

Bajorath, J., Stenkamp, R., & Aruffo, A. (1994). Knowledge-based model building of

proteins: Concepts and examples. Protein Sci. 2, 1798-1810.

Barton, G. J. & Sternberg, M. J. (1987). Evaluation and improvements in the automatic alignment of proteins sequences. Protein Eng. 1, 89-94.

Benner, S. A., Gonnet, G. H., & Cohen, M. A. (1993). Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229, 1065-1082.

Blundell, T. L., Sibanda, B. L., Steinberg, M. J. E., & Thornton, J. M. (1987).

Knowledge-based prediction of protein structures and the design of novel molecules.

Nature, 326, 347-352.

Bowie, J. U., L"uthy, R., & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164-170.

Dayhoff, M. O., Schwartz, R. M., & Orcutt, B. C. (1978). In: Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, (Dayhoff, M. O., ed) pp. 345-352. National

Biomedical Research Foundation Washington D.C.

Godzik, A., Kolinski, A., & Skolnick, J. (1992). Topology fingerprint approach to the

inverse protein folding problem. J. Mol. Biol. 227, 227-238.

Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708.

Greer, J. (1990). Comparative modelling methods: application to the family of the

mammalian serine proteases. Proteins, 7, 317-334.

Holm, L., Rost, B., Sander, C, Schneider, R., & Vriend, G. (1994). Data based modeling of proteins.In: Statistical mechanics, Protein Structure, and Protein Substrate Interactions,

(Doniach, S„ ed) pp. 277-296. Plenum Press New York.

Johnson, M. S. & Overington, J. P. (1993). A structural basis for sequence

comparisons: An evaluation of scoring methodologies. J. Mol. Biol. 233, 716-738.

Johnson, M. S., Srinivasan, N., Sowdhamini, R., & Blundell, T. L. (1994). Knowledge-based

protein modelling. CRC Crit. Rev. Biochem. Mol. Biol. 29, 1-68.

Jones, D. T., Taylor, W. R., & Thornton, J. M. (1992). A new approach to protein fold recognition. Nature, 358, 86-89.

Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: Pattern recognition

of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637.

Koretke, K. K., Luthey-Schulten, Z., & Wolynes, P. G. (1996). Self-consistently optimized

statistical mechanical energy functions for sequence structure alignment. Prot. Sci. 5,

1043-1059.

Lesk, A. M. & Chothia, C. H. (1986). The response of protein structures to amino-acid sequence changes. Philos. Trans. R. Soc. London Ser. B, 317, 345-356.

Lesk, A. M., Levitt, M., & Chothia, C. (1986). Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1, 77-78. Levitt, M. (1983). Molecular dynamics of native protein. II. Analysis and nature of

motion. J. Mol. Biol. 168, 621-657.

Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453.

Overington, J., Donnelly, D., Johnson, M. S., Sali, A., & Blundell, T. L. (1992).

Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Sci. 1, 216-226.

Overington, J., Johnson, M. S., Sali, A., & Blundell, T. L. (1990). Tertiary structural

constraints on protein evolutionary diversity; templates, key residues and structure

prediction. Proc. Roy. Soc. Lond. B 241, 132-145.

Richards, F. M. & Kundrot, C. E. (1988). Identification of structural motifs from protein

coordinate data: Secondary structure and first level super-secondary structure. Proteins, 3,

71-84.

Sali, A. (1995). Modelling mutations and homologous proteins. Curr. Opin. Biotech. 6, 437-451.

Sali, A., Badretdinov, A., Sanchez, R., & Feyfant, E. (1998). Modeller, A Protein Structure Modeling Program, Release 5. URL http://guitar.rockefeller.edu/. Sali, A. & Overington, J. (1994). Derivation of rules for comparative protein

modeling from a database of protein structure alignments. Protein Sci. 3, 1582-1596.

Sali, A., Potterton, L., Yuan, F., van Vlijmen, H., & Karplus, M. (1995). Evaluation of comparative protein structure modeling by MODELLER. Proteins, 23, 318-326.

Sanchez, R. & Sali, A. (1997). Advances in comparative protein-structure modeling. Curr.

Opin. Struct. Biol. 7, 206-214.

Sankoff, D. & Kruskal, J. B. (1983). Time warps, string edits, and macromolecules:

The theory and practice of sequence comparison. Reading, MA: Addison-Wesley

Publishing Company.

Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences.

J. Mol.Biol. 147, 195-197.

Taylor, W. R. (1986). Identification of protein sequence homology by consensus template

alignment. J. Mol. Biol. 188, 233-258.

Vogt, G., Etzold, T., & Argos, P. (1995). An assessment of amino acid exchange matrices in

aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249, 816-831.

Tables and Figures Table 4: Parameters and distance matrices for defining α-helices and β-sheets. The

distances for the α helix are shown in the upper right triangle of the matrix, and the distances

for the β-strand are shown in the lower left triangle. See text for description of cutoffs c,.

Table 4

a-helix: c- = 0.50, C2 = 0.90, C3 = = 0.80, C3 = 1.50, C5 = 0.00, C6 = 6

/3-sheet: C\ = 0.80, C2 = 2.00, C3 = = 1.30, C = 1.60, C5 = 6.10, C6 = 5

- 3.802 5.518 5.541 6.683

3.804 - 3.802 5.519 5.556

6.601 3.802 - 3.801 5.530

9.678 6.607 3.803 - 3.801

12.679 9.665 6.558 3.803

Table 5. Optimization of parameters for the gap pernalty function

Table 5

RUN u v α/t_f ωg ωs ωc wo d₀ 7 SCORE

1 -900 -50 0 0 0 0 0 0 0 3780

3 -900 -50 O50 075 0 0 0 0 0 3357

6 -900 -50 0 0 0.90 0 0 0 0 3418

7 -900 -50 0 0 0 1.20 0 0 0 3520

8 -900 -50 0 0 0 0 Q 0 0 3731

9 -900 -50 0 0 0 0 0.6 J 0.50 3717

11 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 ^' 0.5 2954

12 -450 0 O40 0.75 0.90 1.20 0.6 7.0 •0.5 2944

13 -450 0 0.50 L3Q 0.90 1.20 0.6 7.0 0.5 2926

14 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 0.5 2953

15 -450 0 0.50 0.75 0.90 1.20 0.6 7.0 0.5 2953

16 -450 0 0.50 0.75 0.90 1.20 09 7.0 0.5 2924

17 -450 0 0.50 0.75 0.90 1.20 0.6 7JS 1.10 2926

18 -450 0 0.40 1.30 0.90 1.20 0.9 7.6 1.1 2900

19 -450 0 0.35 1.30 0.90 1.20 0.9 7.6 1.1 2900

20 -450 0 0.35 L20 0.90 1.20 0.9 7.6 1.1 2901

21 -450 0 0.35 1.20 0.90 1.20 0.9 7.6 1.1 2902

22 -450 0 0.35 1.20 0.90 1.20 0.9 7.6 1.1 2901

23 -450 0 0.35 1.20 0.90 1.20 06 7.6 1.1 2879

24 -450 0 0.35 1.20 0.90 1.20 0.6 8 5 1-2 2872

25 -450 0 0.35 1.20 0.90 1.20 , 0.6 8.6 1.2 « 2876

26 -300 0 0.35 1.20 0.90 1-20 0.6 8.6 1.2 3101

27 -300 0 0.35 1.20 0.90 1.20 0.6 8.6 1.2 2849

28 -300 0 0.35 1.20 0.90 1.20 0.6 8.6 1.2 3074

29 -950 ; :80 0.35 1.20 0.90 1.20 0.6 8.6 1.2 4120 underlined . ... parameters that were optimized in the current run. Example 3

Role of PSI BLAST

To find template structures for modeling of the target sequences program PSI-BLAST

was used (Altschul et al., 1997). PSI-BLAST iteratively collects sets of intermediate sequences to find homologs. The main steps of the procedure are: (i) for a given

sequence, an initial set of homologs is collected from the sequence database using a

conventional scoring matrix (for example, we use BLOSUM62); (ii) a weighted multiple

alignment is made from the query sequence and the homologs whose match scores are better than a specified E-value cutoff (for example, we use 0.0005); (iii) a position specific scoring

matrix is constructed from this alignment; (iv) this matrix is then used to search the database

for new homologs; (v) new homologs with a good match score are used to construct a new

position-specific scoring matrix, which is then used in a further search for homologs; and

(vi) rounds of matix reconstruction and new searches are iterated until no new homologs are found or until the number of iterations reaches a specified limit (for example, we use 20). The parameters used for the search have been shown to be optimal for the application of

PSI-BLAST as a fold-recognition method (Park et al., 1998). After the PSI-BLAST search

converges or reaches the maximal number of iterations the position-specific scoring matrix is

stored. A new searchcan be done using Gapped BLAST with the query and the stored

scoring matrix against a representative set of PDB chains. The representative protein chains have at most 95% sequence identity to each other, or have length difference of at least 30 residues or 30%; they are also the highest quality structures within each group

(highest resolution).

Since the scoring matrix that is generated depends on the query sequence this type of

search

is not symmetric. For this reason position specific scoring matrices are also

calculated for the sequences of the representative PDB structures and those are used to search against the target sequences using the same parameters described above.

Usually a match with an E-value of 10^"4 is considered significant, for example, matches

with E-values down to 10² can be accepted with the intention of finding more remote relationships. Using such a permissive cutoff of course increase the number of false positive

hits, but this is dealt with in the model evaluation step. Also other typical problems of remote matches with local alignment procedures like short coverage of the query sequence

and matches of different structures for the same region of the query sequence are dealt with

in the sequence-structure alignment and model evaluation steps.

Template selection.. Structures in PDB were clustered by comparing them against each

other. Structures that have a high enough structural similarity are aligned to each other and

form one single template containing several structures. These mega-templates are used to

construct the models. The prealignment of the structures avoids the need to realign them during the targettemplate alignment step. A template is selected for modeling if any of its

component structures was matched against the target sequence during the template

search step. The mega-templates form a depository of multiple structure alignments such as cath,scop, picasso, homstrad, except that they are for alignments between realatively

similar structures only.

References for this example

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. Z., Miller, W., & Lipman, D. J.

(1997). Nucl. Acids Res. 25, 3389-3402.

Park, J., Karplus, K., Barret, C, Hughey, R., Haussler, D., Hubbard, T., & Chothia, C.

(1998). J. Mol. Biol, 284, 1201-1210.

Example 4

The scoring function

(The following references are also useful as regards explaining how the program generates parameters from the PDB:F. Melo and E. Feytmans, J. Mol. Biol. Vol. 267, pp 207-222 (1997);

and F. Melo and E. Feytmans, J. Mol. Biol., vol. 277, pp. 1141-1152 (1998).)

The current implementation of the model evaluation module uses a scoring function that depends on three variables:

•.

1.- Compactness

2.- Sequence identity

3.- z-score of combined energy (non-bonded interactions + accessibility surface)

The compactness is a measure of how globular and spheric a protein model- is. The second variable represents the percentage of sequence identity in the alignment between target and template. The z-score is calculated using as a reference system the sequence space rather than the structure space. In this particular case, 200 random sequences are threaded in the same fold to obtain the combined energy z-score of the model. The energy terms taken into account involve pairwise contacts and exposure to solvent. The pairwise term is a statistical potential that uses C_a and Cβ atoms and includes all terms within an Euclidean distance of 15 A. The surface accessible term involves only Cβ atoms using a distance threshold of 10 A.

The scoring function has been obtained by running a genetic algorithm (GA) which evolves mathematical expressions that are evaluated for their ability to do a non-linear transform of the variables that maximize the standard distance between good and bad models in the training set. After several runs of this GA, the best mathematical expression was selected. This was the structure 341, thus the scoring funtion was named then GA341. The application of GA341 to the test set used in our study (about 2800 good models and about 5800 bad models) resulted in a standard distance of 5.172 between good and bad models. The GA341 expression, which constitutes the current scoring function, relates, the three variable mentioned above, and its structure is as follows:

{compactness + seq.ide) zscore

G-4341 = 1 — [cos {seq.ide)]

The calculation of the three independent variables is described in detail in the next pages. The dependency of this scoring function in the three variables is shown in several graphs in the pages attached. A summary of the performance of the method in the benchmark ".'(• have used to test the method is also shown. The classification procedure works as follows:

G-4341_«or<! > 0.5 =» model is classsified as GOOD G-4341,eor- < 0.5 =*• model is classified as BAD

Compactness: This is a measurement of how compact a globular protein is. The mathematical expression to calculate the compactness is as follows:

where di is the largest distance between non-bonded residue pairs observed in the protein; and V^AA is the volume of each residue.

z-score: The first step in model evaluation consists in the calculation of the energy of all non-bonded interactions E ^D) and the sum of the energy of exposure to solvent for each residue {E^AS). Then, the sequence of the model is shuffled and two hundred random sequences are generated. These sequences are threaded into the model fold and the energy of non-bonded interactions and exposure to the solvent are calculated for each one of these random models. In this second step, two distributions of energy that will be used as a reference system are generated: Ef?^D and E^AS. Then the standard deviation for each distribution is calculated {σ*?^D and σ^AS). These standard deviations σ?^D and σ^AS are used to normalize E ^D and E^AS, obtaining E ° and E^AS respectively. Also, they are used to normalize each value in each distribution. Once all the energies are normalized, the combined energy is calculated as follows: βCOMB _ jβDD , SjAS

Then, for each random model, the combined energy is also calculated: βCOMB __ βDD , βAS

Then, the standard deviation {σ?^OMB) and the average {μ^^OMB) of the distribution of combined energies are calculated and used to obtain the z-score of the model: πCOMB ..COMB z rCOMB ^fira ^— r ^m _σCOMB sequence identity: The third and last dependent variable in GA341 expression is the percentage of sequence identity in the target-template alignment. This is simply the number of identical residues in the alignment divided by the total number of aligned positions.

The detailed parametrization of the statistical potentials (distance dependent and accessible surface) is given in the tables below:

Distance-dependent potential that describes the energy of interaction between non-bonded residue airs

Accessible surface potential that describes the energy of ex osure to solvent for each residue

Example 5

ModBase

Contents

The database currently contains models for segments of approximately 17,000 proteins

from the completely sequenced genomes of Saccharomyces cerevisiae, Mycoplasma

genitalium, Caenorhabditis elegans, Escherichia coli, Methanobacterium

thermoautotrophicum, Synechocystis sp., Pyrococcus horikoshii, Methanococcus jannaschii, Haemophilus influenzae, and Mycoplasma pneumoniae, as well as all Arabidopsis thaliana

and Homo sapiens proteins in the SwissProt database [27]. The sources of the genomes are listed at http://guitar.rockefeller.edu/modbase/sources.html. Each model has its

non-hydrogen atom coordinates stored in a flat file in the PDB format. The database also

contains all fold assignments, alignments, and model evaluations.

Models are generated with an entirely automated four step procedure implemented in the

Mod-

Pipe pipeline software [10, 28]: (i) Fold assignment, (ii) sequence-structure alignment, (iii)

model building, and (iv) model evaluation. The procedure can be applied independently and in parallel on a cluster of workstations to thousands of protein sequences, including complete genomes and large protein sequence databases. For fold assignment, each

sequence from a genome is compared with a non-redundant set of proteins of known 3D

structure using Psi-Blast [29]. Next, for each target protein sequence, a multiple global alignment with the matching structures is constructed by the ALIGN2D command in the

program Modeller [30]. This alignment tends to be more accurate than the

Psi-Blast alignment because (i) it includes all the sequences and structures that are

sufficiently similar to the target sequence, (ii) it uses a structure-dependent gap penalty function to position gaps in a structurally reasonable environment, and (iii) it matches

complete structural domains as obtained from the known template structures [Roberto Sanchez, Francisco Melo, Nebojsa Mirkovic and Andrej Sali, in preparation]. In the third

step, the sequence-structure alignment is used to build a 3D model for the matched parts of

the target protein sequence by the program Modeller. Finally, the model is evaluated as

discussed next.

Model evaluation is essential for assessing the value of 3D protein models in any protein

structure prediction [ 31, 32]. It is especially important for ModPipe because a relatively

permissive cutoff is used to select known protein structures for model building in the first

fold assignment step. This permissivness reduces the number of missed hits, but it also increases the number of false fold assignments and alignment mistakes. The fold

assignment errors begin to appear when relatively dissimilar template-target sequences are

matched (i.e., < 30% sequence identity). In addition, even if the fold is assigned

correctly, errors in the alignment may still result in a bad model. The alignment errors can be significant when the sequence identity drops below 35%. A reliable model is obtained only if both the correct fold assignment and an approximately correct alignment are made.

The overall accuracy of a model is measured by an overlap between the model and the actual

structure. The overlap is defined as the fraction of residues whose C_α atoms are within 3.5A of each other in the globally superposed pair of structures. Models that overlap with the correct structures in more than 30% of their residues are defined here as good' models.

Such models are likely to have a correct fold, which is frequently sufficient for coarse

prediction of protein function [33]. A method for calculating the probability of whether a

given model is good, pG, was developed [10] and is used to evaluate all the models in ModBase. If a given model has pG > 0.5, it is called a 'reliable' model. The method

depends on a statistical scoring function [32] and was calibrated using 3,993 and 6,270 good

and bad models for 1,085 proteins of known structure [10]. An assessment of the method by

the jack-knife procedure indicated that for models longer than 100 residues the classification

results in less than 5% of false positives and less than 8% of false negatives.

Combined 3D modeling and model evaluation is the best way of either confirming or

rejecting a match between remotely related sequence and structure [10, 34]. This is

important because most of the related protein pairs share less than 30% sequence identity [10]. As a result ModBase includes reliable models based on templates that are not

detectable as significant matches by PSI-BLAST alone.

Access and Interface

ModBase has a web interface at http://guitar.rockefeller.edu/modbase/. Models for yeast

proteins are also accessible through links from the Sacch3D [35] database at http://genome-www. stanford.edu/Sacch3D. The database is searchable by SwissProt/TrEMBL and GenPept accession numbers, as well as by ORF names, keywords,

model reliability, model size, target-template sequence identity, and alignment significance. It is also possible to perform sequence similarity searches against the model sequences using

Blast [29]. Searching results in a table of models satisfying all search criteria. The table lists the modeled regions, the templates used to construct the models, target-template similarities,

and model reliabilities. For each model, it also includes links to a more detailed description

of the model, to a summary of all models for a given protein, and to the PDB for a detailed

description of the template structure used in modeling. If the modeled sequence is present in SwissProt/TrEMBL, its description is displayed together with a link to the database. The

model description page contains a graphical representation of the target-template

alignment. In addition, it is linked to the model coordinates in the PDB format, to the

target-template alignment used to derive the model , and to a display of the model by the

3D visualization program Rasmol [36] . The model description page also contains links to the ModBase entries related to the target sequence and to the CATH domains [17] contained in the model. Finally, statistical data, such as distributions of several model

properties in ModBase can also be displayed.

Using Comparative Models

It is frequently possible to extract more information from a comparative model than

from the modeled sequence alone, or even from its alignment to a related protein

structure [28]. For example, the preferred ligand of brain lipid binding protein could be

predicted correctly from the volume and shape of the ligand binding cleft in its comparative model [37]. Another example is provided by mouse mast cell proteases, some of which have a conserved surface region of positively charged residues that binds proteogl yeans [38]. This region is not easily recognizable in the sequence or its alignment to a known structure

because the constituting residues occur at variable and sequentially non-local positions in sequence that form a binding site only when the protease is fully folded.

In general, comparative modeling has been applied successfully to many biological

problems [6]. It can be helpful in proposing and testing hypotheses in molecular

biology, such as hypotheses about ligand binding sites [37, 38], substrate specificity [39], drug design [40], and protein-protein interactions [41]. It can also provide starting

models in x-ray crystallography [42] and NMR spectroscopy [43]. Another use of 3D

models is that some binding and active sites, which cannot possibly be found by searching

for local sequence patterns, frequently should be detectable by searching for small 3D

motifs that are known to bind or act on specific ligands [44- 46]. Finally, comparative models in combination with model evaluation can also be used to

confirm or reject remote sequence-structure relationships, complementing the existing

sequence matching and threading methods for fold assignment [10, 34].

References for this example

[1] Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., and Walters, L.

(1998) Science 282, 682-689.

[2] Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y.

(1998) /. Mol. Biol. 283, 707-725.

[3] Koonin, E. V., Tatusov, R. L., and Galperin, M. Y. (1998) Curr. Opin. Str. Biol. 3,

355-363.

[4] Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette, B. F. F., Rapp, B. A.,

and Wheeler, D. L. (1999) Nucl. Acids Res. 27, 12-17.

[5]Abola, E. E., Bernstein, F. C, Bryant, S. H., Koetzle, T., and Weng, J. (1987)

Protein data bank In F. H. Allen, G. Bergerhoff, and R. Sievers, (ed.), Crystallographic databases- Information, content, software systems, scientific applications, pp. 107-132 Data

Commission of the International Union of Crystallography Bonn/Cambridge/Chester.

[6] Johnson, M. S., Srinivasan, N., Sowdhamini, R., and Blundell, T. L. (1994) CRC Crit.

Rev.

Biochem. Mol. Biol. 29, 1-68. [8] Guex, N., Diemand, A., and Peitsch, M. C. (1999) Trends Biochem. Sci. 24, 364-367.

[9] Fischer, D. and Eisenberg, D. (1997) Proc. Natl. Acad. Sci. USA 94, 11929-11934.

[10] Sanchez, R. and Sali, A. (1998) Proc. Natl. Acad. Sci. USA 95, 13597-13602.

[11] Rychlewski, L., Zhang, B., and Godzik, A. (1998) Fold. Des. 3, 229-238.

[12] Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C, Sunyaev, S., Yuan, Y.,

and Bork, P. (1998) J. Mol. Biol. 280, 323-326.

[13] Grandori, R. (1998) Prot. Eng. 11, 1129-1135.

[14] Teichmann, S. A., Park, J., and Chothia, C. (1998) Proc. Natl. Acad. Sci. USA

22, 14658-14663.

[15] Jones, D. T. (1999) /. Mol. Biol. 287, 797-815.

[16] Hubbard, T. J. P., Ailey, B., Brenner, S. E., Murzin, A. G., and Chothia, C.

(1999) Nucl. Acids Res. 27, 254-256.

[17] Orengo, C. A., Pearl, F. M. G., Bray, J. E., Todd, A. E., Martin, A. C, Conte, L.

L., and Thornton, J. M. (1999) Nucl. Acids Res. 27, 275-279. [18] Holm, L. and Sander, C. (1999) Nucl. Acids Res. 27, 244-247.

[19] Holm, L. and Sander, C. (1996) Science 273, 595-602.

[20] Terwilhger, T. C, Waldo, G., Peat, T. S., Newman, J. M., Chu, K., and Berendzen, J.

(1998) Protein Sci. 7, 1851-1856.

[21] Sali, A. (1998) Nat. Struct. Biol. 5, 1029-1032.

[22] Zarembinski, T. I., Hung, L. W., Mueller-Dieckmann, H. J., Kim, K. K., Yokota, H.,

Kim, R., and Kim, S. H. (1998) Proc. Nat. Acad. Sci. 95, 15189-15193.

[23] Burley, S. K., Almo, S. C, Bonanno, J. B., , Capel, M., Chance, M. R., Gaasterland,

T., Lin, D., Sali, A., Studier, F. W., and Swaminathan, S. (1999) Nat. Genet. 23,

151-157.

[24] Montelione, G. T. and Anderson, S. (1999) Nat. Str. Biol. 6, 11-12.

[25] Cort, J. R., Koonin, E. V., Bash, P. A., and Kennedy, M. A. (1999) Nucl.

Acids Res. 27, 4018-4027.

[26] Peitsch, M. C, Wilkins, M. R., Tonella, L.,Sanchez, J. C, Appel, R. D., and Hochstrasser, D. F. (1997) Electrophoresis 18, 498-501.

[27] Bairoch, A. and Apweiler, R. (1999) Nucl. Acids Res. 27, 49-54.

[28] Sanchez, R. and Sali, A. (1999) J. Comp. Phys. 151, 388-401.

[29] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. Z., Miller, W., and Lipman,

D. J.

(1997) Nucl. Acids Res. 25, 3389-3402.

[30] Sali, A. and Blundell, T. L. (1993) J. Mol. Biol. 234, 779-815.

[31] Luthy, R., Bowie, J. U., and Eisenberg, D. (1992) Nature 356, 83-85.

[32] Sippl, M. J. (1993) Proteins 17, 355-362.

[33] Orengo, C. A., Jones, D. T., and Thornton, J. M. (1994) Nature 372, 631-634.

[34] Guenther, B., Onrust, R., Sali, A., O'Donnell, M., and Kuriyan, J. (1997) Cell 91,

335-345.

[35] Chervitz, S. A., Hester, E. T., Ball, C. A., Dolinski, K., Dwight, S. S., Harris, M. A.,

Juvik, G., Malekian, A., Roberts, S., Roe, T., Scafe, C, Schroeder, M., Sherlock, G., Weng, S., Zhu, Y., Cherry, J. M., and Botstein, D. (1999) Nucl. Acids Res. 27, 74-78.

[36] Sayle, R. and Milner-White, E. J. (1995) Trends in Biochemical Sciences 20, 374.

[37] Xu, L. Z.,Sanchez, R., Sali, A., and Heintz, N. (1996) J.Biol.Chem. 271, 24711-24719.

[38] Matsumoto, R., Sali, A., Ghildyal, N., Karplus, M., and Stevens, R. L. (1995) /. Biol.

Chem. 270, 19524-19531.

[39] Caputo, A., James, M. N. G., Powers, J. C, Hudig, D., and Bleackley, R. C. (1994)

Nature Struct. Biol. 1, 364-367.

[40] Ring, C. S., Sun, E., McKerrow, J. H., Lee, G. K., Rosenthal, P. J., Kuntz, I. D., and

Cohen, F. E. (1993) Proc. Natl. Acad. Sci. USA 90, 3583-3587.

[41] Vakser, I. A. (1997) Proteins Suppl. 1, 226-230.

[42] Carson, M., Bugg, C. E., Delucas, L., and Narayana, S. (1994) Acta Crystallogr. D50,

889-899.

[43] Nagata, T., Gupta, V., Kim, W.-Y., Sali, A., Chait, B. T., Shigesada, K., Ito, Y., and

Werner, M. H. (1999) Nat. Str. Biol. 6, 615-619. [44] Wallace, A., Borkakoti, N., and Thornton, J. M. (1997) Protein Sci. 6, 2308-2323.

[45] Fetrow, J. S. and Skolnick, J. (1998) J. Mol. Biol. 281, 949-968.

[46] Kleywegt, G. J. (1999) J. Mol. Biol. 285, 1887-1897.

[47] Dunbrack Jr., R. L., Gerioff, D. L., Bower, M., Chen, X., Lichtarge, O., and

Cohen, F. E. (1997) Folding & Design 2, R27-R42.

[48] Koehl, P. and Levitt, M. (1999) Nat. Struct. Biol. 6, 108-111.

[49] Jones, D. (1997) Curr. Opin. Struct. Biol. 7, 377-387.

[50] Eddy, S. R. (1996) Curr. Opin. Struct. Biol. 6, 361-365.

Example 6

Example of 3-dimensional structure obtained for the YBL007C yeast protein using the

process and system of this, including PSI BLAST and the scoring function set forth above.

In Table 6, the 6 columns for each atom, reading left to right, are for: Residue number,

amino acid that is the residue, atom name for which the coordinates are shown, and the x, y, and z coordinates, respectively.

Table 6

Table 6 (continued)

12 THR CA 4.25| -15.83β| 24.247 18 ILE CD1 6.1861 -13.3l|31.432

12 THR CB S.277 -15.715 25.336 18 ILE -9.283 -13.38 27.633

12 THR OGl 4.656 15.863| 26.607 18 -8.897 -12.62 26.747

12 THR CC2 S.926 -14.323 25.24 19 GLN 10.58 -13.47 27.997

12 THR 3.547 -17.136 24.502 19IGLN 11.62 -12.67 27.419

12 THR 2.323 -17.217 24.561 19 GLN CB -12.83 -13.51 27.004

13 PRO 4.333 -18.178 24.556 19 GLN 13.49 -14.16128.221

13 PRO CA 3.799 19.473J 24.88 19 GLN ■14.67 -14.98 27.753

13 PRO CD 5.471| -18.266| 23.656 19 OE1 15.38 -14.61 26.819

13 4.859 -20.481 24.43! 19 CLN NE2 -14.9 -16.13 28.434

13 PRO CG 5.598 -19.754 23.3 19 GLN -12.1 -11.75 28. SI

13 3.395 -19.607 26.326 19|GLN 11.82 -11.98129.685

13 2.51 -20.409 26.621 20 GLU 12.85 10.69128.144

14 4.061 -18.868 27.239 20 GLU -13.37 -9.782 29.133

14 GLU CA 3.845 -18.919 28.665 20 GLU CB -14.22 -8.617 28.574

14 GLU CB 4.963 -18.236 29.468 20 GLU CG -13.48 -7.53 27.784

14 GLU CO 6.289 -18.999 .29.423 20 GLU CD -14.49 -6.442 27.422

14 CD 7.269 -18.2621 30.32 20 GLU OE1 15.69 -6.622 27.751

14 GLU OE1 6.874 -17.207 I 30.885 20|GLU OE2 14.07 5.415126.817

14 GLU OE2 8.423 -18.746 30.46 20 GLU -14.3 10.56130.009

14 2.557 -18.277 29.101 20 GLU 14.97 -11.49 29.561

14 GLU 1.924 -18.737 30.049 21 ASP 14.34 -10.19 31.303 lSlGLU 2.15 -17.185 28.427 21 ASP -IS. -10.77 32.297

ISIGLU 1.033 -16.351 28.794 21 ASP CB -16.7 -10.77 31.914

15|CLU 0.91S -15.081 27.924 21 ASP CG 17.31 -9.387132.085 lSfGLU CG 2.03 -14.049 28.094 21 ASP OD1 -16.59 -8.462 32.534

IS GLU CD 1.715 -13.223 29.327 21 ASP OD2 -18.53 ^■9.245 31.781

15 OE1 1.911 -13.742 30.458 21|ASP -14.84 -12.2 32.567

IS GLU OE2 1.259 -12.063 29.151 21 ASP •15.57 -12.89 33.275

15 GLU -0.27 -17.061 28.601 22 ASP -13.7 -12.7 32.061

15 GLU -0.35 -18.095 27.94 22 ASP 13.39 -14.06 32.398

16 -1.33 -16.503 29.222 22 ASP CB -12.29 -14.71 31.541

16 LEU -2.66 -17.051 29.12 12.26 -16.19 31.881

16 LEU -3.28 -17.361 30.495 22 ASP OD1 13.37 -16.76

16 CG -4.77 -17.789 30.428 22 -11.15 -16.77 31.97

16 LEU CD2 -5.4 -17.848 31.835 22 ASP -12.9 -14.04 33.808

16 LEU CD1 -4.94 -19.104 29.651 22 ASP (-12.32 -13.05 34.246

16 LEU -3.57 -16.057 28.455 23 LEU -13.13 -15.13 34.56

16 LEU -3.63 -14.903 28.876 23 -12.67 -15.15 35.914

17 -4.29 -16.492 27.394 23 LEU -13.7 -15.68 36.929

17 ALA -5.24 -15.657 26.716 23 LEU CG 14.94 -14.77 37.043

17 CB -5.72 -16.211 25.361 23 LEU CD2 •15.73 -14.75 35.73

17 ALA -6.46 -15.51 27.578

23 LEU -14.56 -13.36 37.536

17 ALA -6.79 -16.537 28.276 23 11.45 -16.01 35.968

18 ILE -7.16 -14.423 27.553 23 LEU 11.39 -17.09 35.393

18 ILE CA -8.32 -14.253 28.376 24 LEU 10.42 -15.5 36.675

18 ILE -7.99 -13.52 29.647 24 LEU CA -9.138 -16.12 36.764

18 ILE CG2 -9.27 -13.361 30.482

24 LEU CB -8.023 -15.19 36.243 lβllLE CGI -6.86 -14.228 30.407 24 LEU CG -8.06 -14.85 34.736

LEU CD2 -9.315 -14.06 34.357

24 LEU CD1 -7.85 -16.1 33.864

24|LEU 8.8391 -16.38 38.208 Table 6 (continued)

Table 6 (continued ) 6 TRP CDl 0.289 -10.82 39.539 42 VAL 14.1 -22.4 35.77 6 TRP NE1 -0.219 -9.655 40.058 42 VAL -15.43 22.12 36.299 6 TRP CE2 -1.542 -9.55 39.682 42 VAL CB -16.14-20.99 35.596 6 TRP CE3 -3.11 -10.85 38.4 42 VAL CGI 17.57-20.85 36.142 6 CZ2 -2.479 -8.59 39.945 42 VAL CC2 -15.3 19.72 35.812 6 CZ3 4.049 -9.877 38.66 42 VAL -16.22 -23.38 36.129 6 TRP CH2 3.738 -8.769 39.419 42 VAL -15.9 24.22 35.293 6 TRP 0.61S -14.56 37.022 43 ILE -17.28 23.53 36.95 6 TRP 0.474 -14.64 35.8031 43 ILE CA -18.14 -24.68 36.907 7 THR 0.619 15.64 37.8211 43 ILE -19.17-24.69 38.001 7 THR CA 0.37 -16.94 37.271 43 ILE CC2 20.13 -23.5 37.778 7 THR CB 1.129 -18.04 37.958 43 CGI -19.87 -26.06 38.032 7 THR OG1 0.747 18.1139.324 43 CDl -20.75 -26.29 39.259 7 THR CC2 2.632 -17.74 37.843 43 ILE -18.87 -24.65 3S.606 7 THR -1.082 -17.17 37.515 43 ILE -19.31 -23.6 35.14

-1.554 -17.02 38.64 44 GLY -18.99 25.84 34.975

-1.841 -17.53 36.466 44 CA -19.69 25.93 33.73

VAL CA 3.251 -17.61 36.682 44 -18.74 -25.62 32.612

VAL CB 3.962 -16.45 36.063 44 GLY -17.53 25.76 32.743 VAL 5.466 -16.62 36.287 45 -19.3 -25.18 31.468 VAL CG2 -3.382 15.1636.659 45 CA -18.56 •24.89 30.279

VAL ■3.795 18.8736.084 45 SER -19.45 24.49 29.099

3.225 -19.44 35.154 45 SER -20.1 -23.26 29.381

LYS -4.923 19.3536.649 45 SER -17.64 -23.74 30.536 LYS 5.592 -20.51 36.144 45 SER -16.62 23.59 29.859 LYS 5.442 -21.71 37.105 46 -17.95 -22.91 31.543

CG -6.079 -23.03 36.657 46 ASP -17.17 -21.74 31.788

LYS CD -7.595 -23.08 36.839 46 CB -17.66 i-20.92 32.994

LYS 7.998 -23.32 38.298 46 ASP -19.03 20.35 32.642

-7.39 -24.58 38.788 46 ASP OD1 20.32 31.43

LYS -7.038 -20.15 35.965 46 ASP -19.74 ■19.91 33.586

^■7.651 -19.54 36.B42 46 ASP -15.75 -22.14 32.063

LYS -7.617 -20.48 34.794 46 ASP -15.47 -23.24 32.537

LYS CA -8.987 -20.14 34.523 47 -14.83 -21.21 31.739

LYS 9.264 -19.82 33.043 47 CA -13.41 -21.34 31.894

CG -8.552 -18.6 32.47 47 SER -12.96 21.53 33.34

LYS CD -8.56 -18.59 30.938 47 SER OG -13.21 -20.34 34.079

CE -9.944 -18.83 30.328 47 -22.45 31.052

NZ -9.85 -18.84 28.85 47 SER -13.42 -23.57 31.058

LYS 9.842 -21.34 34.793 48 GLU -11.86 -22.12 30.269

LYS 9.632 -22.39 34.192 48 -11.17 22.99 29.363

10.83 -21.25 35.712 48 -10.25 ■22.24 28.388

-11.7 -22.38 35.847 48 CG -10.97 21.28 27.44

CB -11.2 -23.53 36.749 48 GLU CD -11.62 -22.08 26.329

ARG •12.09 -24.78 36.631 48 GLU OEl -11.94 23.28 26.573

ARG CD ■12.02 -25.74 37.815 48 GLU OE2 -11.81 -21.52 25.218

NE 10.66 -26.36 37.882 48 GLU -10.32 23.95 30.142

ARG CZ -10.26 -26.94 39.048 48 GLU -10.04 25.05 29.675

NH1 11.11 ■26.99 40.113 49 -9.887 23.53 31.348

ARG NH2 9.003 -27.48 39.159 49 GLU CA -8.946 24.17 32.236

ARG 13.02 -21.97 36.418 49 -9.254 -25.65 32.578

ARG -13.09 -21.28 37.437 49 GLU CG -8.835 -26.67 31.51 Table 6 (continued)

Table 6 (continued)

Example 7

Additional Overview of the process

Coarse Parallelization in ModPipe

Parallelization in ModPipe occurs in each of its basic modules: GETJV1ATRICES for

calculation of PSI-BLAST position specific substitution matrices (PSSM), SEARCH for

template search, PARSE for PSI-BLAST output parsing, ALIGN for template selection and

alignment, MODEL for model building and EVAL for model evaluation. There are three forms

of coarse parallelization: using the CLUSTOR program, using a queuing system or running locally on a symmetric multiprocessor machine.

Coarse parallelization using CLUSTOR (ActiveTools, San Francisco, CA)

ModPipe can make use of CLUSTOR to distribute computations on several computers.

CLUSTOR allows distributed computing over any number of computers (nodes) by copying

input files from a central machine (root) to the nodes, executing programs on the nodes, and

copying the program output files from the nodes back to the root. The procedure for each

ModPipe step works as follows:

1. PSI-BLAST PSSM calculation: The ModPipe GETJMATRICES module prepares a

CLUSTOR run file that will process one sequence at a time on each of the nodes using the get natrix ModPipe routine. The input file is the sequence that is copied from the

root to the node, getjnatrix calculates a PSSM by running PSI-BLAST with the input

file sequence against the non-redundant database of sequences (nr). Both PSI-BLAST

and the nr database must be previously installed on the nodes. Once the PSI-BLAST run

on the node finishes, CLUSTOR copies the PSSM file back to the root. The same

procedure is used to obtain PSSMs for all the template sequences.

2. Template Search: The ModPipe SEARCH module prepares a CLUSTOR run file that

will call the do_search ModPipe routine on the nodes to make two searches per

sequence (filtered and non-filtered PSI-BLAST search). The inputs for this search are

the sequence file and the PSSM file obtained in the previous step. Both files are copied

to the node where do_search runs PSI-BLAST against the template sequences using the

input sequence plus its PSSM as a query. Each of the two runs produces one output file. The two output files are copied back to the root.

3. Parsing: The ModPipe PARSE module prepares a CLUSTOR run file that will execute

the appropriate ModPipe routines on the nodes to parse the BLAST output files from the search step and produce a table of target template hits for each target or template sequence. The input for parsing is the BLAST output file. The do parse ModPipe

routine is called by CLUSTOR on the node and the table of hits produced by the routine

is copied back to the root.

4. Template selection and alignment: The ModPipe ALIGN module creates a CLUSTOR

run file that will execute the do_align ModPipe routine on the nodes to do both template selection and target-template alignment. The hit tables produced by the previous step are

copied over to the nodes along with the target and template sequences, target and

template PSSMs, and the template structures. On the node do_align calls ModPipe

routines for template selection, and MODELLER and/or BLAST to generate the

alignments. The alignment files and an extended hit table, containing alignment and

template selection information, are copied back to the root. MODELLER and BLAST must be previously installed on all nodes.

5. Modeling: The ModPipe MODEL module creates a CLUSTOR run file that calls the

makejnodel ModPipe routine for each alignment calculated in the previous step. The

alignment file and template structure are copied to the nodes where CLUSTOR calls the

makejnodel routine, which in turn executes MODELLER to calculate the model. The model file is then copied back to the root along with and extended table that includes the

model identifier in addition to all the previous alignment data. MODELLER must be

previously installed on all nodes.

6. Evaluation: The ModPipe EVAL module creates a CLUSTOR run file that calls

ModPipe' s make_eval routine for each model calculated in the previous step. The model

file is copied to the node where CLUSTOR calls make_eval, which in turn executes the

eval program to evaluate the model structure. The evaluation result is returned to the root to be included into the data table. Coarse parallelization using a queueing system

ModPipe can make use of basically any queuing system to distribute computations as

batch jobs on several computers. A queuing system allows distributed computing over any

number of computers given that they all have access to one common directory, usually through

an NFS mounted file system. The queuing system can then schedule the execution of ModPipe modules on one or more processors on the queuing system nodes (computers controlled by the

queuing system). In this case the file transfer to and from the nodes is not essential because they

all have access to a common directory. But for the sake of efficiency ModPipe modules copy

the necessary data and programs to temporary local directories on each node. The transfer of

input and output is then accomplished by simple copying from and to the NFS mounted file

system, respectively. The queuing system only controls the execution of the modules. Each module, when running in the queuing system mode, takes as arguments the number of batch jobs

and the number of processors to be used per batch job to process a list of inputs (sequences,

alignment or models). The module then divides the input list according to the number of batch jobs that will be executed and creates a new set of scripts (one for each batch job) along with a file containing command line options for the individual jobs to be executed on the node. Once

the scripts created by the module are executed on the nodes they call a multiprocessing module,

which runs a number of copies of the corresponding subroutine {getjnatrix for

GET MATRICES, do_search for SEARCH, etc.) according to the number of processors that

should be used on that node. Each individual sub routine job takes its arguments from the files

created by the module and processes a single input (sequence, alignment or model). The input

files are copied by the sub routine from the common NFS mounted directory to the node. The

input and output files are the same that were described for the CLUSTOR parallelization. Coarse parallelization using a symmetric multiprocessor (SMP) system

ModPipe can do coarse parallelization locally on an SMP by using the methods described for

the queuing system. When running locally on an SMP the ModPipe modules create a single script (equivalent to a single batch job) that will be executed on the SMP locally instead of being submitted to the queuing system. The rest of the procedure is the same with the script

calling a multiprocessing routine that in turn executes one copy of the corresponding sub

routine for each processor on the SMP.

Example 8

Additional illustrative example of the process.

Figure 6 shows a flowchart for comparative protein structure modeling on the

genome scale [11]. The figure relates to a particular example of the process of the

invention.

(It is preferable to use PSIBLAST instead of Align. It is preferable to use the scoring

function and assessment method described elsewhere in this application, rather than

PROSAπ for model evaluation.) To find template structures for modeling of the protein

sequence search of the sequences is compared with each of the 2045 potential templates corresponding to the protein chains representative of the Protein Data Bank (PDB) of

known protein structures [3]. The representative PDB proteins at most 95% sequence

identity to each other, or have length difference of at least 30 residues or 30%; they are also

the highest quality structures within each group. The matching is done by the program Align

[25], which implements he local dynamic programming method with a new gap penalty function and has a search sensitivity higher than that of BLAST.Each sequence - structure

matching is run with the default gap penalty parameters first (Not to be confused with the

variable gap penalty function noted below.) A match is considered significant or insignificant if the alignment score is more than 22 or less than 19 nats, respectively, where the nat is a unit for measuring significance of a match [25]. All the pairs with intermediate

matches with scores between 19 and 22 nats are realigned using 600combinations of the gap

penalty parameters. The match is finally considered significant if the best of the 600

alignments has a score of at least 22 nats. The PDB chain from a significant match is used as

the template structure for the corresponding region of the sequence. To obtain

target-template alignment for comparative modeling, the matching parts of the template

structure and the protein sequence are re-aligned by the use of the Align2d command of the

Modeller program [24]. This command implements a global dynamic programming method

for comparison of two sequences, but also relies on the observation that evolution tends to place residue insertions and deletions in the regions that are solvent exposed, curved, outside secondary structure segments, and between two C_α positions close in space. Gaps in these

structurally reasonable positions are favored by a variable gap penalty function that is

calculated from the template structure alone. As a result, the alignment errors are reduced by

approximately one third relative to the standard sequence alignment techniques. The refined sequence-structure alignment is used by Modeller to construct a 3D model of the matched protein sequence region, containing all mainchain and sidechain non-hydrogen atoms.

Model building begins by extracting distance and dihedral angle restraints on the target

sequence from its alignment with the template structure. These template-derived restraints

are combined with most of the CHARMM energy terms to obtain a full objective function. Finally, this function is optimized to construct a model that satisfies all the spatial restraints

as well as possible. The overall accuracy of the resulting model is predicted by a procedure

that relies on a Z-score from the program PROS AH [28]. The PROS All Z-score

approximates the difference in free energy of an evaluated model and the mean free energy of the same sequence threaded through unrelated folds, expressed in units of standard

deviation. The free energies are calculated with statistical potentials of mean force for single

residues and pairs of residues [28]. Using many models of proteins with known structure,

the distributions of the Prosall Z-score were obtained for good models, which have more

than 30% of their C_α atoms within 3.5 A of their actual positions, and for bad models. These distributions are used with the Bayesian theorem to calculate the probability that a given

model with a certain Z-score is either good or bad. Once a model is predicted to be good, its

overall accuracy is evaluated more precisely based on an empirical relationship between the

fraction of the correctly modeled C_α atoms and the percentage sequence identity to the template [11]. The modeling flowchart in this Figure can result in duplicate and overlapping models of some sequence regions. The flowchart has been implemented in a UNIX Perl

script that calls the appropriate programs for the individual tasks. Program Clustor is used to

individual programs for parallel execution (http://www.activetools.com).

References for this example 3] E. E. Abola, F. C. Bernstein, S. H. Bryant, T.F Koetzle, and J. Weng. Protein data bank.

In F. H. Allen, G. Bergerhoff, and R. Sievers, editors, Crystallographic databases _ Information, content, software systems, scientific applications, pages 107-132. Data

[11] R. Sanchez and A. Sali. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA, 95:13597-13602, 1998.

restraints. J. Mol. Biol., 234:779-815, 1993.

32:88-96, 1998.

[28] M. J. Sippl. Recognition of errors in three-dimensional structures of proteins. Proteins, 17:355-362, 1993.

Appendix 1 evall.pl

# collects all sel files from $count++ ,- the sequence directories, $mod[$count] = $mod; extracts the $sid[$count] = $sid;

# alignments and calculates } model . } else {

$nomodel++; use lib $ENV{MODPIPELIB}; push(@notfound, $mod) ; use RSlib: : Error; } use RSlib: rmodbase; } use ModPipe: : isc; close (LIST) ; use Getopt : :Long;

# program name if ( $check ) {

$program = $0; message "$nomodel model files $program =~ s/.*\///g; wexe not found"; foreach $model (Snotfound) {

# check input message (" $model not found") }

&check_options; message "$failed models with $listfile = $ARGV[0]; previously failed evaluation"; message "$withevall models

# initialize modpipe were already successfuly configura ion evaluated" ; if ( ! modpipe_init ) { }

Error (0, "modpipe_init failed") } if ( $count < 1 ) {

# read list file message "no models to

$nomodel = 0 ; process . " ; $withevall = 0; exit;

$count = 0; } else { $failed = 0; message "processing $count open (LIST,$listfile) ; models" ; while ( <LIST> ) { } chomp;

@F = split(/\|/,$_) ; if ( $check ) { exit } $mod = $F[2]; $mod =~ s/ //g; $moddir = ModDir ($mod) ; if ( $run eq "lsf " | | $run eq $modfile = "lsflocal" ) { $moddir. " /$mod.pdb" ;

$datfile = # LSF run : NG is number $moddir . " /$mod . at " ; of processor groups, NP is

# does the model file number of processor exist? # per group. For if ( -e $modfile | | -e example 2x32 submits jobs on two "$modfile.gz" ) { groups of open (DAT, $datfile) ; # 32 processors while ( <DAT> ) { each, this means submitting two chom ; bsub jobs

($sid, $evall) = # with -n32 and (split(/\|/,$_)) [13,14]; -ptile=32 each. } close (DAT) ; # create tmp directory for if ( $evall && ! $replace S.& this run $evall ne "F" ) { $lsfdir = "lsf-evall" .time;

$withevall++; system( " kdir $lsfdir"); } else { $tmpdir = if ( $evall eg "F" ) { $tmpdirroot. "/$lsfdir"; $failed++ } Appendix 1 (continued)

print "LSF SETUP in directory #print RUN "use lib $lsfdir\n"; \"/home/modpipe/lib/perl5\";\n" _; print "RUNNING in directory print RUN "use lib $tmpdir\n" ; \ " $ENV{MODPIPELIB} \ " ; \n" ; print RUN "use

# parse --lsfp option for RSlib: :multiproc; \n" ; number of groups (ng) and number print RUN "unless ( -e of \"$tmpdir\" ) {\n"_;

# processors per group print RUN " system(\"mkdir (np) -p $tmpdir\") ;\n";

($ng,$np) = split (/x/ , $lsfp) ; print RUN "}\n"; print RUN

# decide which queue to "multiprocl($np, \"$perl use unless it was passed as an $scriptdir/make_evall .pl\" , \"$da argument tfile\") ;";

# (this is ACL specific) close (RUN) ; unless ( $queue ) { $queue =

"small" } # now submit this group's jobs to the queue

# use 12 hour wall time # It is important to limit unless it was passed as an send the job to the background argument "&" to make unless ( $wtime ) { $wtime = # the program continue "12:00" } to the next group. if ( !$test && $run eq "lsf"

# divide the data to be ) { processed into ng groups system("bsub -q $queue -

$nsg = int ($count/$ng) ; $wtime -n $np -R $rest = $count%$ng; \"span[ptile=$np] \ " -o $seq = 1; $lsfdir/group$i .out $perl for $i ( 1 .. $ng ) { $lsfdir/group$i.run £-"); $ns[$i] = $nsg; } if ( $i <= $rest ) { $ns[$i]++ } #print "group $i : using-

$ini = $seq ; $fin = $seq + queue $queue with wtime=$wtime $ns[$i] -1; and $np processors . \n" ;

$seq = $fin+l; # THE FOLLOWING LINE CAN #print "group $i -> $ns[$i] BE USED TO RUN THIS LOCALLY ON : $ini - $fin\n"; AN SMP

# divide sequences for . # if the lsfp option is each group $i set to IxN where N is the number

$datfile = of "$lsfdir/group$i .dat" ; # processors to be used open (DAT, ">$datfile") ; if ( !$test && $run eq for $s ( $ini .. $fin ) { "lsflocal" ) { printf DAT ("%s -sid %d - system( "$perl tmpdir=%s %s $lsfdir/group$i .run 2>&1 >

%s\n" , $mod[$s] , $sid[$s] , $tmpdir, $lsfdir/group$i . log &"); $debugopt) ; }

} } close (DAT) ; }

# prepare script for group open (RUN, ">$lsfdir/group$i .run" ) exit ; Appendix 1 (continued)

# SUBROUTINES

# check options sub check_options { my $result = GetOptions( "run=s" => \$run,

"lsfp=s" => \$lsfp,

"queue=s"

=> \$queue, "wtime=s"

=> \$wtime, "check" =>

\$check, "test" =>

\$test, "replace"

=> \$replace, "debug=i"

=> \$debug

); if ( $#ARGV < 0 I I ( ! $run && ! $check ) ) {

.-, $message = "usage : $program

^"modeldatafile - ^' ...-,run= (clustor I lsf | lsflocal) [- ^"lsfp=NGxNP -queue=queuename - wtime=HH:MM] [-replace] [-check] ^![ -test ] [ -debug=debuglevel ] " ,- Error (0, $message) ; } if ( $#ARGV > 0 I I ! $result) {

$message = "don't understand options" ;

Error ( 0 , $message) ; }

$DEBUG = $debug; if ( $debug ) { $debugopt = "- debug=$debug" }

Appendix 1 (continued) ModEval.pm

# ! /usr/local/bin/perl sub ZscoreO { package ModPipe :: odEval ; my $pdbfile = $_[0] ; use RSlib:.: Error; my $zscore = 'prosa-zscore use strict; $pdbfile | grep Z-SCORE'; use vars 'SISA', '©EXPORT', chomp ($zscore) ;

^•$NAME\ '$VERSION', ' $DATE ' , $zscore =

'$AUTHOR' ; (split ( /\s+/ , $zscore) ) [1] ; require Exporter; if ( $zscore eq "" ) { return

@ISA = qw(Exporter) ; 1 }

ΘEXPORT = qw(EnePairl EneSurfl return (0, $zscore) ;

ZScorel Compactness Evall EvalO } ZscoreO) ;

# -— Evall

# # I_{NPUT :} pdbfile, sequence identity # _OU_TPUT :

#

# ModPipe: :ModEval sub Evall { # my $NAME = "ModPipe: :ModEval " ; ( $result , $enepair, $enesurf, $zsco

$VERSION = "1.00"; re,$comp);

$DATE = "09-27-1999";

$AUTHOR = "Roberto Sanchez ( my $nargs = 2; sancher\@rockefeller . edu ) " ; check_args ( $nargs , \@_) ; my ($pdbfile, $seqid) = (§•_;

# $seqid = $seqid/100;

_# ($result, $enepair) = EnePairl ($pdbfile) ; if ( $result ) {

# EvalO (old Prosall based Error (1, "EnePairl failed") ; pG) return 1 }

# Error (3, "EnePairl =

$enepair" ) ; sub EvalO { use RSlib: :pG; ($result , $enesurf) = my ($result, $zscore) ; EneSurf1 ($pdbfile) ; my ($pdbfile, $seqlen, $pglib) = if ( $result ) {

@_; Error (1, "EneSurfl failed") ; return 1 }

($result, $zscore) = Error (3 , "EneSurf1 =

ZscoreO ($pdbfile) ; $enesurf ") ; if ( $result ) {

Error (1, "ZscoreO failed") ; ($result , $zscore) = return 1 } ZScorel ($pdbfile) ,•

Error (3, "ZscoreO = $zscore"); if ( $result ) {

Erro (1, "ZScorel failed") ; my ($pg,$nzs) = return 1 } get_pg($seqlen, $zscore, $pglib) ; Error (3 , "Zscorel = $zscore");

Error(3,"pg = $pg, nzs =

$nzs"); ($result, $comp) = return ( 0, $pg) ; Compactness ($pdbfile) ;

} if ( $result ) {

Error (1, "Compactness failed") ;

# ZscoreO return 1 } Appendix 1 (continued)

Error ( 3 , "Compactness return ( 0 , $enepair) ,- $comp " ) ; } my $ score = 1 - ( ( cos ( $seqid) ) * * ( EneSurfl ( $comp+ $seqid) /exp ( $zscore ) ) ) ; INPUT OUTPUT return ( 0 , $score ) ; sub EneSurfl { my $nargs = 1; check_args ($nargs, \@_) ;

# EnePairl

# INPUT : my ($pdbfile) = @_;

# OUTPUT : my $exe = $main: :enesurflexerun; sub EnePairl { my $potential = $main: :surflpotentialrun; my $nargs = 1; check_args ( $nargs , \@_) ; Error (3, "running on $pdbfile using $exe with $potential" ) ; my ($pdbfile) = @_; # make sure files are in my $exe = place $main: :enepairlexerun; CopyEneSurflFiles ( ) ; my $potential = # check files and $main: :pairlpotentialrun; executable unless ( -e $exe ) {

Error ( 3 ," running on $pdbfile Error (0, "couldn' t find using $exe with $potential" ) ; executable $exe") }

# make sure files are in unless ( -e $pdbfile ) { place Error (1, "couldn' t find PDB

CopyEnePairlFiles ( ) ; file $pdbfile") ;

# check files and return 1; executable } unless ( -e $exe ) { unless ( -e $potential ) { Error ( 0 , " couldn ' t find Error (0, "couldn' t find executable $exe") } potential file $potential") ; unless ( -e $pdbfile ) { } Error (1, "couldn't find PDB file $pdbfile") ,- my $output = '$exe $potential return 1 ; $pdbfile^x ; } unless ( $output =~ unless ( -e $potential ) { /pair_surf/ ) { Error (0, "couldn' t find Error (1, "failed for potential file $potential") ; $pdbfile") ; } return 1;

} my $output = " $exe $potential my $enesurf = $pdbfile^* ; (split (/\s+/,$output) ) [2] ; unless ( $output =- /pair_ene/ return (0, $enesurf ) ; ) { }

Error (1, "failed for $pdbfile") ; return 1 ;

} # -- ZScorel my $enepair = # INPUT (split (Ms+/,$output) ) [2] ; # OUTPUT Appendix 1 (continued) sub ZScorel {

# Compactness my $nargs = 1; # INPUT : check_args ($nargs, \@_) ; # OUTPUT : my ($pdbfile) = (_•_; sub Compactness { my $exe = $main: : zscorelexerun; my $nargs = 1 ; check_args ( $nargs , \@_) ;

Error (3, "running on $pdbfile using $exe" ) ; my ($pdbfile) = @_;

# make sure files are in my $exe = place $main: : compactnessexerun;

CopyZScorelFiles ( ) ;

# check files and Error (3, "running on $pdbfile executable using $exe" ) ; unless ( -e $exe ) { # make sure files are in Error (0, "couldn' t find place executable $exe") } CopyCompactnessFiles ( ) ; unless ( -e $pdbfile ) { # check files and Error (1, "couldn' t find PDB executable file $pdbfile") ; unless ( -e $exe ) { return 1 ,- Error (0, "couldn' t find } executable $exe") }

# check prerequisite files unless ( -e $pdbfile ) { my $rfilel = Error (1, "couldn¹ t find PDB

"$pdbfile. ene_pair. native" ; file $pdbfile") ; my $rfile2 = return 1 ; " $pdbfile . ene_pair . random" ; } my $rfile3 = " $pdbfile . ene_surf . native " ; my $output = ^* $exe $pdbfile - my $rfile4 = unless ( $output =- " $pdbfile . ene_surf . random" ; /compactness/ ) {

Error ( 1 , " failed for if ( !-e $rfilel | | ! -e $pdbfile") ; $rfile2 || ! -e $rfile3 || ! -e return 1; $rfile4 ) { }

Error(l,"One or more of my $comp = these files is (split (/\s+/,$output) ) [2] ; missing: \n\t\t$rfilel\n\t\t$rfil return (0,$comp); e2\n\t\t$rfile3\n\t\t$rfile4\n") } return 1 ;

} my $output = '$exe $pdbfile" ; # CopyEnePairlFiles unless ( $output =~ /Z-score/ ) { sub CopyEnePairlFiles {

Error (1, "failed for use File: :Basename; $pdbfile") ; my ($name, $path) ; return 1; my $nargs = 0 ;

} check_args ( $nargs , \@_) ; my $zscore = my $exesrc = (split(/\s+/,$output) ) [2] ; $main: :enepairlexesrc; return (0,$zscore); my $potentialsrc = } $main: :pairlpotentialsrc; Appendix 1 (continued) my $exerun = Error ( 0 , "couldn ' t find $main: :enepairlexerun; potential $potentialsrc" ) ; my $potentialrun = } $main: :pairlpotentialrun; unless ( -e $exerun ) { unless ( -e $exesrc ) { ($name, $path) = Error (0, "couldn' t find fileparse($exerun) ; executable $exesrc"); unless ( -e $path ) { } system ( "mkdir -p $path" unless ( -e $potentialsrc ⁾ { } Error (0 , "couldn ' t find system("cp $exesrc potential $potentialsrc" ) ; $exerun" ) ,- } } unless ( -e $exerun ) { unless ( -e $potentialrun ) { ($name, $path) = ($name, $path) = fileparse ($exerun) ; fileparse ($potentialrun) ; unless ( -e $path ) { unless ( -e $path ) { system( "mkdir -p $path"); system( "mkdir -p $path"); } } systemC'cp $exesrc systemC'cp $potentialsrc $exerun" ) ; $potentialrun" ) ; } } unless ( -e $potentialrun ) { ($name, $path) = fileparse ($ρotentialrun) ; unless ( -e $path ) { # CopyZScorelfiles - system( "mkdir -p $path" ) ; } sub CopyZScorelFiles { systemC'cp $potentialsrc use File: :Basename; $potentialrun" ) ; my ($name, $path) ,- } my $nargs = 0; check_args ( $nargs , \@_) ;

} my $exesrc = $main: : zscorelexesrc;

# CopyEneSurflFiles my $exerun = $main: : zscorelexerun; sub CopyEneSurflFiles { use File: :Basename; unless ( -e $exesrc ) { my ($name, $path) ; _^Error (0, "couldn't find my $nargs = 0; executable $exesrc"); check_args ( $nargs , \@_) ; } my $exesre = $main: :enesurflexesrc; unless ( -e $exerun ) { my $potentialsrc = ($name, $path) = $main: : surflpotentialsrc; fileparse ($exerun) ; my $exerun = unless ( -e $path ) { $main: :enesurflexerun; system( "mkdir -p $ρath") my $potentialrun = } $main: : surflpotentialrun; system("cp $exesrc $exerun" ) ; unless ( -e $exesrc ) { } Error (0, "couldn' t find executable $exesrc"); } unless ( -e $potentialsrc ) { -- CopyCompactnessFiles Appendix 1 (continued)

sub CopyCompactnessFiles { use File: :Basename; my ( $name , $path) ; my $nargs = 0; check_args ($nargs, \@_) ; my $exesrc = $main: :compactnessexesrc; my $exerun = $main: :compactnessexerun; unless ( -e $exesrc ) { Error ( 0 , "couldn ' t find executable $exesrc"); } unless ( -e $exerun ) { ($name, $path) = fileparse ($exerun) ; unless ( -e $path ) { system) "mkdir -p $path") } system ("cp $exesrc $exerun" ) ; }

}

1;

Claims

What is claimed is:

1. A computerized process of generating a 3-dimensional model of a protein, the process comprising the steps of:

computer;

to identify potentially sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein and

for which the 3-dimensional structure is known;

(3) an alignment step, wherein for each sequence-related protein an optimal degree of

alignment is created between the amino acid sequence of the query protein and that of each sequence-related protein, at least one which has a known 3-dimensional structure;;

alignment step, electronically stored retrievable information that defines a model of a 3-

dimensional structure for all or part of the query protein amino acid sequence is created;

(5) a model evaluation step, wherein the model that is probably most closest in structure to the actual structure of the query protein is selected from all models generated for

the query amino acid sequence;

(6) a model storage step, wherein information generated in steps (5) is stored

electronically, magnetically, or electromagnetically, such that said information is retrievable.

2. A process of Claim 1 wherein a plurality of query protein amino acid sequences

are processed in steps (1) through (6) in time periods that overlap each other.

3. A process of Claim 1 wherein a variable gap penalty function is used in the

alignment step.

4. A process of Claim 1 wherein, in the model evaluation step, a model is compared

to previously identified populations of good and bad models.

5. A process of Claim 1 wherein, in the model evaluation step, a scoring function is used, said function dependent on compactness, sequence identity, and z-score is used in the

model evaluation step, and optionally additional variables.

6. A process of Claim 1, wherein in the sequence searching step, a low stringency

search for sequence similarity is part of the step.

7. A process of Claim 6 wherein a high stringency search is done before the low

stringency search is done.

8. A process of Claim 7 wherein the high stringency searches comprises (1) one

between the query sequence and a database with a plurality of known amino acid sequences

corresponding to either known or unknown 3-dimensional structures and (2) one between the

sequences of a database of proteins of known 3-dimensional structure and sequences of the database with a plurality of known amino acid sequences corresponding to either known or

unknown structures.

9. A process of Claim 1 which further comprises a ligand-docking step, in which the ability of a molecule to bind to a model generated by a step of the process of Claim 1 is tested.

10. A system for accepting as input a query amino acid sequence and carrying out the

computerized process of the invention so as to output a 3-dimensional model of protein with

the query amino acid sequence :

(1) a collection of one or more query amino acid sequences;

(2) a collection of one or more databases;

(3) a sequence matching engine for searching one or more protein databases so as to

identify sequence-related proteins, those proteins that have an amino acid sequence exceeding a pre-specified degree of sequence similarity with the query protein ;

(4) an alignment engine that, for each sequence-related protein, creates an optimal

no equivalent residues in the other;

(5) a model building engine that, for each query sequence alignment obtained in the

alignment step, generates electronically stored retrievable information that defines a model

of a 3-dimensional structure for the query protein amino acid sequence ; and

(6) a model evaluation engine that discriminates good models from bad models for the query amino acid sequence, and sorts the models according to their accuracy.

11. A system of Claim 10 wherein the interaction between the collections and

engines is under the control of a process control engine.

12. Electronically, magnetically, or electromagnetically stored data generated by the

process of Claim 1.

13. Electronically, magnetically, or electromagnetically stored data generated by step (2), (3), (4), or (5) of the process of Claim 1.

14. A 2-dimensional or 3-dimensional representation of the data of Claim 12.

15. A 2-dimensional or 3-dimensional representation of the data of Claim 13.

16. Electronically, magnetically, or electromagnetically stored data generated by the

system of Claim 10.

17. A computerized process of generating a 3-dimensional model of a protein, the

process comprising the steps of:

computer;

for which the 3-dimensional structure is known;

alignment is created between the amino acid sequence of the query protein and that of each

sequence-related protein, at least one which has a known 3-dimensional structure;; (4) a model-building step, wherein for each query sequence alignment obtained in the

(5) a model evaluation step, wherein the model that is probably most closest in

structure to the actual structure of the query protein is selected from all models generated for the query amino acid sequence;

(6) a model storage step, wherein information generated in steps (2), (3), (4) or (5) is

stored electronically, magnetically, or electromagnetically, such that said information is

retrievable.