WO2001016862A2 - Methods and compositions utilizing a branch and terminate algorithm for protein design - Google Patents
Methods and compositions utilizing a branch and terminate algorithm for protein design Download PDFInfo
- Publication number
- WO2001016862A2 WO2001016862A2 PCT/US2000/040805 US0040805W WO0116862A2 WO 2001016862 A2 WO2001016862 A2 WO 2001016862A2 US 0040805 W US0040805 W US 0040805W WO 0116862 A2 WO0116862 A2 WO 0116862A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- protein
- rotamers
- rotamer
- dee
- energy
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C07—ORGANIC CHEMISTRY
- C07K—PEPTIDES
- C07K1/00—General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
Definitions
- the present invention relates to an apparatus and method for quantitative protein design and optimization
- the invention describes the use of the Branch and Terminate algorithm in protein design
- the present invention provides methods executed by a computer under the control of a program, the computer including a memory for storing the program
- the methods comprise the steps of receiving a protein backbone structure with variable residue positions, establishing a group of potential rotamers for each of the variable residue positions, wherein at least one variable residue position has rotamers from at least two different ammo acid side chains, and analyzing the interaction of each of the rotamers with all or part of the remainder of the protein backbone structure to generate a set of optimized protein sequences
- the methods further comprise classifying each variable residue position as either a core, surface or boundary residue
- the analyzing step may include a Branch and Terminate (B&T) computation either alone or in combination with a Dead-End Elimination (DEE) computation
- B&T Branch and Terminate
- DEE Dead-End Elimination
- the analyzing step includes the use of at least one scoring function selected from the group consisting of a Van der Waals potential scoring function, a hydrogen bond potential scoring function, an atomic solv
- the invention provides nucleic acid sequences encoding a protein sequence generated by the present methods, and expression vectors and host cells containing the nucleic acids
- the invention provides a computer readable memory to direct a computer to function in a specified manner, comprising a side chain module to correlate a group of potential rotamers for residue positions of a protein backbone model, and a ranking module to analyze the interaction of each of said rotamers with all or part of the remainder of said protein to generate a set of optimized protein sequences
- the memory may further comprise an assessment module to assess the correspondence between potential energy test results and theoretical potential energy data
- FIG. 1 illustrates a general purpose computer configured in accordance with an embodiment of the invention
- FIG. 2 illustrates processing steps associated with an embodiment of the invention
- Figure 3 illustrates processing steps associated with a ranking module used in accordance with an embodiment of the invention After any DEE step, any one of the previous DEE steps may be repeated. In addition, any one of the DEE steps may be eliminated; for example, original singles DEE (step 74) need not be run.
- Figure 4 is a schematic representation of the minimum and maximum quantities (defined in Eq. 24 to 27) that are used to construct speed enhancements.
- the minima and maxima are utilized directly to find the (iJ ⁇ pair and for the comparison of extrema.
- the differences between the quantities, denoted with arrows, are used to construct the q re and q uv metrics.
- Figures 5A, 5B, 5C and 5D depict several super-secondary structure parameters for ⁇ / ⁇ proteins.
- the definitions are similar to those previously developed for ⁇ / ⁇ proteins (Janin & Chothia, J Mol Biol 143:95-128 (1980); Cohen et al., J Mol Biol 156:821-862 (1982)).
- the helix center is defined as the average C ⁇ position of the residues in the helix.
- the helix axis is defined as the principal moment of the C ⁇ atoms of the residues in the helix. (Chothia et al., Proc Natl Acad Sci USA 78:4146-4150 (1981); J Mol Biol 145:215-250 (1981).
- the strand axis is defined as the average of the least-squares lines fit through the midpoints of sequential C ⁇ positions of two central ⁇ - strands.
- the sheet plane is defined as the least-squares plane fit through the C ⁇ positions of the residues of the sheet.
- the sheet axis is defined as the vector perpendicular to the sheet plane that passes through the helix center.
- ⁇ is the angle between the strand axis and the helix axis after projection onto the sheet plane;
- ⁇ is the angle between the helix axis and the sheet plane;
- h is the distance between the helix center and the sheet plane;
- ⁇ is the rotation angle about the helix axis.
- Figures 6A, 6B, 6C and 6D depict four supersecondary structure parameters for ⁇ / ⁇ protein interactions.
- Figures 6A and 6B are relevant to ⁇ barrel proteins;
- Figure 6C is relevant to ⁇ -sheet interactions.
- Figure 6A shows only three strands, and depicts R, the barrel radius; ⁇ , the tilt of the strands relative to the barrel axis; a, the distance from C ⁇ to C ⁇ along the strands; and b, the interstrand distance.
- Figure 6B shows the twist and coiling angles of the ⁇ -sheet, with residues A, B and C from one strand, residues D, E and F in strand 2, and residues G, H and I from strand 3.
- the circles represent the positions of the residues when projected onto the surface of the barrel.
- ⁇ is the mean twist of the ⁇ -sheet about an axis perpendicular to the strand direction
- T is the mean twist of the ⁇ -sheet about an axis parallel to the strand direction
- e is the mean coiling of the ⁇ -sheet along the strands
- ⁇ is the mean coiling of the ⁇ -sheet along a line perpendicular to the strands.
- Figure 6C depicts two ⁇ -sheets, with the chain direction being shown with arrows.
- Figure 6D depicts two ⁇ -sheets of distance h with angle ⁇ between the average strand vectors. There is also ⁇ , perpendicular to vectors defining ⁇ .
- Figures 7A, 7B, 7C and 7D depict four supersecondary structure parameters ⁇ / ⁇ supersecondary structure parameters for ⁇ / ⁇ interactions, d is the distance between the helices and ⁇ is the angle between the axes of the helices, ⁇ is defined as the rotation around the helix axis. ⁇ is the angle between two strand axes after projection onto a plane.
- the dark circle represents a view of the helix from the end.
- Figure 9 depicts the benchmark times of B&T versus other combinatorial search algorithms.
- Figure 10 depicts the optimization times resulting from the combination of B&T (hashed bars) and DEE (solid bars) algorithms.
- the bars on the extreme left and right of the figure are the times for lone B&T and DEE optimization, respectively.
- the remaining bars are the cumulative B&T and DEE optimization times when the two algorithms are used in succession.
- the sudden jumps in DEE times arise from lengthy Goldstein doubles calculations.
- the present invention is directed to the quantitative design and optimization of amino acid sequences, using an "inverse protein folding” approach, which seeks the optimal sequence for a desired structure. Inverse folding is similar to protein design, which seeks to find a sequence or set of sequences that will fold into a desired structure. These approaches can be contrasted with a “protein folding” approach which attempts to predict a structure taken by a given sequence.
- the general preferred approach of the present invention is as follows, although alternate embodiments are discussed below.
- a known protein structure is used as the starting point.
- the residues to be optimized are then identified, which may be the entire sequence or subset(s) thereof.
- the side chains of any positions to be varied are then removed.
- the resulting structure consisting of the protein backbone and the remaining sidechains is called the template.
- Each variable residue position is then preferably classified as a core residue, a surface residue, or a boundary residue; each classification defines a subset of possible amino acid residues for the position (for example, core residues generally will be selected from the set of hydrophobic residues, surface residues generally will be selected from the hydrophilic residues, and boundary residues may be either).
- Each amino acid can be represented by a discrete set of all allowed conformers of each side chain, called rotamers.
- rotamers To arrive at an optimal sequence for a backbone, all possible sequences of rotamers must be screened, where each backbone position can be occupied either by each ammo acid in all its possible rotame ⁇ c states, or a subset of ammo acids, and thus a subset of rotamers
- a Monte Carlo search may be done to generate a rank- ordered list of sequences in the neighborhood of the DEE or B&T solution Starting at the DEE or B&T solution, random positions are changed to other rotamers, and the new sequence energy is calculated If the new sequence meets the criteria for acceptance, it is used as a starting point for another jump After a predetermined number of jumps, a rank-ordered list of sequences is generated
- B&T may also be used to generate a rank ordered list of sequences in the neighborhood of the DEE or B&T solution In fact, this search may be perfprmed without prior knowledge of the DEE or B&T solution The results may then be experimentally verified by physically generating one or more of the protein sequences followed by experimental testing The information obtained from the testing can then be fed back into the analysis, to modify the procedure if necessary
- the present invention provides a computer-assisted method of designing a protein
- the method comprises providing a protein backbone structure with variable residue positions, and then establishing a group of potential rotamers for each of the residue positions
- the backbone, or template includes the backbone atoms and any fixed side chains
- the interactions between the protein backbone and the potential rotamers, and between pairs of the potential rotamers, are then processed to generate a set of optimized protein sequences, preferably a single global optimum, which then may be used to generate other related sequences
- FIG. 1 illustrates an automated protein design apparatus 20 in accordance with an embodiment of the invention
- the apparatus 20 includes a central processing unit 22 which communicates with a memory 24 and a set of input/output devices (e g , keyboard, mouse, monitor, printer, etc ) 26 through a bus 28
- a central processing unit 22 which communicates with a memory 24 and a set of input/output devices (e g , keyboard, mouse, monitor, printer, etc ) 26 through a bus 28
- the general interaction between a central processing unit 22, a memory 24, input/output devices 26, and a bus 28 is known in the art
- the present invention is directed toward the automated protein design program 30 stored in the memory 24
- the automated protein design program 30 may be implemented with a side chain module 32 As discussed in detail below, the side chain module establishes a group of potential rotamers for a selected protein backbone structure
- the protein design program 30 may also be implemented with a ranking module 34 As discussed in detail below, the ranking module 34 analyzes the interaction of rotamers with the protein backbone structure to generate optimized protein sequences
- the protein design program 30 may also include a search module 36 to execute a search, for example a Monte Carlo search as described below, in relation to the optimized protein sequences
- an assessment module 38 may also be used to assess physical parameters associated with the derived proteins, as discussed further below
- the memory 24 also stores a protein backbone structure 40, which is downloaded by a user through the input/output devices 26
- the memory 24 also stores information on potential rotamers derived by the side chain module 32
- the memory 24 stores protein sequences 44 generated by the ranking module 34 The protein sequences 44 may be passed as output to the input/output devices 26
- Fig 2 illustrates processing steps executed in accordance with the method of the invention As described below, many of the processing steps are executed by the protein design program 30
- the first processing step illustrated in Fig 2 is to provide a protein backbone structure (step 50) As previously indicated, the protein backbone structure is downloaded through the input/output devices 26 using standard techniques
- the protein backbone structure corresponds to a selected protein
- protein herein is meant at least two am o acids linked together by a peptide bond
- protein includes proteins, oligopeptides and peptides
- the peptidyl group may comprise naturally occurring ammo acids and peptide bonds, or synthetic peptidomimetic structures, i e "analogs", such as peptoids (see Simon ef al , PNAS USA 89(20) 9367 (1992))
- the ammo acids may either be naturally occunng or non- naturally occunng, as will be appreciated by those in the art, any structure for which a set of rotamers is known or can be generated can be used as an ammo acid
- the side chains may be in either the (R) or the (S) configuration In a preferred embodiment, the ammo acids are in the (S) or (L) configuration
- the chosen protein may be any protein for which a three dimensional structure is known or can be generated, that is, for which there are three dimensional coordinates for each atom of the protein Generally this can be determined using X-ray crystallographic techniques, NMR techniques, de novo modelling, homology modelling, etc In general, if X-ray structures are used, structures at 2A resolution or better are preferred, but not required
- the proteins may be from any organism, including prokaryotes and eukaryotes, with enzymes from bacteria, fungi, extremeophiles such as the archebactena, insects, fish, animals (particularly mammals and particularly human) and birds all possible
- Suitable proteins include, but are not limited to, industrial and pharmaceutical proteins, including ligands, cell surface receptors, antigens, antibodies, cytokines, hormones, and enzymes
- Suitable classes of enzymes include, but are not limited to, hydrolases such as proteases, carbohydrases, lipases, isomerases such as racemases, epimerases, tautomerases, or mutases, transferases, kinases, oxidoreductases, and phophatases Suitable enzymes are listed in the Swiss-Prot enzyme database
- Suitable protein backbones include, but are not limited to, all of those found in the protein data base compiled and serviced by the Brookhaven National Lab
- protein backbone structure or grammatical equivalents herein is meant the three dimensional coordinates that define the three dimensional structure of a particular protein
- the structures which comprise a protein backbone structure are the nitrogen, the carbonyl carbon, the ⁇ -carbon, and the carbonyl oxygen, along with the direction of the vector from the ⁇ -carbon to the ⁇ -carbon
- the protein backbone structure which is input into the computer can either include the coordinates for both the backbone and the ammo acid side chains, or just the backbone, i e with the coordinates for the ammo acid side chains removed If the former is done, the side chain atoms of each ammo acid of the protein structure may be "stripped" or removed from the structure of a protein, as is known in the art, leaving only the coordinates for the "backbone” atoms (the nitrogen, carbonyl carbon and oxygen, and the ⁇ -carbon, and the hydrogens attached to the nitrogen and ⁇ - carbon)
- the protein backbone structure is altered prior to the analysis outlined below
- the representation of the starting protein backbone structure is reduced to a description of the spatial arrangement of its secondary structural elements
- the relative positions of the secondary structural elements are defined by a set of parameters called supersecondary structure parameters These parameters are assigned values that can be systematically or randomly varied to alter the arrangement of the secondary structure elements to introduce explicit backbone flexibility
- the atomic coordinates of the backbone are then changed to reflect the altered supersecondary structural parameters, and these new coordinates are input into the system for use in the subsequent protein design automation
- a protein is first parsed into a collection of secondary structural elements which are then abstracted into geometrical objects
- an ⁇ -helix is represented by its helical axis and geometric center
- the relative o ⁇ entation and distance between these objects are summarized as super-secondary structure parameters
- Concerted backbone motion can be introduced by simply modulating a protein's super-secondary structure parameter values Accordingly, when all or part of the backbone is to be altered, the portion to be altered is classified as belonging to a particular supersecondary structure element, i e ⁇ / ⁇ , ⁇ / ⁇ or ⁇ / ⁇ , and then the supersecondary structural elements as outlined below are altered As will be appreciated by those in the art, these elements need not be covalently linked, i e part of the same protein, for example, this can be done to evaluate protein-protein interactions
- both the backbone can be moved and the am o acid side chain can be optimized as outlined herein Similarly, the backbone can be held constant and only the ammo acid side chains are optimized Combinations of any of these at any position may be done In general, when supersecondary structural parameters are altered, this is done on more than one ammo acid, i e the backbone atoms of a plurality of am o acids that contribute to the secondary structure are moved
- the helix center is defined as the average C ⁇ position of the residues chosen for backbone movement.
- the helix axis is defined as the principal moment of the C ⁇ atoms of these residues (see Chothia et al , 1981 , supra).
- the strand axis is defined as the average of the least-squares lines fit through the midpoints of sequential C ⁇ positions of the two central ⁇ -strands.
- the sheet plane is defined as the least-squares plane fit through the C ⁇ positions of the two central ⁇ -strands.
- the sheet axis is defined as the vector perpendicular to the sheet plane that passes through the helix center.
- ⁇ is the angle between the strand axis and the helix axis after projection onto the sheet plane
- ⁇ is the angle between the helix axis and the sheet plane
- h is the distance between the helix center and the sheet plane
- ⁇ is the rotation angle about the helix axis
- the supersecondary structure parameter value ⁇ is altered by changing the angle degree (either positively or negatively) of up to about 25 degrees, with changes of + 1°, 2.5°, 5 * , 7.5°, and 10° being particularly preferred.
- the supersecondary structure parameter value ⁇ is altered by changing the angle degree (either positively or negatively) of up to about 25 degrees, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° being particularly preferred
- the supersecondary structure parameter value ⁇ is altered by changing the angle degree (either positively or negatively) of up to about 25 degrees, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° being particularly preferred.
- the supersecondary structure parameter value h is altered by changes (either positive or negative) of up to about 8 A, with changes of + 0.25, 0 50, 0 75, 1 00, 1 25 and 1 5 being particularly preferred
- changes can be made, depending on the protein (i e. how close or far other secondary structure elements are) and whether other parameter values are made; for example, larger changes in ⁇ can be made if the helix is also moved away from the sheet (i e h is increased)
- the helix center is defined as the average C ⁇ position of the residues in the helix
- the helix axis is defined as the principal moment of the C ⁇ atoms of the residues in the helix ⁇ is defined as the rotation around the helix axis ⁇ is the angle between two strand axes after projection onto a plane
- d the distance between the helices, can be altered, generally as outlined above for h
- ⁇ , ⁇ and ⁇ can be altered as above
- the coordinate positions for the positions chosen are altered to reflect the change, to form a "new" or “altered” backbone protein structure, i e one that has all or part of the backbone atoms in a different position relative to the starting structure It should be noted that this process can be repeated, i e additional backbone changes can be made, on the same or different residues.
- the backbone of one or more optimal sequences can altered and an optimization can be run
- movement of the backbone can be done manually, i e sections of the protein backbone can be randomly or arbitrarily moved
- the backbone atoms of one or more ammo acids can be moved some distance, generally an angstrom or more, in any direction This can be done using standard modeling programs, for example, Molecular Dynamics minimization, Monte Carlo dynamics, or random backbone coordinate/angle motion It is also possible to move the backbone atoms of single residues, that are either components of secondary structural elements or not
- the protein backbone structure contains at least one variable residue position
- the residues, or ammo acids, of proteins are generally sequentially numbered starting with the N-terminus of the protein
- a protein having a methionine at it's N-terminus is said to have a methionine at residue or ammo acid position 1 , with the next residues as 2, 3, 4, etc
- the wild type (i e naturally occunng) protein may have one of at least 20 ammo acids, in any number of rotamers
- variable residue position herein is meant an ammo acid position of the protein to be designed that is not fixed in the design method as a specific residue or rotamer, generally the wild-type residue or rotamer
- all of the residue positions of the protein are variable That is, every am o acid side chain may be altered in the methods of the present invention This is particularly desirable for smaller proteins, although the present methods allow the design of larger proteins as well While there is no theoretical limit to the length of the protein which may be designed this way, there is a practical computational limit
- residue positions of the protein are variable, and the remainder are "fixed", that is, they are identified in the three dimensional structure as being in a set conformation
- a fixed position is left in its original conformation (which may or may not correlate to a specific rotamer of the rotamer library being used)
- residues may be fixed as a non-wild type residue, for example, when known site-directed mutagenesis techniques have shown that a particular residue is desirable (for example, to eliminate a proteolytic site or alter the substrate specificity of an enzyme), the residue may be fixed as a particular ammo acid
- the methods of the present invention may be used to evaluate mutations de novo, as is discussed below
- a fixed position may be "floated", the ammo acid at that position is fixed, but different rotamers of that ammo acid are tested
- the variable residues may be at least one, or anywhere from 0 1 % to 99 9% of the total
- residues which can be fixed include, but are not limited to, structurally or biologically functional residues
- residues which are known to be important for biological activity such as the residues which form the active site of an enzyme, the substrate binding site of an enzyme, the binding site for a binding partner (ligand/receptor, antigen/antibody, etc ), phosphorylation or glycosylation sites which are crucial to biological function, or structurally important residues, such as disulfide bridges, metal binding sites, critical hydrogen bonding residues, residues critical for backbone conformation such as prolme or glycine, residues critical for packing interactions, etc may all be fixed in a conformation or as a single rotamer, or "floated"
- residues which may be chosen as variable residues may be those that confer undesirable biological attributes, such as susceptibility to proteolytic degradation, dimenzation or aggregation sites, glycosylation sites which may lead to immune responses, unwanted binding activity, unwanted allostery, undesirable enzyme activity but
- the methods of the present invention allow computational testing of "site-directed mutagenesis" targets without actually making the mutants, or prior to making the mutants That is, quick analysis of sequences in which a small number of residues are changed can be done to evaluate whether a proposed change is desirable In addition, this may be done on a known protein, or on an protein optimized as desc ⁇ bed herein
- a domain of a larger protein may essentially be treated as a small independent protein, that is, a structural or functional domain of a large protein may have minimal interactions with the remainder of the protein and may essentially be treated as if it were autonomous
- all or part of the residues of the domain may be variable
- step 52 This step may be implemented using the side chain module 32
- the side chain module 32 includes at least one rotamer library, as described below, and program code that correlates the selected protein backbone structure with corresponding information in the rotamer library
- the side chain module 32 may be omitted and the potential rotamers 42 for the selected protein backbone structure may be downloaded through the input/output devices 26
- each ammo acid side chain has a set of possible conformers, called rotamers See Ponder, ef al , Acad Press Inc (London) Ltd pp 775-791 (1987), Dunbrack, ef al , Struc Biol 1(5) 334-340 (1994), Desmet, et al , Nature 356 539-542 (1992), all of which are hereby expressly incorporated by reference in their entireity
- rotamers for every am o acid side chain is used
- a backbone dependent rotamer library allows different rotamers depending on the position of the residue in the backbone, thus for example, certain leucine rotamers are allowed if the position is within an ⁇ helix, and different leucine rotamers are allowed if the position is not in a ⁇ -helix
- a backbone independent rotamer library utilizes all rotamers of an ammo acid
- a preferred embodiment does a type of "fine tuning" of the rotamer library by expanding the possible ⁇ (chi) angle values of the rotamers by plus and minus one standard deviation (or more) about the mean value, in order to minimize possible errors that might arise from the discreteness of the library This is particularly important for aromatic residues, and fairly important for hydrophobic residues, due to the increased requirements for flexibility in the core and the rigidity of aromatic rings, it is not as important for the other residues Thus a preferred embodiment expands the ⁇ , and ⁇ 2 angles for all ammo acids except Met, Arg and Lys
- alanine has 1 rotamer
- glycine has 1 rotamer
- arginine has 55 rotamers
- threonine has 9 rotamers
- lysine has 57 rotamers
- glutamic acid has 69 rotamers
- asparagine has 54 rotamers
- aspartic acid has 27 rotamers
- tryptophan has 54 rotamers
- tyrosine has 36 rotamers
- cysteine has 9 rotamers
- glutamine has 69 rotamers
- histidine has 54 rotamers
- valine has 9 rotamers
- isoleucine has 45 rotamers
- leucine has 36 rotamers
- methionine has 21 rotamers
- serine has 9 rotamers
- phenylalanme has 36 rotamers
- prolme is not generally used, since it will rarely be chosen for any position, although it can be included if desired Similarly, a preferred embodiment omits cysteine as a consideration, only to avoid potential disulfide problems, although it can be included if desired
- At least one variable position has rotamers from at least two different am o acid side chains, that is, a sequence is being optimized, rather than a structure
- each variable residue position that is, the group or set of potential rotamers at each variable position is every possible rotamer of each ammo acid This is especially preferred when the number of variable positions is not high as this type of analysis can be computationally expensive
- each variable position is classified as either a core, surface or boundary residue position, although in some cases, as explained below, the variable position may be set to glycine to minimize backbone strain
- the classification of residue positions as core, surface or boundary may be done in several ways, as will be appreciated by those in the art
- the classification is done via a visual scan of the original protein backbone structure, including the side chains, and assigning a classification based on a subjective evaluation of one skilled in the art of protein modelling
- a preferred embodiment utilizes an assessment of the orientation of the C ⁇ -C ⁇ vectors relative to a solvent accessible surface computed using only the template C ⁇ atoms
- the solvent accessible surface for only the C ⁇ atoms of the target fold is generated using the Connolly algorithm with a add-on radius ranging from about 4 to about 12A, with from about 6 to about 1 ⁇ A being preferred, and 8 A being particularly preferred
- the C ⁇ radius used ranges from about 1 6A to about 2 3A, with from about 1 8 to about 2 1 A being preferred, and 1 95 A being especially preferred
- a residue is classified as a core position if a) the distance for its C ⁇ , along its C ⁇ -C ⁇ vector, to the solvent
- a core residue will generally be selected from the group of hydrophobic residues consisting of alanine, valine, isoleucine, leucine, phenylalanme, tyrosine, tryptophan, and methionine (in some embodiments, when the ⁇ scaling factor of the van der Waals scoring function, described below, is low, methionine is removed from the set), and the rotamer set for each core position potentially includes rotamers for these eight ammo acid side chains (all the rotamers if a backbone independent library is used, and subsets if
- prolme, cysteine and glycine are not included in the list of possible ammo acid side chains, and thus the rotamers for these side chains are not used
- the variable residue position has a ⁇ angle (that is, the dihedral angle defined by 1) the carbonyl carbon of the preceding ammo acid, 2) the nitrogen atom of the current residue, 3) the ⁇ -carbon of the current residue, and 4) the carbonyl carbon of the current residue) greater than 0°
- the position is set to glycine to minimize backbone strain
- processing proceeds to step 54 of Figure 2
- This processing step entails analyzing interactions of the rotamers with each other and with the protein backbone to generate optimized protein sequences
- the ranking module 34 may be used to perform these operations That is, computer code is written to implement the following functions Simplistically, as is generally outlined above, the processing initially comprises the use of a number of scoring functions, described below, to calculate energies of interactions of the rotamers, either to the backbone itself or other rotamers
- the scoring functions include a Van der Waals potential scoring function, a hydrogen bond potential scoring function, an atomic solvation scoring function, a secondary structure propensity sco ⁇ ng function and an electrostatic scoring function
- at least one scoring function is used to score each position, although the scoring functions may differ depending on the position classification or other considerations, like favorable interaction with an ⁇ -helix dipole
- the total energy which is used in the calculations is the sum of the energy of each scoring function used at
- Equation 1 the total energy is the sum of the energy of the van der Waals potential (E vdw ), the energy of atomic solvation (E as ), the energy of hydrogen bonding (E h bond ⁇ ng ), the energy of secondary structure (E ss ) and the energy of electrostatic interaction (E el ⁇ c )
- E vdw van der Waals potential
- E as the energy of atomic solvation
- E h bond ⁇ ng the energy of hydrogen bonding
- E ss the energy of secondary structure
- E el ⁇ c the energy of electrostatic interaction
- van der Waals' scoring function is used as is known in the art.
- van der Waals' forces are the weak, non-covalent and non-ionic forces between atoms and molecules, that is, the induced dipole and electron repulsion (Pauli principle) forces
- the van der Waals scoring function is based on a van der Waals potential energy
- van der Waals potential energy calculations including a Lennard-Jones 12/6 potential with radii and well depth parameters from the Dreidmg force field, Mayo ef a/ , J Prot Chem . 1990, expressly incorporated herein by reference, or the exponential 6 potential Equation 2, shown below, is the preferred Lennard-Jones potential
- R 0 is the geometric mean of the van der Waals radii of the two atoms under consideration
- D 0 is the geometric mean of the well depth of the two atoms under consideration
- E vdw and R are the energy and interatomic distance between the two atoms under consideration, as is more fully described below
- the van der Waals forces are scaled using a scaling factor, ⁇ Equation 3 shows the use of ⁇ in the van der Waals Lennard-Jones potential equation Equation 3
- ⁇ scaling factor The role of the ⁇ scaling factor is to change the importance of packing effects in the optimization and design of any particular protein Specifically, a reduced van der Waals ste ⁇ c constraint can compensate for the restrictive effect of a fixed backbone and discrete side-chain rotamers in the simulation and can allow a broader sampling of sequences compatible with a desired fold
- ⁇ values ranging from about 0 70 to about 1 10 can be used, with ⁇ values from about 0 8 to about 1 05 being preferred, and from about 0 85 to about 1 0 being especially preferred Specific ⁇ values which are preferred are 0 80, 0 85, 0 90, 0 95, 1 00, and 1 05
- the van der Waals scaling factor is used in the total energy calculations for each variable residue position, including core, surface and boundary positions
- an atomic solvation potential scoring function is used as is appreciated by those in the art, solvent interactions of a protein are a significant factor in protein stability, and residue/protein hydrophobicity has been shown to be the major driving force in protein folding Thus, there is an entropic cost to solvatmg hydrophobic surfaces, in addition to the potential for misfolding or aggregation Accordingly, the burial of hydrophobic surfaces within a protein structure is beneficial to both folding and stability Similarly, there can be a disadvantage for burying hydrophilic residues
- the accessible surface area of a protein atom is generally defined as the area of the surface over which a water molecule can be placed while making van der Waals contact with this atom and not penetrating any other protein atom
- the solvation potential is generally scored by taking the total possible exposed surface area of the moiety or two independent moieties (either a rotamer or the first rotamer and the second rotamer), which is the reference, and subtracting out the "
- a preferred embodiment calculates the scoring function on the basis of the "buried" portion, i e the total possible exposed surface area is calculated, and then the calculated surface area after the interaction of the moieties is subtracted, leaving the buried surface area
- the pairwise solvation potential is implemented in two components, “singles” (rotamer/template) and “doubles” (rotamer/rotamer), as is more fully described below
- the reference state is defined as the rotamer in question at residue position i with the backbone atoms only of residues ⁇ -1 , i and ⁇ +1 , although in some instances just i may be used
- the solvation potential is not calculated for the interaction of each backbone atom with a particular rotamer, although more may be done as required
- the area of the side chain is calculated with the backbone atoms excluding solvent but not counted in the area
- the folded state is defined as the area of the rotamer in question at residue i, but now in the context of the entire template structure including non-optimized side chains, i e every other fixed position residue
- the rotamer/template buried area is the difference between the reference and the folded states The
- a correction for a possible overestimation of buried surface area which may exist in the calculation of the energy of interaction between two rotamers (but not the interaction of a rotamer with the backbone). Since, as is generally outlined below, rotamers are only considered in pairs, that is, a first rotamer is only compared to a second rotamer during the "doubles" calculations, this may overestimate the amount of buried surface area in locations where more than two rotamers interact, that is, where rotamers from three or more residue positions come together. Thus, a correction or scaling factor is used as outlined below.
- Equation 4 The general energy of solvation is shown in Equation 4.
- Equation 5 Equation 5
- Equation 5 t-sa — 'lV ""buried hydrophobic)
- f is a constant which ranges from about 10 to about 50 cal/mol/A 2 , with 23 or 26 cal/mol/A 2 being preferred.
- Equation 7 or 8 may be used:
- Equation 8 ⁇ sa — ""burie hydrophobic) *2 ⁇ "" uried hydrophilic) *3l”"exposed hydrophobic) V ""exposed hydrophilic)
- f 3 -f
- backbone atoms are not included in the calculation of surface areas, and values of 23 cal/mol/A 2 (f,) and -86 cal/mol/A 2 (f 2 ) are determined.
- this overcounting problem is addressed using a scaling factor that compensates for only the portion of the expression for pairwise area that is subject to overcounting.
- values of -26 cal/mol/A 2 (f,) and 100 cal/mol/A 2 (f 2 ) are determined.
- Atomic solvation energy is expensive, in terms of computational time and resources Accordingly, in a preferred embodiment, the solvation energy is calculated for core and/or boundary residues, but not surface residues, with both a calculation for core and boundary residues being preferred, although any combination of the three is possible
- a hydrogen bond potential scoring function is used A hydrogen bond potential is used as predicted hydrogen bonds do contribute to designed protein stability (see Stickle ef al , J Mol Biol 226 1143 (1992), Huyghues-Despomtes ef a/ , Biochem 34 13267 (1995), both of which are expressly incorporated herein by reference) As outlined previously, explicit hydrogens are generated on the protein backbone structure
- the hydrogen bond potential consists of a distance-dependent term and an angle-dependent term, as shown in Equation 9
- Equation 10 is used for sp 3 donor to sp 3 acceptor
- Equation 11 is used for sp 3 donor to sp 2 acceptor
- Equation 12 is used for sp 2 donor to sp 3 acceptor
- Equation 13 is used for sp 2 donor to sp 2 acceptor Equation 10
- ⁇ is the donor-hydrogen-acceptor angle
- ⁇ is the hydrogen-acceptor-base angle (the base is the atom attached to the acceptor, for example the carbonyl carbon is the base for a carbonyl oxygen acceptor)
- cp is the angle between the normals of the planes defined by the six atoms attached to the sp 2 centers (the supplement of ⁇ is used when ⁇ is less than 90°)
- the hydrogen-bond function is only evaluated when 2 6 A ⁇ R ⁇ 3 2 A, ⁇ > 90°, ⁇ - 109 5° ⁇ 90° for the sp 3 donor - sp 3 acceptor case, and, ⁇ > 90° for the sp 3 donor - sp 2 acceptor case, preferably, no switching functions are used
- Template donors and acceptors that are involved in template-template hydrogen bonds are preferably not included in the donor and acceptor lists For the purpose of exclusion, a template-template hydrogen bond is considered to exist when 2 5 A ⁇ R ⁇
- the hydrogen-bond potential may also be combined or used with a weak coulombic term that includes a distance-dependent dielectric constant of 40R, where R is the interatomic distance Partial atomic charges are preferably only applied to polar functional groups A net formal charge of +1 is used for Arg and Lys and a net formal charge of -1 is used for Asp and Glu, see Gasteiger, et al , Tetrahedron 36 3219-3288 (1980), Rappe, ef al , J Phvs Chem 95 3358-3363 (1991)
- an explicit penalty is given for buried polar hydrogen atoms which are not hydrogen bonded to another atom
- this penalty for polar hydrogen burial is from about 0 to about 3 kcal/mol, with from about 1 to about 3 being preferred and 2 kcal/mol being particularly preferred
- This penalty is only applied to buried polar hydrogens not involved in hydrogen bonds
- a hydrogen bond is considered to exist when E HB ranges from about 1 to about 4 kcal/mol, with E HB of less than -2 kcal/mol being preferred
- the penalty is not applied to template hydrogens, i e unpaired buried hydrogens of the backbone
- the hydrogen bonding scoring function is used for all positions, including core, surface and boundary positions In alternate embodiments, the hydrogen bonding scoring function may be used on only one or two of these In a preferred embodiment, a secondary structure propensity scoring function is used This is based on the specific ammo acid side chain, and is conformation independent That is, each ammo acid has a certain propensity to take on a secondary structure, either ⁇ -helix or ⁇ -sheet, based on its ⁇ and ⁇ angles See Munoz ef al , Current Op in Biotech 6 382 (1995), Minor, et al , Nature 367 660-663 (1994), Padmanabhan, et al , Nature 344 268-270 (1990), Mu ⁇ oz, et al , Folding & Design 1(3) 167-178 (1996), and Chakrabartty, et al , Protein Sci 3 843 (1994), all of which are expressly incorporated herein by reference Thus, for variable residue positions that are
- variable residue positions when a variable residue position is in a ⁇ -sheet backbone conformation, the ⁇ -sheet propensity scoring function is used ⁇ -sheet backbone conformation is generally described by ⁇ angles from -30 to -100 and ⁇ angles from +40 to +180 In alternate preferred embodiments, variable residue positions which are within areas of the backbone which are not assignable to either ⁇ -sheet or ⁇ -helix structure may also be subjected to secondary structure propensity calculations
- energies associated with secondary propensities are calculated using Equation 14
- E ⁇ (or E ⁇ ) is the energy of ⁇ -helical propensity
- ⁇ G° aa is the standard free energy of helix propagation of the ammo acid
- ⁇ G° ala is the standard free energy of helix propagation of alanine used as a standard, or standard free energy of ⁇ -sheet formation of the am o acid, both of which are available in the literature (see Chakrabartty, et al , (1994) (supra), and Munoz, ef al , (1996) (supra)), both of which are expressly incorporated herein by reference)
- N ss is the propensity scale factor which is set to range from 1 to 4, with 2 0 being preferred This potential is preferably selected in order to scale the propensity energies to a similar range as the other terms in the scoring function
- ⁇ -sheet propensities are preferably calculated only where the ⁇ -1 and ⁇ +1 residues are also in ⁇ -sheet conformation
- the secondary structure propensity scoring function is used only in the energy calculations for surface variable residue positions
- the secondary structure propensity scoring function is used in the calculations for core and boundary regions as well
- an electrostatic scoring function is used, as shown below in Equation 15
- At least one scoring function is used for each variable residue position, in preferred embodiments, two, three or four scoring functions are used for each variable residue position
- the preferred first step in the computational analysis comprises the determination of the interaction of each possible rotamer with all or part of the remainder of the protein That is, the energy of interaction, as measured by one or more of the scoring functions, of each possible rotamer at each variable residue position with either the backbone or other rotamers, is calculated
- the interaction of each rotamer with the entire remainder of the protein i e both the entire template and all other rotamers, is done
- the first step of the computational processing is done by calculating two sets of interactions for each rotamer at every position (step 70 of figure 3) the interaction of the rotamer side chain with the template or backbone (the ' singles" energy), and the interaction of the rotamer side chain with all other possible rotamers at every other position (the "doubles" energy), whether that position is varied or floated
- the backbone in this case includes both the atoms of the protein structure backbone, as well as the atoms of any fixed residues, wherein the fixed residues are defined as a particular conformation of an am o acid
- the total singles energy is the sum of the energy of each scoring function used at a particular position, as shown in Equation 1 , wherein n is either 1 or zero, depending on whether that particular scoring function was used at the rotamer position
- each singles E tota ⁇ for each possible rotamer is stored in the memory 24 within the computer, such that it may be used in subsequent calculations, as outlined below
- a first variable position, i has three (an unrealistically low number) possible rotamers (which may be either from a single ammo acid or different ammo acids) which are labelled ⁇ a , ⁇ b , and ⁇ c
- a second variable position, j also has three possible rotamers, labelled j d , j ⁇ , and j,
- nine doubles energies (E total ) are calculated in all E total ( ⁇ a , j d ), E tota ,( ⁇ a , l ⁇ ), E total ( ⁇ a , j f ), otalUb' Jd) ⁇ ot ⁇ l i t tota
- each doubles E tota ⁇ for each possible rotamer pair is stored in memory 24 within the computer, such that it may be used in subsequent calculations, as outlined below
- the next step of the computational processing may occur Generally speaking, the goal of the computational processing is to determine a set of optimized protein sequences
- “optimized protein sequence” herein is meant a sequence that best fits the mathematical equations herein
- a global optimized sequence is the one sequence that best fits Equation 1 , i e the sequence that has the lowest energy of any possible sequence
- the set comprises the globally optimal sequence in its optimal conformation, i e the optimum rotamer at each variable position That is, computational processing is run until the simulation program converges on a single sequence which is the global optimum
- the set comprises at least two optimized protein sequences
- the computational processing step may eliminate a number of disfavored combinations but be stopped prior to convergence, providing a set of sequences of which the global optimum is one
- further computational analysis for example using a different method, may be run on the set, to further eliminate sequences or rank them differently
- the global optimum may be reached, and then further computational processing may occur, which generates additional optimized sequences in the neighborhood of the global optimum If a set comprising more than one optimized protein sequences is generated, they may be rank ordered in terms of theoretical quantitative stability, as is more fully described below
- the computational processing step first comprises an elimination step, sometimes referred to as "applying a cutoff', either a singles elimination or a doubles elimination
- Singles elimination comprises the elimination of all rotamers with template interaction energies of greater than about 10 kcal/mol prior to any computation, with elimination energies of greater than about 15 kcal/mol being preferred and greater than about 25 kcal/mol being especially preferred
- doubles elimination is done when a rotamer has interaction energies greater than about 10 kcal/mol with all rotamers at a second residue position, with energies greater than about 15 being preferred and greater than about 25 kcal/mol being especially preferred
- the computational processing comprises direct determination of total sequence energies, followed by comparison of the total sequence energies to ascertain the global optimum and rank order the other possible sequences, if desired
- the energy of a total sequence is shown below in Equation 17 Equation 17
- the computational processing includes one or more Branched & Terminated (B&T) computational steps as outlined below, and optionally a DEE step, also outlined below
- B&T Branched & Terminated
- the present invention provides a novel deterministic combinatorial search algorithm, called "Branch and Terminate” (B&T) derived from the Branch-and-Bound search method
- B&T Branch and Terminate
- the B&T approach is based on the construction of an efficient, but very restrictive bounding expression, which is used for the search of a combinatorial tree representing the protein system
- the bounding expression is used both to determine the optimal organization of the tree and to perform a highly effective pruning procedure named "termination "
- the B&T method rivals the current deterministic standard, Dead-End Elimination (DEE), sometimes finding the solution up to 21 times faster
- DEE Dead-End Elimination
- the B&T algorithm is an effective optimization algorithm when used alone. Moreover, it can increase the problem size limit of ammo acid side chain placement calculations, such as protein design, by completing DEE optimizations that reach a point at which the DEE criteria become inefficient. Together the two algorithms make it possible to find solutions to problems that are intractable by either algorithm alone.
- B&T is used when DEE algorithms are not sufficient, due either to the nature of their energy distributions or their sheer size.
- the optimization of long hydrophilic side chains on ⁇ -sheets is typically composed of large numbers of rotamers with interaction energies that are very small in magnitude DEE is able to reduce the combinatorial size of the problem significantly at the outset, but soon after, elimination becomes inefficient, relying entirely on computationally expensive DEE doubles calculations ( Lasters, I & Desmet, J., Prof Eng. 6, 717-722 (1993), Gordon, D.B & Mayo, S L, J Comp. Chem.
- B&T Brain-and-Termmate
- B&B Branched & Bound
- Backtrack algorithms are commonly used in atomic-level simulations to construct self-avoiding chains, and they have been used in protein design to engineer metal binding sites into proteins (Hellmga, H W & Richards, F M., J. Mol Biol 222, 763-785 (1991))
- a bounding function is used that maximizes the efficiency of pruning for problems in which to total energy can be decomposed into interactions between pairs of rotamers
- the root of the tree is placed at the top, and branches extend downward
- Each level of depth of the tree corresponds to an am o acid position, and each node represents a particular rotamer choice at that position
- a path that extends all the way from the tree root through all levels of branches to a leaf describes a complete rotamer sequence
- the problem is to search for the path corresponding to the sequence with the lowest energy
- a partial path from the root describes a rotamer sequence that is incompletely specified
- the path can be interpreted physically as specifying a unique composite rotamer, or "super-rotamer” that occupies a subset of the ammo acid positions Extending the path deeper into the tree corresponds to appending additional rotamers to the super-rotamer, which can be repeated until all positions are specified According to this interpretation, a full search of the tree would entail the construction of all possible super-rotamers to completion
- the pruning determination is accomplished by comparing a lower energy bound for the partially- specified rotamer sequence to a known reference energy As shown in Equation 18, given a reference energy of any plausible sequence, it must be true that the energy of the GMEC is less than or equal to the energy of any plausible sequence
- Branch-and-Bound algorithm consists of an exhaustive traversal of the combinatorial tree, applying this criterion to each node as it is encountered
- the reference energy is updated This way, the effectiveness of the bounding criterion is increased over the course of the optimization Moreover, upon completion of the search, the reference energy is the global minimum energy
- the corresponding sequence is also stored during each update, which produces the corresponding GMEC
- the successful implementation of a B&B type of algorithm depends largely on the construction of the bounding expression. A bounding expression that is very stringent will produce lower bounds that are high in energy, and therefore will result in more sub-trees that can be pruned by the bounding criterion. The size of the resulting tree will be smaller than one pruned by a less stringent expression, and the search will be faster. It is therefore important to design the bounding expression to most fully utilize the sequence information available.
- Example 1 The construction of such a bounding expression is shown in Example 1. Given a partially constructed super-rotamer and the available rotamers at the remaining positions, the approach is to utilize the corresponding energetic information as fully as possible while keeping the computational order of the bounding expression constant The result is a novel, highly-effective bounding expression that provides the basis for the remaining B&T techniques.
- the invention further provides a termination function, in which the bounding function is used to determmistically remove rotamers at all ammo acid positions, thereby reducing the overall size of the tree before searching Termination is additionally effective when performed at every level of recursion of the search, sometimes increasing the overall speed of the optimization by an order of magnitude
- the enhancements of the B&T algorithm relative to the B&B method are based on a process called "termination " Because all the pairwise interactions are precomputable, the organization of the combinatorial tree is arbitrary (i e there is no specific order to which different ammo acid positions must be assigned to different levels of the tree) However, organization of the tree can have a significant influence on the speed of the calculation. For example, a greater reduction in the size of the search is derived from pruning a branch at the root of the tree rather than pruning a branch closer to the leaves Placing a branch at the leaves that would be pruned if placed at the root would be inefficient because the same pruning step would necessarily be repeated for every leaf
- Terminate is intended to be contrasted with “eliminate,” which is used to describe rotamers that are analogously discarded by using the DEE criterion Indeed, many of the same rotamers are discarded As with DEE, termination may be performed iteratively until no further rotamers are terminated Iterative termination is executed as the preprocessing step before search of the tree
- termination serves as an effective preprocessing step
- the hallmark of the B&T algorithm is that termination is employed at every level of recursion
- the rotamers defined at levels above the level of the current ammo-acid position may be considered a root comprised of a single, partially specified super-rotamer
- Termination then, consists of temporarily considering each of the rotamers at all the remaining positions as candidates for the next appendage of the super-rotamer and applying the bounding criterion to each one All rotamers terminated this way may be discarded from the optimization of the sub-tree with this partially specified super-rotamer root
- the recursive step in a B&B search consists of application of the bounding criterion to the rotamers at only one ammo acid position
- the benefits of the extra reductions in the sizes of sub-trees far outweigh the costs of calculation of extra bounds for termination
- the resulting increase in efficiency makes the B&T search significantly faster than a similarly constructed B&B search
- the energetic information produced by the termination process can be used to determine the optimal search order for the remainder of the tree. Because termination effectively replaces the usual bounding process, the resulting breadth-first algorithm is called "Branch-and- Terminate.” We also describe a variation of the B&T method that can rapidly find approximate 5 solutions close to the GMEC.
- residue ordering is performed at every level of recursion depth.
- an optimal ordering may be obtained by combining energetic and list- size sorting criteria using the following heuristic. Positions are sorted in descending order according to a rank index, as computed in Equation 22,
- N is the number of rotamers at the position
- E top is the bounding energy of the top-ranked rotamer of that position
- E topmn and E topmax are the minimum and maximum top-ranked bounding energies of all positions, respectively
- the quantity f is selected to control the relative weighting of the two criteria
- a value of zero for f corresponds to sort based entirely on the number of residues per position, and a value of one produces a ranking based entirely on bounding energies
- the approximation is based on the observation that the GMEC rotamers are often among those with the lowest termination bounding energies according to the bounding expression This indicates that the bounding expression has predictive properties
- the ranked rotamer lists are arbitrarily truncated after the pre-processing termination step, and the B&T search is conducted on the abbreviated set of rotamers
- Branch-and-Terminate algorithm herein is tailored for rotamer selection, but the algorithm is in fact generalizable to any combinatorial optimization problem in which all the interactions energies are pairwise and pre-computable
- the bounding expression we describe is similarly general
- the B&T and DEE algorithms are used in succession DEE is used to eliminate rotamers and to perform unification until the optimization reaches iterations that are inefficient Inefficiency typically occurs after several unifications when the total number of rotamers and unified super-rotamers gets very large (>5000) and very few eliminations result even from lengthy Goldstein doubles calculations
- DEE optimization is aborted, and the state information is transferred to a B&T implementation
- Rotamer lists and energy tables are transferred directly, including references to unified super-rotamers, which are transparently represented as ordinary rotamers in the B&T algorithm
- DEP dead-ending pairs
- the computational processing includes one or more Dead-End Elimination (DEE) computational steps
- DEE Dead-End Elimination
- the DEE theorem is the basis for a very fast discrete search program that was designed to pack protein side chains on a fixed backbone with a known sequence See Desmet, et al , Nature 356 539-542 (1992), Desmet, ef al , The Proetin Folding Problem and Tertiary Structure Prediction, Ch 10 1-49 (1994), Goldstein.
- Equation 23 E( ⁇ a ) + ⁇ [m ⁇ n over t ⁇ E(. a , j t ) ⁇ ] > E( ⁇ b ) + ⁇ [max over t ⁇ E( ⁇ b , j t ) ⁇ ]
- Equation 23 rotamer ⁇ a is being compared to rotamer ⁇ b
- the left side of the inequality is the best possible interaction energy (E tota ⁇ ) of ⁇ a with the rest of the protein, that is, "mm over t” means find the rotamer t on position j that has the best interaction with rotamer ⁇ a
- the right side of the inequality is the worst possible (max) interaction energy of rotamer ⁇ b with the rest of the protein If this inequality is true, then rotamer ⁇ a is Dead-Ending and can be Eliminated
- the speed of DEE comes from the fact that the theorem only requires sums over the sequence length to test and eliminate rotamers
- a variation of DEE is performed Goldstein DEE, based on Goldstein, (1994) (supra), hereby expressly incorporated by reference, is a variation of the DEE computation, as shown in Equation 24
- Equation 24 E( ⁇ .) - E( ⁇ din) + ⁇ [m ⁇ n over t ⁇ E( ⁇ a , j t ) - E( ⁇ b , j,) ⁇ ] > 0
- the Goldstein Equation 24 says that a first rotamer a of a particular position i (rotamer ⁇ a ) will not contribute to a local energy minimum if the energy of conformation with ⁇ a can always be lowered by just changing the rotamer at that position to ⁇ b , keeping the other residues equal If this inequality is true, then rotamer ⁇ a is Dead-Ending and can be Eliminated
- a first DEE computation is done where rotamers at a single variable position are compared, ("singles" DEE) to eliminate rotamers at a single position
- This analysis is repeated for every variable position, to eliminate as many single rotamers as possible
- the minimum and maximum calculations of Equation 23, depending on which DEE variation is used thus conceivably allowing the elimination of further rotamers
- the singles DEE computation can be repeated until no more rotamers can be eliminated, that is, when the inequality is not longer true such that all of them could conceivably be found on the global optimum
- doubles DEE is additionally done
- pairs of rotamers are evaluated, that is, a first rotamer at a first position and a second rotamer at a second position are compared to a third rotamer at the first position and a fourth rotamer at the second position, either using original or Goldstein DEE Pairs are then flagged as nonallowable, although single rotamers cannot be eliminated, only the pair
- the minimum calculations of Equation 24 change (depending on which DEE variation is used) thus conceivably allowing the flagging of further rotamer pairs Accordingly, the doubles DEE computation can be repeated until no more rotamer pairs can be flagged
- rotamer pairs are initially prescreened to eliminate rotamer pairs prior to DEE This is done by doing relatively computationally inexpensive calculations to eliminate certain pairs up front This may be done in several ways, as is outlined below
- To search exhaustively for all dead-ending rotamers at a residue position i it is necessary to compare every rotamer to every other rotamer available at i
- each column corresponds to a particular rotamer, ⁇ r , as a candidate r for e miantion
- each row corresponds to one of the possible reference rotamers ⁇ t
- an exhaustive search of n 2 -n matrix elements is necessary Such a matrix is evaluated for each of the positions that may be represented by i
- the rotamer pair with the lowest interaction energy with the rest of the system is found Inspection of the energy distributions in sample comparison matrices has revealed that an ⁇ j v pair that dead-end eliminates a particular ⁇ j s pair can also eliminate other ⁇ r j s pairs In fact, there are often a few g v pairs, which we call "magic bullets," that eliminate a significant number of ⁇ j s pairs We have found that one of the most potent magic bullets is the pair for which maximum interaction energy, e max ([y v ])kjan is least (see Equations 29-31) This pair is referred to as (y v ) mb If this rotamer pair is used in the first round of doubles DEE, it tends to eliminate pairs faster
- Equation 23 or 24 The magic bullet Goldstein calculation will also discover all dead-ending pairs that would be discovered by the Equation 23 or 24, thereby making it unnecessary This stems from the fact that e ma ⁇ ((U v ) mb ) m ust be less than or equal to any e max ([ ⁇ j ) that would successfully eliminate a pair by Equations 23 or 24
- the last DEE speed enhancement refines the search of the remaining quarter of the matrix This is done by constructing a metric from the precomputed extrema to detect those matrix elements likely to result in a dead-ending pair
- the first-order doubles criterion is applied only to those doubles for which qf ra > 0.98 and q uv > 0.99.
- the sample data analyses predict that by using these two metrics, as many as half of the dead-ending elements may be found by evaluating only two to five percent of the reduced matrix.
- single and double DEE using either or both of original DEE and Goldstein DEE, is run iteratively until no further elimination is possible. Usually, convergence is not complete, and further elimination must occur to achieve convergence. This is generally done using "super residue” DEE. In a preferred embodiment, additional DEE computation is done by the creation of "super residues” or “unification”, as is generally described in Desmet , Nature 356 539-542 (1992), Desmet, ef al , The Protein Folding Problem and Tertiary Structure Prediction.
- a super residue is a combination of two or more variable residue positions which is then treated as a single residue position
- the super residue is then evaluated in singles DEE, and doubles DEE, with either other residue positions or super residues
- the disadvantage of super residues is that there are many more rotame ⁇ c states which must be evaluated, that is, if a first variable residue position has 5 possible rotamers, and a second variable residue position has 4 possible rotamers, there are 20 possible super residue rotamers which must be evaluated
- these super residues may be eliminated similar to singles, rather than being flagged like pairs
- the selection of which positions to combine into super residues may be done in a variety of ways In general, random selection of positions for super residues results in inefficient elimination, but it can be done, although this is not preferred
- the first evaluation is the selection of positions for a super residue is the number of rotamers at the position If the position has too many rotamers, it is never unified into a super residue, as the computation becomes too unwieldy Thus, only positions with fewer than about 100,000 rotamers are chosen, with less than about 50,000 being preferred and less than about 10,000 being especially preferred
- the evaluation of whether to form a super residue is done as follows All possible rotamer pairs are ranked using Equation 33, and the rotamer pair with the highest number is chosen for unification
- Equation 33 fraction of flagged pairs log(number of super rotamers resulting from the potential unification)
- Equation 33 is looking for the pair of positions that has the highest fraction or percentage of flagged pairs but the fewest number of super rotamers That is, the pair that gives the highest value for Equation 33 is preferably chosen Thus, if the pair of positions that has the highest number of flagged pairs but also a very large number of super rotamers (that is, the number of rotamers at position i times the number of rotamers at position j), this pair may not be chosen (although it could) over a lower percentage of flagged pairs but fewer super rotamers
- positions are chosen for super residues that have the highest average energy, that is, for positions i and j, the average energy of all rotamers for i and all rotamers for j is calculated, and the pair with the highest average energy is chosen as a super residue
- Super residues are made one at a time, preferably After a super residue is chosen, the singles and doubles DEE computations are repeated where the super residue is treated as if it were a regular residue As for singles and doubles DEE, the elimination of rotamers in the super residue DEE will alter the minimum energy calculations of DEE Thus, repeating singles and/or doubles DEE can result in further elimination of rotamers
- FIG 3 is a detailed illustration of the processing operations associated with a ranking module 34 of the invention
- the calculation and storage of the singles and doubles energies 70 is the first step, although these may be recalculated every time Step 72 is the optional application of a cutoff, where singles or doubles energies that are too high are eliminated prior to further processing Either or both of original singles DEE 74 or Goldstein singles DEE 76 may be done, with the elimination of original singles DEE 74 being generally preferred
- Original doubles (78) and/or Goldstein doubles (80) DEE is run
- Super residue DEE is then generally run, either original (82) or Goldstein (84) super residue DEE This preferably results in convergence at a global optimum sequence As is depicted in Figure 3, after any step any or all of the previous steps can be rerun, in any order
- DEE is run until the global optimum sequence is found That is, the set of optimized protein sequences contains a single member, the global optimum
- the various DEE steps are run until a managable number of sequences is found, i e no further processing is required
- sequences represent a set of optimized protein sequences, and they can be evaluated as is more fully described below
- a manageable number of sequences depends on the length of the sequence, but generally ranges from about 1 to about 10 15 possible rotamer sequences This range can be extended to approximately 10 30 if B&T is used as the next analyzing step
- DEE is run to a point, resulting in a set of optimized sequences (in this context, a set of remainder sequences) and then further compututational processing of a different type may be run
- a set of optimized sequences in this context, a set of remainder sequences
- further compututational processing of a different type may be run
- direct calculation of sequence energy as outlined above is done on the remainder possible sequences
- a Monte Carlo search can be run
- B&T can be run
- the computation processing need not comprise a DEE computational step
- a Monte Carlo search is undertaken, as is known in the art See Metropolis ef al , J Chem Phys 21 1087 (1953), hereby incorporated by reference
- a random sequence comprising random rotamers is chosen as a start point
- the variable residue positions are classified as core, boundary or surface residues and the set of available residues at each position is thus defined
- a random sequence is generated, and a random rotamer for each ammo acid is chosen This serves as the starting sequence of the Monte Carlo search
- a Monte Carlo search then makes a random jump at one position, either to a different rotamer of the same ammo acid or a rotamer of a different ammo acid, and then a new sequence energy (E tola ⁇ s ⁇ qu ⁇ nc ⁇ ) is calculated, and if the new sequence energy meets the Boltzmann criteria for acceptance, it is used as the starting point for another jump If
- additional sequences are also optimized protein sequences
- the generation of additional optimized sequences is generally preferred so as to evaluate the differences between the theoretical and actual energies of a sequence
- the set of sequences is at least about 75% homologous to each other, with at least about 80% homologous being preferred, at least about 85% homologous being particularly preferred, and at least about 90% being especially preferred
- homology as high as 95% to 98% is desirable
- Homology in this context means sequence similarity or identity, with identity being preferred Identical in this context means identical ammo acids at corresponding positions in the two sequences which are being compared
- Homology in this context includes ammo acids which are identical and those which are similar (functionally equivalent) This homology will be determined using standard techniques known in the art, such as the Best Fit sequence program described by Devereux, et al , Nucl Acid Res ,
- the search module 36 may be written to execute a Monte Carlo search as described above. Starting with the global solution, random positions are changed to other rotamers allowed at the particular position, both rotamers from the same am o acid and rotamers from different ammo acids. A new sequence energy (E tota ⁇ S ⁇ qu ⁇ nc ⁇ ) is calculated, and if the new sequence energy meets the Boltzmann criteria for acceptance, it is used as the starting point for another jump. See Metropolis et al., 1953, supra, hereby incorporated by reference.
- the best scoring sequences may be output as a rank-ordered list
- at least about 10 6 jumps are made, with at least about 10 7 jumps being preferred and at least about 10 8 jumps being particularly preferred
- at least about 100 to 1000 sequences are saved, with at least about 10,000 sequences being preferred and at least about 100,000 to 1 ,000,000 sequences being especially preferred.
- the temperature is preferably set to 1000 K
- each optimized protein sequence preferably comprises at least about 5-10% variant ammo acids from the starting or wild-type sequence, with at least about 15-20% changes being preferred and at least about 30% changes being particularly preferred
- the designed proteins are chemically synthesized as is known in the art This is particularly useful when the designed proteins are short, preferably less than 150 ammo acids in length, with less than 100 ammo acids being preferred, and less than 50 am o acids being particularly preferred, although as is known in the art, longer proteins can be made chemically or enzymatically
- the optimized sequence is used to create a nucleic acid such as DNA which encodes the optimized sequence and which can then be cloned into a host cell and expressed
- a nucleic acid such as DNA which encodes the optimized sequence and which can then be cloned into a host cell and expressed
- nucleic acids, and particularly DNA can be made which encodes each optimized protein sequence This is done using well known procedures
- the choice of codons, suitable expression vectors and suitable host cells will vary depending on a number of factors, and can be easily optimized as needed
- the designed proteins are experimentally evaluated and tested for structure, function and stability, as required This will be done as is known in the art, and will depend in part on the original protein from which the protein backbone structure was taken
- the designed proteins are more stable than the known protein that was used as the starting point, although in some cases, if some constamts are placed on the methods, the designed protein may be less stable
- Stable in this context means that the new protein retains either biological activity or conformation past the point at which the parent molecule did Stability includes, but is not limited to, thermal stability, i e an increase in the temperature at which reversible or irreversible denaturing starts to occur, proteolytic stability, i e a decrease in the amount of protein which is irreversibly cleaved in the presence of a particular protease (including autolysis), stability to alterations in pH or oxidative conditions,
- modelled proteins are at least about 5% more stable than the original protein, with at least about 10% being preferred and at least about 20-50% being especially preferred
- the results of the testing operations may be computationally assessed, as shown with step 62 of Figure 2
- An assessment module 38 may be used in this operation That is, computer code may be prepared to analyze the test data with respect to any number of met ⁇ ces
- step 66 the protein is utilized (step 66), as discussed below If a protein is not selected, the accumulated information may be used to alter the ranking module 34, and/or step 56 is repeated and more sequences are searched.
- the experimental results are used for design feedback and design optimization.
- proteins and enzymes exhibiting increased thermal stability may be used in industrial processes that are frequently run at elevated temperatures, for example carbohydrate processing (including saccharification and liquifaction of starch to produce high fructose corn syrup and other sweetners), protein processing (for example the use of proteases in laundry detergents, food processing, feed stock processing, baking, etc.), etc.
- useful pharmaceutical proteins such as analogs of known proteinaceous drugs which are more thermostable, less proteolytically sensitive, or contain other desirable changes.
- Rotamers were selected from a backbone dependent library (Dunbrack, R.L. & Karplus, M., J. Mol. Biol. 230, 543-574 (1993)).
- ⁇ - helical surface positions the 12 residues occupying the b, c, and f locations in the heptad repeat of one helix of the coiied-cold GCN4-p1 dimer (E.K. O'Shea, et al., Science.
- the core residues (positions 3, 5, 7, 20, 26, 30, 34, 39, 52, 54) were selected from the set of hydrophobic ammo acids, and the boundary residues (positions 1 , 12, 23, 33, 37, 45, 50, 56) were selected from the composite list of hydrophilic and hydrophobic residues There were 1 9 x 10 34 possible rotameric combinations
- Pairwise Bounding Expression This section describes the construction of a stringent expression for a lower bound for a system composed only of one and two-body interactions in terms of both a partially specified sequence and the set of rotamers available at its unspecified positions
- the total potential energy can be expressed as the sum of energies between all pairs.
- / and j refer to amino-acid positions, and E(i,j) is energy of interaction between amino- acids at those positions.
- a protein system also consists of single-body interactions. Because each body is an amino-acid side chain at a particular position on the protein backbone, there is an energy contribution both from side chain interactions with other side chains as well as interactions with the protein template scaffolding. Both energies of interaction depend on the side chain position, amino acid identity, and configuration. Thus the total potential energy can be expressed,
- Equation 37 To ensure that the bounding expression satisfies the condition in Equation 3, we use the following inequalities (Equations 37 and 38):
- Equation 37 min [E(i r , template )] ⁇ E(i g , template )
- Equation 39 bo ° und ⁇ min[E(/ r , template)] + j ⁇ min ⁇ min[E(/ r , j s )]
- Equation 35 represents a generic strategy for producing a bounding expression from any total energy expression
- more restrictive bounds can be obtained from energy expressions that sum over three or four-body interactions
- the computational cost to implement such bounds on a protein system is very high Fortunately, there are variations of Equation 35 that are equivalent in terms of computational cost yet yield better bounds
- Equation 41 can be decomposed into two subsets, fixed (F) and variable (V). Equation 41 can be rewritten as Equation 43,
- Equation 44 Equation 44
- Equation 45 min ⁇ E p ⁇ r ) ⁇ ⁇ mm E p r (i i )
- Equation 44 The middle two terms of Equation 44 differ only in their indices, and are therefore equivalent to one another. However, there is a difference once the minimum operators are applied, since the rotamers of the fixed subset (F) will restrict the selection of the minimum energy rotamer pair for the minimized third term, but not for the second. Therefore, we reverse the order of the summation for the second term and combine it with the third term to make use of (Equation 45) such that the minimum will be as large as possible,
- Equation 47 min £ p ⁇ l ⁇ r (',. )
- Equation 41 The expression is generalizable to any system consisting only of two-body interactions such that the total energy of the system can be expressed as in Equation 41.
- Equation 49 The computational cost of evaluating Equation 49 is proportional to p 2 n 2 , where p is the number of positions and n is the average number of rotamers at each position.
- p is the number of positions and n is the average number of rotamers at each position.
- Termination consists of evaluating the bounding expression for rotamers at all the unspecified positions. Therefore, a position is temporarily considered a member of set F while its rotamers are being evaluated. Since the expensive second term of the final summation is dependent only on V, its possible values may be precomputed for all rotamers i, once per position and placed into a table for lookup during the evaluation of Equation 49.
- the size limit may be raised even higher once the limitations of the approximate form of the algorithm become better understood
- the approximate algorithm found the GMEC solutions up to a thousand times faster than either of the exact methods
- the DEE implementation to which the B&T method is compared incorporates some conservative approximations in the form of high energy threshold rejection (HETR) criteria (De Maeyer, M , et al , Folding & Design, 2, 53-56 (1997))
- HETR high energy threshold rejection
- Analogous techniques may provide a way to construct a faster, approximate B&T algorithm with a clearly defined accuracy
- truncation based on bounding energies might be an effective replacement for HETR cutoffs in DEE
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU80372/00A AU8037200A (en) | 1999-09-01 | 2000-09-01 | Methods and compositions utilizing a branch and terminate algorithm for protein design |
EP00971083A EP1222603A2 (en) | 1999-09-01 | 2000-09-01 | Methods and compositions utilizing a branch and terminate algorithm for protein design |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15181899P | 1999-09-01 | 1999-09-01 | |
US60/151,818 | 1999-09-01 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001016862A2 true WO2001016862A2 (en) | 2001-03-08 |
WO2001016862A3 WO2001016862A3 (en) | 2002-01-03 |
Family
ID=22540365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/040805 WO2001016862A2 (en) | 1999-09-01 | 2000-09-01 | Methods and compositions utilizing a branch and terminate algorithm for protein design |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1222603A2 (en) |
AU (1) | AU8037200A (en) |
WO (1) | WO2001016862A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001090960A2 (en) * | 2000-05-24 | 2001-11-29 | California Institute Of Technology | Methods for protein design |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5680331A (en) * | 1992-10-05 | 1997-10-21 | Chiron Corporation | Method and apparatus for mimicking protein active sites |
WO1998047089A1 (en) * | 1997-04-11 | 1998-10-22 | California Institute Of Technology | Apparatus and method for automated protein design |
-
2000
- 2000-09-01 EP EP00971083A patent/EP1222603A2/en not_active Withdrawn
- 2000-09-01 AU AU80372/00A patent/AU8037200A/en not_active Abandoned
- 2000-09-01 WO PCT/US2000/040805 patent/WO2001016862A2/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5680331A (en) * | 1992-10-05 | 1997-10-21 | Chiron Corporation | Method and apparatus for mimicking protein active sites |
WO1998047089A1 (en) * | 1997-04-11 | 1998-10-22 | California Institute Of Technology | Apparatus and method for automated protein design |
Non-Patent Citations (3)
Title |
---|
D GORDON AND S L MAYO: "Branch-and-Terminate: a combinatorial optimization algorithm for protein design" STRUCTURE WITH FOLDING & DESIGN, vol. 7, no. 9, 15 October 1999 (1999-10-15), pages 1089-1098, XP001028197 * |
KLEPEIS JL ET AL: "Protein Folding and Peptide Docking: A Molecular Modeling and Global Optimization Approach" COMPUTERS & CHEMICAL ENGINEERING, vol. 22, 24 - 27 May 1998, pages S3-S10, XP001027996 UK * |
LATHROP R H AND SMITH T F: "A Branch-and-Bound Algorithm for Optimal Protein Threading with Pairwise (Contact Potential) Amino Acid Interactions" PROCEEDINGS OF THE TWENTY-SEVENTH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, vol. V: Biotechnology Computing, 4 - 7 January 1994, pages 365-374, XP001027999 HI, USA * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001090960A2 (en) * | 2000-05-24 | 2001-11-29 | California Institute Of Technology | Methods for protein design |
WO2001090960A3 (en) * | 2000-05-24 | 2003-03-27 | California Inst Of Techn | Methods for protein design |
Also Published As
Publication number | Publication date |
---|---|
WO2001016862A3 (en) | 2002-01-03 |
EP1222603A2 (en) | 2002-07-17 |
AU8037200A (en) | 2001-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gordon et al. | Branch-and-terminate: a combinatorial optimization algorithm for protein design | |
US6708120B1 (en) | Apparatus and method for automated protein design | |
Looger et al. | Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable: implications for protein design and structural genomics | |
Guerois et al. | The SH3-fold family: experimental evidence and prediction of variations in the folding pathways | |
Voigt et al. | Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design | |
Hobohm et al. | A sequence property approach to searching protein databases | |
EP1255209A2 (en) | Apparatus and method for automated protein design | |
Kraemer-Pecore et al. | Computational protein design | |
Abagyan et al. | Ab InitioFolding of peptides by the optimal-Bias Monte Carlo minimization procedure | |
Rose | Reframing the protein folding problem: Entropy as organizer | |
WO2001016810A2 (en) | A computer-based method for macromolecular engineering and design | |
WO2001016862A2 (en) | Methods and compositions utilizing a branch and terminate algorithm for protein design | |
US20030049680A1 (en) | Methods and compositions utilizing hybrid exact rotamer optimization algorithms for protein design | |
US20020052004A1 (en) | Methods and compositions utilizing hybrid exact rotamer optimization algorithms for protein design | |
AU2005211654B2 (en) | Apparatus and method for automated protein design | |
AU2002302138B2 (en) | Apparatus and method for automated protein design | |
Eskow et al. | An optimization approach to the problem of protein structure prediction | |
Kingsford | Computational approaches to problems in protein structure and function | |
Schuster et al. | Sequence redundancy in biopolymers: A study on RNA and protein structures | |
Cootes et al. | Automated Protein Design and Sequence Optimisation Scoring Functions and the Search Problem | |
Tariman | Genetic algorithms for stochastic context-free grammar parameter estimation | |
Verma | Development and application of a free energy force field for all atom protein folding | |
HARAUZ | Pattern Recognition and Artificial Intelligence ES Gelsema and LN Kanal (Editors)© Elsevier Science Publishers BV (North-Holland), 1988 437 | |
Šali et al. | Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research Mission Bay Genentech Hall 600 16th Street, Suite N472D University of California, San Francisco | |
HARAUZ | Pattern recognition and artificial intelligence in molecular biology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000971083 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2000971083 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000971083 Country of ref document: EP |
|
NENP | Non-entry into the national phase in: |
Ref country code: JP |