WO1998018814A1

WO1998018814A1 - Nucleic acid-level analysis of protein structure

Info

Publication number: WO1998018814A1
Application number: PCT/US1997/019673
Authority: WO
Inventors: David Halitsky; Jacques R. Fresco
Original assignee: Cumulative Inquiry, Inc.
Priority date: 1996-10-28
Filing date: 1997-10-27
Publication date: 1998-05-07

Abstract

Methods for analyzing protein structure are disclosed. The methods of the invention permit identification of polypeptides which have structural homology to a known polypeptide, but have little sequence homology to the known polypeptide. The methods of the invention are useful for designing novel proteins having desired structural or functional characteristics.

Description

NUCLEIC ACID-LEVEL ANALYSIS OF PROTEIN STRUCTURE

Background of the Invention

The invention relates to methods of evaluating, altering, and designing protein structures.

Summary of the Invention

Methods of the invention incorporate considerations of mRNA sequence and structure and codon-anticodon energetics into the analysis and design of protein structure. Many prior art methods for analyzing or designing protein structure have relied in part or in whole on analysis at the amino acid level. Proteins, however, are the product of a process which involves a number of cellular entities and their interactions.

The interaction of mRNA molecules with the protein translation machinery (e.g., ribosomes, and tRNAs, as well as other elements of the cellular environment, such as water and salt molecules) and mRNA intrachain interactions place physico-chemical restraints on the overall process.

Not only are proteins the product of a process, but the process itself has evolved over time. Some constraints, e.g., those imposed by the interaction of mRNAs with environmental elements or with primitive ribosomal structures, those of mRNA structure and energetics, e.g., the propensity to form secondary structure, may have been more important, or at least different, primordially than they are currently. While not wishing to be bound by theory, the inventors postulate that evidence of those prior constraints may be seen in the sequence of current messages.

Methods of the invention provide for the analysis and design of protein structures on the basis of patterns or features of the nucleic acid message, e.g., codon usage patterns or coding modalities. Methods of the invention are based on dividing the genetic code, that is the codon-anticodon pairs which specify amino acids (and stops), into classes, sometimes referred to herein as subcodes or coding modalities, and evaluating a nucleic acid sequence which encodes a protein structure based on its class (i.e., subcode or coding modality). Relevant subcodes or coding modalities can be defined using choice parameters which are a function of message-level properties, wherein each property is related to the composition or structure of the nucleic acid, and is other than the identity of the amino acid (or stop) encoded and other than codon bias. Examples of structural choice parameters, which can serve as methods or rules for assignment of codons into classes, include the nature of the substituents on the coding bases (e.g., so-called keto-rich bases U and G or amino-rich bases A and C), size of the coding bases (e.g., purine vs. pyrimidine), hydrogen-bonding and base-stacking energies of the coding bases in overlapping base pairs, and the like. Examples of compositional choice parameters include frequencies of subclasses of codons within more than one of the three alternative reading frames in which a nucleic acid message can be read. Alternative subcodes or coding modalities are not necessarily entirely disjointed, discrete, or unique, and identical subcodes or coding modalities can be obtained using structural and/or compositional parameters.

Methods of the invention allow the identification, analysis, modification and design of protein structures on the basis of patterns or features revealed by the nucleic acid, e.g., the messenger nucleic acid. For example, the identification of a "run" of amino acids residues of a class can be indicative of an evolutionarily conserved region. The identification of a "minority" class codon in a run of majority class codons can be indicative of a structure- or function-critical residue. The discovery of a critical residue can be used in the design or modification of a protein, e.g., to develop a second generation protein. For example, in situations where it is desirable to alter structure or activity of a protein, it may be desirable to alter a critical residue(s) (or a residue which interacts with a critical residue, e.g., an adjacent residue or a residue elsewhere in the protein (or in another protein) with which it interacts). In the case where a change which does not result in significant alterations in structure or activity is desired, residues other than the identified critical residue (or other than residues which interact with it) are changed.

Methods of the invention provide for nearest neighbor frequencies calculated based upon the frequency or pattern of selected classes of codons, i.e., by codon class of the amino acid, and thus provide a higher degree of relevance for analysis of single- class-rich protein structures. Conventional tables of nearest neighbor amino acids do not take into account the classes described herein, and as such, provide only "average" values across multiple classes of codons. Also, unlike tables of the invention, conventional nearest-neighbor tables do not take into account the fact that consistent secondary/tertiary structures of proteins can be shown to correlate with: a) "out of frame" properties of protein messages; b) "interframe" properties of protein messages, i.e., correlations between properties of messages read in frame 1, properties of messages read in frame 2, and/or properties of messages read in frame 3, as defined below. In general, the invention features a method of evaluating protein structure. The method includes: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning one or a plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying triplets of the nucleic acid sequence as members of a class of a binary choice alphabet of n degrees of freedom, and wherein the classes can be generated by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein a binary choice parameter is a function of a message-level property of the nucleic acid sequence, thereby evaluating the protein structure. Triplets can be assigned to a class based on whether they satisfy a value for the message level property, e.g., a triplet can be assigned to a class based on whether its value for a parameter is above or below a predefined value, e.g., enthalpy for formation of a codon-anticodon duplex, or whether or not it possess a particular characteristic, e.g., whether it is GC rich. The message- level property is other than, the identity of the amino acid or punctuation which a triplet encodes and is other than codon bias.

The class constant table provides a measure of the frequency with which a first and a second amino acid occur as nearest neighbors and wherein nearest neighbor frequencies are determined within a codon class, and wherein a class is a function of a message level property of a nucleic acid, e.g., the codon, which encodes an amino acid. The class can be any class generated by the binary choice parameter-based methods referred to herein. For example, if the classes are a first class, e.g., high enthalpy codons and a second class, e.g., low enthalpy codons, the table is generated for nearest neighbors where both neighbors are encoded by codons of either the first class or codons of the second class.

In another aspect, the invention features a method of evaluating a protein structure. The method includes: providing a class-constant table of nearest neighbor relationships for amino acid residues; providing a nucleic acid which encodes a protein structure; and comparing one or a plurality of the observed nearest neighbor pairs in the protein structure with the frequencies provided by the class constant table, thereby evaluating the protein structure. In preferred embodiments, the comparison can include: assigning an expected frequency from the class constant table to one or a plurality of the observed nearest neighbor pairs and determining how many of the observed nearest neighbor pairs fall above or below a predetermined value; determining the likelihood of occurrence, as predicted by the class constant table, for an observed nearest neighbor pair.; or determining if an observed nearest neighbor pair of a first and a second amino acid residue from the protein structure is predicted by the class constant table to occur at a predetermined frequency.

In another aspect, the invention features a method of evaluating a protein structure for resistance to change, e.g., evolutionary or mutational change. The method includes: identifying regions of a protein which is encoded by runs of a single subcode, thereby identifying regions which have been resistant to change and which are therefor predicted to be functionally or structurally significant. E.g., the method can include determining if the nucleic acid sequence which encodes the protein structure includes a run of triplets, e.g., a run at least 20, 40, 60, or 120 triplets in length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the triplets in the run are from one class. Any of the ways of generating classes described herein can be used in this method.

In another aspect, the invention includes, a method of evaluating a protein structure for the presence of critical amino acid residues. The method includes: identifying critical amino acid residues by identifying "minority codons" in runs encoded by codons of a single class or subcode, thereby identifying residues which have been resistant to change and which are therefor believed to be functionally important. Any of the ways of generating classes described herein can be used in this method.

In another aspect, the invention features a method for evaluating a protein structure. The method includes: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning at least one of the subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets, the at least four classes of triplets being represented in at least a portion of the nucleic acid sequence in a ratio of about 3:5:3:5; thereby evaluating the protein structure.

In another aspect, the invention features a method for identifying coding regions of a nucleic acid sequence, the method comprising: providing the nucleic acid sequence; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets; assigning the plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets A, B, C, and D; determining whether the plurality of subject triplets are distributed into the at least four classes of triplets A:B:C:D in a ratio of about 3:5:3:5; thereby identifying coding regions of the nucleic acid sequence.

In another aspect, the invention features, a method for identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of a test protein, the method comprising: providing a nucleic acid sequence which encodes all or a portion of the test protein; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets in a first reading frame; assigning the plurality of subject triplets in the first reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of freedom by applying n first binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; assorting bases of the at least a portion of the nucleic acid sequence into a plurality of subject triplets in a second reading frame; assigning the plurality of subject triplets in the second reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a second binary choice alphabet of n degrees of freedom by applying n second binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; and identifying a protein which includes a polypeptide portion encoded by the plurality of triplets in the second reading frame; thereby identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of the test protein.

In another aspect, the invention features, a method for identifying a mutation- prone region of a nucleic acid sequence, e.g., a viral nucleic acid sequence. The method includes: providing the nucleic acid sequence; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets in a first reading frame; assigning the plurality of subject triplets in the first reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; assorting bases of the at least a portion of the nucleic acid sequence into a plurality of subject triplets in a second reading frame; and assigning the plurality of subject triplets in the second reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; thereby identifying a mutation-prone region of the nucleic acid sequence.

In another aspect, the invention includes, a method of providing a protein structure, e.g., the structure of a protein of known function, in which one or a plurality of amino acid residues are changed. The method includes: providing a nucleic acid sequence which encodes a candidate protein structure; evaluating the sequence by a method described herein; and altering one or a plurality of amino acid residues in the candidate protein structure, thereby providing a protein structure. In yet another aspect, the invention features, a machine-readable data storage medium, including a data storage material encoded with machine readable data which, when used with a machine programmed with instructions for using the data, is capable of storing, retrieving, or displaying databases, binary choice alphabets, protein sequences, nucleic acid sequences of the invention. The storage medium can be used in methods of the invention. In preferred embodiments the storage medium is recorded with: a class constant nearest neighbor table; the classes into which the triplets of a nucleic acid are assigned; or nucleic acid sequence which encodes or protein structure which is to be analyzed or which has been altered by application of a method described herein.

Methods referred to herein can further include creating a record of one or more protein structures to be analyzed or modified, e.g., proteins, protein portions or fragments, or nucleic acids which encode all or part of such protein structure. The protein or nucleic acid structure which is to be analyzed or modified, or the structure which has been identified, evaluated or modified, or both, can be recorded. The record can be encoded in the form of a machine-readable data storage medium. The recorded structure, e.g., a nucleic acid or amino acid sequence, can be displayed on a machine, e.g., on a monitor, or in printed form.

Methods referred to herein can further include providing an identified or modified substance, e.g., a protein or nucleic acid, e.g., chemically synthesizing the identified substance based on the structure identified by way of the methods described herein. In preferred embodiments, the method includes assessing the biological activity of the identified substance. The biological activity of the identified substance can be assessed in vitro or in vivo. In preferred embodiments, the identified substance can be combined with a carrier suitable for introduction into any living cell or organism, e.g., an animal model, e.g., naturally derived or synthetic polymers, solvents, dispersion media, coatings, antibacterial and antifungal agents and the like.

Methods referred to herein can further include providing a three dimensional representation of the protein structure, or a representation of the primary sequence of the protein structure, either before or after a modification. The structure can be compared to the candidate structure or can be evaluated for the ability to exhibit a predetermined structure, e.g., possession of a structural component such as a helix, or a turn segment, an activity, e.g., the ability to dock with a second protein. In methods referred to herein the nucleic acid sequence can be any of: a genomic sequence; an mRNA sequence; a sequence which encodes a protein structure of known function; a sequence for which the reading frame, if it exists, is known; a sequence for which the reading frame, if it exists, is unknown; a sequence which includes a coding portion; a sequence which includes a non-coding portion; or a sequence from a multiprotein data base.

Methods of the invention allow a wide variety of information to be extracted from nucleic acid sequences and allow a wide variety of useful manipulations, e.g., the identification of useful protein structures and the design of improved or altered function protein structures. These include, but are not limited to: providing a protein structure encoded by codons of a first subcode which has a predetermined property of a protein structure encoded by codons of a second class or subcode. This allows: provision of a protein structure having a novel amino acid sequence but which has a desired property, e.g., secondary structure, of a known protein; provision of protein structure with improved or altered function; identifying regions of proteins which are encoded by runs of codons of a single class or subcode, thereby identifying regions which have been resistant to evolutionary or mutational change and which may therefore be functionally important; identifying a critical amino acid residue(s) in a protein structure by identifying

"minority codons" in runs encoded by codons of a single class or subcode, thereby identifying amino acid replacements which, although disfavored at the mRNA level, exhibit sufficiently favored characteristics at the protein level that they have been maintained and may therefore be functionally important; determination of nearest neighbor relationships based upon nearest neighbors encoded by codons drawn from the same class or subcode; distinguishing a coding region from a non-coding region by determining whether the region obeys nearest neighbor relationships involving codons drawn from the same class or subcode; assignment of function (or structure) to a protein or polypeptide of unknown structure by recognizing codon patterns in message-level nucleic acid which encodes the protein or polypeptide structure of unknown function (e.g., the protein or polypeptide is encoded in a first subcode) similar to codon patterns in message-level nucleic acid which encodes the structure of known function (but different primary sequence) (e.g., which is encoded by a second subcode). DEFINITIONS

As used herein, "protein structure" refers to a structure of at least two amino acids linked by a peptide bond. A protein structure can include an entire protein, or a part thereof. For example, a protein structure can include a domain or other region having a characteristic structural, chemical, or biological property. Examples of structural elements include helices, turns, sheets, helix-turn structure; tertiary amino acid structure; and the like. Examples of chemical properties include net charge, side chain bulk, side chain charge, acidity, nucleophilicity, hydrophobicity, and the like. Examples of biological properties include catalytic activity, promoter or suppressor activity, ability to bind to or interact with a second molecule such as DNA, RNA, a protein, a metal atom, immunological activity, and the like. Examples of known domains which can be included in protein structures include: zinc fingers, binding regions, and the like. A protein structural element can be from a naturally occurring protein or can be a non- naturally occurring (e.g., a novel) construct. The protein structure can be of a predetermined length. In preferred embodiments it is at least 8, 16, 32, 64 or 128 amino acids in length.

As used herein, a predetermined property is a property other than the sequence of amino acids, and can include one or more of the following: (1) three dimensional structure, e.g., secondary structure, tertiary structure, or quaternary structure; (2) a charge-related property, e.g., due to positively or negatively charged side chain residues, including, but not limited to: the presence of a predetermined charge at a predetermined location in the sequence, the net charge on a protein or polypeptide, and the like; (3) hydrophobicity, e.g., due to the presence of water- insoluble side-chain residues; (4) an activity associated with an intramolecular interaction or an intermolecular interaction. Intermolecular interactions include binding activity, catalytic activity, and the like. An "amino acid alphabet," as used herein, refers to a group of codons which encode amino acids or stop codons.

As used herein, a "binary choice" amino acid alphabet of n degrees of freedom, refers to an amino acid alphabet which is structured into 2ⁿ subcodes, by the application of binary choices dictated by n choice parameters, and where a choice parameter is a function of nucleic acid sequence and/or codon patterns of the nucleic acid (e.g., an mRNA).

A "binary choice parameter" or "opposition," as used herein, refers to a parameter by which a polynucleotide codon or triplet can be assigned one of two values. The assigned values allow the triplets to be assigned to classes. It will be appreciated that application of more than one non-degenerate binary choice parameter can divide triplets into more than two classes. The division into classes can be based on a predetermined value. E.g., all triplets with a value less than the predetermined value are in one class and all with values above the predetermined values are in a second class, or all triplets having predetermined characteristic a, e.g., being pyrimidine-rich, are in a first class and all codons being pyrimidine-poor are in a second class.

The term "coding modality," as used herein, refers to a pattern of codon usage in a nucleic acid message, e.g., the frequency that one or more codons appears in a nucleic acid sequence, the relative frequency that one or more codons appears in two or more reading frames of a nucleic acid message, and the like. A "triplet", as used herein, refers to three contiguous (sequential) nucleic acid residues (e.g., read in the 5'-3' direction along the nucleic acid strand). A triplet can be a codon (e.g., when a coding nucleic acid sequence is read in the coding frame) or can be a non-reading frame triplet or non-coding triplet.

A leading triplet, as used herein, refers to a triplet which is 5' to the most 3' base in the subject triple. Thus, in a sequence 12345, the leading triplet is 123.

A final triplet, as used herein, refers to a triplet which is 3' to the most 5' base in a subject triple. Thus, in a sequence 12345, the final triplet is 345.

A class of triplets, as used herein, refers to all triplets which fall within a particular subgroup of triplets under a selected binary choice alphabet. A message-level property, as used herein, refers to a property of a nucleic acid

(e.g.,. mRNA) of three or more bases in length, which property is other than the identity of or physical or chemical property of an amino acid (or punctuation) encoded by the nucleic acid (wherein such physical and chemical characteristics include, e.g., size, hydrophobicity, hydrophilicity), and is other than codon-bias. Structural message-level properties include physical and energetic properties of the nucleic acid. Examples include: UA-rich triplets vs. CG-rich triplets; UG-rich triplets vs. AC-rich triplets; purine-rich ("R-rich") triplets vs. pyrimidine-rich ("Y-rich") triplets; assigning a plurality of codons in said sequence to (1) either a Y-rich subcode or an R-rich subcode and (2) to either an E-rich (UG-rich) subcode or an M-rich (AC-rich) subcode. Compositional message-level properties include frequencies of particular codon groups in one or more reading frames of a message.

The term "reading frame" is known in the art and refers to a frame for reading, e.g., translating, a nucleic acid message. For example, a sequence of nucleotides 123456789 can be read in three reading frames (e.g., in groups of three nucleotides, each triplet being a codon): Reading Frame 1 : 123 456 789; Reading Frame 2: 234 567; or Reading Frame 3: 345 678. "Evaluating protein structure," as used herein, refers to determining properties of a protein or polypeptide. For example, evaluating protein structure includes: determining the three-dimensional structure of a protein or polypeptide; comparing the three-dimensional structure of a known protein or polypeptide with that of an unknown protein or polypeptide; determining the function of a protein or polypeptide; comparing the function of a known protein or polypeptide with that of an unknown protein or polypeptide; and the like.

Other features and advantages of the invention will be apparent from the following detailed description, and from the claims.

Brief Description of the Drawings

Figure 1 schematically depicts alternate reading frames for a nucleic acid message. Figure 2 depicts the distinction between "wildcard" and "constant" codon doublets.

Figure 3 shows the 64 codons divided into four groups based on the "wildcard" and "constant" distinction and the leading base of the codon.

Figure 4 shows the frequencies of codons in the groups of Figure 1 in a test mRNA database.

Detailed Description

In general, the invention features, a method of evaluating protein structure. The method includes: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning one or a plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying triplets, e.g., a subject triplet or a leading and following triplet of the subject triplet, of the nucleic acid sequence as members of a class of a binary choice alphabet of n degrees of freedom, and wherein the classes can be generated by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein a binary choice parameter is a function of a message-level property of the nucleic acid sequence, thereby evaluating the protein structure. Triplets can be assigned to a class based on whether they satisfy a value for the message level property, e.g., a triplet can be assigned to a class based on whether its value for a parameter is above or below a predefϊned value, or whether or not it possess a particular characteristic, e.g., whether it is GC rich.

The message-level property: is other than, the identity of the amino acid or punctuation which a triplet encodes and is other than codon bias.

In preferred embodiments the method includes making a record, e.g., on a machine readable medium, of the class assigned to one or more triplets.

In preferred embodiments: n is chosen from the integers 1, 2, 3, and 4.

In preferred embodiments the message-level property is a function of a physical or chemical property of one or more bases of a nucleic acid; is a function of a physical or chemical property which affects the tendency of a nucleic acid to form secondary structure.

In preferred embodiments triplets are assigned to a first and a second class: the first class having the property that a message made of triplets drawn exclusively from the first class is less likely to form secondary (intrachain) structure than is a message which is made of triplets from both the first class and the second class of triplets, and the second class having the property that a message made of triplets drawn exclusively from the second class is less likely to form secondary (intrachain) structure than is a message which is made of triplets from both the first class and the second class of triplets.

In preferred embodiments the message-level property is: a function of the UA content of a subject triplet; a function of the GC content of a subject triplet; a function of the size or molecular weight of a triplet; a function of whether the triplet is keto rich or amino rich; a function of whether the triplet is purine rich or pyrimidine rich; a function of a the enthalpy of the interaction between the triplet and a fully or partially complementary nucleic acid.

In preferred embodiments: the binary choice parameter is applied to the subject triplet, e.g., applied to the codon which encodes an amino acid, to place a subject triplet in a class. In preferred embodiments: the class into which a subject triplet is assigned is a function of:

(1) providing a value for a subject triplet of bases 456, wherein the value is a function of the application of a binary choice parameter to a first set of contiguous bases which includes all or a subset of the bases of the subject triplet, e.g., bases 4 and 5 and of the application a binary choice parameter to a second, different, set of contiguous bases which includes all or a subset of the bases of the subject triplet, e.g., bases 5 and 6; and

(2) assigning a plurality of subject triplets to a first class, and a plurality of triplets to a second class, as a function of subject triplet value.

In preferred embodiments: the class into which a subject triplet is assigned is a function of:

(1) providing a value for a subject triplet of bases 456, wherein the value is a function of the application of a binary choice parameter to a first subset of the bases of the subject triplet, e.g., 4 and 5, and of the application a binary choice parameter to a second, different, subset of the bases of the subject triplet

(1) providing a value for a subject triplet of bases 456, wherein the value is a function of (S¹ + S²)/2, wherein S¹ a function of the application of a binary choice parameter (e.g., the value for enthalpy of anticodon-codon formation above or below a predetermined value) to a first subset of the bases of the subject triplet, e.g., bases 4 and 5 of the subject triplet, and S² is a function of the application of a binary choice parameter to a second, different, subset of the bases of the subject triplet, e.g., bases 5 and 63 of the subject triplet^; and

(2) assigning a plurality of subject triplets to a first class, and a plurality of triplets to a second class.

In preferred embodiments: the class into which a subject triplet is assigned is a function of the application of a binary choice parameter to one or both of a leading triplet or a final triplet of the subject triplet.

In preferred embodiments: the class into which a subject triplet is assigned is a function of: (1) providing a value, e.g., enthalpy, of a triplet of bases 456, wherein the value is a function of (S¹ + S²)/2, wherein S¹ is the value, e.g., enthalpy, of the base pair doublet 45 of the subject triplet, and S² is the value, e.g., enthalpy, of the base pair doublet 56 of the subject triplef and (2) assigning a plurality of subject triplets to a first class, e.g., a low enthalpy class, and a plurality of triplets to a second class, e.g., a high enthalpy class.

In preferred embodiments: a subject triplet 456 of a nucleic acid sequence of bases 123 456 789 is assigned into a class as a function of: (1 ) performing one or more of (i), (ii), and (iii)

(i) applying a binary choice parameter to a leading triplet of 456, e.g., to one or more of triplet 123, 234, or 345, to yield a leading value;

(ii) applying a binary choice parameter to 456, to provide a center value; (iii) applying a binary choice parameter to a following triplet of 456, e.g., to one or more of triplet 567, 678, or 789, to yield a following value;

(2) assigning one or a plurality of subject triplets 345 into a class based on the values determined in one or more of (1), (3) and (3). thereby assigning one or a plurality of subject triplets into classes.

In preferred embodiments: the class into which a subject triplet is assigned is a function of the application of a first binary choice parameter to a leading triplet and a second binary choice parameter to a following triplet of a subject triplet.

In preferred embodiments: the evaluation includes determining if the nucleic acid sequence includes a run of triplets, e.g., a run at least 20, 40, 60, or 120 triplets in length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the triplets in the run are from a first class. The method allows for evaluating a protein structure for resistance to change, e.g., evolutionary or mutational change, by identifying regions of the protein which structure encoded by a run of a single class or subcode, thereby identifying regions which have been resistant to change and which are therefor predicted to be functionally or structurally significant. In preferred embodiments a codon, preferably within the run, is changed so as to alter the sequence of the encoded amino acid to provide an altered sequence.

In preferred embodiments: the evaluation comprises identifying a triplet from a first class in a run of triplets of a second class, e.g., a run at least 20, 40, or 60 codons in length, in which at least 20, 40, 80, 90 or 95 %, or all, of the codons are from the second class, thereby identifying the triplet of the first class as encoding a critical residue, e.g., a structure or function critical residue. In a preferred embodiment, a codon is changed so as to alter the amino acid encoded by the critical residue, a residue adjacent to the critical residue, or a residue which interacts with the critical residue, and thereby provide an altered sequence.

In preferred embodiments: the nucleic acid encodes a protein structure of known or unknown function.

In another aspect, the invention features, a class-constant table of nearest neighbor relationships for amino acid residues which provides, for each of a plurality of class constant nearest neighbors, a frequency of occurrence which is a function of the occurrence of the class constant nearest neighbor pair in a collection of protein structures, e.g., a collection of at least 10, 50, 100, or 500 proteins.

In preferred embodiments: the assignment of amino acids into a class is done by assigning a codon which encodes it into a class as a function of classifying triplets, e.g., the subject codon or a leading and following triplet of the subject codon, as a member of a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of triplets, wherein a binary choice parameter is a function of a message-level property of the nucleic acid sequence.

The table can be recorded on a machine readable medium.

In another aspect, the invention features, a method of evaluating a protein structure. The method includes: providing a class-constant table of nearest neighbor relationships for amino acid residues; providing a nucleic acid which encodes a protein structure; and comparing one or a plurality of the observed nearest neighbor pairs in the protein structure with the frequencies provided by the class constant table, thereby evaluating the protein structure. The class constant table provides a measure of the frequency with which a first and a second amino acid occur as nearest neighbors and wherein nearest neighbor frequencies are determined within a codon class, and wherein a class is a function of a message level property of a nucleic acid, e.g., the codon, which encodes an amino acid. The class can be any class generated by the binary choice parameter-based methods referred to herein. For example, if the classes are a first class, e.g., high enthalpy codons and a second class, e.g., low enthalpy codons, the table is generated for nearest neighbors where both neighbors are encoded by codons of either the first class or codons of the second class.

In preferred embodiments, the comparison can include: assigning an expected frequency from the class constant table to one or a plurality of the observed nearest neighbor pairs and determining how many of the observed nearest neighbor pairs fall above or below a predetermined value; determining the likelihood of occurrence, as predicted by the class constant table, for an observed nearest neighbor pair.; or determining if an observed nearest neighbor pair of a first and a second amino acid residue from the protein structure is predicted by the class constant table to occur at a predetermined frequency.

In preferred embodiments the method includes making a record of observed class constant nearest neighbors in the protein structure on a machine-readable medium.

In preferred embodiments: the method further includes determining if an observed nearest neighbor of the protein structure is that predicted, at a predetermined frequency, by the table, thereby evaluating the protein structure. In preferred embodiments the method can be used to identify coding regions in a nucleic acid sequence. A coding region can be identified by comparing observed nearest neighbors in the protein structure with a class constant nearest neighbor table, the presence of observed pairs which correspond to predicted pairs in the table being predictive of a coding region. In a preferred embodiment, a codon in the coding region is changed so as to alter its encoded amino acid.

In preferred embodiments the method can identify structure or function critical residues, the occurrence of a nearest neighbor of low probability being predictive of a critical amino acid residue. In a preferred embodiment, a codon is changed so as to alter the amino acid encoded by the critical residue, a residue adjacent to the critical residue, or a residue which interacts with the critical residue.

In preferred embodiments: the protein structure is from a protein of known or unknown function.

In preferred embodiments: the protein structure is evaluated for the presence of a first nearest neighbor with a predicted occurrence below a predetermined value which is located in a run of residues, wherein at least 20, 40, 80. 90 or 95% of the residues in the run are members of nearest neighbors pairs having an expected frequency from the table of greater than a predetermined value, thereby identifying a critical residue. In a preferred embodiment, a codon is changed so as to alter the amino acid encoded by the critical residue, a residue adjacent to the critical residue, or a residue which interacts with the critical residue.

In preferred embodiments: the nearest neighbor includes or is adjacent to a critical residue.

In another aspect, the invention includes, a machine-readable medium on which is recorded a class-constant nearest neighbor table.

In another aspect, the invention features a method of evaluating a protein structure for resistance to change, e.g., evolutionary or mutational change. The method includes: identifying regions of a protein which is encoded by runs of a single subcode, thereby identifying regions which have been resistant to change and which are therefor predicted to be functionally or structurally significant. E.g., the method can include determining if the nucleic acid sequence which encodes the protein structure includes a run of triplets, e.g., a run at least 20, 40, 60, or 120 triplets in length (or e.g., 16, 32, 48, 64, 128, or 256 triplets in length), in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the triplets in the run are from one class. Any of the ways of generating classes described herein can be used in this method.

In preferred embodiment: the evaluation comprises identifying a triplet from a first class in a s run of triplets of a second class, e.g., a run at least 20, 40, or 60 codons in length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the codons are from the second class, thereby identifying the triplet of the first class as encoding a critical residue, e.g., a structure or function critical residue. In a preferred embodiment, a codon is changed so as to alter the amino acid encoded by the critical residue, a residue adjacent to the critical residue, or a residue which interacts with the critical residue.

In another aspect, the invention features, a method for evaluating a protein structure. The method includes: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning at least one of the subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets, the at least four classes of triplets being represented in at least a portion of the nucleic acid sequence in a ratio of about 3:5:3:5; thereby evaluating the protein structure.

In preferred embodiments n is 1, 2, 3, or 4. In preferred embodiments the method includes making a record, e.g., on a machine-readable medium, of the class assigned to one or more triplets.

In preferred embodiments, the classes can be generated by application of a binary choice parameter referred to herein.

In another aspect, the invention features, a method for identifying coding regions of a nucleic acid sequence, the method comprising: providing the nucleic acid sequence; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets; assigning the plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets A, B, C, and D; determining whether the plurality of subject triplets are distributed into the at least four classes of triplets A:B:C:D in a ratio of about 3:5:3:5; thereby identifying coding regions of the nucleic acid sequence.

In preferred embodiments n is 1, 2, 3, or 4.

In preferred embodiments (A+B)/(C+D) is about one.

In preferred embodiments (A+D)/(B+C) is about one.

In preferred embodiments the method includes making a record, e.g., on a machine-readable medium, of the class assigned to one or more triplets.

In another aspect, the invention features, a method for identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of a test protein, the method comprising: providing a nucleic acid sequence which encodes all or a portion of the test protein; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets in a first reading frame; assigning the plurality of subject triplets in the first reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of freedom by applying n first binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; assorting bases of the at least a portion of the nucleic acid sequence into a plurality of subject triplets in a second reading frame; assigning the plurality of subject triplets in the second reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a second binary choice alphabet of n degrees of freedom by applying n second binary choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; and identifying a protein which includes a polypeptide portion encoded by the plurality of triplets in the second reading frame; thereby identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of the test protein.

In preferred embodiments each of the first and second binary choice alphabets, n is 1, 2, 3, or 4.n is two.

In preferred embodiments (A+B)/(C+D) is about one.

In preferred embodiments (A+D)/(B+C) is about one.

In preferred embodiments the first reading frame is frame 1 and the second reading frame is frame 2 or 3.

In preferred embodiments, the classes can be generated by application of a binary choice parameter referred to herein. In preferred embodiments the step of identifying a protein which includes a polypeptide portion encoded by the plurality of triplets in the second reading frame comprises reading all or a portion of a protein sequence from a database of protein sequences.

Structural Binary Choice Parameters Structural binary choice parameters can be selected from a variety of physical or physico-chemical qualities related to the structure of the nucleic acid (polynucleotide) sequence, including the primary or secondary structure of the nucleic acid sequence, the physical or chemical nature of the nucleotide bases, the physical or chemical nature of the codons, and the like. Thus, for example, properties related to the ability of a nucleic acid sequence to form secondary structure, e.g., by hybridization of subsequences of the nucleic acid sequence, can be selected as binary choice parameters. For example, the self-pairing of a nucleic acid sequence could be greater in, e.g., a highly UA (or GC)- rich region of the nucleic acid, while a nucleic acid which is not UA (or GC)-rich would be less prone to self-pairing.

Other exemplary binary choice parameters include the size of the nucleotide bases (e.g., pyrimidine vs. purine), H-bonding qualities due to H-bond donor or acceptor substituents (e.g., amino vs. keto-containing nucleotide bases), and the like.

Binary choice parameters can also be related to selected properties of codons, including the relative enthalpy of codon-anticodon interactions (which can include the relative enthalpy of the interaction of a codon with its anticodon plus the flanking complementary bases, e.g., the relative enthalpy of pentamers with their antiparallel complements; the ability of a codon to be "read" by a tRNA (which can be related to codon-anticodon interaction enthalpy, size, polarity, and the like), and other such codon- level parameters. However, a codon-level parameter is not a function of the amino acid encoded by the codon.

Compositional Binary Choice Parameters

Compositional binary choice parameters can be selected from observed frequencies of certain codon groups in one message reading frame and/or correlations among frequencies of particular codon groups in two different reading frames of the same message. Compositional choice parameters include those derived from enthalpic and statistical analysis of mRNA pentamers; compositional choice parameters also include any derived from energetic and statistical analysis of mRNA n-mers (i.e., n > 3), where such analyses can be shown to yield constant intra- and inter-frame frequencies of particular codon groups.

Application of Binary Choice Parameters

The application of a first binary choice parameter, e.g., with choices a and b, will structure the triplets into classes (or subcodes) a and b. The application of a second choice parameter, with choices c and d, will structure the triplets into classes ac, ad, be, and bd. Application of a third binary choice parameter would structure triplets into 2³ or eight subcodes. Thus, application of n binary choice parameters to the genetic code will result in the formation of a binary choice alphabet having 2ⁿ classes or subcodes. It is possible that some subcodes will be empty when the binary choice alphabet is applied to a given nucleic acid sequence.

The binary choice parameter can be applied directly to a subject triplet to assign triplets into a class. For example, the binary choice parameter can be based upon relative enthalpy of a codon-anticodon interaction (e.g., the codons are divided into group(s) of codons having high relative enthalpy and group(s) of codons having low relative enthalpy) and that parameter applied to a subject codon such as 234. A subject triplet can also be assigned a class by a method in which bases are not in the subject triplet, or which do not correspond exactly to the bases of the subject triple. E.g., the binary choice parameter can be applied to one or more base pairs which do not define the triplet. E.g, in evaluation of triplet 234, the binary choice parameter can be applied to triplet 123 and triplet 345, and the classes into which the triplets 123 and 345 fall can be used to assign a class or subcode to the triplet 234. In other words, the subcode of 234 can be a function of the application of the binary choice parameter to the triplets 123 and 345.

Frame Choice

Methods of the invention require the division of a sequence of bases into triplets. The simplest way is to consider a string of bases, 123456789, as triplets of 123 456 789. Mechanistically, this or any mode of division into triplets can be viewed as a process with two components, a "ratchet" or advance component and a "read" or selection component. As will be seen below, the ratchet component varies by the number of base pairs advanced after the determination of a triple.

The read component refers to the length in base pairs, of the segment of base pairs from which the triplet will be chosen.

The simplest system, that used by most evolutionarily current cellular mechanisms, is "ratchet three/read 3" (that is, the mRNA is advanced, or ratcheted, through the reading mechanism three bases at a time, and the message is read by the reading mechanism in groups of three bases (one codon). Other systems, however, are possible. Without being bound by theory, it is postulated that other systems may have existed in earlier stages in the evolution of the cellular protein translation machinery. In fact, examples of current frame-shift repressing tRNA's are known. Thus, possible alternate systems include "ratchet 3/read 5 on center" (in which the mRNA is ratcheted into the reading mechanism three bases at a time, and the reading mechanism reads the group of three bases at the center of a group of five bases in the reading mechanism). If a read value is more than 3 (e.g., in a "ratchet 3/read 5 on center" system), then additional choices are imposed: the triplet must be selected from the 3+N bases which are read. Thus, a string 1 2 3 4 5 6 7 8 9 10 can be divided into the following triplets: 234 I 567 I 8910, which would be generated by reads 12345 I 45678 I 759701 1, wherein the italicized bases, the on center bases, are chosen.

For example, read-ratchet mechanisms or configurations can be divided into the following classes:

Class 1 : ratchet 3; read 3

Class 2 : ratchet 3; read 5 and select the center triplet, 12345 Class 2a: ratchet 3; read 5 and select the leading triplet, 72345

Class 2b: ratchet 3; read 5 and select the final triplet, \2345 Class 2c: ratchet 3; read 5 and read any triplet Class 3 : ratchet 3; frameshift; read 5, any triplet

Class 2 approaches allow the assignment of a binary choice parameter to a codon 234 as a function of the binary choice parameter outcome for one or both of 123 and 345, e.g., 123 is classified as UG (k) or AC- (a) rich; and 345 is classified as UG (k) or AC-(a) rich, which gives the following possible classes for 234: kk, ka, aa, ak. If, for example, 123 is k, and 234 is a, then 234 is ka. Note that although only one binary choice is applied, there are 4 degrees of freedom with regard to 234, because the binary choice parameter is applied twice.

A binary choice parameter which divides triplets into classes on the basis of enthalpy, e.g., of the codon-anticodon interaction (e.g., into enthalpically strong and enthalpically weak classes) is particularly useful.

Read-ratchet configurations wherein the read value is greater than 3 make possible the context-sensitive (as opposed to context-free) assignment of triplets into classes by binary choice parameters, e.g., allow triplet 234 to be assigned a value which is a function of the binary choice parameter outcome of one and, more preferably both, of 123 and 345.

Binary Choice Alphabets

A binary choice alphabet can be constructed by selection of suitable pre-selected binary choice parameters. For example, binary choice parameters corresponding to enthalpy (of codon-anticodon interaction), size, polarity, charge, hydrophobicity, etc. can be selected and combined in any desired combination to arrive at a binary choice alphabet. Alternatively, a binary choice alphabet can be constructed by segregation of codons into 2ⁿ classes, without selecting the groups based on binary choice parameters. For example, a computer can rapidly segregate codons into randomly-selected classes to create a binary choice alphabet.

However the binary choice alphabet is constructed, it is generally preferable to validate the alphabet to ensure that the alphabet will be predictive of protein structure. It has been found that, in preferred embodiments, valid binary choice alphabets having 2ⁿ classes will generally include at least four classes (A, B, C, D) for which the following relationships are true when triplets of a nucleic acid sequence are parsed with the binary choice alphabet: the ratio A:B:C:D is about 3:5:3:5 (e.g., from about 2:4:2:4 to about 7:11 :7:11); the ratio (A+B)/(C+D) is about 1 (e.g., from about 0.9 to about 1.1); and the ratio (A+D)/(B+C) is about 1 (e.g., from about 0.9 to about 1.1). Thus, in preferred embodiments, application of a binary choice alphabet to a nucleic acid sequence which encodes a protein will yield at least four classes in which triplets are arrayed according to these ratios. It is therefore possible to validate a binary choice alphabet by searching for the appearance of the desired ratios. If the ratios are found, then the alphabet may have predictive value for protein structure evaluation. If the ratios are not found, the alphabet may not have such predictive value. It will be appreciated that the presence or absence of the ratios provides a useful "check" for a selected binary choice alphabet. In preferred embodiments, regions of triplets are checked for lengths which are multiples of 16 (e.g., 3+5+3+5=16), such as 16 triplets, 32 triplets, 64 triplets, and the like. Thus, in a preferred embodiment, groups of N x 16 sequential triplets (wherein N is an integer, e.g., between 1 and 16) are evaluated to determine whether the desired ratios are present.

Another means for validating a binary choice alphabet is by comparing the frequency of codon groups when the message is read in one reading frame (e.g., Frame 1) with the frequency of the same codon groups when the message is read in another reading frame (e.g., Frame 3). It has also been found that valid binary choice alphabets will generally include at least four classes A, B, C, D such that the frequency of codons A, B, C, D varies systematically from one frame to another, e.g., from Frame 2 to Frame 1, Frame 2 to Frame 3, and/or Frame 1 to Frame 3. It is therefore possible to further validate a binary choice alphabet by searching for a systematic inter-frame variation in the frequencies of codon groups defined by the alphabet. If such systematic variation is found, the alphabet may have predictive value. One of ordinary skill in the art, in light of the teachings herein, will be able to select useful binary choice alphabets according to these criteria using no more than routine experimentation. Examples

Example 1 : Generation of a predictive amino acid alphabet based on binary choices which are a function of enthalpy of codon-anticodon interaction

This example provides a predictive four letter amino acid alphabet (4a^) for the representation of protein primary structures (s^^s) from the energetic properties of mRNA molecules, i.e., the translation of mRNAs. While not wishing to be bound by theory, the basis for deriving an amino acid alphabet from codon-anticodon interactions can be rationalized as follows: if the genetic code was not "frozen" prior to the onset of translation and the evolution of protein primary structure, then the evolutionary trajectory of this code may have been one factor which determined important properties of protein primary structure. Energetics of codon-anticodon interactions may have been relevant to the evolution of the genetic code before ribosomes existed, when these interactions occurred in an aqueous medium.

The configuration of the reading frame may also provide a basis for deriving an amino acid alphabet. Figure 1 schematically depicts two alternative reading frames for a nucleic acid sequence, each reading frame defining an energetic packet or triplet; each nucleic acid base of the message is represented by a black square. Again, while not being bound by theory, evolution may have favored systems which would allow slippage from frame 1 to frame 2. This would impose entropic requirements on the code. It is noted that this in system, which permits "slippage", energy packaging may be analogous to human linguistic systems which permit slippage and routines for assigning syllabic stress (f0) and consequent systematic recasting of signifying sound tokens. For comparison, refer to Grimm's Law and Verner's Law for Indo-European.

It is shown herein that the energies of codon-anticodon interactions pattern systematically, that this pattern implicitly defines a particular amino acid alphabet, and that this amino acid alphabet characterizes protein primary structure predictively, i.e., provides insight into protein secondary and tertiary structure. Table I A shows the Ornstein-Fresco ΔH values for the 10 possible base pair overlaps. Using those ΔH values, , average ΔH values were calculated for all 64 possible codon-anticodon interactions for all possible mRNA pentamers (or "five-envelopes") with codons as the center three bases. The average ΔH values shown in Tables IB and IC assume that there is no wobble pairing of codon and anticodon. The average ΔH values in Tables IB and C were calculated according to the following formula: for any pentamer ABCDE: ΔH is calculated for B, C, D according to the formula: (ΔH(AB) +ΔH (BC) + ΔH (CD) + ΔH (BE)) / 4. The 64 codon triplets are shown in the first column of Table IB. Values for a codon in each of all possible five-envelopes are shown in each row. For example, in the case of UUU, the enthalpy value for a UUU codon preceded by a U and followed by a U is 2.80. The enthalpic value when UUU is preceded by a U and is followed by a C is 2.45. The average value for a codon in all possible "five- envelopes" is given in the penultimate column on the right side of the table. For the UUU codon, the average for all possible 5 envelopes is 2.43. That average is calculated for all codons in Table IB. The final column (far right) of Table IB provides the average enthalpic value for all codons having a common leading doublet. For example, all codons which begin with the doublet UU have an average enthalpic value of 2.11. Table IC shows the values from the penultimate column of Table IB. Note that the values in Table IC hover around four values, 0.6, 1.2, 1.8, and 2.4. It can also be seen, as indicated in the caption of Table IC, that for any given doublet XX, the average enthalpic value for the codons XXU and XXA is about 0.6 higher than the average value for the codons XXC and XXG. The energetic pattern evident in Table IC manifests itself in mRNAs. Table IIA shows 16 enthalpically defined codon groups (separated by dashed lines) produced by ranking the codons according to the interaction ΔH of the leading doublet, that is, the first two base pairs of the codon, and by the codon interaction enthalpy value from Table IC. In Table IIA the first column shows all codons. The second column identifies the first doublet in the third bases of the codon. The third column provides the ΔH of the first doublet, the fourth column provides the main codon ΔH over all 16 possible pentameric envelopes (as set out in Table IB, penultimate column) and the fifth column provides a letter for a group designation. The horizontal divisions segregate the first doublets according to the eight energy levels shown in Table IA. Each of the groups thus formed by horizontal division is further subdivided on the basis of the average value for the codon for each of the 5 possible envelopes for Table IB and by which of the 4 energy levels identified in Table IC it falls into. Table IIB is analogous except that the first binary choice applied is the ΔH for the second or final doublet of the codon.

Table IIC shows the frequency of Table IIA, or leading, codon groups and of Table IIB, or following codon groups in a test mRNA database.

The leading or L codon groups of Table IIC correspond to frame 1 of the mRNA and the final or F codon groups in Table IIC correspond to frame 3 of the mRNA. The middle column of Table IIC shows the difference in frequency between the L groups and the F groups shown in the first and last columns of Table IIC. It can be seen that the differences are very small, which may be a consequence of an original evolutionary pentameric energy packaging scheme. One possible explanation for this conserved "epiphenomenon" is that the present day "ratchet 3/read 3" translation system evolved from a "ratchet 3/read 5 on center" primordial translation system. Present day "frame shift suppressor" tRNAs with anticodon loops greater than 3, are possibly mutant analogs of ancestor tRNAs which regularly read pentamers. According to this view, as ratchet 3/read 3 translation systems evolved from ratchet 3/read 5 ancestral translation systems, mRNAs would have had to be repackaged in one of two alternative reading frames different from the original reading frame. For example, an original evolutionary ratchet 3/read 5 on center system would read pentamer 12345 as 234. This corresponds to present day frame 2. However, a ratchet 3/read 3 translation system reading from that same pentamer 12345, would read 123, corresponding to present day reading frame 1, or else it would read 345, corresponding to present day reading frame 3. It is believed that the prevalence of the "weak" bases U and A at the 5' ends of the anticodon loops of tRNA pentamers would favor repackaging of codons into present day frame 1 rather than into present day frame 3.

If such an evolution from a ratchet 3/read 5 on center to a ratchet 3/read 3 translation system occurred, the resulting frameshift from reading frame 2 to reading frame 1 would have the potential to cause disastrous changes in protein structure as the alternate reading frame was read. There are at least two ways in which catastrophic mutations could be avoided. First, if the pentamer packets of the earliest mRNAs were read "loosely" by the earliest tRNA anticodon, that is, if early tRNAs could read either 123, 234, or 345 out of each pentamer, then the loose reading would result in evolutionary pressure to select mRNAs containing packets which would not introduce harmful amino acids into protein primary structures when the packets were read differently, e.g., when the packets were read in frame 1 rather than in frame 2. Second, if the mRNAs were so selected from the start, then a systemic frameshift would not necessarily introduce harmful amino acids into protein primary structures in numbers sufficient to damage structure and/or function of the protein, and in fact might permit the introduction of novel amino acid sequences with beneficial effects on protein secondary and tertiary structures.

This suggests that if a systemic frameshift occurred, some codon distributions would have remained essentially unchanged ("constant" codons) while other codon distributions would have changed ("wild card"), which could have a beneficial effect on protein structure. In this case, the evolutionary distinction between "wild card" and "constant" codons might classify amino acids in such a way as to enable the construction of a predictive amino acid alphabet. Accordingly, a binary choice alphabet was created in which the "constant" vs. "wildcard" distinction was one binary choice parameter (Figure 2).

Table IIIA shows possible enthalpic groups of leading and final triplets in mRNA pentamers with the 64 codons as centers. An example is shown in Figure 2, in which the codon UUA is the center triple. The first column of Figure 2 shows the four possible leading L triplets together with the classification group from Table IIA in the second column. The fourth column of Figure 11 shows the classification group of the final (F) triplets shown in the last column of Figure 11.

As shown in Table IIIA, doublets can be classified as "constant codon doublets" or "wild card codon doublets". A constant codon doublet is a doublet XX of a codon XXY or XXR (Y and R stand for a pyrimidine base or a purine base respectively), in which XX is UU, CC, GG, or AA, for which codon, as shown in Table IIIA, the leading (NXX) and final (XYN or XRN) triplets of all possible pentamers (N is any base), belong to the same enthalpic groups of Tables IIA and IIB. For example, for the codon UUA (boxed line at upper left of Table IIIA), the four possible leading triplets (NUU) all belong to the groups Z and W. The four possible final triplets (UAN) also all belong to the groups Z, W, and X. Because U is a pyrimidine (Y) and A is a purine (R), UUA is a constant codon doublet of class YXR. A "wild card codon doublet", in contrast, shows an alternation between enthalpic groups of Tables IIA and IIB as the leading and final triplets are analyzed over all pentamers. For example, for the codon UUU (top line at upper left of Table IIIA), the four possible leading triplets (NUU) belong to the groups Z, W and X, as noted above. The four possible final triplets (UUN) belong to the groups Z, V, Y, and U, differing from the leading triplets. Because U is a pyrimidine (Y), UUU is a constant codon doublet of class YXY. The distinction between constant codon doublets and wild card codon doublets can be used to construct a four letter amino acid alphabet. As shown in Figure 3, the 64 codons can be divided into four groups: constant Y, X, R, doublets, constant R, X, Y doublets, and wild card Y, X, Y, doublets, and wild card R, X, R doublets.

As shown in Figure 4, a test mRNA database was analyzed to determine the frequencies of the four codon groups in the four letter amino acid alphabet of Figure 3. The mRNA database was read in both frame 1 and frame 2. As can be seen from Figure 4, shifting from reading in frame 2 to reading in frame 1 results in the interchange of frequencies of p and s. Example 2: Determination of Secondary and Tertiary Protein Structural Features Correlated With Message Segments Evaluated With a Binary Choice Alphabet

A binary choice alphabet of Example 1 (s, p, d, t) was used to evaluate protein structures as follows:

Test mRNA sequences were analyzed from a database of mRNAs (e.g., from GenBank). Note that in GenBank, uracil (U) is stored as "T"; this convention will be used throughout this example. Each sequence was then analyzed in reading frame 2 using the following mapping: ATT/ATC/GTT/GTC=A

ACT/ACC/GCT/GCC=B AAT/AAC/GAT/GAC=C AGT/AGC/GGT/GGC=D TTA/TTG/CTA/CTG=E TCA/TCG/CCA/CCG=F

TAA/TAG/CAA/CAG=G TGA/TGG/CGA/CGG=H TTT/TTC/CTT/CTC=I TCT/TCC/CCT/CCC=J TAT/TAC/CAT/CAC=K

TGT/TGC/CGT/CGC=L ATA/ATG/GTA/GTG=M ACA/ACG/GCA/GCG=N AAA/AAG/GAA/GAG=O AGA/AGG/GGA/GGG=P

One binary choice parameter was whether the leading base of the triplet was purine (A or G; groups A-D and M-P) or pyrimidine (T or C; groups E-L). The other binary choice parameter was the "wildcard" vs. "constant" distinction discussed in Example 1, infra. It should be noted that this parameter also corresponds to a binary choice between "symmetrical" (YXY and RXR) codons vs. "non-symmetrical" (YXR and RXY) codons (in which Y and R are pyrimidine and purine as defined above).

The mapped string from reading frame 2 was then converted to the binary choice alphabet (s, p, d, t) according to the following scheme: ABCDEFGHIJKLMNOP=ssssppppddddtttt. The result is a binary choice alphabet of degree 2, dividing the genetic code into 4 classes (denoted s, p, d, t), as shown in Figure 3. The mapped string was then evaluated, over a moving window of 16 triplets (16 letters in the spdt alphabet), to determine regions in which the s:p:d:t ratio was about 3:5:3:5 in reading frame 2 (that is, s >= 2, p >= 4, d >= 2, t >= 4). When such a region was found, the mRNA sequence was translated to an amino acid sequence in frame 1 for that region of the mRNA (i.e., by reading the message resulting from adding a base at the beginning and eliminating a base at the end of the message segment). Our protein database (described infra) was then searched for proteins which included the amino acid sequence encoded by the resulting Frame 1 amino acid sequence. When a single protein was found to have two separate and distinct regions with even low homology to the derived Frame 1 amino acid sequence, the two regions were often found to have similar, or virtually identical, secondary and tertiary structural features. When two different proteins were found, which each manifested one or more regions with even low homology to the derived Frame 1 amino acid sequence, these regions were often found to have very similar secondary and tertiary structural features.

Example 3 : Starting from a Known Protein Structure

Binary choice alphabets (s, p, d, t) were used to evaluate protein structures as follows:

Test mRNA sequences were read from a database of mRNAs (e.g., from GenBank). Each sequence was then read in reading frame 1 and in reading frame 2 using the mapping described in Example 2 for the 16-letter alphabet A-P.

The mapped string from reading frame 1 was then converted to a binary choice alphabet (s, p, d, t) according to the following scheme: ABCDEFGHIJKLMNOP=ppppssssttttddd The mapped string from reading frame 2 was then converted to a binary choice alphabet (s, p, d, t) according to the following scheme: ABCDEFGHIJKLMNOP=ssssppppddddtttt

The mapped strings were then evaluated, over a moving window of 16 triplets (16 letters in the spdt alphabet), to determine regions in which the s:p:d:t ratio was about 3:5:3:5 in both frame 1 and frame 2. When such a region was found, the mRNA sequence was translated to an amino acid sequence in both frame 1 and frame 2 for that region of the mRNA. Our protein database was then searched for proteins which contain the amino acid sequence encoded by the translated region of Frame 2. The database of protein messages contained messages for three hundred proteins, those proteins being sixty to six thousand amino acids in length. The proteins included proteins with roles in protein synthesis, nucleic acid synthesis, protein or nucleic acid degradation, various "house-keeping" enzymes, and some immunoglobulins. When a protein containing the sequence was found, the structural similarity (e.g., the tertiary structure) of that portion of the protein was compared to the structure of the protein encoded by the test mRNA sequence.

It was found that for several test mRNA sequences, many of those portions of the identified proteins were structurally very similar to the comparable portions of the protein encoded by the test mRNA sequence. For example, a helix-strand transition in the protein encoded by the test mRNA sequence was structurally similar to a helix-strand transition of a protein located in the protein database according to methods of the invention. Application of the methods of the invention (e.g., the methods of Example 2 and Example 3) to a variety of test sequences identified structural similarity in at least one protein of the our protein database for other structural motifs such as sheets, helix entry, helix exit, Pro-His-Pro turns, and the like.

Example 4 The function of introns (e.g., non-coding DNA sequences in genomic DNA) is generally not well understood. Methods of the invention provide knowledge which is useful for investigating intron function. The methods of the invention can include searching nucleic acid databases (e.g., of genomic DNA) for regions of nucleic acid which do not code for protein in the present-day reading frame (i.e., frame 1), but which could code for protein in an alternate reading frame (e.g., frame 2 or frame 3). Such a presently non-coding region (i.e., an intron) could correspond to a region of a nucleic acid which was a coding region prior to a frameshift. Such formerly-coding regions could encode alternate structures (i.e., protein regions which differ from the modern protein regions) which preserve the function of the protein. Thus, a nucleic acid which represents both coding and non-coding regions can be analyzed in both frames 1 and 2, as described supra for Examples 2 and 3. Where a non- coding region, such as an intron, is found in which the s:p:d:t ratio is about 3:5:3:5 in frame 2, that region may correspond to a region of the nucleic acid which coded for protein structure prior to a shift in reading frame.

Equivalents

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims. The contents of all references cited herein are hereby incorporated by reference. What is claimed is:

Claims

1. A method of evaluating protein structure comprising: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning one or a plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying triplets, of the nucleic acid sequence as members of a class of a binary choice alphabet of n degrees of freedom, and wherein the classes can be generated by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein a binary choice parameter is a function of a message-level property of the nucleic acid sequence, thereby evaluating the protein structure.

2. The method of claim 1, further comprising making a record, on a machine readable medium of the class assigned to one or more triplets.

3. The method of claim 1 , wherein triplets are assigned to a first and a second class: the first class having the property that a message made of triplets drawn exclusively from the first class is less likely to form secondary (intrachain) structure than is a message which is made of triplets from both the first class and the second class of triplets, and the second class having the property that a message made of triplets drawn exclusively from the second class is less likely to form secondary (intrachain) structure than is a message which is made of triplets from both the first class and the second class of triplets.

4. The method of claim 1, wherein the message-level property is: a function of the UA content of a subject triplet; a function of the GC content of a subject triplet; a function of the size or molecular weight of a triplet; a function of whether the triplet is keto rich or amino rich; a function of whether the triplet is purine rich or pyrimidine rich; or a function of a the enthalpy of the interaction between the triplet and a fully or partially complementary nucleic acid.

5. The method of claim 1, wherein a subject triplet 456 of a nucleic acid sequence of bases 123 456 789 is assigned into a class as a function of: (1) performing one or more of (i), (ii), and (iii)

(i) applying a binary choice parameter to a leading triplet of 456, e.g., to one or more of triplet 123, 234, or 345, to yield a leading value; (ii) applying a binary choice parameter to 456, to provide a center value;

(iii) applying a binary choice parameter to a following triplet of 456, e.g., to one or more of triplet 567, 678, or 789, to yield a following value;

6. A class-constant table of nearest neighbor relationships for amino acid residues which provides, for each of a plurality of class constant nearest neighbors, a frequency of occurrence which is a function of the occurrence of the class constant nearest neighbor pair in a collection of at least 10 proteins.

7. A method of evaluating a protein structure comprising: providing a class-constant table of nearest neighbor relationships for amino acid residues; providing a nucleic acid which encodes a protein structure; and comparing one or a plurality of the observed nearest neighbor pairs in the protein structure with the frequencies provided by the class constant table, thereby evaluating the protein structure.

8. The method of claim 7 wherein the comparison can include: assigning an expected frequency from the class constant table to one or a plurality of the observed nearest neighbor pairs and determining how many of the observed nearest neighbor pairs fall above or below a predetermined value; determining the likelihood of occurrence, as predicted by the class constant table, for an observed nearest neighbor pair; or determining if an observed nearest neighbor pair of a first and a second amino acid residue from the protein structure is predicted by the class constant table to occur at a predetermined frequency.

9. The method of claim 7, further comprising making a record of observed class constant nearest neighbors in the protein structure on a machine-readable medium.

10. A machine-readable medium on which is recorded a class-constant nearest neighbor table.

1 1. A method of evaluating a protein structure for resistance to change, e.g., evolutionary or mutational change comprising: identifying regions of a protein which is encoded by runs of a single subcode, thereby identifying regions which have been resistant to change and which are therefor predicted to be functionally or structurally significant.

12. The method of claim 1 1 , wherein the method includes determining if the nucleic acid sequence which encodes the protein structure includes a run of triplets at least 40 triplets in length, in which at least 90% of the triplets in the run are from one class.

13. A method of evaluating a protein structure for the presence of critical amino acid residues comprising: identifying critical amino acid residues by identifying minority codons in runs encoded by codons of a single class or subcode, thereby identifying residues which have been resistant to change and which are therefor believed to be functionally important.

14. The method of claim 13, wherein the evaluation comprises identifying a triplet from a first class in a run of triplets of a second class at least 40 codons in length, in which at least 40% of the codons are from the second class, thereby identifying the triplet of the first class as encoding a critical residue.

15. A method for evaluating a protein structure comprising: providing a nucleic acid sequence which encodes the protein structure; assorting bases of the nucleic acid sequence into subject triplets; and assigning at least one of the subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets, the at least four classes of triplets being represented in at least a portion of the nucleic acid sequence in a ratio of about 3:5:3:5; thereby evaluating the protein structure.

16. The method of claim 15, wherein the method includes making a record on a machine-readable medium of the class assigned to one or more triplets.

17. A method for identifying coding regions of a nucleic acid sequence, the method comprising: providing the nucleic acid sequence; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets; assigning the plurality of subject triplets to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets A, B, C, and D; determining whether the plurality of subject triplets are distributed into the at least four classes of triplets A:B:C:D in a ratio of about 3:5:3:5; thereby identifying coding regions of the nucleic acid sequence.

19. The method of claim 17, wherein the method includes making a record on a machine-readable medium, of the class assigned to one or more triplets.

20. A method for identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of a test protein, the method comprising: providing a nucleic acid sequence which encodes all or a portion of the test protein; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets in a first reading frame; assigning the plurality of subject triplets in the first reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of freedom by applying n first binary choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; assorting bases of the at least a portion of the nucleic acid sequence into a plurality of subject triplets in a second reading frame; assigning the plurality of subject triplets in the second reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a second binary choice alphabet of n degrees of freedom by applying n second binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; and identifying a protein which includes a polypeptide portion encoded by the plurality of triplets in the second reading frame; thereby identifying a protein that includes a polypeptide portion which is structurally or functionally similar to all or a portion of the test protein.

21. A method for identifying a mutation-prone region of a viral nucleic acid sequence comprising: providing the nucleic acid sequence; assorting bases of at least a portion of the nucleic acid sequence into a plurality of subject triplets in a first reading frame; assigning the plurality of subject triplets in the first reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; assorting bases of the at least a portion of the nucleic acid sequence into a plurality of subject triplets in a second reading frame; and assigning the plurality of subject triplets in the second reading frame to one of a plurality of classes, wherein the assignment is a function of classifying the subject triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of freedom by applying n binary choice parameters to a triplet to yield at least 2ⁿ classes of subject triplets, wherein the assignment provides at least four classes of triplets distributed in a ratio of about 3:5:3:5; thereby identifying a mutation-prone region of the nucleic acid sequence.