EP1173755A2

EP1173755A2 - Amino acid sequence evaluation system

Info

Publication number: EP1173755A2
Application number: EP00926228A
Authority: EP
Inventors: Stefano Volinia; Hung-Sen Lai; Michael B. Yaffe; German G. Lepar; Lewis C. Cantley
Original assignee: Beth Israel Deaconess Medical Center Inc; Beth Israel Hospital Association
Current assignee: Beth Israel Deaconess Medical Center Inc
Priority date: 1999-04-23
Filing date: 2000-04-21
Publication date: 2002-01-23
Also published as: WO2000065358A3; JP2002543390A; CA2371238A1; AU4479200A; WO2000065358A2

Abstract

Method and apparatus for evaluating a query protein amino acid sequence with respect to an amino acid sequence motif corresponding to a target of a domain of a known protein having a known function to predict whether the query protein amino acid sequence performs the known function. In one embodiment, the prediction is performed by using a scoring system to generate a score for the query protein amino acid sequence with respect to the amino acid sequence motif. The score indicates a degree of confidence that the query protein amino acid sequence performs the known function. Several embodiments of scoring systems are described.

Description

AMINO ACID SEQUENCE EVALUATION SYSTEM

Field of the Invention

The present invention relates to computer-implemented methods for predicting the functional roles and pathways in which proteins may interact with each other.

Background

Genome sequencing initiatives such as the Human Genome project have generated thousands of protein sequences of unknown function and are expected to generate vast numbers of additional sequences in the coming years. A comparison of sequence homologies between such newly discovered proteins and proteins of known function remains the only method available for predicting functional properties for newly discovered proteins.

Many biological responses are mediated by the interaction of a binding protein with another molecule. Examples of such binding interactions include the interaction of an enzyme with its substrate (e.g., the interaction of kinases, proteases, phosphatases etc. with their substrates), the interaction of antibodies with antigens, the interaction of receptors with ligands and the interaction of SH2 domains with phosphotyrosine- containing targets. The specificity of a particular binding protein, such as an enzyme. typically has been determined by identifying a number of natural substrates for the protein, obtaining the sequence of these substrates and then comparing the sequences of these substrates to define a consensus motif for substrate binding.

In general, protein and nucleic acid homology can be calculated using various, publicly available software tools developed by NCBI (Bethesda, Maryland) that can be obtained through the Internet (ftp:/ncbi.nlm.nih.gov/pub/). Exemplary tools include the BLAST system available at http://www.ncbi.nlm.nih.gov. Pairwise and ClustalW alignments (BLOSUM30 matrix setting) as well as Kyte-Doolittle hydropathic analysis can be obtained, for example, using the MacVector sequence analysis software (Oxford Molecular Group). In addition, a compilation of motif sequences has been described (Bairoch, A., Nucl. Acids Res. 19:2241-2245 (1991). Unfortunately, this compilation of sequences is deduced from the works of many investigators, each with varying degrees of precision and error and. accordingly, the compilation is replete with inaccuracies that are attributable to the original data entry.

In view of the foregoing, a need still exists to provide an improved method for predicting the functional and/or binding properties of newly discovered proteins. Such methods should be designed to reduce or prevent the errors introduced by comparison to sequences of individual sequences and to provide a more accurate method of predicting the functional and/or binding properties of newly discovered proteins.

Summary The invention is directed to a system which allows a user to scan a protein sequence for novel functions, interactions, post-translation modifications and other structural characteristics. Application in the invention is not limited to proteins previously identified in the laboratory, but can also be used to deduce this information for theoretical proteins whose sequences are obtained from genome sequencing initiatives such as the Human Genome Project. Consequently, the system disclosed herein can be used to suggest the functional roles and pathways in which a novel protein may be involved, and to provide directions towards identifying other proteins that are likely to interact with the target protein of interest. In particular, in one embodiment, the system searches a query protein sequence to identify amino acid motifs that are likely to interact with proteins, e.g., as enzyme substrates or other binding domains. In another embodiment the system is a web-based system.

In general, the primary data from which the system predicts protein motifs is based upon a database containing information generated using oriented degenerate peptide libraries (ODPL). An ODPL contains a mixture of library members which differ from one another in amino acid sequence but which, in general, contain the same amino acid located at a fixed amino acid position (referred to herein as the Anon-degenerateto] position). A position within each library peptide that is occupied by a different amino acid in different peptides (i.e., not fixed) is referred to herein as a Adegenerate position@. Exemplary oriented degenerate peptide libraries are described in Songyang et al. (Cell (1993) 72:767-778); U.S. Patent No. 5,532,167; and PCT Application No. PCT/US9S710876, entitled ACyclic Peptide Libraries and Methods of Use Thereof to j

Identify Binding Motifs,@ publication no. WO 98/54577. All documents identified in this application are incorporated in their entirety herein by reference.

According to one aspect of the invention a system is provided which is useful for evaluating a query protein amino acid sequence to determine whether the query protein contains one or more defined motifs. The system comprises a database which includes a record of motifs corresponding to a protein domain of known function. The record further includes a matrix of Apreference values@ (alternatively referred to as Aselectivity values@) for amino acids at positions in the motif, the preference values indicating the relative importance of each amino acid at each position to the function of the motif (e.g., a binding function, a phosphorylation site function). The system is used as a method for evaluating a query protein amino acid sequence to determine whether the query protein contains a sequence corresponding to the motif and, therefore, likely exhibits the function attributed to other proteins which contain a sequence corresponding to this motif. Thus, in general, the method of the invention for evaluating the query protein amino acid sequence with respect to the motif, wherein the score is based on selected preference values in the motif which correspond to amino acid in the query protein amino acid sequence.

The invention provides two different approaches to provide relative scores of candidate motifs in the query protein sequence. Each of these approaches involves a step of (A) calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence. Each method is briefly summarized below and is discussed in more detail in the Examples.

In one aspect, the invention is directed to a method for evaluating a query protein amino acid sequence with respect to a motif corresponding to a target of a domain of this or another protein having a known function, in a system including a database including a record for the motif, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions. The method includes a step of: (A) calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence. In one embodiment, the step (A) includes a step of: (A)(1) calculating the score as a logarithm of a product of the selected preference values. In another embodiment, the step (A) includes steps of: (A)(1) multiplying the selected preference values to obtain the product; and (A)(2) calculating the score as the logarithm of the product. In a further embodiment, the step (A) includes steps of: (A)(1) calculating the logarithm of each of the selected preference values: and (A)(2) calculating the score as a sum of the logarithms calculated in step (A)(1). In another embodiment, the step (A) comprises a step of: (A)(1) calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values. In a further embodiment, the motif includes a number of degenerate positions, and the step (A) includes steps of: (A)(1) generating, for each of the selected preference values, a probability value that is proportional to the selected preference value; (A)(2) calculating, for each of the probability values, a negative logarithm of the probability value; (A)(3) summing the negative logarithms; and (A)(4) calculating the score by dividing the sum by the number of degenerate positions in the known amino acid sequence. In another embodiment, the step (A) includes steps of: (A)(1) generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids. In another embodiment, the method further includes steps of: (B) calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences; and (C) calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences. In one embodiment, the step (C) includes steps of: (C)(1) generating a histogram of the scores of the plurality of amino acid sequences; (C)(2) identifying a position of the score of the query protein amino acid sequence within the histogram; and (C)(3) calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences.

In one embodiment, the method further includes a step of: (B) generating a graphical display on a display device, the graphical display including information descriptive of the score. In one embodiment, the step (B) includes steps of: (B)(1) generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and (B)(2) generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In one embodiment, the step (B)(1) includes a step of: (B)(l)(l) generating the query protein amino acid sequence graphical element having a visible range of positions corresponding to positions within the query- protein amino acid sequence; and the step (B)(2) includes a step of: (B)(2)(l) generating the motif identifier at a location that visually corresponds to the position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

In one embodiment, the step (B) includes a step of: (B)(1) generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In a further embodiment, the method further includes a step of: (B)(1) generating display information descriptive of a sub-sequence within the query protein amino acid sequence that matches the motif particularly well. In another embodiment, the step (B) includes steps of: (B)(1) generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and (B)(2) generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram. In another aspect, the invention is directed to a query sequence evaluator in a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions. The query sequence evaluator includes: a first input to receive a query sequence evaluation request indicating a query protein amino acid sequence to be evaluated with respect to the motif; a second input to receive information descriptive of the record from the database; and an output to develop a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.

In one embodiment, the score comprises a logarithm of a product of the selected preference values. In another embodiment, the score comprises an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.

In another aspect, the invention is directed to a query sequence evaluation system in a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions. The query sequence evaluation system includes: a query sequence evaluator to develop on an output a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence; and a query sequence user interface having an input to receive the score and to develop on an output display information descriptive of the score for output to a display device. In one embodiment, the display information includes information descriptive of: a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In another embodiment, the display information includes information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In a further embodiment, the display information includes information descriptive of a sub-sequence within the query protein amino acid sequence that matches the motif particularly well. In another embodiment, the display information includes information descriptive of: a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.

In another aspect, the invention is directed to a query sequence evaluation system for evaluating a query protein amino acid sequence with respect to a motif corresponding to a target of a domain of a protein having a known function. The query sequence evaluation system operates in a system including a database including a record for the motif, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions. The query sequence evaluation system includes query sequence evaluation means for calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence. In one embodiment, the query sequence evaluation means includes means for calculating the score as a logarithm of a product of the selected preference values. In another embodiment, the query sequence evaluation means includes means for calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values. In a further embodiment, the query system evaluation means includes means for generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids. In another embodiment, the query sequence evaluation system further includes first calculation means for calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences; and second calculation means for calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences. In a further embodiment, the second calculation means includes means for generating a histogram of the scores of the plurality of amino acid sequences; means for identifying a position of the score of the query protein amino acid sequence within the histogram: and means for calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences. In another embodiment, the query sequence evaluation system further comprises graphical display generation means for generating a graphical display on a display device, the graphical display including information descriptive of the score.

In one embodiment, the graphical display generation means comprises means for generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence and means for generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In another embodiment, the graphical display generation means comprises means for generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well. In a further embodiment, the graphical display generation means comprises means for generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif and means for generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.

Brief Description of Drawings

FIG. 1 is a dataflow diagram of an amino acid evaluation system according to one embodiment of the present invention.

FIG. 2 is a table including preference values for a motif corresponding to a target of a domain of a protein.

FIG. 3 is a flow chart of a method for evaluating a query amino acid sequence.

FIGS. 4A-B are flow charts of methods for generating a score for a query amino acid sequence with respect to a motif for a domain of a protein.

FIG. 5 is a flow chart of a method for generating a percentile score for a query amino acid sequence with respect to a motif for a domain of a protein.

FIGS. 6A-C are diagrams of graphical displays displaying information descriptive of an evaluation of a query amino acid sequence with respect to multiple motifs. FIGS. 7A-C are flows charts of illustrative methods that may be used to generate the graphical displays of FIGS. 6A-C.

Detailed Description

The present invention provides a method and apparatus for evaluating a query protein amino acid sequence with respect to an amino acid sequence motif for a domain of a known protein having a known function to predict whether the query protein amino acid sequence performs the known function. In one embodiment, the prediction is performed by using a scoring system to generate a score for the query protein amino acid sequence with respect to the amino acid sequence motif. The score indicates a degree of confidence that the query protein amino acid sequence performs the known function. Several embodiments of scoring systems are described in more detail below.

The invention is directed to a system which allows a user to scan a protein sequence for novel functions, interactions, post-translational modifications and other structural characteristics. Application of the invention is not limited to proteins previously identified in the laboratory, but can also be used to deduce this information for theoretical proteins whose sequences are obtained from genome sequencing initiatives such as the Human Genome Project. Consequently, the system disclosed herein can be used to suggest the functional roles and pathways in which a novel protein may be involved, and to provide directions towards identifying other proteins that are likely to interact with the target protein of interest. In particular, the system searches a query protein sequence to identify amino acid motifs that are likely to interact with proteins, e.g., as enzyme substrates or other protein binding domains. In the preferred embodiments, the system is a web-based system.

The approaches described herein for evaluating an amino acid sequence and predicting the function of a previously uncharacterized sequence are based on three premises: (1) information contained in the linear sequence of amino acids (primary structure) is sufficient to provide many clues about the function of either known or novel proteins; (2) many proteins within the eucaryotic cell function either as components of larger molecular complexes dominated (at least in part) by protein-protein interactions, or as enzymes whose activity is regulated by post-translational modifications such as phosphorylation or by protein-protein interactions; and (3) small sequence motifs within proteins are sufficient to provide specificity for directing modular protein-protein interactions or post-translation modifications. The techniques described herein provide a highly reliable method for analyzing an uncharacterized query sequence based upon these premises. In general, the primary data from which the system predicts protein motifs is based upon a database containing information generated using oriented degenerate peptide libraries (ODPL).

An ODPL contains a mixture of library members which differ from one another in amino acid sequence but which, in general, contain the same amino acid located at a fixed amino acid position (referred to herein as the "non-degenerate" position). A position within each library peptide that is occupied by a different amino acid in different peptides (i.e., not fixed) is referred to herein as a "degenerate position".

The ODPL contain at least one fixed, nondegenerate amino acid position and several degenerate amino acid positions. In a one embodiment, the amino acid residues on either side of the fixed non -degenerate amino acid residue are degenerate (e.g., immediately N-terminal and C-terminal to the non-degenerate residue), thus enabling one to determine an interaction site motif for the region surrounding the fixed amino acid residue. For example, four amino acid residues located on each side of the non- degenerate amino acid residue can be degenerate (e.g., positions -4, -3, -2, -1 , +1 , +2, +3, +4. relative to the non-degenerate amino acid residue at position 0, can be degenerate). The degenerate positions in the peptides of an oriented degenerate cyclic peptide library can be created such that any one of the twenty natural amino acids, as well as unnatural α-amino acids, can occupy those positions. However, in order to reduce "background^" events (e.g., enzymatic events at a residue other than a fixed residue), ine one embodiment the degenerate positions do not contain amino acid residues that can be acted upon by the particular binding compound being examined. Thus, for example. when the binding compound is a protein-serine/threonine kinases and the fixed residue is a serine or threonine. it is preferred that the degenerate positions not contain serine or threonine. Likewise, for a protein-tyrosine specific kinase. where the fixed residue is a tyrosine. it is preferred that the degenerate positions not contain tyrosine.

Exemplary oriented degenerate peptide libraries are described in Songyang et al. (Cell ( 1993) 72:767-778); U.S. Patent No. 5.532,167; and PCT Application No. PCT/US98/10876, entitled "Cyclic Peptide Libraries and Methods of Use Thereof to Identify Binding Motifs," publication no. WO 98/54577. Songyang et al. (Cell (1993) 72:767-778) describe a method for determining the sequence specificity of the peptide- binding sites of SH2 domains using oriented degenerate phosphopeptide libraries. In this approach a library of linear peptides containing a fixed phosphotyrosine residue is used to select the optimal phosphopeptide substrates for a particular SH2 domain. A similar approach has been applied to the determination of optimal substrates for protein kinases (see U.S. Patent No. 5,532,167). These methodologies utilized linear peptide libraries composed entirely of naturally-occurring α-amino acids. Likewise, PCT Application No. PCT/US98/10876, entitled "Cyclic Peptide Libraries and Methods of Use Thereof to Identify Binding Motifs," publication no. WO 98/54577 describes methods and compositions for identifying binding motifs for binding compounds, which utilize cyclic peptides composed of natural and/or unnatural α-amino acids.

In general, the database of peptide motifs is generated by contacting an oriented degenerate peptide library (such as those described in the above-cited references) with a binding compound under conditions which allow for interaction between the binding compound and the ODPL. The binding compound interacts with the ODPL such that a complex is formed between the binding compound and a subpopulation of library members capable of interacting with the binding compound. The subpopulation of library members capable of interacting with the binding compound is then separated from the library members that are incapable of interacting with the binding compound. An amino acid sequence motif is then determined for an interaction site of the binding compound, based upon the relative abundance of different amino acid residues at each degenerate position within the linearized library members. The term ^"binding compound", as used herein, refers to compounds which can interact with an interaction site on a peptide by one or more mechanisms. According to the invention, exemplary binding compounds include enzymes and binding proteins such as kinases (e.g., protein serine/threonine kinases, protein tyrosine kinases, lipid kinases). phosphatases (e.g., protein phosphatases, lipid phosphatases), proteases (e.g.. serine proteases, cysteine proteases), and binding proteins containing, e.g., such as SH2 domains. SH3 domains, antibodies, WW domains, PTB domains, PDZ domains, LIM domains, pleckstrin homology domains, zinc finger domains, extracellular growth factors and receptors, adhesion molecules, intercellular signaling molecules, 7-transmembrane receptor proteins, ion channels, methyltransferases, ubiquitinating enzymes and peptidyl- transferases.

The term "interaction" refers to attractive forces which physically combine a binding compound with an interaction site on an oriented degenerate peptide library member. Such attractive forces include hydrophobic interactions, hydrophilic interactions, covalent binding, ionic binding, charged interactions, etc.

The phrase "an amino acid sequence motif for an interaction site is intended to describe a composite amino acid sequence which represents a consensus sequence for an interaction site. In general, an amino acid sequence motif encompasses the region of the peptide which includes and surrounds an amino acid residue(s) which specifically and preferentially interacts with a binding compound.

The specific conditions under which the binding compound and oriented degenerate peptide library (ODPL) are contacted will vary depending upon the particular binding compound and ODPL used but are chosen such that a complex can form between the binding compound and a subpopulation of library members that are capable of interacting with the binding compound. When the binding compound is an enzyme, the enzyme and the ODPL are contacted under conditions that maintain the enzymatic activity of the enzyme (e.g., a kinase is incubated with the ODPL under conditions that allow for phosphorylation of the library members by the kinase).

After complexes have formed between the binding compound and the subpopulation of library members that are capable of interacting with the binding compound (referred to as the "binding subpopulation"), the binding subpopulation is separated from the non-binding subpopulation (e.g., those library members that do not interact with the binding compound). If applicable, the binding subpopulation is linearized.

The method for separating the binding subpopulation of library members from the non-binding subpopulation of library members will depend upon the particular binding compound used. For example, in one embodiment, the binding compound is immobilized on a solid support (e.g., a column) and the binding subpopulation of library members remains bound to the immobilized binding compound while the non-binding subpopulation is washed away. Standard methods for affinity chromatography can be used to afford such separation. Binding compounds can be immobilized to a solid support using methods known in the art. For example, the binding compound can be prepared as a glutathione-S-transferase (GST) fusion protein and immobilized by binding the GST fusion protein to glutathione agarose beads. Such an approach is suitable for many types of binding compounds but is particularly preferred for binding domains that mediate protein-protein interactions (such as SH2 and SH3 domains) but that do not have enzymatic activity.

Alternatively, the binding compound has an enzymatic activity and separation of the binding subpopulation of library from the non-binding subpopulation is based upon enzymatic modification of the binding subpopulation by the binding compound. For example, when the binding compound is a kinase, the binding subpopulation of library members becomes phosphorylated while the non-binding subpopulation remains nonphosphorylated (discussed in the Examples). Accordingly, phosphorylated peptides can be separated from nonphosphorylated peptides to achieve separation of the binding subpopulation from the non-binding subpopulation. Similarly, when the binding compound is a phosphatase, the binding subpopulation of library members becomes dephosphorylated while the non-binding subpopulation remains phosphorylated. Accordingly, nonphosphorylated peptides can be separated from phosphorylated peptides to achieve separation of the binding subpopulation from the non-binding subpopulation. In general, selected binding library members are sequenced by standard amino acid sequencing techniques (e.g., Edman degradation). Automated peptide sequencers can be used to determine the amino acid sequence of the library members. Preferably. the ODPL used is a soluble synthetic peptide library and the subpopulation of peptides is sequenced as a bulk population using an automated peptide sequencer. This approach provides information on the abundance of each amino acid residue at a given cycle in the sequence of the complexed mixture, most importantly at the degenerate positions. For each degenerate position in the selected peptides (e.g., the binding subpopulation), a relative abundance value can then be calculated by dividing the abundance of a particular amino acid residue at that position after library screening (e.g., after peptide complex and separation) by the abundance of the same amino acid residue at that position in the starting library. Thus, the relative abundance (RA) of an amino acid residue Xaa at a degenerate position in the peptide library is defined as:

RA= amount of Xaa in the population of selected peptides amount of Xaa in the original oriented degenerate peptide library

The relative abundance value may be corrected for background contamination as described in, for example, PCT/US98/10876.

Amino acid residues which are neither enriched for nor selected against in the population of peptides which can serve as substrates for, or bind to. the binding compound will have a relative abundance of 1.0. Those amino acid residues which are preferred at a particular degenerate position (e.g.. residues which are enriched at that position in the complexed peptides) will have a relative abundance greater than 1.0. Those amino acid residues which are not preferred (e.g., residues which are selected against at that position in the complexed peptides) will have a relative abundance less than 1.0. Based upon the relative abundance values for each amino acid residue at a degenerate position, preferred amino acid residues, e.g., amino acid residues with a relative abundance greater than 1.0. can be identified at that position.

Based upon the relative abundance of different amino acid residues at each degenerate position within the population of selected peptides, e.g., phosphorylated peptides. an amino acid sequence motif for an interaction site, e.g., a phosphorylation site for a protein kinase, can be determined. The amino acid sequence motif encompasses the degenerate region of the peptides. The particular amino acid residues chosen for the motif at each degenerate position are those which are most abundant at each position. Thus, an amino acid residue(s) with a relative abundance value greater than a predetermined threshold (e.g.. 1.0) at a particular position may be chosen as the amino acid residue(s) at that position within the amino acid sequence motif. The amino acid sequence motif may. therefore, contain any number of amino acids at each position. The predetermined threshold may be any value. Alternatively, all amino acid residues may be included in the motif at each position.

These and other aspects of the invention are described in greater detail in reference to the following figures and description. Although the following description illustrates the application of the system to the analysis of a query protein for the presence of an amino acid sequence motif for an interaction site that is phosphorylated by a protein kinase, it is to be understood that the system can be used for the analysis of virtually any type of binding compound as discussed above.

Referring to FIG. 1 , in one embodiment, an amino acid sequence evaluation system 100 is provided to evaluate query protein amino acid sequences with respect to amino acid sequence motifs for domains of proteins having known functions and to generate, for each of the motifs, a quantitative score or scores for the presence of the motif in the query protein amino acid sequence. The quantitative nature of the scores indicates a degree of confidence that the query protein amino acid sequence performs the same function or functions as the proteins having known functions. The system 100 includes a peptide library database 102 including records \04a-n descriptive of motifs for domains of proteins having known functions. In one embodiment, the records 104a-n in the peptide library database 104a-n are developed according to the oriented peptide library approach described above. The contents and function of the peptide library database 102 are described in more detail below with respect to FIG. 2.

The amino acid sequence evaluation system 100 also includes a query sequence user interface 106 that provides an interface between the system 100 and users of the system 100. The query sequence user interface 106 accepts input from the user and generates display information 1 10 that is displayed to the user on a display device 1 12, such as a computer monitor. To perform a search of a query protein amino acid sequence, a user submits user query sequence input 108 to the query sequence user interface 106. The user query sequence input 108 describes the query protein amino acid sequence that the user wishes to search. The user query sequence input 108 may take am form. For example, the user may provide a complete amino acid sequence in single- letter format, or may provide the name of a protein stored in a database, such as the publicly-available Swiss Prot database. The user query sequence input 108 may also include additional information, such as user preferences indicating how the search is to be performed. For example, the user query sequence input 108 may indicate that only selected motifs in the peptide library database should be included in the search.

Once the user submits the user query sequence input 108 to the query sequence user interface, the query sequence user interface 106 generates and submits a query sequence evaluation request 1 14 to a query sequence evaluator 1 16. The query sequence evaluation request 1 14 may, for example, include a description of the query protein amino acid sequence to be searched as well as information descriptive of the user's preferences (e.g., which motifs represented in the peptide library database 102 are to be included in the search).

After receiving the query sequence evaluation request 1 14, the query sequence evaluator 1 16 evaluates the query protein amino acid sequence contained in the query sequence evaluation request 1 14. For example, the query sequence evaluator 1 16 may evaluate the query protein amino acid sequence to the motifs 104a-n represented in the peptide library database 102 to generate query sequence evaluation results 1 18 indicative of the presence of the motifs \04a-n in the query protein amino acid sequence. Examples of methods for generating such evaluation results 1 18 are described in more detail below with respect to FIGS. 4A-B. The query sequence evaluator 116 transmits the query sequence evaluation results 1 18 to the query sequence user interface 106, which generates and transmits display information 1 10 descriptive of the query sequence evaluation results 118 to the display device 112 in a format suitable for browsing by the user. The user may browse the display information 110 using user display navigation commands 120.

A query sequence score for a motif corresponding to a target of a domain of a known protein having a known function may be used to predict whether the query protein amino acid sequence performs the same or similar function. For example, the range of possible scores may be ordered such that a higher score is predicted to be more likely to perform the same function of the known protein than an amino acid sequence having a lower score.

The various elements shown in FIG. 1 may be implemented in any of numerous ways. For example, each of the query sequence evaluator 1 16 and the query sequence user interface may be implemented as a computer program residing in a computer- readable memory, such as a random-access memory (RAM). Such computer programs include, for example, standalone applications, background processes, plug-ins. and dynamic link libraries, either alone or in combination. The query sequence user interface 106 and the query sequence evaluator 1 16 may be implemented as programs executing on a single computer or on different computers, or may be combined into a single computer program executing on a single computer or distributed over a network.

The query sequence user interface 106 may, for example, be implemented as a web page displayable by a standard web browser. The query sequence evaluator 116 may be implemented as a web-compatible server accessible to the query sequence user interface 106 over a network, such as an intranet or internet (e.g., the public Internet).

The query sequence evaluation results 118 may include any information indicative of the presence of motifs from the peptide library database 102 in the query protein amino acid sequence. For example, in one embodiment, the query sequence evaluation results 1 18 include quantitative scores for the query protein amino acid sequence with respect to motifs represented in the peptide library database. A score for the query protein amino acid sequence with respect to a particular motif may, for example, indicate a degree of confidence that the query protein amino acid sequence contains the motif. Such a score may therefore indicate whether the query protein amino acid sequence performs a function that is the same as or similar to a function performed by proteins containing the motif for the particular domain.

The query sequence evaluation results 118 may include multiple scores for the query protein amino acid sequence with respect to a single motif, each of the scores corresponding to a different sub-sequence within the query protein amino acid sequence. For example, in one embodiment (described in more detail below with respect to FIG. 3), subsequences of the query protein amino acid sequence are evaluated with respect to each of the motifs in the peptide library database 102. In such an embodiment, the query sequence evaluation results 1 18 may include an evaluation result (e.g.. a quantitative score) for each subsequence with respect to each motif. In a further embodiment, the query sequence evaluation results 1 18 includes, for each evaluation result (e.g., quantitative score), an identifier identifying the subsequence of the query protein amino acid sequence to which the evaluation result corresponds. The identifier may, for example, indicate the position of the beginning of the subsequence within the query protein amino acid sequence.

In another embodiment, the query sequence evaluation results 1 18 include onh selected evaluation results. For example, the query sequence evaluation results 1 18 may include only evaluation results for motifs that match the query protein amino acid sequence (or subsequences of it) particularly well. For example, if the query sequence evaluation results 1 18 include quantitative scores, the query sequence evaluation results 118 may include only those quantitative scores that satisfy a predetermined threshold. In a further embodiment, the query sequence user interface 106 generates display information 1 10 only for selected ones of the query sequence evaluation results. For example, the query sequence user interface 106 may generate display information 1 10 only for motifs that match the query protein amino acid sequence (or subsequences of it) particularly well. For example, if the query sequence evaluation results 118 include quantitative scores, the query sequence user interface 106 may generate display information 1 10 only for those quantitative scores that satisfy a predetermined threshold.

Each of the records 104a-« in the peptide library database 102 contains information descriptive of a motif for a domain of a protein. The peptide library database 102 may be any kind of database capable of storing information descriptive of motifs. Although, as described herein, the motif records \04a-n in the peptide library database 102 may be generated according to the oriented peptide library approach, this is not a limitation of the present invention. Rather, the records \04a-n may be generated in any way. The oriented peptide library approach is described in detail in Cantley et al.. U.S. Pat. No. 5,532,167, entitled "Substrate Specificity of Protein Kinases," and incorporated herein by reference in its entirety. Exemplary oriented degenerate peptide libraries are described in Songyang et αl. (Cell (1993) 72:767-778); U.S. Patent No. 5.532.167; and PCT Application No. PCT/US98/10876. entitled "Cyclic Peptide Libraries and Methods of Use Thereof to Identify Binding Motifs,^" publication no. WO 98/54577.

As described in more detail below as an illustrative example, the amino acid sequence motifs determined by the oriented peptide library approach are useful for predicting whether a query protein is a substrate for a particular protein kinase. The primary amino acid sequence of a query protein can be examined for the presence of the determined amino acid sequence motif. If the same or a very similar motif is present in the protein, it can be predicted that the protein could function as a substrate for that protein kinase.

Although the motifs may correspond to domains for proteins having known functions, the functions of the proteins to which the motifs correspond need not be known. For example, if the functions performed by the proteins corresponding to the motif records 104a-« in the database are not known, the techniques described herein may still be used to determine degrees of correspondence between a query protein amino acid sequence and motifs represented in the peptide library database 102. Such degrees of correspondence may subsequently become useful if the functions of peptides corresponding to the motifs are later discovered, or if the function performed by the query protein amino acid sequence is known. In one embodiment, the records \04a-n in the peptide library database 102 are represented as tables. An example of such a table 300 is shown in FIG. 2. Assume for purposes of the following description that the table 300 corresponds to the record 104a (Motif 1) in the peptide library database 102 (FIG. 1). The columns in the table 300 correspond to positions in Motif 1. and the rows in the table 300 correspond to amino acids. In this example, Motif 1 includes nine positions numbered -4 through +4, in which positions -4 through -1 and +1 through +4 are degenerate positions, and in which position zero is a non-degenerate position. Although Motif 1 includes both degenerate and non-degenerate positions, the peptide library database may include records corresponding to motifs having any number of degenerate and non-degenerate positions. As shown in FIG. 2, the single non-degenerate position (position zero) in Motif 1 corresponds to Tyrosine. Each cell of the table 300 at a particular row (corresponding to an amino acid) and column (corresponding to a position) contains a preference value corresponding to the relative abundance of the amino acid at the position. For example, the preference value of Lysine at position -1 is 0.60033. which is the value stored at the row corresponding to Lysine (Lys) and the column numbered -1. Thus, the preference value of any amino acid with respect to any position of Motif 1 can be readily determined by reference to the table 300. The table 300 need not include preference values for all amino acids or within all cells. In the case that the table 300 does not include a preference value for a particular amino acid at a particular location in the corresponding motif, a suitable default value may be substituted. For example, in one embodiment, in which the motif includes only selected amino acid residues at each position (e.g.. amino acid residues exceeding a predetermined threshold), amino acids which are not in the motif at a particular position are assigned a default preference value at that position, such as one or zero. The user may provide the user query sequence input using any suitable input device, such as a standard keyboard or mouse.

Referring to FIG. 3. in one embodiment, the query sequence evaluator 1 16 evaluates a query protein amino acid sequence according to a process 301. The query sequence evaluator 1 16 receives the query sequence evaluation request 1 14 from the query sequence user interface 106 (step 302). As described above, the query sequence evaluation request 1 14 includes a description of the query protein amino acid sequence to be searched, and may further include additional information such as information indicating which of the records \ 04a-n in the peptide library database 102 are to be included in the search. For each subsequence in the query protein amino acid sequence beginning at position p_s within the query protein amino acid squence (step 304). and for each motif m represented in the peptide library database (102) (step 306). the query sequence evaluator 1 16 evaluates the subsequence s with respect to motif m (step 308). Examples of ways in which the query sequence evaluator 1 16 may evaluate the query protein amino acid sequence are described in more detail below with respect to FIGS. 4A-C.

Evaluation of the query protein amino acid sequence (step 308) produces a query sequence evaluation result (e.g.. a quantitative score) that is stored for future use (step 310). The evaluation result may. for example, be stored in a two-dimensional Results array at column s and row m. The evaluation result may be stored in any manner, as long as the evaluation result is associated with the amino acid subsequence and the motif to which it corresponds.

Steps 308-310 are repeated for the remaining motifs (step 312) and the remaining subsequences of the query protein amino acid sequence (step 314). Although the illustrative process 400 shown in FIG. 4 evaluates all subsequences of the query protein amino acid sequence with respect to all motifs within the peptide library database 102, fewer than all subsequences of the query protein amino acid sequence may be e aluated with respect to fewer than all of the motifs represented in the peptide library database 102. After all of the query sequence evaluation results 1 18 have been generated, they are transmitted to the query sequence user interface 106 (step 316).

In one embodiment of the present invention, a method referred to as the "log-sum method" is used to evaluate the query protein amino acid sequence with respect to a motif. In particular, the log-sum method may be used to implement step 308 (FIG. 3), as shown by the process 400 in FIG. 4A. The log-sum method assigns a score to a query protein amino acid sequence based on the logarithm of the product of each amino acid^'s preference value in the motif against which the query protein amino acid sequence is being scored. This is mathematically equivalent to summing the logarithms of each amino acid^'s preference value. Note also that the logarithm of the preference value may be considered a reflection of the chemical binding energy. The preference values may be normalized before being used in the log-sum method. For example, the values in each of the columns in the table 300 (FIG. 2) are normalized to a sum of fifteen. Referring to FIG. 4A, the query sequence evaluator 1 16 may generate a score for a subsequence beginning at position p_q of the query protein amino acid sequence with respect to a motif m using the process 400 according to the log-sum method as follows. The process 400 is described with respect to an illustrative example in which the query protein amino acid sequence is GNGDYMPMS and the record in the peptide library database 102 corresponding to the motif m contains the preference values shown in table 400 (FIG. 4). The query sequence evaluator 1 16 initializes the score to a value of zero (step 402). and identifies the record r in the peptide library database 102 corresponding to the motif m for which the score is being generated (step 404). In this example, the contents of the record r that is retrieved are shown in the table 400 (FIG. 2). The query sequence evaluator 1 16 then enters a loop over each position p_m in the motif (step 406). In this example, the positions of the motif are numbered -4 through τ-4. For each such position p_m, the query sequence evaluator 1 16 identifies the amino acid in the query sequence at position (p_s+ p,„) (step 408). In this example, the amino acid in the query protein amino acid sequence (GNGDYMPMS) at the first position (position -4) is G. The query sequence evaluator 1 16 retrieves, from the record r, the preference value pv of the identified amino acid at position p_m of the motif (step 410). In this example, the preference value of G (Gly) at position -4 is 1.5284. This is represented by the contents of the table 300 at the row labeled "Gly^" and the column numbered -4. If there is no preference value in the motif at position p_m for the identified amino acid, the query sequence evaluator may substitute an appropriate default value, such as 1. Note that, in this example, there is an amino acid (Tyr) at a non-degenerate position in the motif, which contains a preference value of 16. The preference value for an amino acid at a non-degenerate position may. however, be assigned any preference value. Preference values less than one may be neglected. An alternative approach is to only score subsequences containing the non-degenerate amino acid in the correct position within the subsequence.

The query sequence evaluator 1 16 updates the score for the motif by adding to the score the logarithm (e.g., the natural logarithm or the base ten logarithm) of the preference value pv (step 412). The preference values at a particular position within the motif m may be normalized before calculation of the logarithm at step 412. In this example, the logarithm of the preference value is approximately 0.1842. Steps 408 and 410 are repeated for the remaining amino acids in the query protein amino acid sequence. At the end of the loop (step 414), the score is equal to the sum of the logarithms of the preference values of each of the amino acids in the query protein amino acid sequence. In this example, the score is approximately equal to 2.42. It should be appreciated that the score may be alternatively and equivalently calculated by initializing the score to a value of one at step 402. by multiplying the score by the preference value pv at step 412. and by calculating the log of the score after step 414. Substitution of these steps produces an equivalent result because the sum of logarithms of a plurality of values is equal to the logarithm of the product of the values. In one embodiment, each preference value v retrieved in step 410 is increased by a constant value c before the logarithm of the preference value pv is calculated in step 412. This addition of the constant c may be used to shift the retrieved preference values toward a region of the logarithm function in which lower preference values are less heavily weighted. The constant value c may be chosen in any way and ma}' be any value. Preference values that are less than one may be neglected or raised to be equal to one.

In one embodiment of the present invention, a method referred to as the "entropy method" is used to score for the presence of a motif in a query sequence. According to the entropy method, the preference value p, of each amino acid at each position in the motif is translated to a probability (with the sum of all of the probabilities being equal to one), the -log₂ of p, is used as a measure of the relative "entropic density" of the amino acid in that position and the cumulative score is calculated. The value at each position ranges from -infinity (for the worst match, i.e., a residue that is never present in the motif) to zero for a perfect match (p, = 1). The resulting score is then averaged for the number of the degenerate positions in the motif. A lower score indicates a higher degree of confidence that the the query protein amino acid sequence includes the motif, while a higher score indicates a lower degree of confidence.

Referring to FIG. 4B. the query sequence evaluator 116 may evaluate an amino acid subsequence with respect to a motif m (step 308 of FIG. 3) using a process 308b according to the entropy method as follows. The process 420 is described with respect to an illustrative example in which the query protein amino acid sequence is

GSEEYMNMD and the record in the peptide library database 102 corresponding to the motif contains the preference values shown in table 300 (FIG. 2). The query sequence evaluator 1 16 initializes the score to a value of zero (step 422). and identifies the record r in the peptide library database 102 corresponding to the motif for which the score is being generated (step 404). In this example, the contents of the record r that is retrieved are shown in the table 300 (FIG. 2). For each position p_m in the motif m. the quen sequence evaluator 1 16 normalizes the preference values at position p_m by translating the preference values into probabilities, with the sum of the probabilities being equal to one (step 426). This may be performed by. for example, summing all of the preference values at a particular position and dividing each of the preference values by the sum. These probabilities are used by all subsequent steps of the process 308b described below. The query sequence evaluator 1 16 then enters a loop over each position p_m in the motif m (step 406). In this example, the positions of the motif are numbered-4 through +4. For each 428 position p_m. the query sequence evaluator 1 16 identifies the amino acid in the query sequence at position (p_s + p (step 430). In this example, the amino acid in the query protein amino acid sequence (GSEEYMNMD) at the first position (position -4) is G. The query sequence evaluator 1 16 retrieves, from the record r. the probability value of the identified amino acid at position p_m of the motif m (step 432). In this example, the probability value of G (Gly) at position -4 is 0.1019.

The query sequence evaluator 116 updates the score for the motif m by adding the negative logarithm (e.g., the natural logarithm or the base ten logarithm) of the retrieved probability value to the score (FIG. 434). Steps 430-434 are repeated for the remaining amino acids in the query amino acid subsequence. After the loop completes (step 436). the score is divided by the number of positions in the motif m (in this example, nine) to obtain a final score (step 438). In this example, the final score is approximately equal to 3.3691.

In one embodiment of the present invention, a method that takes advantage of quantitative structure-activity relationships (QSAR) is used to score for the presence of a motif in a query sequence. According to QSAR. quantitative scales are assigned to physicochemical properties of amino acids. For example, in one embodiment, a first scale (labeled z,) is assigned to amino acid hydrophilicity. a second scale (labeled zX) is assigned to size, and a third scale (labeled z,) is assigned to polarity (electronic effects). For purposes of the following discussion, these scales are referred to as "z scales." QSAR is described in more detail in "Minimum analogue peptide sets (MAPS) for quantitative structure-activity relationships." Sven Hellberg et al.. Int. J. Peptide Protein Res. 37 (1991 ), pp.414-424. Each amino acid has particular values for each z scale.

Such values may be obtained experimentally or from pre-existing sources and stored in a database for future use.

A score for a query protein amino acid sequence or other chemical structure with respect to a motif may be calculated using the preference values stored in the peptide library database 102 for the motif and using the z scale values of the amino acids in the motif. Given the known preference values of amino acids at each position within the motif and the known z scale values of each amino acid, an equation can be derived that relates the preference value p of a particular chemical structure at a particular position x of a motif m to the z scale values z,, z₂. and z, of the chemical structure:

p_m = az bz ⁺ cz]

The values of the coefficients a, b, and c are set appropriately to characterize the relationship between the z scale values (z,, z₂, and z-,) and the preference value p_m Equation (2) can therefore be used to calculate a preference value for an amino acid or other chemical structure based on the chemical structure's known z scale values. To calculate a preference value p_{m v} for a chemical structure at position x of a motif m, the query sequence evaluator 1 16 merely substitutes the z scale values of the chemical structure into a form of Equation (2) having coefficients a, b. and c with appropriate values. Use of Equation (2) to calculate a preference value may be useful when, for example, the peptide library database 102 does not contain a preference value for a particular amino acid or other chemical structure. Once a preference value has been calculated using Equation (2), the preference value may be stored in the peptide library database 102 and/or used in the calculation of a score for a query protein amino acid sequence using any appropriate method, such as the log-sum method or entropy method, described above.

For purposes of computational efficiency, the coefficients a, b. and c of Equation (2) for each motif m and each position x may be generated prior to evaluation of any query sequences by the query sequence evaluator 1 16. Such pre-generation of the coefficients allows preference values for chemical structures to be generated quickly during the evaluation process, without having to generate values for the coefficients. The coefficients a. b. and c for a particular motif m at a position x may be generated using any appropriate method. For example, if the preference values and z scale values of at least three amino acids at position x of motif m are known, the coefficients a. b. and c may be obtained by solving Equation (2) using standard algebraic techniques. After a final score has been calculated for an amino acid sequence (or subsequence) with respect to a motif (e.g.. according to the log-sum method or entropy method as described above), a "percentile score^" may also be calculated (e.g.. by the query sequence evaluator 1 16 or the query sequence user interface 106). Such a percentile score indicates where the final score for the query protein amino acid sequence ranks compared to the final scores of other amino acid sequences containing the non- degenerate residue when evaluated with respect to the same motif. The percentile score can therefore be useful for interpreting the final score for the query protein amino acid sequence. A flow chart of one example of a process 500 that may be performed (e.g.. by the query sequence evaluator 1 16) to generate such a percentile score is shown in FIG. 5. The query sequence evaluator 1 16 calculates final scores (e.g.. as described above with respect to FIGS. 4A-B) for all amino acid sequences in an amino acid sequence database (step 502), such as the publicly available Swiss Prot database, with respect to the motif, and generates a histogram of the final scores (step 504). The position of the query protein amino acid sequence^'s final score is identified within the histogram (step 506). The query protein amino acid sequence^'s percentile score is calculated by dividing the number of scores greater than (or less than, depending on the scoring method used) the final score for the query protein amino acid sequence (step 508).

In one embodiment, the histogram of final scores generated in step 504 is used to determine whether to include the score for a query sequence in the query sequence evaluation results 1 18 or, alternatively, whether to generate display information 1 10 for the score. For example, in one embodiment the query sequence evaluator 1 16 only includes the final score for a query protein amino acid sequence in the query sequence evaluation results 1 18 if the final score falls within a predetermined region of the histogram, such as within the best five percent of scores in the histogram. In another embodiment, the query sequence evaluator 1 16 only includes the final score for a query protein amino acid sequence in the query sequence evaluation results 1 18 if the final score is further than two standard deviations from the mean of the histogram. It should be appreciated that other methods may be used to determine whether to include the final score for a query protein amino acid sequence in the query sequence evaluation results 1 18. and that components other than the query sequence evaluator 1 16. such as the query sequence user interface 106. may be used to filter final scores.

Once a query protein amino acid sequence has been evaluated with respect to one or more motifs represented in the peptide library database 102. the query sequence user interface 106 may generate and transmit display information 1 10. representing the results of the evaluation, to the display device 1 12. The display information 1 10 may include. for example, information to display the final score(s) of the query protein amino acid sequence with respect to one or more of the motifs. In one embodiment, the display information 1 10 only includes information to display selected final scores of the query protein amino acid sequence, e.g., scores that satisfy a predetermined threshold value. The predetermined threshold value may be any value and may be selected in any manner. Including only selected final scores in the display information 1 10 may be used to provide the user with a graphical display of the final scores of only those amino acid subsequences that are particularly likely to match the corresponding motifs in the peptide library database 102.

Referring to FIG. 6A, in one embodiment, the display information 1 10 includes a graphical display 600 that displays potentially matching motifs from the peptide library database 102 superimposed on the domain structure of the query protein amino acid sequence. More specifically, a query protein amino acid sequence graphical element 602 displays the structure of the query protein amino acid sequence as a horizontal strip, with the leftmost edge of the strip representing the first position in the query protein amino acid sequence and the rightmost edge of the strip representing the last position in the query protein amino acid sequence. An x-axis 604, displayed under the query protein amino acid sequence graphical element 602, provides a visible indication of the locations of positions in the query amino sequence graphical element 602. The user can thus quickly identify the position of any point in the query protein amino acid sequence graphical element 602 by reference to the x-axis 604. In one embodiment, when the query protein amino acid sequence includes known domains, the graphical element 602 is broken into sub-elements each representing a known or putative domain based on sequence homology of the query protein amino acid sequence.

Displayed above the query protein amino acid sequence graphical element are motif identifiers 606a- f, indicating protein domains whose motifs from the peptide library database 102 match subsequences in the query protein amino acid sequence particularly well. For example, the motif identifier 606e indicates that an Abl kinase domain has a motif that matches a subsequence in the query protein amino acid sequence at position Y266 in the query protein amino acid sequence. Motifs may be selected for display in the graphical display 600 by, for example, selecting only those motifs for which the query protein amino acid sequence^'s final score satisfies a predetermined threshold, as described above. Displaying such motif identifiers 606a-f enables the user to quickly identify those motifs which most closely match the query protein amino acid sequence.

As further shown in FIG. 6, the motif identifiers 606a-f are positioned along the y-axis 604 at the y coordinates corresponding to the first positions of the domains in the query protein amino acid sequence which they match. For example, the motif identifier 606e (identifying an Abl kinase domain at position Y266) is positioned at y coordinate 266 along the x-axis. Displaying the motif identifiers 606a-f at the locations to which they correspond in the query protein amino acid sequence enables the user to visually identify the locations of matching motifs quickly and easily.

Referring to FIG. 7A, in one embodiment the query sequence user interface 106 uses a process 700 to generate the graphical display 600 after the query sequence user interface 106 receives the query sequence evaluation results from the query sequence evaluator 1 16. The query sequence user interface 106 generates the query protein amino acid sequence element 602 (step 702). The query sequence user interface 106 generates the x-axis 604 (step 704). In one embodiment, the query sequence user interface 106 selects motifs that match the query protein amino acid sequence particularly well, such as by selecting motifs whose scores satisfy a predetermined threshold, as described above (step 706). In one embodiment, the query sequence evaluation results 1 18 generated by the query sequence evaluator 1 16 include information describing the positions at which the motifs match the query protein amino acid sequence. The query sequence user interface 106 uses this information to generate motif identifiers for well-matching motifs at the positions where they match the query protein amino acid sequence (step 708).

Referring to FIG. 6B. in one embodiment, the display information 1 10 includes a graphical display 620 that displays information about a particular motif that matches the query protein amino acid sequence. The graphical display 620 includes a title 622 that displays the name of the domain corresponding to the motif for which information is displayed in the graphical display 620. The graphical display 620 includes rows 624a-c. each of which corresponds to the first position of a domain within the query protein amino acid sequence which the motif matches particularly well. The graphical display includes a position column 626. a score column 628. and a sequence column 630. The value in the position column 626 indicates the position at which the motif matches the query sequence, the value in the score column 628 indicates the score of the motif with respect to the query protein amino acid sequence, and the sequence shown in the sequence column 630 displays the sub-sequence within the query protein amino acid sequence that is considered to be a particularly good match for the motif. For each, the information displayed in the row 624a indicates that the SRC kinase domain matched the query protein amino acid sequence particularly well beginning at position 141 within the query protein amino acid sequence, that the SRC kinase domain had a score of 3.4710 (according to, e.g., the log-sum method), and that the sub-sequence within the query protein amino acid sequence that matched the SRC kinase domain particularly well was DEDIYSGLS. In one embodiment, a plurality of graphical displays similar to graphical display 620 are generated, each of which displays information for a different motif. In one embodiment, the graphical display 600 (FIG. 6A) includes hyperlinks to information contained in the graphical display 620 (FIG. 6B). For example, the motif identifiers 606a-f in the graphical display 600 (FIG. 6A) may include hyperlinks to graphical displays, such as the graphical display 620 (FIG. 6B) displaying information about the corresponding motifs. Thus, by selecting (e.g.. with a mouse or keyboard) one of the motif identifiers 606a-f in the graphical display 600, the user can cause the query sequence user interface 106 to generate a graphical display (e.g., graphics display 620 in FIG. 6B) for the motif corresponding to the selected motif identifier.

Referring to FIG. 7B. in one embodiment the query sequence user interface 106 uses a process 720 to generate the graphical display 620 in response to the user^'s selection of a particular one of the motif identifiers 606a-f in the graphical display 600 (FIG. 6A). In this embodiment, the query protein amino acid sequence includes domains at known positions and the query sequence evaluation results 1 18 include evaluation results for the selected motif with respect to each of the domains within the query protein amino acid sequence. It should be appreciated, however, that similar methods may be used to generate the graphical display 620 in other embodiments. For each subsequence s (at position p_s) within the query protein amino acid sequence (step 722). the score of subsequence with respect to the selected motif is retrieved from the query sequence evaluation results (step 724). If the retrieved score satisfies a predetermined threshold (step 726). then the position p_s, the retrieved score, and the sequence of the subsequence s are displayed (step 728), as shown in FIG. 6B. Steps 724-728 are repeated for the remaining domains in the query protein amino acid sequence.

As described above with respect to FIG. 5, the query sequence evaluator 1 16 may generate a percentile score for a query protein amino acid sequence. Referring to FIG. 6C, in one embodiment, the display information 1 10 includes information descriptive of the percentile score. For example, as shown in FIG. 6C, the display information 1 10 may describe a graphical display 640. The graphical display 640 includes a histogram display 642 representing the histogram generated in step 504 (FIG. 5). The graphical display 640 also includes a query protein amino acid sequence marker 644 placed on the x-axis that indicates where the final score of the query protein amino acid sequence lies with respect to the final scores of other amino acid sequences represented in the histogram. Using such a display 640, the user can visually identify how the score of the query protein amino acid sequence compares to the scores of other amino acid sequences in the histogram quickly and easily. Referring to FIG. 7C. in one embodiment the query sequence user interface 106 uses a process 740 to generate the graphical display 640 including a histogram display 642 representing the histogram generated in step 504. The query sequence user interface 106 generates the histogram display 640 (step 742). The query sequence user interface 106 then generates the query protein amino acid sequence marker 644 on the histogram display 640 at a position corresponding to the score (step 744).

A computer system for implementing the system of FIG. 1 as one or more computer programs typically includes a main unit connected to both an output device which displays information to a user and an input device which receives input from a user. The main unit generally includes a processor connected to a memory system via an interconnection mechanism. The input device and output device also are connected to the processor and memory system via the interconnection mechanism.

It should be understood that one or more output devices may be connected to the computer system. Example output devices include a cathode ray tube (CRT) display, liquid crystal displays (LCD), printers, communication devices such as a modem, and audio output. It should also be understood that one or more input devices may be connected to the computer system. Example input devices include a keyboard, keypad, track ball, mouse, pen and tablet, communication device, and data input devices such as sensors. It should be understood the invention is not limited to the particular input or output devices used in combination with the computer system or to those described herein.

The computer system may be a general purpose computer system which is programmable using a computer programming language, such as C++, Java, or other language, such as a scripting language or assembly language. The computer system may also include specially programmed, special purpose hardware. In a general purpose computer system, the processor is typically a commercially available processor, of which the series x86 and Pentium processors, available from Intel, and similar devices from AMD and Cyrix, the 680X0 series microprocessors available from Motorola, the PowerPC microprocessor from IBM and the Alpha-series processors from Digital Equipment Corporation, are examples. Many other processors are available. Such a microprocessor executes a program called an operating system, of which WindowsNT. UNIX, DOS. VMS and OS8 are examples, which controls the execution of other computer programs and provides scheduling, debugging, input/output control, accounting, compilation, storage assignment, data management and memory management, and communication control and related services. The processor and operating system define a computer platform for which application programs in high-level programming languages are written.

A memory system typically includes a computer readable and writeable nonvolatile recording medium, of which a magnetic disk, a flash memory and tape are examples. The disk may be removable, known as a floppy disk, or permanent, known as a hard drive. A disk has a number of tracks in which signals are stored, typically in binary form, i.e., a form interpreted as a sequence of one and zeros. Such signals may define an application program to be executed by the microprocessor, or information stored on the disk to be processed by the application program. Typically, in operation, the processor causes data to be read from the nonvolatile recording medium into an integrated circuit memory element, which is typically a volatile, random access memory such as a dynamic random access memory (DRAM) or static memory (SRAM). The integrated circuit memory element allows for faster access to the information by the processor than does the disk. The processor generally manipulates the data within the integrated circuit memory and then copies the data to the disk when processing is completed. A variety of mechanisms are known for managing data movement between the disk and the integrated circuit memory element, and the invention is not limited thereto. It should also be understood that the invention is not limited to a particular memory system.

It should be understood the invention is not limited to a particular computer platform, particular processor, or particular high-level programming language.

Additionally, the computer system may be a multiprocessor computer system or may include multiple computers connected over a computer network. It should be understood that each module (e.g. 102, 106, 1 16) in FIG. 1 may be separate modules of a computer program, or may be separate computer programs. Such modules may be operable on separate computers. Data (e.g., 108 and 1 14) may be stored in a memory system or transmitted between computer systems. The invention is not limited to any particular implementation using software or hardware or firmware, or any combination thereof. The various elements of the system, either individually or in combination, may be implemented as a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor. Various steps of the process may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions by operating on input and generating output. Computer programming languages suitable for implementing such a system include procedural programming languages, object-oriented programming languages, and combinations of the two. Having now described a few embodiments, it should be apparent to those skilled in the art that the foregoing is merely illustrative and not limiting, having been presented by way of example only. Numerous modifications and other embodiments are within the scope of one of ordinary skill in the art and are contemplated as falling within the scope of the invention. What is claimed is:

Claims

1. In a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions, a method for evaluating a query protein amino acid sequence with respect to the motif, the method comprising a step of:

(A) calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.

2. The method of claim 1 , wherein the step (A) comprises a step of:

(A)(1 ) calculating the score as a logarithm of a product of the selected preference values.

3. The method of claim 1 , wherein the step (A) comprises steps of:

(A)(1 ) multiplying the selected preference values to obtain the product; and (A)(2) calculating the score as the logarithm of the product.

4. The method of claim 1 , wherein the step (A) comprises steps of:

(A)( 1 ) calculating the logarithm of each of the selected preference values; and (A)(2) calculating the score as a sum of the logarithms calculated in step (A)( 1 ).

5. The method of claim 1. wherein the step (A) comprises a step of:

(A)( 1 ) calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.

6. The method of claim 1 , wherein the motif includes a number of degenerate positions, and wherein the step (A) comprises steps of:

(A)( 1 ) generating, for each of the selected preference values, a probability value that is proportional to the selected preference value: (A)(2) calculating, for each of the probability values, a negative logarithm of the probability value: (A)(3) summing the negative logarithms: and (A)(4) calculating the score by dividing the sum by the number of degenerate positions in the known amino acid sequence.

7. The method of claim 1. wherein the step (A) comprises steps of:

(A)( 1 ) generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids.

8. The method of claim 1. further comprising steps of:

(B) calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences; and (C) calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences.

9. The method of claim 8. wherein the step (C) comprises steps of: (C)(1) generating a histogram of the scores of the plurality of amino acid sequences;

(C)(2) identifying a position of the score of the query protein amino acid sequence within the histogram; (C)(3) calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences.

10. The method of claim 1. further comprising a step of:

(B) generating a graphical display on a display device, the graphical display including information descriptive of the score.

1 1 . The method of claim 10. wherein the step (B) comprises steps of:

(B)(1 ) generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and (B)(2) generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

12. The method of claim 1 1 , wherein the step (B)( 1 ) comprises a step of:

(B)(l )(l) generating the query protein amino acid sequence graphical element having a visible range of positions corresponding to positions within the query protein amino acid sequence; and wherein the step (B)(2) comprises a step of: (B)(2)(l ) generating the motif identifier at a location that visually corresponds to the position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

13. The method of claim 10, wherein the step (B) comprises a step of:

(B)(1 ) generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

14. The method of claim 10. further comprising a step of:

(B)(1 ) generating display information description of a sub-sequence within the query protein amino acid sequence that matches the motif particularh well.

15. The method of claim 10. wherein the step (B) comprises steps of: (B)( 1 ) generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and (B)(2) generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the quen protein amino acid sequence within the histogram.

16. In a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions, a query sequence evaluator comprising: a first input to receive a query sequence evaluation request indicating a query protein amino acid sequence to be evaluated with respect to the motif: a second input to receive information descriptive of the record from the database: and an output to develop a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.

17. The query sequence evaluator of claim 16. wherein the score comprises a logarithm of a product of the selected preference values.

18. The query sequence evaluator of claim 16. wherein the score comprises an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.

1 . In a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions, a query sequence evaluation system comprising: a query sequence evaluator to develop on an output a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence: and a query sequence user interface having an input to receive the score and to develop on an output display information descriptive of the score for output to a display device.

20. The system of claim 19. wherein the display information includes information descriptive of: a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

21. The system of claim 19. wherein the display information includes information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

22. The system of claim 19, wherein the display information includes information descriptive of a sub-sequence within the query protein amino acid sequence that matches the motif particularly well.

23. The system of claim 19. wherein the display information includes information descriptive of: a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.

24. In a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference \ alues indicating preferences of the amino acids to interact with the protein at the positions, a query sequence evaluation system for evaluating a query protein amino acid sequence with respect to the motif, the query sequence evaluation system comprising: query sequence evaluation means for calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.

25. The query sequence evaluation system of claim 24, wherein the query sequence evaluation means comprises: means for calculating the score as a logarithm of a product of the selected preference values.

26. The query sequence evaluation system of claim 24, wherein the query sequence evaluation means comprises: means for calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.

27. The query sequence evaluation system of claim 24. wherein the query system evaluation means comprises: means for generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids.

28. The query sequence evaluation system of claim 24. further comprising: first calculation means for calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences: and second calculation means for calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences.

29. The query sequence evaluation system of claim 28. wherein the second calculation means comprises: means for generating a histogram of the scores of the plurality of amino acid sequences; means for identifying a position of the score of the query protein amino acid sequence within the histogram; and means for calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences.

30. The query sequence evaluation system of claim 24, further comprising: graphical display generation means for generating a graphical display on a display device, the graphical display including information descriptive of the score.

31. The query sequence evaluation system of claim 30. wherein the graphical display generation means comprises: means for generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and means for generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

32. The query sequence evaluation system of claim 24. wherein the graphical display generation means comprises: means for generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.

33. The query sequence evaluation system of claim 24. wherein the graphical display generation means comprises: means for generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and means for generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.