WO2000065358A2 - Amino acid sequence evaluation system - Google Patents
Amino acid sequence evaluation system Download PDFInfo
- Publication number
- WO2000065358A2 WO2000065358A2 PCT/US2000/010756 US0010756W WO0065358A2 WO 2000065358 A2 WO2000065358 A2 WO 2000065358A2 US 0010756 W US0010756 W US 0010756W WO 0065358 A2 WO0065358 A2 WO 0065358A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- amino acid
- acid sequence
- query
- motif
- score
- Prior art date
Links
- 125000003275 alpha amino acid group Chemical group 0.000 title claims abstract description 277
- 238000011156 evaluation Methods 0.000 title claims description 78
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 283
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 283
- 238000000034 method Methods 0.000 claims abstract description 83
- 150000001413 amino acids Chemical group 0.000 claims description 93
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000003550 marker Substances 0.000 claims description 8
- 235000018102 proteins Nutrition 0.000 description 217
- 235000001014 amino acid Nutrition 0.000 description 79
- 108010067902 Peptide Library Proteins 0.000 description 48
- 230000006870 function Effects 0.000 description 37
- 150000001875 compounds Chemical class 0.000 description 36
- 108090000765 processed proteins & peptides Proteins 0.000 description 28
- 125000000539 amino acid group Chemical group 0.000 description 26
- 230000003993 interaction Effects 0.000 description 22
- 102000004196 processed proteins & peptides Human genes 0.000 description 18
- 230000008569 process Effects 0.000 description 14
- 239000000758 substrate Substances 0.000 description 13
- 238000013459 approach Methods 0.000 description 12
- 102000004190 Enzymes Human genes 0.000 description 10
- 108090000790 Enzymes Proteins 0.000 description 10
- 239000000126 substance Substances 0.000 description 10
- 102000001253 Protein Kinase Human genes 0.000 description 8
- 238000004590 computer program Methods 0.000 description 8
- 108060006633 protein kinase Proteins 0.000 description 8
- 108091000080 Phosphotransferase Proteins 0.000 description 7
- 102000020233 phosphotransferase Human genes 0.000 description 7
- 210000004027 cell Anatomy 0.000 description 6
- 108010069514 Cyclic Peptides Proteins 0.000 description 5
- 102000001189 Cyclic Peptides Human genes 0.000 description 5
- 238000004617 QSAR study Methods 0.000 description 5
- 102000014400 SH2 domains Human genes 0.000 description 5
- 108050003452 SH2 domains Proteins 0.000 description 5
- 238000000926 separation method Methods 0.000 description 5
- 102000014914 Carrier Proteins Human genes 0.000 description 4
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 description 4
- 108091008324 binding proteins Proteins 0.000 description 4
- 230000002255 enzymatic effect Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 230000026731 phosphorylation Effects 0.000 description 4
- 238000006366 phosphorylation reaction Methods 0.000 description 4
- 230000004481 post-translational protein modification Effects 0.000 description 4
- 230000004850 protein–protein interaction Effects 0.000 description 4
- 102000005720 Glutathione transferase Human genes 0.000 description 3
- 108010070675 Glutathione transferase Proteins 0.000 description 3
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 description 3
- 102000004160 Phosphoric Monoester Hydrolases Human genes 0.000 description 3
- 108090000608 Phosphoric Monoester Hydrolases Proteins 0.000 description 3
- 150000001371 alpha-amino acids Chemical class 0.000 description 3
- 235000008206 alpha-amino acids Nutrition 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000012268 genome sequencing Methods 0.000 description 3
- 230000037361 pathway Effects 0.000 description 3
- 102000005962 receptors Human genes 0.000 description 3
- 108020003175 receptors Proteins 0.000 description 3
- 102000009076 src-Family Kinases Human genes 0.000 description 3
- 108010087686 src-Family Kinases Proteins 0.000 description 3
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 description 3
- MTCFGRXMJLQNBG-REOHCLBHSA-N (2S)-2-Amino-3-hydroxypropansäure Chemical compound OC[C@H](N)C(O)=O MTCFGRXMJLQNBG-REOHCLBHSA-N 0.000 description 2
- 102000054300 EC 2.7.11.- Human genes 0.000 description 2
- 108700035490 EC 2.7.11.- Proteins 0.000 description 2
- AYFVYJQAPQTCCC-GBXIJSLDSA-N L-threonine Chemical compound C[C@@H](O)[C@H](N)C(O)=O AYFVYJQAPQTCCC-GBXIJSLDSA-N 0.000 description 2
- 239000004472 Lysine Substances 0.000 description 2
- 102000035195 Peptidases Human genes 0.000 description 2
- 108091005804 Peptidases Proteins 0.000 description 2
- 108010001441 Phosphopeptides Proteins 0.000 description 2
- 102000045595 Phosphoprotein Phosphatases Human genes 0.000 description 2
- 108700019535 Phosphoprotein Phosphatases Proteins 0.000 description 2
- 239000004365 Protease Substances 0.000 description 2
- 102000000395 SH3 domains Human genes 0.000 description 2
- 108050008861 SH3 domains Proteins 0.000 description 2
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 description 2
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 description 2
- 239000004473 Threonine Substances 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 108020001507 fusion proteins Proteins 0.000 description 2
- 102000037865 fusion proteins Human genes 0.000 description 2
- RWSXRVCMGQZWBV-WDSKDSINSA-N glutathione Chemical compound OC(=O)[C@@H](N)CCC(=O)N[C@@H](CS)C(=O)NCC(O)=O RWSXRVCMGQZWBV-WDSKDSINSA-N 0.000 description 2
- 150000002632 lipids Chemical class 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 2
- 108020001580 protein domains Proteins 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 229920000936 Agarose Polymers 0.000 description 1
- 102000016289 Cell Adhesion Molecules Human genes 0.000 description 1
- 108010067225 Cell Adhesion Molecules Proteins 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- 102000005927 Cysteine Proteases Human genes 0.000 description 1
- 108010005843 Cysteine Proteases Proteins 0.000 description 1
- 108010024636 Glutathione Proteins 0.000 description 1
- 102000004310 Ion Channels Human genes 0.000 description 1
- 108090000862 Ion Channels Proteins 0.000 description 1
- 102000000470 PDZ domains Human genes 0.000 description 1
- 108050008994 PDZ domains Proteins 0.000 description 1
- 108090000279 Peptidyltransferases Proteins 0.000 description 1
- 102000010995 Pleckstrin homology domains Human genes 0.000 description 1
- 108050001185 Pleckstrin homology domains Proteins 0.000 description 1
- 102000055027 Protein Methyltransferases Human genes 0.000 description 1
- 108700040121 Protein Methyltransferases Proteins 0.000 description 1
- 102000004022 Protein-Tyrosine Kinases Human genes 0.000 description 1
- 108090000412 Protein-Tyrosine Kinases Proteins 0.000 description 1
- 102000012479 Serine Proteases Human genes 0.000 description 1
- 108010022999 Serine Proteases Proteins 0.000 description 1
- HCHKCACWOHOZIP-UHFFFAOYSA-N Zinc Chemical compound [Zn] HCHKCACWOHOZIP-UHFFFAOYSA-N 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 150000007513 acids Chemical class 0.000 description 1
- 238000001042 affinity chromatography Methods 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 102000036639 antigens Human genes 0.000 description 1
- 108091007433 antigens Proteins 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008512 biological response Effects 0.000 description 1
- 210000004899 c-terminal region Anatomy 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000003915 cell function Effects 0.000 description 1
- 238000010668 complexation reaction Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- -1 e.g. Chemical group 0.000 description 1
- 230000009144 enzymatic modification Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 229960003180 glutathione Drugs 0.000 description 1
- 239000003102 growth factor Substances 0.000 description 1
- 230000002209 hydrophobic effect Effects 0.000 description 1
- 230000035990 intercellular signaling Effects 0.000 description 1
- 239000003446 ligand Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 125000003588 lysine group Chemical group [H]N([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])(N([H])[H])C(*)=O 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 239000006225 natural substrate Substances 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- DCWXELXMIBXGTH-QMMMGPOBSA-N phosphonotyrosine Chemical group OC(=O)[C@@H](N)CC1=CC=C(OP(O)(O)=O)C=C1 DCWXELXMIBXGTH-QMMMGPOBSA-N 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000011701 zinc Substances 0.000 description 1
- 229910052725 zinc Inorganic materials 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- the present invention relates to computer-implemented methods for predicting the functional roles and pathways in which proteins may interact with each other.
- Genome sequencing initiatives such as the Human Genome project have generated thousands of protein sequences of unknown function and are expected to generate vast numbers of additional sequences in the coming years.
- a comparison of sequence homologies between such newly discovered proteins and proteins of known function remains the only method available for predicting functional properties for newly discovered proteins.
- binding interactions include the interaction of an enzyme with its substrate (e.g., the interaction of kinases, proteases, phosphatases etc. with their substrates), the interaction of antibodies with antigens, the interaction of receptors with ligands and the interaction of SH2 domains with phosphotyrosine- containing targets.
- the specificity of a particular binding protein, such as an enzyme typically has been determined by identifying a number of natural substrates for the protein, obtaining the sequence of these substrates and then comparing the sequences of these substrates to define a consensus motif for substrate binding.
- protein and nucleic acid homology can be calculated using various, publicly available software tools developed by NCBI (Bethesda, Maryland) that can be obtained through the Internet (ftp:/ncbi.nlm.nih.gov/pub/).
- Exemplary tools include the BLAST system available at http://www.ncbi.nlm.nih.gov. Pairwise and ClustalW alignments (BLOSUM30 matrix setting) as well as Kyte-Doolittle hydropathic analysis can be obtained, for example, using the MacVector sequence analysis software (Oxford Molecular Group).
- a compilation of motif sequences has been described (Bairoch, A., Nucl. Acids Res. 19:2241-2245 (1991). Unfortunately, this compilation of sequences is deduced from the works of many investigators, each with varying degrees of precision and error and. accordingly, the compilation is replete with inaccuracies that are attributable to the original data entry.
- the invention is directed to a system which allows a user to scan a protein sequence for novel functions, interactions, post-translation modifications and other structural characteristics.
- Application in the invention is not limited to proteins previously identified in the laboratory, but can also be used to deduce this information for theoretical proteins whose sequences are obtained from genome sequencing initiatives such as the Human Genome Project. Consequently, the system disclosed herein can be used to suggest the functional roles and pathways in which a novel protein may be involved, and to provide directions towards identifying other proteins that are likely to interact with the target protein of interest.
- the system searches a query protein sequence to identify amino acid motifs that are likely to interact with proteins, e.g., as enzyme substrates or other binding domains.
- the system is a web-based system.
- ODPL oriented degenerate peptide libraries
- An ODPL contains a mixture of library members which differ from one another in amino acid sequence but which, in general, contain the same amino acid located at a fixed amino acid position (referred to herein as the Anon-degenerateto] position).
- Adegenerate position@ A position within each library peptide that is occupied by a different amino acid in different peptides (i.e., not fixed) is referred to herein as a Adegenerate position@.
- Exemplary oriented degenerate peptide libraries are described in Songyang et al. (Cell (1993) 72:767-778); U.S. Patent No. 5,532,167; and PCT Application No. PCT/US9S710876, entitled ACyclic Peptide Libraries and Methods of Use Thereof to j
- a system which is useful for evaluating a query protein amino acid sequence to determine whether the query protein contains one or more defined motifs.
- the system comprises a database which includes a record of motifs corresponding to a protein domain of known function.
- the record further includes a matrix of Apreference values@ (alternatively referred to as Aselectivity values@) for amino acids at positions in the motif, the preference values indicating the relative importance of each amino acid at each position to the function of the motif (e.g., a binding function, a phosphorylation site function).
- the system is used as a method for evaluating a query protein amino acid sequence to determine whether the query protein contains a sequence corresponding to the motif and, therefore, likely exhibits the function attributed to other proteins which contain a sequence corresponding to this motif.
- the method of the invention for evaluating the query protein amino acid sequence with respect to the motif wherein the score is based on selected preference values in the motif which correspond to amino acid in the query protein amino acid sequence.
- the invention provides two different approaches to provide relative scores of candidate motifs in the query protein sequence.
- Each of these approaches involves a step of (A) calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.
- the invention is directed to a method for evaluating a query protein amino acid sequence with respect to a motif corresponding to a target of a domain of this or another protein having a known function, in a system including a database including a record for the motif, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions.
- the method includes a step of: (A) calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.
- the step (A) includes a step of: (A)(1) calculating the score as a logarithm of a product of the selected preference values.
- the step (A) includes steps of: (A)(1) multiplying the selected preference values to obtain the product; and (A)(2) calculating the score as the logarithm of the product.
- the step (A) includes steps of: (A)(1) calculating the logarithm of each of the selected preference values: and (A)(2) calculating the score as a sum of the logarithms calculated in step (A)(1).
- the step (A) comprises a step of: (A)(1) calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.
- the motif includes a number of degenerate positions
- the step (A) includes steps of: (A)(1) generating, for each of the selected preference values, a probability value that is proportional to the selected preference value; (A)(2) calculating, for each of the probability values, a negative logarithm of the probability value; (A)(3) summing the negative logarithms; and (A)(4) calculating the score by dividing the sum by the number of degenerate positions in the known amino acid sequence.
- the step (A) includes steps of: (A)(1) generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids.
- the method further includes steps of: (B) calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences; and (C) calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences.
- the step (C) includes steps of: (C)(1) generating a histogram of the scores of the plurality of amino acid sequences; (C)(2) identifying a position of the score of the query protein amino acid sequence within the histogram; and (C)(3) calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences.
- the method further includes a step of: (B) generating a graphical display on a display device, the graphical display including information descriptive of the score.
- the step (B) includes steps of: (B)(1) generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and (B)(2) generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the step (B)(1) includes a step of: (B)(l)(l) generating the query protein amino acid sequence graphical element having a visible range of positions corresponding to positions within the query- protein amino acid sequence; and the step (B)(2) includes a step of: (B)(2)(l) generating the motif identifier at a location that visually corresponds to the position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the step (B) includes a step of: (B)(1) generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the method further includes a step of: (B)(1) generating display information descriptive of a sub-sequence within the query protein amino acid sequence that matches the motif particularly well.
- the step (B) includes steps of: (B)(1) generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and (B)(2) generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.
- the invention is directed to a query sequence evaluator in a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions.
- the query sequence evaluator includes: a first input to receive a query sequence evaluation request indicating a query protein amino acid sequence to be evaluated with respect to the motif; a second input to receive information descriptive of the record from the database; and an output to develop a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.
- the score comprises a logarithm of a product of the selected preference values. In another embodiment, the score comprises an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.
- the invention is directed to a query sequence evaluation system in a system including a database including a record for a motif corresponding to a target of a domain of a protein having a known function, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions.
- the query sequence evaluation system includes: a query sequence evaluator to develop on an output a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence; and a query sequence user interface having an input to receive the score and to develop on an output display information descriptive of the score for output to a display device.
- the display information includes information descriptive of: a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence; and a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the display information includes information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the display information includes information descriptive of a sub-sequence within the query protein amino acid sequence that matches the motif particularly well.
- the display information includes information descriptive of: a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif; and a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.
- the invention is directed to a query sequence evaluation system for evaluating a query protein amino acid sequence with respect to a motif corresponding to a target of a domain of a protein having a known function.
- the query sequence evaluation system operates in a system including a database including a record for the motif, the record including preference values for amino acids at positions in the motif, the preference values indicating preferences of the amino acids to interact with the protein at the positions.
- the query sequence evaluation system includes query sequence evaluation means for calculating a score for the query protein amino acid sequence with respect to the motif based on selected preference values in the motif corresponding to amino acids in the query protein amino acid sequence.
- the query sequence evaluation means includes means for calculating the score as a logarithm of a product of the selected preference values.
- the query sequence evaluation means includes means for calculating the score as an average of a sum of negative logarithms of probabilities corresponding to the selected preference values.
- the query system evaluation means includes means for generating one of the selected preference values corresponding to a first amino acid at a particular position in the motif based on preference values of a plurality of other preference values corresponding to a plurality of other amino acids at the particular position in the motif and based on values corresponding to physicochemical properties of the plurality of other amino acids.
- the query sequence evaluation system further includes first calculation means for calculating scores for a plurality of amino acid sequences with respect to the motif based on selected preference values in the motif corresponding to amino acids in the plurality of amino acid sequences; and second calculation means for calculating a percentile score for the query protein amino acid sequence by comparing the score of the query protein amino acid sequence to the scores of the plurality of amino acid sequences.
- the second calculation means includes means for generating a histogram of the scores of the plurality of amino acid sequences; means for identifying a position of the score of the query protein amino acid sequence within the histogram: and means for calculating the percentile score for the query protein amino acid sequence by dividing the number of scores that lie to one side of the score of the query protein amino acid sequence by the number of the plurality of amino acid sequences.
- the query sequence evaluation system further comprises graphical display generation means for generating a graphical display on a display device, the graphical display including information descriptive of the score.
- the graphical display generation means comprises means for generating a query protein amino acid sequence graphical element that displays a structure of the query protein amino acid sequence and means for generating a motif identifier that identifies the motif, the motif identifier visually indicating a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the graphical display generation means comprises means for generating display information descriptive of a position within the query protein amino acid sequence at which the motif matches the query protein amino acid sequence particularly well.
- the graphical display generation means comprises means for generating a histogram display that displays a histogram of scores of a plurality of amino acid sequences with respect to the motif and means for generating a query protein amino acid sequence marker within the histogram display that indicates the position of the score of the query protein amino acid sequence within the histogram.
- FIG. 1 is a dataflow diagram of an amino acid evaluation system according to one embodiment of the present invention.
- FIG. 2 is a table including preference values for a motif corresponding to a target of a domain of a protein.
- FIG. 3 is a flow chart of a method for evaluating a query amino acid sequence.
- FIGS. 4A-B are flow charts of methods for generating a score for a query amino acid sequence with respect to a motif for a domain of a protein.
- FIG. 5 is a flow chart of a method for generating a percentile score for a query amino acid sequence with respect to a motif for a domain of a protein.
- FIGS. 6A-C are diagrams of graphical displays displaying information descriptive of an evaluation of a query amino acid sequence with respect to multiple motifs.
- FIGS. 7A-C are flows charts of illustrative methods that may be used to generate the graphical displays of FIGS. 6A-C.
- the present invention provides a method and apparatus for evaluating a query protein amino acid sequence with respect to an amino acid sequence motif for a domain of a known protein having a known function to predict whether the query protein amino acid sequence performs the known function.
- the prediction is performed by using a scoring system to generate a score for the query protein amino acid sequence with respect to the amino acid sequence motif. The score indicates a degree of confidence that the query protein amino acid sequence performs the known function.
- the invention is directed to a system which allows a user to scan a protein sequence for novel functions, interactions, post-translational modifications and other structural characteristics.
- Application of the invention is not limited to proteins previously identified in the laboratory, but can also be used to deduce this information for theoretical proteins whose sequences are obtained from genome sequencing initiatives such as the Human Genome Project. Consequently, the system disclosed herein can be used to suggest the functional roles and pathways in which a novel protein may be involved, and to provide directions towards identifying other proteins that are likely to interact with the target protein of interest.
- the system searches a query protein sequence to identify amino acid motifs that are likely to interact with proteins, e.g., as enzyme substrates or other protein binding domains.
- the system is a web-based system.
- the approaches described herein for evaluating an amino acid sequence and predicting the function of a previously uncharacterized sequence are based on three premises: (1) information contained in the linear sequence of amino acids (primary structure) is sufficient to provide many clues about the function of either known or novel proteins; (2) many proteins within the eucaryotic cell function either as components of larger molecular complexes dominated (at least in part) by protein-protein interactions, or as enzymes whose activity is regulated by post-translational modifications such as phosphorylation or by protein-protein interactions; and (3) small sequence motifs within proteins are sufficient to provide specificity for directing modular protein-protein interactions or post-translation modifications.
- the techniques described herein provide a highly reliable method for analyzing an uncharacterized query sequence based upon these premises.
- the primary data from which the system predicts protein motifs is based upon a database containing information generated using oriented degenerate peptide libraries (ODPL).
- An ODPL contains a mixture of library members which differ from one another in amino acid sequence but which, in general, contain the same amino acid located at a fixed amino acid position (referred to herein as the "non-degenerate” position).
- a position within each library peptide that is occupied by a different amino acid in different peptides (i.e., not fixed) is referred to herein as a "degenerate position”.
- the ODPL contain at least one fixed, nondegenerate amino acid position and several degenerate amino acid positions.
- the amino acid residues on either side of the fixed non -degenerate amino acid residue are degenerate (e.g., immediately N-terminal and C-terminal to the non-degenerate residue), thus enabling one to determine an interaction site motif for the region surrounding the fixed amino acid residue.
- four amino acid residues located on each side of the non- degenerate amino acid residue can be degenerate (e.g., positions -4, -3, -2, -1 , +1 , +2, +3, +4. relative to the non-degenerate amino acid residue at position 0, can be degenerate).
- the degenerate positions in the peptides of an oriented degenerate cyclic peptide library can be created such that any one of the twenty natural amino acids, as well as unnatural ⁇ -amino acids, can occupy those positions.
- the degenerate positions do not contain amino acid residues that can be acted upon by the particular binding compound being examined.
- the binding compound is a protein-serine/threonine kinases and the fixed residue is a serine or threonine. it is preferred that the degenerate positions not contain serine or threonine.
- the degenerate positions not contain serine or threonine.
- the degenerate positions not contain tyrosine.
- Exemplary oriented degenerate peptide libraries are described in Songyang et al. (Cell ( 1993) 72:767-778); U.S. Patent No. 5.532,167; and PCT Application No. PCT/US98/10876, entitled “Cyclic Peptide Libraries and Methods of Use Thereof to Identify Binding Motifs," publication no. WO 98/54577.
- Songyang et al. Cell (1993) 72:767-778) describe a method for determining the sequence specificity of the peptide- binding sites of SH2 domains using oriented degenerate phosphopeptide libraries.
- the database of peptide motifs is generated by contacting an oriented degenerate peptide library (such as those described in the above-cited references) with a binding compound under conditions which allow for interaction between the binding compound and the ODPL.
- the binding compound interacts with the ODPL such that a complex is formed between the binding compound and a subpopulation of library members capable of interacting with the binding compound.
- the subpopulation of library members capable of interacting with the binding compound is then separated from the library members that are incapable of interacting with the binding compound.
- An amino acid sequence motif is then determined for an interaction site of the binding compound, based upon the relative abundance of different amino acid residues at each degenerate position within the linearized library members.
- binding compound refers to compounds which can interact with an interaction site on a peptide by one or more mechanisms.
- exemplary binding compounds include enzymes and binding proteins such as kinases (e.g., protein serine/threonine kinases, protein tyrosine kinases, lipid kinases).
- kinases e.g., protein serine/threonine kinases, protein tyrosine kinases, lipid kinases.
- phosphatases e.g., protein phosphatases, lipid phosphatases
- proteases e.g.. serine proteases, cysteine proteases
- binding proteins containing e.g., such as SH2 domains.
- SH3 domains antibodies, WW domains, PTB domains, PDZ domains, LIM domains, pleckstrin homology domains, zinc finger domains, extracellular growth factors and receptors, adhesion molecules, intercellular signaling molecules, 7-transmembrane receptor proteins, ion channels, methyltransferases, ubiquitinating enzymes and peptidyl- transferases.
- interaction refers to attractive forces which physically combine a binding compound with an interaction site on an oriented degenerate peptide library member.
- attractive forces include hydrophobic interactions, hydrophilic interactions, covalent binding, ionic binding, charged interactions, etc.
- an amino acid sequence motif for an interaction site is intended to describe a composite amino acid sequence which represents a consensus sequence for an interaction site.
- an amino acid sequence motif encompasses the region of the peptide which includes and surrounds an amino acid residue(s) which specifically and preferentially interacts with a binding compound.
- binding compound and oriented degenerate peptide library will vary depending upon the particular binding compound and ODPL used but are chosen such that a complex can form between the binding compound and a subpopulation of library members that are capable of interacting with the binding compound.
- the binding compound is an enzyme
- the enzyme and the ODPL are contacted under conditions that maintain the enzymatic activity of the enzyme (e.g., a kinase is incubated with the ODPL under conditions that allow for phosphorylation of the library members by the kinase).
- binding subpopulation After complexes have formed between the binding compound and the subpopulation of library members that are capable of interacting with the binding compound (referred to as the "binding subpopulation"), the binding subpopulation is separated from the non-binding subpopulation (e.g., those library members that do not interact with the binding compound). If applicable, the binding subpopulation is linearized.
- binding compound is immobilized on a solid support (e.g., a column) and the binding subpopulation of library members remains bound to the immobilized binding compound while the non-binding subpopulation is washed away. Standard methods for affinity chromatography can be used to afford such separation.
- Binding compounds can be immobilized to a solid support using methods known in the art.
- the binding compound can be prepared as a glutathione-S-transferase (GST) fusion protein and immobilized by binding the GST fusion protein to glutathione agarose beads.
- GST glutathione-S-transferase
- the binding compound has an enzymatic activity and separation of the binding subpopulation of library from the non-binding subpopulation is based upon enzymatic modification of the binding subpopulation by the binding compound.
- the binding compound is a kinase
- the binding subpopulation of library members becomes phosphorylated while the non-binding subpopulation remains nonphosphorylated (discussed in the Examples).
- phosphorylated peptides can be separated from nonphosphorylated peptides to achieve separation of the binding subpopulation from the non-binding subpopulation.
- the binding compound is a phosphatase
- the binding subpopulation of library members becomes dephosphorylated while the non-binding subpopulation remains phosphorylated.
- nonphosphorylated peptides can be separated from phosphorylated peptides to achieve separation of the binding subpopulation from the non-binding subpopulation.
- selected binding library members are sequenced by standard amino acid sequencing techniques (e.g., Edman degradation).
- Automated peptide sequencers can be used to determine the amino acid sequence of the library members.
- the ODPL used is a soluble synthetic peptide library and the subpopulation of peptides is sequenced as a bulk population using an automated peptide sequencer. This approach provides information on the abundance of each amino acid residue at a given cycle in the sequence of the complexed mixture, most importantly at the degenerate positions.
- a relative abundance value can then be calculated by dividing the abundance of a particular amino acid residue at that position after library screening (e.g., after peptide complex and separation) by the abundance of the same amino acid residue at that position in the starting library.
- RA relative abundance
- RA amount of Xaa in the population of selected peptides amount of Xaa in the original oriented degenerate peptide library
- the relative abundance value may be corrected for background contamination as described in, for example, PCT/US98/10876.
- Amino acid residues which are neither enriched for nor selected against in the population of peptides which can serve as substrates for, or bind to. the binding compound will have a relative abundance of 1.0.
- Those amino acid residues which are preferred at a particular degenerate position e.g. residues which are enriched at that position in the complexed peptides
- Those amino acid residues which are not preferred e.g., residues which are selected against at that position in the complexed peptides
- preferred amino acid residues e.g., amino acid residues with a relative abundance greater than 1.0. can be identified at that position.
- an amino acid sequence motif for an interaction site e.g., a phosphorylation site for a protein kinase
- the amino acid sequence motif encompasses the degenerate region of the peptides.
- the particular amino acid residues chosen for the motif at each degenerate position are those which are most abundant at each position.
- an amino acid residue(s) with a relative abundance value greater than a predetermined threshold (e.g.. 1.0) at a particular position may be chosen as the amino acid residue(s) at that position within the amino acid sequence motif.
- the amino acid sequence motif may. therefore, contain any number of amino acids at each position.
- the predetermined threshold may be any value.
- all amino acid residues may be included in the motif at each position.
- an amino acid sequence evaluation system 100 is provided to evaluate query protein amino acid sequences with respect to amino acid sequence motifs for domains of proteins having known functions and to generate, for each of the motifs, a quantitative score or scores for the presence of the motif in the query protein amino acid sequence.
- the quantitative nature of the scores indicates a degree of confidence that the query protein amino acid sequence performs the same function or functions as the proteins having known functions.
- the system 100 includes a peptide library database 102 including records ⁇ 04a-n descriptive of motifs for domains of proteins having known functions.
- the records 104a-n in the peptide library database 104a-n are developed according to the oriented peptide library approach described above. The contents and function of the peptide library database 102 are described in more detail below with respect to FIG. 2.
- the amino acid sequence evaluation system 100 also includes a query sequence user interface 106 that provides an interface between the system 100 and users of the system 100.
- the query sequence user interface 106 accepts input from the user and generates display information 1 10 that is displayed to the user on a display device 1 12, such as a computer monitor.
- a user submits user query sequence input 108 to the query sequence user interface 106.
- the user query sequence input 108 describes the query protein amino acid sequence that the user wishes to search.
- the user query sequence input 108 may take am form. For example, the user may provide a complete amino acid sequence in single- letter format, or may provide the name of a protein stored in a database, such as the publicly-available Swiss Prot database.
- the user query sequence input 108 may also include additional information, such as user preferences indicating how the search is to be performed. For example, the user query sequence input 108 may indicate that only selected motifs in the peptide library database should be included in the search.
- the query sequence user interface 106 generates and submits a query sequence evaluation request 1 14 to a query sequence evaluator 1 16.
- the query sequence evaluation request 1 14 may, for example, include a description of the query protein amino acid sequence to be searched as well as information descriptive of the user's preferences (e.g., which motifs represented in the peptide library database 102 are to be included in the search).
- the query sequence evaluator 1 16 After receiving the query sequence evaluation request 1 14, the query sequence evaluator 1 16 evaluates the query protein amino acid sequence contained in the query sequence evaluation request 1 14. For example, the query sequence evaluator 1 16 may evaluate the query protein amino acid sequence to the motifs 104a-n represented in the peptide library database 102 to generate query sequence evaluation results 1 18 indicative of the presence of the motifs ⁇ 04a-n in the query protein amino acid sequence. Examples of methods for generating such evaluation results 1 18 are described in more detail below with respect to FIGS. 4A-B.
- the query sequence evaluator 116 transmits the query sequence evaluation results 1 18 to the query sequence user interface 106, which generates and transmits display information 1 10 descriptive of the query sequence evaluation results 118 to the display device 112 in a format suitable for browsing by the user.
- the user may browse the display information 110 using user display navigation commands 120.
- a query sequence score for a motif corresponding to a target of a domain of a known protein having a known function may be used to predict whether the query protein amino acid sequence performs the same or similar function.
- the range of possible scores may be ordered such that a higher score is predicted to be more likely to perform the same function of the known protein than an amino acid sequence having a lower score.
- each of the query sequence evaluator 1 16 and the query sequence user interface may be implemented as a computer program residing in a computer- readable memory, such as a random-access memory (RAM).
- Such computer programs include, for example, standalone applications, background processes, plug-ins. and dynamic link libraries, either alone or in combination.
- the query sequence user interface 106 and the query sequence evaluator 1 16 may be implemented as programs executing on a single computer or on different computers, or may be combined into a single computer program executing on a single computer or distributed over a network.
- the query sequence user interface 106 may, for example, be implemented as a web page displayable by a standard web browser.
- the query sequence evaluator 116 may be implemented as a web-compatible server accessible to the query sequence user interface 106 over a network, such as an intranet or internet (e.g., the public Internet).
- the query sequence evaluation results 118 may include any information indicative of the presence of motifs from the peptide library database 102 in the query protein amino acid sequence.
- the query sequence evaluation results 1 18 include quantitative scores for the query protein amino acid sequence with respect to motifs represented in the peptide library database.
- a score for the query protein amino acid sequence with respect to a particular motif may, for example, indicate a degree of confidence that the query protein amino acid sequence contains the motif. Such a score may therefore indicate whether the query protein amino acid sequence performs a function that is the same as or similar to a function performed by proteins containing the motif for the particular domain.
- the query sequence evaluation results 118 may include multiple scores for the query protein amino acid sequence with respect to a single motif, each of the scores corresponding to a different sub-sequence within the query protein amino acid sequence. For example, in one embodiment (described in more detail below with respect to FIG. 3), subsequences of the query protein amino acid sequence are evaluated with respect to each of the motifs in the peptide library database 102. In such an embodiment, the query sequence evaluation results 1 18 may include an evaluation result (e.g.. a quantitative score) for each subsequence with respect to each motif.
- an evaluation result e.g.. a quantitative score
- the query sequence evaluation results 1 18 includes, for each evaluation result (e.g., quantitative score), an identifier identifying the subsequence of the query protein amino acid sequence to which the evaluation result corresponds.
- the identifier may, for example, indicate the position of the beginning of the subsequence within the query protein amino acid sequence.
- the query sequence evaluation results 1 18 include onh selected evaluation results.
- the query sequence evaluation results 1 18 may include only evaluation results for motifs that match the query protein amino acid sequence (or subsequences of it) particularly well.
- the query sequence evaluation results 1 18 include quantitative scores
- the query sequence evaluation results 118 may include only those quantitative scores that satisfy a predetermined threshold.
- the query sequence user interface 106 generates display information 1 10 only for selected ones of the query sequence evaluation results.
- the query sequence user interface 106 may generate display information 1 10 only for motifs that match the query protein amino acid sequence (or subsequences of it) particularly well.
- the query sequence evaluation results 118 include quantitative scores
- the query sequence user interface 106 may generate display information 1 10 only for those quantitative scores that satisfy a predetermined threshold.
- Each of the records 104a- « in the peptide library database 102 contains information descriptive of a motif for a domain of a protein.
- the peptide library database 102 may be any kind of database capable of storing information descriptive of motifs.
- the motif records ⁇ 04a-n in the peptide library database 102 may be generated according to the oriented peptide library approach, this is not a limitation of the present invention. Rather, the records ⁇ 04a-n may be generated in any way.
- the oriented peptide library approach is described in detail in Cantley et al.. U.S. Pat. No. 5,532,167, entitled “Substrate Specificity of Protein Kinases,” and incorporated herein by reference in its entirety.
- Exemplary oriented degenerate peptide libraries are described in Songyang et ⁇ l. (Cell (1993) 72:767-778); U.S. Patent No. 5.532.167; and PCT Application No. PCT/US98/10876. entitled “Cyclic Peptide Libraries and Methods of Use Thereof to Identify Binding Motifs, " publication no. WO 98/54577.
- the amino acid sequence motifs determined by the oriented peptide library approach are useful for predicting whether a query protein is a substrate for a particular protein kinase.
- the primary amino acid sequence of a query protein can be examined for the presence of the determined amino acid sequence motif. If the same or a very similar motif is present in the protein, it can be predicted that the protein could function as a substrate for that protein kinase.
- the motifs may correspond to domains for proteins having known functions, the functions of the proteins to which the motifs correspond need not be known. For example, if the functions performed by the proteins corresponding to the motif records 104a- « in the database are not known, the techniques described herein may still be used to determine degrees of correspondence between a query protein amino acid sequence and motifs represented in the peptide library database 102. Such degrees of correspondence may subsequently become useful if the functions of peptides corresponding to the motifs are later discovered, or if the function performed by the query protein amino acid sequence is known.
- the records ⁇ 04a-n in the peptide library database 102 are represented as tables. An example of such a table 300 is shown in FIG. 2.
- the table 300 corresponds to the record 104a (Motif 1) in the peptide library database 102 (FIG. 1).
- the columns in the table 300 correspond to positions in Motif 1. and the rows in the table 300 correspond to amino acids.
- Motif 1 includes nine positions numbered -4 through +4, in which positions -4 through -1 and +1 through +4 are degenerate positions, and in which position zero is a non-degenerate position.
- the peptide library database may include records corresponding to motifs having any number of degenerate and non-degenerate positions. As shown in FIG.
- the single non-degenerate position (position zero) in Motif 1 corresponds to Tyrosine.
- Each cell of the table 300 at a particular row (corresponding to an amino acid) and column (corresponding to a position) contains a preference value corresponding to the relative abundance of the amino acid at the position.
- the preference value of Lysine at position -1 is 0.60033. which is the value stored at the row corresponding to Lysine (Lys) and the column numbered -1.
- the preference value of any amino acid with respect to any position of Motif 1 can be readily determined by reference to the table 300.
- the table 300 need not include preference values for all amino acids or within all cells.
- a suitable default value may be substituted.
- the motif includes only selected amino acid residues at each position (e.g.. amino acid residues exceeding a predetermined threshold)
- amino acids which are not in the motif at a particular position are assigned a default preference value at that position, such as one or zero.
- the user may provide the user query sequence input using any suitable input device, such as a standard keyboard or mouse.
- the query sequence evaluator 1 16 evaluates a query protein amino acid sequence according to a process 301.
- the query sequence evaluator 1 16 receives the query sequence evaluation request 1 14 from the query sequence user interface 106 (step 302).
- the query sequence evaluation request 1 14 includes a description of the query protein amino acid sequence to be searched, and may further include additional information such as information indicating which of the records ⁇ 04a-n in the peptide library database 102 are to be included in the search.
- additional information such as information indicating which of the records ⁇ 04a-n in the peptide library database 102 are to be included in the search.
- the query sequence evaluator 1 16 evaluates the subsequence s with respect to motif m (step 308). Examples of ways in which the query sequence evaluator 1 16 may evaluate the query protein amino acid sequence are described in more detail below with respect to FIGS. 4A-C.
- Evaluation of the query protein amino acid sequence produces a query sequence evaluation result (e.g.. a quantitative score) that is stored for future use (step 310).
- the evaluation result may. for example, be stored in a two-dimensional Results array at column s and row m.
- the evaluation result may be stored in any manner, as long as the evaluation result is associated with the amino acid subsequence and the motif to which it corresponds.
- Steps 308-310 are repeated for the remaining motifs (step 312) and the remaining subsequences of the query protein amino acid sequence (step 314).
- the illustrative process 400 shown in FIG. 4 evaluates all subsequences of the query protein amino acid sequence with respect to all motifs within the peptide library database 102, fewer than all subsequences of the query protein amino acid sequence may be e compacted with respect to fewer than all of the motifs represented in the peptide library database 102.
- a method referred to as the "log-sum method" is used to evaluate the query protein amino acid sequence with respect to a motif.
- the log-sum method may be used to implement step 308 (FIG. 3), as shown by the process 400 in FIG. 4A.
- the log-sum method assigns a score to a query protein amino acid sequence based on the logarithm of the product of each amino acid ' s preference value in the motif against which the query protein amino acid sequence is being scored. This is mathematically equivalent to summing the logarithms of each amino acid ' s preference value. Note also that the logarithm of the preference value may be considered a reflection of the chemical binding energy.
- the preference values may be normalized before being used in the log-sum method. For example, the values in each of the columns in the table 300 (FIG. 2) are normalized to a sum of fifteen.
- the query sequence evaluator 1 16 may generate a score for a subsequence beginning at position p q of the query protein amino acid sequence with respect to a motif m using the process 400 according to the log-sum method as follows.
- the process 400 is described with respect to an illustrative example in which the query protein amino acid sequence is GNGDYMPMS and the record in the peptide library database 102 corresponding to the motif m contains the preference values shown in table 400 (FIG. 4).
- the query sequence evaluator 1 16 initializes the score to a value of zero (step 402). and identifies the record r in the peptide library database 102 corresponding to the motif m for which the score is being generated (step 404). In this example, the contents of the record r that is retrieved are shown in the table 400 (FIG. 2). The query sequence evaluator 1 16 then enters a loop over each position p m in the motif (step 406). In this example, the positions of the motif are numbered -4 through ⁇ -4. For each such position p m , the query sequence evaluator 1 16 identifies the amino acid in the query sequence at position (p s + p, render) (step 408).
- the amino acid in the query protein amino acid sequence (GNGDYMPMS) at the first position (position -4) is G.
- the query sequence evaluator 1 16 retrieves, from the record r, the preference value pv of the identified amino acid at position p m of the motif (step 410).
- the preference value of G (Gly) at position -4 is 1.5284. This is represented by the contents of the table 300 at the row labeled "Gly " and the column numbered -4. If there is no preference value in the motif at position p m for the identified amino acid, the query sequence evaluator may substitute an appropriate default value, such as 1.
- the query sequence evaluator 1 16 updates the score for the motif by adding to the score the logarithm (e.g., the natural logarithm or the base ten logarithm) of the preference value pv (step 412).
- the preference values at a particular position within the motif m may be normalized before calculation of the logarithm at step 412.
- the logarithm of the preference value is approximately 0.1842.
- Steps 408 and 410 are repeated for the remaining amino acids in the query protein amino acid sequence.
- the score is equal to the sum of the logarithms of the preference values of each of the amino acids in the query protein amino acid sequence. In this example, the score is approximately equal to 2.42.
- each preference value v retrieved in step 410 is increased by a constant value c before the logarithm of the preference value pv is calculated in step 412.
- This addition of the constant c may be used to shift the retrieved preference values toward a region of the logarithm function in which lower preference values are less heavily weighted.
- the constant value c may be chosen in any way and ma ⁇ ' be any value. Preference values that are less than one may be neglected or raised to be equal to one.
- a method referred to as the "entropy method” is used to score for the presence of a motif in a query sequence.
- the preference value p, of each amino acid at each position in the motif is translated to a probability (with the sum of all of the probabilities being equal to one), the -log 2 of p, is used as a measure of the relative "entropic density" of the amino acid in that position and the cumulative score is calculated.
- the resulting score is then averaged for the number of the degenerate positions in the motif.
- a lower score indicates a higher degree of confidence that the the query protein amino acid sequence includes the motif, while a higher score indicates a lower degree of confidence.
- the query sequence evaluator 116 may evaluate an amino acid subsequence with respect to a motif m (step 308 of FIG. 3) using a process 308b according to the entropy method as follows.
- the process 420 is described with respect to an illustrative example in which the query protein amino acid sequence is
- the query sequence evaluator 1 16 initializes the score to a value of zero (step 422). and identifies the record r in the peptide library database 102 corresponding to the motif for which the score is being generated (step 404). In this example, the contents of the record r that is retrieved are shown in the table 300 (FIG. 2). For each position p m in the motif m. the quen sequence evaluator 1 16 normalizes the preference values at position p m by translating the preference values into probabilities, with the sum of the probabilities being equal to one (step 426). This may be performed by.
- the query sequence evaluator 1 16 then enters a loop over each position p m in the motif m (step 406). In this example, the positions of the motif are numbered-4 through +4. For each 428 position p m . the query sequence evaluator 1 16 identifies the amino acid in the query sequence at position (p s + p (step 430). In this example, the amino acid in the query protein amino acid sequence (GSEEYMNMD) at the first position (position -4) is G. The query sequence evaluator 1 16 retrieves, from the record r. the probability value of the identified amino acid at position p m of the motif m (step 432). In this example, the probability value of G (Gly) at position -4 is 0.1019.
- the query sequence evaluator 116 updates the score for the motif m by adding the negative logarithm (e.g., the natural logarithm or the base ten logarithm) of the retrieved probability value to the score (FIG. 434). Steps 430-434 are repeated for the remaining amino acids in the query amino acid subsequence. After the loop completes (step 436). the score is divided by the number of positions in the motif m (in this example, nine) to obtain a final score (step 438). In this example, the final score is approximately equal to 3.3691.
- the negative logarithm e.g., the natural logarithm or the base ten logarithm
- a method that takes advantage of quantitative structure-activity relationships is used to score for the presence of a motif in a query sequence.
- QSAR quantitative structure-activity relationships
- quantitative scales are assigned to physicochemical properties of amino acids. For example, in one embodiment, a first scale (labeled z,) is assigned to amino acid hydrophilicity. a second scale (labeled zX) is assigned to size, and a third scale (labeled z,) is assigned to polarity (electronic effects).
- Such values may be obtained experimentally or from pre-existing sources and stored in a database for future use.
- a score for a query protein amino acid sequence or other chemical structure with respect to a motif may be calculated using the preference values stored in the peptide library database 102 for the motif and using the z scale values of the amino acids in the motif.
- an equation can be derived that relates the preference value p of a particular chemical structure at a particular position x of a motif m to the z scale values z,, z 2 . and z, of the chemical structure:
- the values of the coefficients a, b, and c are set appropriately to characterize the relationship between the z scale values (z,, z 2 , and z-,) and the preference value p m Equation (2) can therefore be used to calculate a preference value for an amino acid or other chemical structure based on the chemical structure's known z scale values.
- the query sequence evaluator 1 16 merely substitutes the z scale values of the chemical structure into a form of Equation (2) having coefficients a, b. and c with appropriate values.
- Equation (2) Use of Equation (2) to calculate a preference value may be useful when, for example, the peptide library database 102 does not contain a preference value for a particular amino acid or other chemical structure.
- the preference value may be stored in the peptide library database 102 and/or used in the calculation of a score for a query protein amino acid sequence using any appropriate method, such as the log-sum method or entropy method, described above.
- the coefficients a, b. and c of Equation (2) for each motif m and each position x may be generated prior to evaluation of any query sequences by the query sequence evaluator 1 16. Such pre-generation of the coefficients allows preference values for chemical structures to be generated quickly during the evaluation process, without having to generate values for the coefficients.
- the coefficients a. b. and c for a particular motif m at a position x may be generated using any appropriate method. For example, if the preference values and z scale values of at least three amino acids at position x of motif m are known, the coefficients a. b. and c may be obtained by solving Equation (2) using standard algebraic techniques.
- a "percentile score" may also be calculated (e.g.. by the query sequence evaluator 1 16 or the query sequence user interface 106). Such a percentile score indicates where the final score for the query protein amino acid sequence ranks compared to the final scores of other amino acid sequences containing the non- degenerate residue when evaluated with respect to the same motif. The percentile score can therefore be useful for interpreting the final score for the query protein amino acid sequence.
- a flow chart of one example of a process 500 that may be performed (e.g...
- the query sequence evaluator 1 16 calculates final scores (e.g.. as described above with respect to FIGS. 4A-B) for all amino acid sequences in an amino acid sequence database (step 502), such as the publicly available Swiss Prot database, with respect to the motif, and generates a histogram of the final scores (step 504).
- the position of the query protein amino acid sequence ' s final score is identified within the histogram (step 506).
- the query protein amino acid sequence ' s percentile score is calculated by dividing the number of scores greater than (or less than, depending on the scoring method used) the final score for the query protein amino acid sequence (step 508).
- the histogram of final scores generated in step 504 is used to determine whether to include the score for a query sequence in the query sequence evaluation results 1 18 or, alternatively, whether to generate display information 1 10 for the score.
- the query sequence evaluator 1 16 only includes the final score for a query protein amino acid sequence in the query sequence evaluation results 1 18 if the final score falls within a predetermined region of the histogram, such as within the best five percent of scores in the histogram.
- the query sequence evaluator 1 16 only includes the final score for a query protein amino acid sequence in the query sequence evaluation results 1 18 if the final score is further than two standard deviations from the mean of the histogram.
- the query sequence user interface 106 may generate and transmit display information 1 10. representing the results of the evaluation, to the display device 1 12.
- the display information 1 10 may include. for example, information to display the final score(s) of the query protein amino acid sequence with respect to one or more of the motifs.
- the display information 1 10 only includes information to display selected final scores of the query protein amino acid sequence, e.g., scores that satisfy a predetermined threshold value.
- the predetermined threshold value may be any value and may be selected in any manner. Including only selected final scores in the display information 1 10 may be used to provide the user with a graphical display of the final scores of only those amino acid subsequences that are particularly likely to match the corresponding motifs in the peptide library database 102.
- the display information 1 10 includes a graphical display 600 that displays potentially matching motifs from the peptide library database 102 superimposed on the domain structure of the query protein amino acid sequence.
- a query protein amino acid sequence graphical element 602 displays the structure of the query protein amino acid sequence as a horizontal strip, with the leftmost edge of the strip representing the first position in the query protein amino acid sequence and the rightmost edge of the strip representing the last position in the query protein amino acid sequence.
- An x-axis 604 displayed under the query protein amino acid sequence graphical element 602, provides a visible indication of the locations of positions in the query amino sequence graphical element 602.
- the user can thus quickly identify the position of any point in the query protein amino acid sequence graphical element 602 by reference to the x-axis 604.
- the graphical element 602 is broken into sub-elements each representing a known or putative domain based on sequence homology of the query protein amino acid sequence.
- motif identifiers 606a- f Displayed above the query protein amino acid sequence graphical element are motif identifiers 606a- f, indicating protein domains whose motifs from the peptide library database 102 match subsequences in the query protein amino acid sequence particularly well.
- the motif identifier 606e indicates that an Abl kinase domain has a motif that matches a subsequence in the query protein amino acid sequence at position Y266 in the query protein amino acid sequence.
- Motifs may be selected for display in the graphical display 600 by, for example, selecting only those motifs for which the query protein amino acid sequence ' s final score satisfies a predetermined threshold, as described above. Displaying such motif identifiers 606a-f enables the user to quickly identify those motifs which most closely match the query protein amino acid sequence.
- the motif identifiers 606a-f are positioned along the y-axis 604 at the y coordinates corresponding to the first positions of the domains in the query protein amino acid sequence which they match.
- the motif identifier 606e (identifying an Abl kinase domain at position Y266) is positioned at y coordinate 266 along the x-axis. Displaying the motif identifiers 606a-f at the locations to which they correspond in the query protein amino acid sequence enables the user to visually identify the locations of matching motifs quickly and easily.
- the query sequence user interface 106 uses a process 700 to generate the graphical display 600 after the query sequence user interface 106 receives the query sequence evaluation results from the query sequence evaluator 1 16.
- the query sequence user interface 106 generates the query protein amino acid sequence element 602 (step 702).
- the query sequence user interface 106 generates the x-axis 604 (step 704).
- the query sequence user interface 106 selects motifs that match the query protein amino acid sequence particularly well, such as by selecting motifs whose scores satisfy a predetermined threshold, as described above (step 706).
- the query sequence evaluation results 1 18 generated by the query sequence evaluator 1 16 include information describing the positions at which the motifs match the query protein amino acid sequence.
- the query sequence user interface 106 uses this information to generate motif identifiers for well-matching motifs at the positions where they match the query protein amino acid sequence (step 708).
- the display information 1 10 includes a graphical display 620 that displays information about a particular motif that matches the query protein amino acid sequence.
- the graphical display 620 includes a title 622 that displays the name of the domain corresponding to the motif for which information is displayed in the graphical display 620.
- the graphical display 620 includes rows 624a-c. each of which corresponds to the first position of a domain within the query protein amino acid sequence which the motif matches particularly well.
- the graphical display includes a position column 626. a score column 628. and a sequence column 630.
- the value in the position column 626 indicates the position at which the motif matches the query sequence
- the value in the score column 628 indicates the score of the motif with respect to the query protein amino acid sequence
- the sequence shown in the sequence column 630 displays the sub-sequence within the query protein amino acid sequence that is considered to be a particularly good match for the motif.
- the information displayed in the row 624a indicates that the SRC kinase domain matched the query protein amino acid sequence particularly well beginning at position 141 within the query protein amino acid sequence, that the SRC kinase domain had a score of 3.4710 (according to, e.g., the log-sum method), and that the sub-sequence within the query protein amino acid sequence that matched the SRC kinase domain particularly well was DEDIYSGLS.
- a plurality of graphical displays similar to graphical display 620 are generated, each of which displays information for a different motif.
- the graphical display 600 includes hyperlinks to information contained in the graphical display 620 (FIG. 6B).
- the motif identifiers 606a-f in the graphical display 600 may include hyperlinks to graphical displays, such as the graphical display 620 (FIG. 6B) displaying information about the corresponding motifs.
- the user can cause the query sequence user interface 106 to generate a graphical display (e.g., graphics display 620 in FIG. 6B) for the motif corresponding to the selected motif identifier.
- the query sequence user interface 106 uses a process 720 to generate the graphical display 620 in response to the user ' s selection of a particular one of the motif identifiers 606a-f in the graphical display 600 (FIG. 6A).
- the query protein amino acid sequence includes domains at known positions and the query sequence evaluation results 1 18 include evaluation results for the selected motif with respect to each of the domains within the query protein amino acid sequence. It should be appreciated, however, that similar methods may be used to generate the graphical display 620 in other embodiments. For each subsequence s (at position p s ) within the query protein amino acid sequence (step 722).
- the score of subsequence with respect to the selected motif is retrieved from the query sequence evaluation results (step 724). If the retrieved score satisfies a predetermined threshold (step 726). then the position p s , the retrieved score, and the sequence of the subsequence s are displayed (step 728), as shown in FIG. 6B. Steps 724-728 are repeated for the remaining domains in the query protein amino acid sequence.
- the query sequence evaluator 1 16 may generate a percentile score for a query protein amino acid sequence.
- the display information 1 10 includes information descriptive of the percentile score.
- the display information 1 10 may describe a graphical display 640.
- the graphical display 640 includes a histogram display 642 representing the histogram generated in step 504 (FIG. 5).
- the graphical display 640 also includes a query protein amino acid sequence marker 644 placed on the x-axis that indicates where the final score of the query protein amino acid sequence lies with respect to the final scores of other amino acid sequences represented in the histogram.
- the query sequence user interface 106 uses a process 740 to generate the graphical display 640 including a histogram display 642 representing the histogram generated in step 504.
- the query sequence user interface 106 generates the histogram display 640 (step 742).
- the query sequence user interface 106 then generates the query protein amino acid sequence marker 644 on the histogram display 640 at a position corresponding to the score (step 744).
- a computer system for implementing the system of FIG. 1 as one or more computer programs typically includes a main unit connected to both an output device which displays information to a user and an input device which receives input from a user.
- the main unit generally includes a processor connected to a memory system via an interconnection mechanism.
- the input device and output device also are connected to the processor and memory system via the interconnection mechanism.
- Example output devices include a cathode ray tube (CRT) display, liquid crystal displays (LCD), printers, communication devices such as a modem, and audio output.
- CTR cathode ray tube
- LCD liquid crystal displays
- one or more input devices may be connected to the computer system.
- Example input devices include a keyboard, keypad, track ball, mouse, pen and tablet, communication device, and data input devices such as sensors. It should be understood the invention is not limited to the particular input or output devices used in combination with the computer system or to those described herein.
- the computer system may be a general purpose computer system which is programmable using a computer programming language, such as C++, Java, or other language, such as a scripting language or assembly language.
- the computer system may also include specially programmed, special purpose hardware.
- the processor is typically a commercially available processor, of which the series x86 and Pentium processors, available from Intel, and similar devices from AMD and Cyrix, the 680X0 series microprocessors available from Motorola, the PowerPC microprocessor from IBM and the Alpha-series processors from Digital Equipment Corporation, are examples. Many other processors are available.
- Such a microprocessor executes a program called an operating system, of which WindowsNT. UNIX, DOS.
- VMS and OS8 are examples, which controls the execution of other computer programs and provides scheduling, debugging, input/output control, accounting, compilation, storage assignment, data management and memory management, and communication control and related services.
- the processor and operating system define a computer platform for which application programs in high-level programming languages are written.
- a memory system typically includes a computer readable and writeable nonvolatile recording medium, of which a magnetic disk, a flash memory and tape are examples.
- the disk may be removable, known as a floppy disk, or permanent, known as a hard drive.
- a disk has a number of tracks in which signals are stored, typically in binary form, i.e., a form interpreted as a sequence of one and zeros. Such signals may define an application program to be executed by the microprocessor, or information stored on the disk to be processed by the application program.
- the processor causes data to be read from the nonvolatile recording medium into an integrated circuit memory element, which is typically a volatile, random access memory such as a dynamic random access memory (DRAM) or static memory (SRAM).
- DRAM dynamic random access memory
- SRAM static memory
- the integrated circuit memory element allows for faster access to the information by the processor than does the disk.
- the processor generally manipulates the data within the integrated circuit memory and then copies the data to the disk when processing is completed.
- a variety of mechanisms are known for managing data movement between the disk and the integrated circuit memory element, and the invention is not limited thereto. It should also be understood that the invention is not limited to a particular memory system.
- the computer system may be a multiprocessor computer system or may include multiple computers connected over a computer network.
- each module e.g. 102, 106, 1 16
- FIG. 1 may be separate modules of a computer program, or may be separate computer programs. Such modules may be operable on separate computers.
- Data e.g., 108 and 1 14
- the invention is not limited to any particular implementation using software or hardware or firmware, or any combination thereof.
- the various elements of the system either individually or in combination, may be implemented as a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Peptides Or Proteins (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP00926228A EP1173755A2 (en) | 1999-04-23 | 2000-04-21 | Amino acid sequence evaluation system |
CA002371238A CA2371238A1 (en) | 1999-04-23 | 2000-04-21 | Amino acid sequence evaluation system |
AU44792/00A AU4479200A (en) | 1999-04-23 | 2000-04-21 | Amino acid sequence evaluation system |
JP2000614047A JP2002543390A (en) | 1999-04-23 | 2000-04-21 | Amino acid sequence evaluation system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US29837199A | 1999-04-23 | 1999-04-23 | |
US09/298,371 | 1999-04-23 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2000065358A2 true WO2000065358A2 (en) | 2000-11-02 |
WO2000065358A3 WO2000065358A3 (en) | 2001-01-11 |
Family
ID=23150210
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/010756 WO2000065358A2 (en) | 1999-04-23 | 2000-04-21 | Amino acid sequence evaluation system |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1173755A2 (en) |
JP (1) | JP2002543390A (en) |
AU (1) | AU4479200A (en) |
CA (1) | CA2371238A1 (en) |
WO (1) | WO2000065358A2 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5532167A (en) * | 1994-01-07 | 1996-07-02 | Beth Israel Hospital | Substrate specificity of protein kinases |
WO1998054577A1 (en) * | 1997-05-28 | 1998-12-03 | Beth Israel Deaconess Medical Center, Inc. | Cyclic peptide libraries and methods of use thereof to identify binding motifs |
-
2000
- 2000-04-21 JP JP2000614047A patent/JP2002543390A/en not_active Withdrawn
- 2000-04-21 AU AU44792/00A patent/AU4479200A/en not_active Abandoned
- 2000-04-21 EP EP00926228A patent/EP1173755A2/en not_active Withdrawn
- 2000-04-21 CA CA002371238A patent/CA2371238A1/en not_active Abandoned
- 2000-04-21 WO PCT/US2000/010756 patent/WO2000065358A2/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5532167A (en) * | 1994-01-07 | 1996-07-02 | Beth Israel Hospital | Substrate specificity of protein kinases |
WO1998054577A1 (en) * | 1997-05-28 | 1998-12-03 | Beth Israel Deaconess Medical Center, Inc. | Cyclic peptide libraries and methods of use thereof to identify binding motifs |
Non-Patent Citations (2)
Title |
---|
BAIROCH A: "PROSITE: a dictionary of sites and patterns in proteins" NUCLEIC ACIDS RESEARCH, SUPPLEMENT, vol. 19, 25 April 1991 (1991-04-25), pages 2241-2245, XP002901216 cited in the application * |
SONGYANG Z ET AL: "SH2 Domains Recognize Specific Phosphopeptide Sequences" CELL, vol. 72, no. 5, 12 March 1993 (1993-03-12), pages 767-778, XP002901215 cited in the application * |
Also Published As
Publication number | Publication date |
---|---|
WO2000065358A3 (en) | 2001-01-11 |
EP1173755A2 (en) | 2002-01-23 |
CA2371238A1 (en) | 2000-11-02 |
JP2002543390A (en) | 2002-12-17 |
AU4479200A (en) | 2000-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Agrawal et al. | Benchmarking of different molecular docking methods for protein-peptide docking | |
Bork et al. | Predicting function: from genes to genomes and back | |
Yang et al. | Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment | |
Greenidge et al. | Improving docking results via reranking of ensembles of ligand poses in multiple X-ray protein conformations with MM-GBSA | |
Schafferhans et al. | Docking ligands onto binding site representations derived from proteins built by homology modelling | |
Persson | Bioinformatics in protein analysis | |
Malmström et al. | Superfamily assignments for the yeast proteome through integration of structure prediction with the gene ontology | |
CZ20031090A3 (en) | Method for operating a computer system for carrying out discrete substructure analysis | |
WO2005008240A2 (en) | STRUCTURAL INTERACTION FINGERPRINT (SIFt) | |
Jackson | Q-fit: a probabilistic method for docking molecular fragments by sampling low energy conformational space | |
Li et al. | Protein-ligand binding enthalpies from near-millisecond simulations: Analysis of a preorganization paradox | |
Verkhivker et al. | A mean field model of ligand-protein interactions: implications for the structural assessment of human immunodeficiency virus type 1 protease complexes and receptor-specific binding. | |
Tosovic et al. | Conserved water networks identification for drug design using density clustering approaches on positional and orientational data | |
US20060235622A1 (en) | Statistical methods for analyzing biological sequences | |
Vangone et al. | Prediction of biomolecular complexes | |
WO2000065358A2 (en) | Amino acid sequence evaluation system | |
Schnitker et al. | Objective models for steroid binding sites of human globulins | |
JP4688467B2 (en) | Method for searching structure of receptor-ligand stable complex | |
Kontoyianni et al. | Functional prediction of binding pockets | |
US7729867B2 (en) | Method for introducing conjugated caps onto molecular fragments and systems and methods for using the same to determine inter-molecular interaction energies | |
Yuan et al. | A survey of computational methods for protein structure prediction | |
US20040236515A1 (en) | System, method and computer product for predicting protein- protein interactions | |
Borrel et al. | Exploring Chemical Space Using ChemMaps. com | |
Siew et al. | A User's Guide to Fold Recognition | |
Diamantis et al. | Comparison of multiple sequence alignment programs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AU CA JP |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AU CA JP |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2000 614047 Kind code of ref document: A Format of ref document f/p: F |
|
ENP | Entry into the national phase |
Ref document number: 2371238 Country of ref document: CA Ref country code: CA Ref document number: 2371238 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000926228 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2000926228 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000926228 Country of ref document: EP |