WO2002021139A2 - Identification automatisee de peptides - Google Patents

Identification automatisee de peptides Download PDF

Info

Publication number
WO2002021139A2
WO2002021139A2 PCT/GB2001/004034 GB0104034W WO0221139A2 WO 2002021139 A2 WO2002021139 A2 WO 2002021139A2 GB 0104034 W GB0104034 W GB 0104034W WO 0221139 A2 WO0221139 A2 WO 0221139A2
Authority
WO
WIPO (PCT)
Prior art keywords
mass
peptide
sequences
sequence
database
Prior art date
Application number
PCT/GB2001/004034
Other languages
English (en)
Other versions
WO2002021139A3 (fr
Inventor
Robert Reid Townsend
Andrew William Robinson
Original Assignee
Oxford Glycosciences (Uk) Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0022136A external-priority patent/GB0022136D0/en
Application filed by Oxford Glycosciences (Uk) Ltd. filed Critical Oxford Glycosciences (Uk) Ltd.
Priority to EP01965415A priority Critical patent/EP1317765A2/fr
Priority to AU2001286059A priority patent/AU2001286059A1/en
Publication of WO2002021139A2 publication Critical patent/WO2002021139A2/fr
Publication of WO2002021139A3 publication Critical patent/WO2002021139A3/fr

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6818Sequencing of polypeptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching

Definitions

  • the present invention relates to a method for determining the amino acid sequence of a peptide. More particularly, the present invention relates to an amino acid sequence analysis carried out using mass spectrometry. BACKGROUND OF THE INVENTION
  • a proteome is the protein complement of a cell or tissue. Since one genome produces many proteomes (multi-cellular organisms can have hundreds of proteomes) and the number of expressed genes in a cell is generally considered to exceed 10,000, the characterisation of thousands of proteins to evaluate proteomes can best be accomplished using a high-throughput, automated process.
  • Certain methods for analyzing peptides using mass spectrometry are known in the art. Peptide molecular weights and the masses of sequencing ions can be obtained routinely to an accuracy which enables mass distinction amongst most of the 20 amino acids in the genetic code.
  • tandem mass spectrometry a peptide sample is introduced into the mass spectrometer and is subjected to analysis in two mass analyzers (denoted as MSI and MS2).
  • MSI a narrow mass-to-charge window (typically 2-4Da), centered around the m/z ratio of the peptide to be analyzed, is selected.
  • the ions within the selected mass window are then subjected to fragmentation via collision-induced dissociation, which typically occurs in a collision cell by applying a voltage to the cell and introducing a gas to promote fragmentation.
  • the process produces smaller peptide fragments derived from the precursor ion (termed the 'product' or 'daughter' ions).
  • the product ions in addition to any remaining intact precursor ions, are then passed through to a second mass spectrometer (MS2) and detected to produce a fragmentation or tandem (MS/MS) spectrum.
  • MS2/MS spectrum records the m/z values and the instrument-dependent detector response for all ions exiting from the collision cell.
  • C-terminal fragment designated as x, y or z ions
  • N-terminal fragment a, b or c ions
  • Peptides are fragmented using two general approaches, high and low energy collision-induced dissociation (CID) conditions.
  • CID collision-induced dissociation
  • signals assigned to y and b ions and from losses of water and ammonia are usually the most intense.
  • high energy CID peptide molecules with sufficient internal energy to cause cleavages of the amino acid side chains are produced. These side chain losses predominantly occur at the amino acid residue where the backbone cleavage occurs.
  • Prior art methods for automated analysis of fragmentation mass spectra are capable of generating a ranked list of candidate peptide sequences in a sequence database; however, identification of a true match from amongst multiple candidate sequences has heretofore required subjective manual assessment by one skilled in spectral interpretation.
  • the present invention relates to a user-independent method to identify and characterize a peptide sequence present in a peptide database that corresponds to an experimental peptide, for example a peptide derived by selective cleavage of a polypeptide.
  • the present method identifies the corresponding sequence if it is present in the database (or the corresponding sequences if duplicates are present in the database), without the need for a skilled observer to choose from amongst a list or ranked list of possible matches by reference to mass spectrometric or other criteria.
  • the methods can be performed with large peptide databases, including those prepared by conceptual translation of large nucleotide databases such as a database representing a eukaryotic (e.g. mammalian or higher plant) genome such as the human genome or maize genome.
  • a computer-based method for determining whether or not a first peptide sequence database contains one or more peptide sequences that correspond to an experimental peptide comprises:
  • step (d) performing a computer-mediated back-read that tests the candidate sequences, if any, against the first peak list or a second peak list derived from a fragmentation spectrum of the experimental peptide and determining whether one or more candidate sequences fit the data in the peak list according to one or more matching criteria, wherein upon satisfaction of the matching criteria, the candidate sequences, if any, that satisfy the matching criteria are identified as corresponding sequences.
  • the back-read of step (d) comprises (i) for each candidate sequence,
  • a computer-mediated program (or set of programs) performs the method described herein without the intervention of a person skilled in spectral interpretation, and preferably without the intervention of an operator.
  • the capacity for fully automated analysis of mass spectral information and searching of a peptide sequence database, coupled with computer-mediated mapping of related nucleotide or peptide databases, permits the high-throughput identification and organization of expressed segments of DNA in large polycistronic genomic databases and the rapid identification of nucleotide or peptide sequencing errors and polymorphisms.
  • the methods described herein provide the ability to sequence and/or identify peptides without any derivatization or labeling, for instance without preparing isotopically labeled peptides such as 18 O labeled peptides.
  • the present method can uniquely identify a corresponding peptide sequence in a peptide database based on identifying a single peptide sequence that is shared between the experimental peptide and the corresponding peptide in the database. This obviates the need to interpret multiple fragmentation mass spectra or to find multiple hits in order to identify the corresponding peptide in the database.
  • the present invention further provides methods for mapping mass spectral data to sequences in peptide or nucleotide databases for (1) unambiguous identification of exons within nucleotide sequences; (2) determining a correct reading frame of a nucleotide sequence; (3) identifying artefacts and errors in nucleotide or peptide sequences;
  • the invention further provides a computer-readable medium comprising instructions for causing a computer to perform any of the methods disclosed herein; a computer comprising instructions for performing any of the method disclosed herein; a peptide or nucleic acid database comprising information obtained by performing any of the methods disclosed herein; a computer-readable file or list comprising information obtained by performing any of the methods disclosed herein; and a display comprising information obtained by performing any of the methods disclosed herein.
  • the present invention provides a computer-mediated method for determining whether or not a fragmentation mass spectrum (or a defined segment thereof) contains peaks defining a member of a set of peptides to be recognized, comprising generating a set of signature arrays collectively representing the spectral signatures of the peptides to be recognized; generating a spectral array representing a plurality of peaks detected in the fragmentation mass spectrum; and performing a series of logical AND comparisons between the display array and each signature array while the latter is swept across a portion of the array representing the segment of the spectrum to be inspected.
  • This method has general applicability to interpretation of fragmentation mass spectra.
  • FIG. 1 shows an embodiment of the present invention for identifying and characterizing genomic sequences that are expressed as peptides
  • FIG. 2 details an embodiment of an algorithm for constructing and editing the peak table derived from the fragmentation mass spectrum
  • FIG. 3 shows some types of ions in a fragmentation mass spectrum which may be used in practicing the present invention (nomenclature of Biemann 1990).
  • FIG. 4 shows an overview of the modules in the HOPS (Holistic Protein Sequencing) algorithm for the interpretation of fragmentation mass spectra according to a preferred embodiment of the present invention
  • FIG. 5 shows an embodiment of the main peptide sequencing module of HOPS
  • FIG. 6 is a flow chart of one embodiment for editing the peptide sequences generated by the HOPS module.
  • FIG. 7 details one embodiment of the steps for selecting HOPS sequences to construct a database search string;
  • FIG. 8 shows the TESLA algorithm (Trimer Signature Lookup Algorithm) which interprets fragmentation spectra to identify trimer peptide sequences according to a preferred embodiment of the invention
  • FIG. 9 shows a preferred embodiment of the FIREPEP algorithm (Find Related
  • FIG. 10 details a preferred embodiment of the rules for constructing a search string and preferred criteria of the retrieved sequences from the six-frame translated human genome
  • FIG. 11 shows an embodiment of the FIREPROT (Find Related Proteins) algorithm which maps the mass spectrometric sequence data onto retrieved peptide or translated nucleotide sequences that have been retrieved by searching with a search string;
  • FIG. 12 shows the mapping of observed masses and sequences onto conceptually translated genome sequences, in which the box delineates a tryptic peptide matched by FIREPEP to an experimental peptide, the underlined sequences were mass matched using peptide molecular weights, the bolded sequences were identified by spectral read, and the arrowheads delineate a sequence matched by the mapping algorithm of FIREPROT; and FIG. 13 shows an algorithm for mapping observed peptide masses and post-translational modifications onto the unique set of identified translated genome sequences.
  • amino acid residue means a monomer of the general structure: -NH-
  • CHR-CO- which makes up peptides, oligopeptides and polypeptides. These include the twenty basic amino acids and common derivatives listed in Table 1, and chemically or biologically modified monomers having the same general structure.
  • ceptually translated peptide sequence means a listing of a peptide sequence predicted to be encoded by a given nucleotide sequence in accordance with the universal genetic code. Preferably, the conceptually translated peptide sequence is in machine-readable form.
  • Consensus sequence means a subsequence that is shared among multiple peptide sequences deduced by interpreting a fragmentation mass spectrum of a peptide.
  • a "display” means any device or artefact that presents information in a form intelligible to a human observer and includes, without limitation, a computer terminal, a computer screen, a screen upon which information is projected, and paper or other tangible medium upon which information is temporarily or permanently recorded, whether by printing, writing or any other means.
  • a peptide sequence in a database "corresponds to" an experimental peptide when it correctly specifies the identity and order of the amino acid residues in the experimental peptide except only for substitution of amino acids that are mutually isobaric or mutually mass ambiguous within the resolution of the mass spectrometer used to identify the peptide sequence.
  • a peptide sequence in a database that corresponds to an experimental peptide is referred to herein as a "corresponding" sequence.
  • a “database” of peptide (or nucleotide) sequences means a computer- readable representation of a plurality of peptide (or nucleotide) sequences.
  • a database may be implemented as one or more computer-readable files.
  • an “experimental peptide” is a peptide that is to be identified by the present invention or that is sought to be matched with one or more peptide sequences in a database.
  • in silico digestion of a peptide means use of a computer-mediated algorithm to generate a list representing peptides that would result from selective cleavage (e.g. by digestion with a proteolytic enzyme such as trypsin) of the peptide.
  • in silico digestion may be applied to a single peptide, a plurality of peptides represented in a database, or all the peptides represented in a database.
  • list means a computer-readable representation of data; a list may be implemented as any desired data structure, including without limitation a table, stack or array. A list may if desired be stored as a file or as a plurality of files.
  • parent ion also known as a "precursor ion”
  • precursor ion means an ionized peptide (e.g. an ionized form of an experimental peptide) that is fragmented into a plurality of "product ions” (also known as “daughter ions”).
  • a fragmentation mass spectrum can be produced by recording the mass-to-charge m/z) ratios and intensities of the product ions.
  • peptide means an organic compound comprising two or more amino acid residues joined covalently by one or more peptide bonds; a peptide may be glycosylated or unglycosylated. A peptide containing ten or fewer amino acid residues is an
  • oligopeptide and a peptide containing more than ten amino acid residues is a “polypeptide”.
  • post-translational modification means a chemical or biological modification to an amino acid residue after its insertion into a peptide chain. This may occur naturally or in the laboratory.
  • publicly available database means a database that is available in the public domain.
  • Examples of publicly available databases include, but are not limited to, the European Molecular Biology Laboratory (EMBL) human genome database, the National EML database, the European Molecular Biology Laboratory (EMBL) human genome database, the National EML database, the European Molecular Biology Laboratory (EMBL) human genome database, the National EML database, the National EML database, the National EML database, the National EML database, the National EBL
  • NCBI Center for Biotechnology Information
  • one or more mass spectra are obtained from a peptide (the "experimental peptide") that is to be identified or matched to a peptide sequence in a database.
  • the experimental peptide is obtained by selective cleavage of a mixture of polypeptides, for example a mixture containing no more than 50 (preferably no more than 20, more preferably no more than 10, still more preferably no more than 5) polypeptides; alternatively, the experimental peptide is obtained by selective cleavage of a polypeptide that has been isolated free from other polypeptides.
  • Enzymatic cleavage is suitable for this purpose; suitable enzymes include arginine endopeptidase (ArgC), asparatic acid endopeptidase N (aspN), chymotrypsin, glutamic acid endopeptidase C (gluC), lysine endopeptidase C (lysC), V8 endopeptidase and (more preferably) trypsin.
  • ArgC arginine endopeptidase
  • asparatic acid endopeptidase N asparatic acid endopeptidase N
  • chymotrypsin glutamic acid endopeptidase C (gluC)
  • gluC glutamic acid endopeptidase C
  • lysC lysine endopeptidase C
  • V8 endopeptidase and (more preferably) trypsin.
  • Other enzymes with sufficiently restrictive cleavage patterns may also be used and are known in the art.
  • one or more fragmentation mass spectra are obtained from the experimental peptide
  • ladder sequencing may be used to obtain one or more mass spectra as described in U.S. Patent No. 6,271,037, which is incorporated herein by reference.
  • Processes that produce fragmentation useful for generating a fragmentation mass spectrum include but are not limited to, collision-induced dissociation (also known as collision-activated dissociation), post-source decay from laser desorption, surface-induced dissociation, and in-source fragmentation.
  • Ionisation processes which can be used include, without limitation, electrospray ionisation, nanoflow electrospray ionisation, matrix-assisted laser desorption ionisation, plasma desorption ionisation, fast atom bombardment, and field desorption.
  • a mass spectrum can be generated using tandem mass spectrometry or multiple stages of mass spectrometry.
  • a mass spectrum is obtained by linear tandem mass spectrometry, for example using a tandem time-of-flight (TOF-TOF) mass spectrometer.
  • TOF-TOF tandem time-of-flight
  • a mass spectrum is obtained by orthogonal mass spectrometry, for example using a quadrupole tandem time of flight (Q-TOF) or Q-STAR mass spectrometer.
  • a first aliquot of a preparation containing one or more peptides is analyzed with a matrix-assisted laser-desorption ionization mass spectrometer (MALDI-TOF) to determine the mass of one or more peptides; a second aliquot of the preparation is then analyzed with a hybrid mass spectrometer (e.g.
  • MALDI-TOF matrix-assisted laser-desorption ionization mass spectrometer
  • FIG. 1 shows an overview of a preferred embodiment of the invention for identification of a polypeptide (e.g. an unknown protein) and its further use for characterization of expressed genomic sequences including their post-translational modifications.
  • the polypeptide is first digested with a specific endoprotease such as trypsin, to cleave the polypeptide into one or more peptide fragments.
  • the polypeptide has been isolated from a polyacrylamide
  • the gel pieces are preferably subjected to in situ proteolysis, for instance using an OGS ChemStation robot and a modification of the manual method described in Page et al., Proc. Natl. Acad. Sci. 96:
  • digestion of the polypeptide is carried out with trypsin under conditions chosen to achieve thorough trypsinolysis, so as to maximise the number of peptide fragments that contain C-terminal Arginine or Lysine residues.
  • one or more robotically cut gel plugs are washed by adding 50 ⁇ l of lOOmM ammonium bicarbonate to each sample. After standing for 10 minutes at ambient temperature, the liquid is removed and acetonitrile (50 ⁇ l) is added to each tube. The samples are allowed to stand for 10 minutes at ambient temperature and are manually agitated for 5 minutes, then dried by centrifugal evaporation for 10 minutes with no heating. Acetonitrile (50 ⁇ l) is again added to each tube, the samples allowed to stand for 10 minutes at ambient temperature and then manually agitated for 5 minutes, followed by drying by centrifugal evaporation for 10 minutes with no heating.
  • Trypsin cleaves specifically at the carboxyl side of lysine (Lys) and arginine (Arg) residues, so that the resulting tryptic digest fragments should have a Lys or Arg as the C-terminal amino acid, unless the peptide fragment was obtained from the C-terminal end of the peptide.
  • the amino acid in the intact polypeptide that, prior to cleavage, directly preceded the N-terminal amino acid of the peptide fragment should also be a Lys or Arg, unless the peptide fragment was obtained from the N-terminal of the peptide.
  • the mixture of peptide fragments (experimental peptides) obtained from digestion of individual polypeptides (or mixtures of polypeptides) can be analysed by mass spectrometry without any prior separation (as shown in FIG. 1, step 2) or can optionally be separated into individual experimental peptides using known chromatographic methods.
  • the experimental peptides are initially analysed using matrix-assisted laser-desorption time-of-flight mass spectrometry with delayed extraction and a reflectron in the time-of-flight chamber (MALDI-TOF). This instrument configuration is used to generate a primary mass spectrum in order to determine the molecular weight of the experimental peptide, preferably with an experimental error of 100 parts-per-million (ppm) or less.
  • Accurate measurement of peptide masses in the primary mass spectrum advantageously increases the specificity of the mass-constrained database searches used in subsequent steps of a preferred embodiment of the present invention. (See, e.g., FIG. 1, step 5).
  • Other mass spectrometric techniques capable of mass measurement within an error of 100 ppm or less include, without limitation, time-of-flight, Fourier transform ion cyclotron resonance, quadrupole, ion trap, and magnetic sector mass spectrometry and compatible combinations thereof.
  • the fragmentation mass spectrum is determined for a parent ion having m/z greater than or equal to 850, e.g. as dete ⁇ nined in a primary mass spectrum.
  • a resolution of better than 4000 (peak width at half maximum height) and an accuracy of mass measurement of at least 50 ppm (parts-per-million) is used.
  • Tandem mass spectrometry may be carried out on a doubly protonated parent ion ([M + 2H] +2 ), although the method can be performed on parent ions of other charge states, e.g., [M + H] + or [M + 3H] +3 .
  • a Q-TOF mass spectrometer is used with the quadrupole mass analyser set to allow transmission of ions with an m/z equal to that of the doubly protonated peptide ion ([M + 2H] +2 ) deduced from the singly charged peptide ion ([M+H] + ) observed in a primary mass spectrum obtained by MALDI-TOF analysis.
  • the transmitted ions are termed 'parent" or 'precursor' ions.
  • the peptide ion beam passes into the collision cell where the parent ions are subjected to low energy CID. This can be achieved through the application of a voltage on the collision cell and/or by the introduction of an inert gas.
  • the timed ion selector is preferably set to capture ions in a high energy collision cell at m/z equal to that of the singly charged peptide ion ([M+H] + ).
  • fragmentation occurs both across the peptide backbone, giving rise to N-terminally charged ions (a, b and c ions) and C-terminally charged ions (x, y and z ions), and also across the side chains, giving rise to d and w ions.
  • Fragmentation (MS/MS) spectra are typically represented by a two-dimensional graph with ion intensity on the y-axis, and mass-to-charge ratio (m/z) on the x-axis.
  • One or more fragmentation spectra are subjected to computer-mediated analysis to identify one or more peaks and prepare a first peak list.
  • Techniques for computerized recognition of peaks are known in the art and include pattern recognition and linear interpolation. Sukharev, Y. N. and ekrasov, Y. S. (1976) The computer processing and interpretation of mass spectral information. Organ. Mass Spect. 11, 1232-1238; Klimowski, R. J. et al., (1970) A small on-line computer system for high resolution mass spectrometers. Org. Mass Spect. 4, 17-39, which are incorporated by reference)..
  • a mass spectrum is acquired as an array of numerical values representing m/z and signal intensity (the "raw spectrum”).
  • Signals arising from sources other than the analyte(s) of interest are well documented. These include electronic disturbance (electronic noise) and signals from the sample matrix (chemical noise).
  • Methods for converting raw spectral data, represented as a series of peaks within x,y coordinates, into a subset of m/z and intensity values useful for implementing computer-mediated spectral interpretation have been described for a variety of mass spectrometers and sample types.
  • the successful implementation of an algorithm to deduce peptide sequences from fragmentation spectra of varying quality depends on the methodology which optimizes the choice of peaks from the raw mass spectrum. Preferably, only peaks having a signal to. noise ratio > 2 are considered.
  • an intensity threshold is applied to the raw spectrum and values with intensity ⁇ 2 are removed; a median filter is applied to the intensity values using at least 3 points to identify a peak (i.e. a smoothing function is used).
  • a TOF-TOF mass spectrometer (ABI, Framingham, MA)
  • an intensity threshold is calculated which excludes a defined fraction (say 80%) of the data points in the spectrum; peak picking from the data in the spectrum is then performed for data points that lie above this intensity threshold.
  • Median values are assigned for the m/z of each peak, using a peak top method.
  • the list of median peak values is edited by one or more of the following computer-mediated procedures.
  • the list can be computer read to identify two or three median peak m/z values that differ by 1 Dalton (atomic mass unit). Where such a cluster is found, the lowest m/z value is retained and the one or two higher m/z values (which are taken to arise from isotopic peptides containing one or two 13 C atoms) are removed.
  • the charge (z) of the product ion is determined by analyzing the fragmentation spectrum (e.g., by determining the m/z separation of the paired 12 C and 13 C isotopic variants for that peak); once the charge (z) is known, the mass (m) can be readily determined from the measured mass-to-charge(m/_r) ratio.
  • the fragmentation mass spectrum is represented as a data file comprising a list of m/z values of a plurality of peaks identified in the spectrum. These may be arranged in any desired order (e.g., ascending or descending order of m/z values).
  • the data file also contains the intensities of the peaks for which m/z values are listed, as well as sufficient information to determine the mass of the parent ion (e.g. the mass of the singly or doubly protonated precursor ion and charge state of the ion).
  • a preferred method for obtaining a peak table from the peptide fragmentation spectrum is shown in FIG. 2. The peak values extracted from the fragmentation mass spectrum (e.g.
  • paired m/z and intensity values for peaks identified according to the 'peak-picking' algorithm described above are subjected to filtering and/or editing to generate a peak table.
  • This may be implemented by representing the raw extracted values in a Raw Peak Table, which is then subjected to computerized editing to generate an Edited Peak Table.
  • the process of peak picking may be combined with a process of filtering in accordance with previously ordained criteria, circumventing the need for a Raw Peak Table.
  • a preferred editing or filtering process may comprise one or more of the following functions in any desired combination and preferably includes any two or all three of these functions: (a) eliminating values due to 13 C isotopes ("deisotoping"), (b) removing a 2 Da parent ion window, and (c) removing peaks falling below a threshold value.
  • An input value for the mass resolution of the mass spectrometer used to obtain the spectrum is also preferably obtained.
  • the Raw Peak Table is edited to eliminate certain peaks.
  • deisotoping is performed by identifying clusters of two or three peaks that are spaced ⁇ 1 Da apart, where the peak with the lowest m/z value is interpreted as arising from ions containing the more commonly occurring 12 C isotope, with the one or two peaks with higher m/z arising from rarer ions containing 13 C.
  • only the lowest m/z peak in such a cluster is retained and the one or two higher m/z peaks in the cluster are eliminated from the peak table if those higher m/z peaks are of lower intensity than the lowest m/z peak in the cluster.
  • the peak representing the parent ion is removed from the peak table.
  • a window of m/z 0.5 below and 1.5 above the m/z of the precursor ion is calculated and peaks falling within this window are removed.
  • an intensity thresholding function is performed in order to decrease the risk of spurious spectral interpretation and to speed the computational process.
  • a preferred embodiment of the thresholding procedure is as follows: equation (6) is used to compute the integer "ideal" number of peaks (Pideal) which we would expect in a spectrum with a given precursor ion mass, M.
  • the physical justification for using this equation is that we expect on average ( /100)+l peaks due to the primary ion series (the y-ions), estimating the mean molecular weight of an amino acid residue at about 100 Da. Since three types of sequence ions are considered (y, b, and a ions) a multiplier of 3 is placed in equation (6). In reality more peaks will be present from other types of fragments (e.g., due to internal fragmentation of the peptide and decay of ions into additional alternative decay products).
  • the initial intensity threshold is then set to any desired starting value, preferably to a desired signal to noise ratio or to the detector response or some multiple thereof.
  • the thresholding algorithm then calculates the number of peaks that would remain in the spectrum if all peaks of intensity less than or equal to the set threshold are removed. If this number of peaks is greater than intensity threshold value is incremented by some arbitrary value and the calculation repeated. This procedure continues until the number of peaks remaining after application of the intensity threshold is less than or equal to the ideal number. This value is then decremented by one step. The purpose of this is to produce a spectrum which has a larger number of peaks than the "ideal" number, but where most of the low intensity peaks have been removed. This threshold level is then applied to the peaks in the peak table and those peaks with intensity less than or equal to the threshold value are excluded.
  • the threshold value determined by iteration may be used for filtering or editing fragmentation mass spectra obtained from the same or a similar apparatus.
  • a threshold value can be determined by calculation, by trial and error, by analysis of the relevant scientific literature, or chosen arbitrarily.
  • the invention provides a process of filtering or editing that comprises removing one or more peaks having an intensity less than a previously established threshold value.
  • the result of this editing process is to remove one or more peaks so as to produce an
  • Edited Peak List Preferably, the editing process occurs in accordance with previously ordained criteria, without the intervention of a human operator.
  • the fragmentation spectrum and/or peak table can be displayed to a human operator for interactive editing.
  • the Edited Peak List is then subjected to further analysis in order to identify one or more candidate search strings.
  • the deduced sequence(s) may be the complete sequence of the experimental peptide, but more typically may be one or more partial amino acid sequences within the experimental peptide.
  • the spectral read analysis is preferably performed on a peak list, more preferably on an Edited Peak List or on a peak list that was produced by a process that comprises filtering to exclude one or more peaks, as described above.
  • the deduced sequence(s) can then be used to construct search strings suitable for searching a database of peptide sequences.
  • the spectral read process comprises: (1) iteratively determining mass differences between peaks in the peak table that correspond to masses of amino acid residues recognized by the algorithm ("recognized amino acid residues") in order to deduce one or more peptide sequences within the parent ion; and (2) selecting one or more of the deduced peptide sequences for further analysis, e.g., for construction of a search sequence and/or for database mapping, as described herein. If this de novo analysis yields more than one deduced peptide sequence, the spectral read algorithm preferably performs a ranking process so as to prioritize the deduced peptide sequences according to previously ordained criteria, for example by producing a ranked list of deduced peptide sequences.
  • a computer-mediated algorithm operates on a peak list (preferably an Edited Peak List) by: (a) selecting a peak as a starting point and determining one or more mass differences in the peak list corresponding to the mass of a recognized amino acid residue; (b) sequentially determining each subsequent mass difference in the peak list that corresponds to the mass of a recognized amino acid residue; and (c) repeating the process of steps (a) and (b) for additional peaks so as to obtain a set of deduced sequences.
  • the peak selected as a starting point is a high mass peak, for example the peak having the highest mass in the peak table. The sequential determination of step (b) may be repeated until every mass difference between peaks has been investigated.
  • the algorithm may establish a maximum length for the deduced peptide sequence by terminating the sequential determination process once a deduced peptide sequence contains a previously established number of amino acid residues.
  • the maximum length for the deduced peptide sequence can be set at any desired value, such as 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 amino acid residues and so on, up to the number of amino acid residues in the parent ion.
  • a maximum length of 3 is especially preferred, resulting in an algorithm that interprets the fragmentation mass spectrum to deduce tripeptide sequences.
  • the spectral read algorithm can consider peaks from the entire fragmentation mass spectrum.
  • analysis can be confined to peaks found in one or more segments of the spectrum. This can be accomplished either by excluding peaks during the editing or filtering process that generates the peak list used for the spectral read, or by excluding peaks in the peak table from analysis during the spectral read.
  • spectral read analysis is confined to peaks from that part of the spectrum representing m/z values greater than that of the doubly protonated precursor ion or includes only a predetermined number of peaks having an m/z value less than that of the doubly protonated precursor ion (e.g.
  • the spectral read algorithm preferably considers peaks with m/z from that of the molecular ion down to half that of the molecular ion, and/or peaks in a low m/z window (e.g., 300- 500).
  • the product ions used for determination of the amino acid sequence are contained in the y-ion series.
  • a b-ion series or a combination of both y-ions and b-ions may be used.
  • about five to about fifty product ions are selected as peaks as starting points for the sequential determination of the amino acid mass residue difference.
  • Recognized amino acid residues may include residues of the twenty naturally occurring amino acids (or any desired subset of them) and preferably also include residues of amino acids altered during sample preparation and/or analysis (e.g. oxidized methionine and carbamidomethyl cysteine). Residues of amino acids altered by one or more natural or synthetic post translational modification (PTM) may also be included in the set of recognized amino acid residues. Such PTMs preferably include alkylation, phosphorylation, sulfation, oxidation or reduction, ADP-ribosylation, hydroxylation, glycosylation, glucosylphosphatidylinositol addition, ubiquitination, and artificial modification (e.g. biotinylation, cross-linking, and photoaffinity labeling). This is readily achieved by including the modified amino acid residues in the set of masses that are considered in determining whether a mass difference between two peaks corresponds to the mass of a recognized amino acid residue.
  • PTM post translational modification
  • the recognition process may be implemented by means of a table representing the identity and masses of recognized amino acid residues; in a preferred embodiment, the monoisotopic masses are used for this purpose. See, e.g., Table 1.
  • amino acid masses are specified up to 5 decimal places; any experimental uncertainty in their mass is then negligible compared to the uncertainty in the experimentally obtained m/z values.
  • minimum and maximum values need to be computed for the m/z differences in observed in the fragmentation spectrum.
  • the experimental uncertainty is assumed to result in a normal statistical distribution.
  • the given m/z value from the peak table is then deemed to be the mean value for the peak, and the experimental error of the instrument is deemed to be equal to the standard deviation from the mean.
  • the experimental uncertainty is then assigned as being + 2 standard deviations from the mean value, i.e. the 95% confidence limit of the normal distribution.
  • a mass range that incorporates the uncertainties in both m/z values is calculated which is dependent on the mass resolution of the type of instrument used.
  • the minimum instrument resolution desired for interpretation of peptide fragmentation spectra may be determined by the following calculation.
  • the mass spectrometer used for example, a Q-TOF instrument
  • each peak has a finite width and this width acts as the error for an individual peak. This error is compounded for the difference in m/z values of two peaks.
  • ⁇ E from the components El and E2 due to the two peaks:
  • condition (5) is met for any mass, M, within a mass spectrometer of resolution, R, then the algorithms described herein will be able to unambiguously interpret the spectrum.
  • the fragmentation mass spectrum preferably has a resolution of at least 5600 (full width half peak height) for peptides with a molecular weight up to about 4000 daltons.
  • the first method is based on the total number of ions (e.g., y, b and a ions) in the fragmentation spectra which match the deduced sequences.
  • ions e.g., y, b and a ions
  • the second method uses the sum of individual ion intensities which correspond to the signals assigned to ions (e.g., y, b or a ions) in the fragmentation spectra.
  • ions e.g., y, b or a ions
  • a deduced sequence which has a higher value of summed ion intensities is considered more likely than those of lower summed intensities.
  • the values of the whole spectrum can be transformed into a vector quantity T.
  • An array of arbitrary size say, 32,000 is assigned and filled with the value zero.
  • the value of intensity is assigned to the
  • the scaling factor 16 being used to quantize the mass into bins 1/16 th Da wide. This quantization is lower than the resolution of the instrument, so that there is no loss of data, which could occur if two similar masses could be assigned to the same position in the vector array.
  • the array size of 32,000 is sufficient for a spectrum ranging from 0 to 2,000 Da. Both the array size and the scaling factor can be increased to cope with
  • a hit quality index is defined as:
  • T m is the mean centered total spectrum, calculated from the original total spectrum vector T:
  • the quality index will have the value of 0 if there is no match between the total spectrum and the fragment spectrum, and 1 if the fragment spectrum is identical to the total spectrum.
  • the ranked deduced sequences can then be subjected to a process of selection.
  • the highest ranking deduced sequence, or a predetermined number of deduced sequences having the highest rank go forward for further analysis.
  • a predetermined number of deduced sequences having the highest rank are selected and compared with one another to determine whether a peptide sequence. is shared by all (or a specified percentage) of them. If such a shared sequence (a "consensus sequence") is found, it goes forward for further analysis. Additional selection criteria can also be applied at any stage of the spectral read or selection process to tailor the sequences to the database that is to be searched.
  • deduced sequences are excluded that are incompatible with the selective cleavage procedure that was applied to the peptide under analysis; for a tryptic peptide, deduced sequences can be excluded that contain an internal lysine or arginine residue unless it is immediately followed by a proline residue towards the carboxyterminal side of the peptide.
  • all peptide sequences deduced by the spectral read algorithm are stored for use in obtaining search strings which preferably are tailored to suit the size, content and characteristics of the database to be searched.
  • HOPS Holistic Protein Sequencing
  • TESLA Triplemer Signature Lookup Algorithm
  • HOPS Holistic Protein Sequencing
  • the spectral read algorithm in HOPS was designed for de novo identification of sequences from fragmentation mass spectra.
  • HOPS uses the constraint of passing only sequences in which all y, b and a ions can be accounted for in the fragmentation mass spectrum; in another embodiment, this constraint is not imposed.
  • HOPS produces a list comprising one or more identified peptide sequences, which may be of differing lengths; these are then ranked and the top-ranking sequences used to determine a consensus sequence.
  • vectorial ranking is used and the set of sequences with scores greater than or equal to the score of the top-ranked sequence minus 0.03 are used to determine a consensus sequence.
  • FIG. 4 An overview of one embodiment of the HOPS process which produces (via consensus sequence intermediates) sequences for constructing each database search string is shown in FIG. 4.
  • the sequencing algorithm within HOPS calculates m/z differences between peaks as though they represent masses of the 20 naturally occurring amino acid residues or residues modified by post-translational processing.
  • the HOPS algorithm incorporates the following components:
  • This object incorporates an expandable array of m/z and intensity paired values sorted in order of increasing m/z.
  • This object contains several expandable arrays:
  • the walk object contains a floating point value representing the score of the sequence contained in that walk object.
  • the 'Stack' This is an expandable array of walk objects. Each walk object is identified by an index number which is its relative position on the stack.
  • Amino acid object This contains a list of the masses of the amino acid masses applicable in any study. This may be confined to the 20 common naturally occurring amino acids (listed in table 1), or may include masses corresponding to modifications of these amino acids caused by post-translational modifications. The object also contains an identifying symbol for each of the amino acids.
  • the algorithm maintains a variable pointing to the index number of the walk currently under consideration.
  • all sequence reads from HOPS are kept as possibilities without any pruning or rejection. All possibilities are later reviewed and ranked, and the output sequence is deduced through a consensus process.
  • the HOPS method is implemented to obtain highly specific sequence information to be used to search databases comprising proteins, polypeptides, peptides or conceptual polypeptides translated from nucleotide sequences, or any combination thereof.
  • the HOPS method is not limited to use with database searching and can also be used as a method for the interpretation of fragmentation mass spectra of peptides without any application of the resulting sequence information to database searching.
  • the sequencing loop of the program is then invoked.
  • a description of the steps involved in the process is shown in the flow chart in FIG. 5.
  • a select number of the highest m/z values remaining in the edited Peak Table object (typically twenty are selected) are then used to create Walk Objects and are placed on the Stack.
  • the set of the twenty starting m/z walk objects selected above the doubly charged ion will include y ions, and the walking process for each ion is carried out on the basis that we are starting with a y ion and walking down to a lower m/z y ion.
  • the first walk object from the stack is copied into a Walk Object known as the 'Current Walk'.
  • the 'walk down' stage then proceeds.
  • the lowest m/z value in the Current Walk, M is determined (for the very first walk this will be the starting m/z value).
  • the program searches through the peak table for all m/z values lower than this value and tests whether the difference between the two m/z values corresponds to any of the amino acid residue masses defined in the amino acid object.
  • the two m/z ions spaced apart by the mass difference corresponding to an amino acid residue are assumed to be two consecutive y ions. If this is the case, then this value is a possible amino acid to add to the sequence defined in the Current Walk.
  • the program tracks the number of possible permutations. If there is only one possibility, then the m/z value of this possibility is added to the current walk, and the updated current walk is then resubmitted to the 'walk down' stage for further processing. If there is more than one possibility, the Current Walk is cloned as many times as necessary, and the appropriate m/z and sequence information is added to the clones, and these clones are added to the end of the stack object. If no possibilities exist, then the Current Walk has terminated at that position, and is copied back into its original position in the Stack. The stack counting index, which refers to the position of the Current Walk, is incremented.
  • the m/z values present in a particular walk object can arise from y or b ions.
  • a list of sequences derived from the fragmentation spectrum is produced; in one embodiment, the list comprises all possible sequences that can be derived by the algorithm.
  • the next step is to rank the sequences in order to identify those which are most likely to be correct, using ion count ranking, ion intensity ranking, and/or vectorial ranking as described above.
  • FIG. 7 shows an embodiment of the steps used to choose a set of peptide sequences derived from a y-ion correlation analysis of the sequences returned from the ranking method.
  • the consensus sequence is defined as the common partial sequence within the set of peptide sequences that satisfy the prescribed ranking criteria.
  • the consensus sequences are determined by calculating the frequency of occurrence of each y ion signal across the set of top ranked Walk Objects. Walk Objects which are supported by y ion signals for each residue are placed into the consensus set.
  • An amino acid consensus sequence is produced based on mass differences between the common set of sequential y ions which are related by the masses of the 20 naturally-occurring amino acids or their derivatives produced during sample preparation and/or analysis.
  • the output of the ranking and y-ion correlation process yield a single sequence, it is used as the sole amino acid sequence for the candidate sequence set. If the consensus sequences are not the same length, in one embodiment the longest one is selected and used to construct a search string. For sequences of the same length, in one embodiment preference is given to those deduced from y ions with m/z values that are of greater value than the m/z value of the doubly-charged precursor ion.
  • spectra from different instruments and samples prepared by diverse methods vary in the levels of instrument and chemical noise, both across the full m/z range (upper limit is defined by the mass of the precursor ion) and within defined m/z regions. Further, contamination of the fragmentation spectrum with ions other than those of the precursor peptide ion of interest may contribute significantly to the ions observed at m/z values below the doubly charged precursor ion. As shown in FIG. 7, the HOPS method can use multiple criteria to select sequences for candidate sequence selection to compensate for differences among samples and instruments.
  • using the y ions above the doubly charged precursor ion for certain fragmentation spectra may increase the fidelity of spectral sequence reads since contaminating fragment ions from singly-charged, unrelated peptides having the same mass ( ⁇ 2 Da) as the precursor ions would not be present.
  • HOPS produces a consensus sequence of no prescribed length.
  • TESLA interprets the fragmentation mass spectrum to deduce sequences having a uniform, previously ordained length; detection of tripeptide sequences is especially preferred.
  • Establishing a fixed length for the deduced sequences permits other constraints for a search string to be relaxed and thereby facilitates the successful construction of search strings from poorer quality fragmentation mass spectra.
  • TESLA performs spectral interpretation by pattern recognition, e.g. by performing one or more logical comparisons (preferably logical AND comparisons) between an array representing a mass spectrum and one or more arrays representing peptide signatures to be detected if present in the mass spectrum.
  • logical comparisons preferably logical AND comparisons
  • TESLA performs a spectral read by a process that comprises: (a) creating a set of signature arrays containing the mass spectral signatures of a set of peptide sequences (preferably, a set of sequences of identical length, most preferably a set of trimeric peptide sequences); (b) creating a spectral array representing peaks identified in a fragmentation mass spectrum; and (c) performing pattern recognition by comparing the spectral array to the set of signature arrays (e.g. using a logical AND function) to determine whether the mass spectrum contains the signature of a peptide sequence in the set.
  • a process that comprises: (a) creating a set of signature arrays containing the mass spectral signatures of a set of peptide sequences (preferably, a set of sequences of identical length, most preferably a set of trimeric peptide sequences); (b) creating a spectral array representing peaks identified in a fragmentation mass spectrum; and (c) performing pattern recognition by comparing the spect
  • each array comprises a plurality of bits (which may be conceptualized as bins) and each individual bin represents a defined range of assigned mass values (equated to m/z values) in a fragmentation spectrum, so that the bins collectively represent all (or any desired segment) of the x axis in a fragmentation mass spectrum.
  • each array can be full (the bit is set, i.e., is non-zero) or empty (the bit is zero).
  • a peak picking algorithm identifies one or more peaks in the array to construct a peak list, which may optionally have been filtered or edited, as described above. For each peak in the peak list, a corresponding bit is set. Thus, for example, if a peak list represents peaks having mean m/z values 112, 146, 255 and 450 and the array has bins 1 Dalton wide, then the bits with assigned numbers ("baryon numbers") 112, 146, 255 and 450 are set while the other bits are zero.
  • a set of arrays (a "library" of arrays) is created in which all peptides are represented that are to be recognized if they occur in the fragmentation mass spectrum.
  • a library of arrays is needed that collectively represent all possible permutations of three amino acid residues independently chosen from a universe of 20. (To represent all possible tetramers, the library would have 20 4 arrays.)
  • the library includes peptides comprising derivatives produced during sample preparation and/or analysis from the naturally occurring amino acids. If desired, the library can be expanded to represent trimers comprising other amino acid residues, such as those resulting from one or more post-translational modifications).
  • the size of the library is diminished by treating isobaric amino acid residues (e.g. He and Leu) as a single residue (this can be done for one or more sets of isobaric residues) and/or by excluding permutations that are incompatible with the endopeptidase used to produce the experimental peptide.
  • isobaric amino acid residues e.g. He and Leu
  • permutations that are incompatible with the endopeptidase used to produce the experimental peptide.
  • a signature is disallowed if it represents a tripeptide having an Arg or Lys in the first or second position unless the Arg or Lys is immediately followed by a Pro residue; an Arg or Lys in the third position does not result in exclusion.
  • Each tripeptide in the set (or constrained set) to be recognized is represented in an array by setting four bits so that the intervals between successive non-zero bits represents the monoisotopic mass of a residue in the tripeptide.
  • bits with baryon numbers (c + x), (c + x + y), and (c + x + y + z) are set, where c is an arbitrary sufficiently small number.
  • bits with baryon numbers (c + x), (c + x + y), and (c + x + y + z) are set, where c is an arbitrary sufficiently small number.
  • bits with baryon numbers (c + x), (c + x + y), and (c + x + y + z) are set, where c is an arbitrary sufficiently small number.
  • the "databits" array representing peaks in the spectrum is now tested for bits corresponding to peaks that represent a first tripeptide sequence.
  • the corresponding array in the "motif library is compared to the databits array (using a logical AND comparison) to test whether the databits array has non-zero bits exactly matching the four set bits in the motif array; if so, then the tripeptide represented by the motif is present in the spectrum, and a hit is scored for that tripeptide. Then, each full bit in the motifbits array is shifted ("rolled") up by one baryon number and the logical AND comparison with the databits array is repeated.
  • the motif is rapidly and efficiently swept through the entire spectrum represented in the databits array, using the relatively fast bit comparison and bit movement operations, and scoring each hit for that tripeptide.
  • performance is enhanced by first testing whether the first non-zero bit of the motifbits array has a matching non-zero bit in the databits array; if so, the full logical test is carried out, otherwise the motifbits array is immediately rolled to the next position. Once this rolling process (a "bit sweep”) is completed through a range corresponding to a previously ordained segment of the fragmentation spectrum that is to be tested, the next array is taken from the motif library and the bit sweep process is repeated.
  • a test is performed to determine whether associated the N-terminal mass is a valid baryon number. The difference between the starting baryon number in the signature and the parent ion mass is determined. If the N-terminal baryon number is less than an arbitary number (preferably 306), then it is looked up in a list of allowed baryon numbers representing the sum of all combinations of 3 amino acid residues from the universe of permitted residues (e.g. the 20 naturally occurring amino acid residues and derivatives produced during sample preparation and/or analysis). The hit is accepted if, and only if, the N-terminal baryon number is valid.
  • an arbitary number preferably 306
  • the sum of the intensities of the 4 matching peaks in the databits array is calculated from the peak list and tracked.
  • accepted hits are scored differently depending on the region of the mass spectrum in which they occur. If all the matching peaks have m/z values that are greater than or equal to one half of the parent ion mass, the sequence is designated as a "non-straddle" sequence and accepted as a candidate for constructing a search string (optionally subject to additional constraints). If one or two peaks are below this m/z value, the sequence is designated as a "straddle" sequence and may be rejected but preferably is accepted as a candidate for constructing a search string.
  • the range of the spectrum to be tested is preferably set so that no more than two peaks are below the m/z watershed.
  • the consensus sequence e.g. from HOPS
  • the top-ranked sequence or sequences identified by spectral analysis e.g. by TESLA
  • TESLA top-ranked sequence or sequences identified by spectral analysis
  • the four top ranked sequences found by TESLA are used to form search strings.
  • all straddle sequences that share the two top-ranked scores in the straddle category and all non-straddle sequences that share the two top-ranked scores in the non-straddle category are used to form search strings; if multiple sequences tie for the same score, more than four sequences are used.
  • search strings may be performed by a computer-mediated algorithm that comprises: (a) analyzing a deduced sequence to form a permuted set of search sequences; and (b) constraining the permuted set of search sequences according to previously ordained criteria.
  • search strings are constructed by a peptide search algorithm such as the FIREPEP module described herein (FIG. 9).
  • a "search string” comprises a "search sequence” and "associated mass data" for that search sequence.
  • a search sequence is a peptide sequence that has been deduced by interpreting a fragmentation mass spectrum or that has been derived from a deduced sequence by constraints and/or permutation, as described herein.
  • the search sequence is a tripeptide sequence.
  • the associated mass data comprise any two or more of the following: the N-terminal mass (denoted Ml), the C-terminal mass (denoted M2), and the total mass.
  • the N-terminal mass (Ml) is the mass that flanks the search sequence on the N-terminal side of the experimental peptide
  • the C-terminal mass (M2) is the mass that flanks the search sequence on the C-terminal end of the experimental peptide. (FIG. 10)
  • the mass of an actual, physical peptide e.g. the experimental peptide
  • the mass of a molecular ion in a mass spectrum further includes the mass of one or more protons, according to its charge state.
  • the total mass may be calculated as the sum of the individual amino acid residues, i.e. without adding the mass of a water molecule or of one or more protons.
  • Ml and M2 are calculated by using the summed masses of the amino acid residues, without including the mass of the water molecule or protons, and this convention is used herein. Alternative conventions may be adopted as a matter of design choice
  • search strings can readily be tailored to the size and other characteristics of the database to be searched and the resolution of the mass spectrometer.
  • a search string preferably contains at least three amino acid residues, but a search string consisting of a dimer or a single amino acid residue may also be used, particularly for searching smaller databases such as conceptually translated peptide databases derived from genomic databases of microorganisms.
  • a trimer sequence flanked by two masses is preferred in order to retrieve a more practical working set of sequences from peptide databases derived by conceptual translation of nucleotide sequences.
  • Sequencing algorithms e.g. HOPS and TESLA
  • HOPS and TESLA can construct search strings with dimers or sequences greater than three amino acid residues in length but the additional sequence length may be unnecessary and may increase the number of false positive reads from the fragmentation spectra, thereby compromising the fidelity of the overall process.
  • any sequence data that do not meet the criteria for a search string may be used in a mapping algorithm as described below, e.g. to edit the retrieved peptide or translated nucleotide database sequences and to remove errors. (FIG. 1, step 7 and FIG. 11).
  • Criteria for forming and constraining a permuted set of search sequences are illustrated in FIGS. 9 and 10 under 'Constraints for search string' and described below, and may be applied in any desired sequence. These are based on empirical observations and from database attributes. One of ordinary skill in the art can adjust these criteria for improved searching of other databases (e.g. databases representing other genomes) either empirically or from considerations such as genome size, frequency of translated amino acid residues, nucleotide sequencing error rates and gene structure.
  • certain disfavored search strings are eliminated that have been found from experience to be frequent artifacts.
  • the identity Of disfavored search strings will vary according to the preparative and analytical procedures used, the instruments employed, and the nature of the sample being analyzed, and may readily be determined by experience.
  • the set of disfavored search strings comprises one or more (and preferably all) of the following: Ml-GEL-878.6, Ml-ELV-779.4, Ml-DND-631.4, and Ml-TLD-860.5.
  • the number of false hits from a large database is reduced by constraining Ml such that it cannot equal the mass of a single naturally occurring amino acid residue; under this constraint, Ml is likely to represent the combined masses of two. or more amino acid residues.
  • Ml can be constrained by requiring (1) that Ml does not equal the mass of a single amino acid residue in its natural state or following post- translational modification; or (2) that Ml be greater than 186.079 Daltons (the mass of a residue of Tryptophan, the largest naturally occurring amino acid).
  • M2 can be constrained in accordance with the specific endopeptidase used to generate the experimental peptide; for tryptic digestion, the last residue of the experimental peptide must be the C-terminal Arg or Lys residue, thus M2 is required to exceed 156.10 Daltons (the mass of an arginine residue) (FIG. 10).
  • These constraints may be applied by the spectral read algorithm or may be imposed by subsequent steps of the method described herein.
  • a search sequence is formed from a consensus sequence of HOPS by identifying the most N-terminal tripeptide within the consensus sequence that satisfies the constraints for Ml, or for Ml and M2.
  • the highest ranking tripeptide sequence provided by TESLA is chosen that satisfies the constraints for Ml, or for Ml and M2.
  • the amino acid asparagine (N) and two glycine residues (GG) are isobaric isomers and have identical chemical compositions and hence mass.
  • the amino acid residue, phenylalanine (F) has a mass that is similar to that of the oxidised form of methionine (M*) (147.0684 versus 147.0399 Da). Depending on signal intensity and instrument resolution, it may be difficult to achieve sufficient mass accuracy to distinguish these two residues.
  • the spectral read algorithm always specifies F for both F and oxidized methionine (M*). In constructing the permuted search set, F is permuted to M* and vice versa.
  • the amino acid residue, glutamine (Q) has a mass which may be difficult to distinguish from that of lysine (K) (128.0586 versus 128.0950 Da), depending on signal intensity and instrument resolution.
  • the spectral read algorithm uses the Q with a change to K for the single residue change.
  • K is included in the allowed trimer sequences only if followed by a Pro residue.
  • a search string contains W
  • a set of six permuted search strings is constructed; conversely, if a search string contains one of the 6 di-amino acid combinations, the two residues could be permuted to a Tip ('back permutation').
  • R is permuted to VG and/or GV, and either of these dimers is back-permuted to R.
  • all isobaric and mass-ambiguous amino acid substitutions are taken into account, and so all possible permutations are calculated from a deduced sequence to construct a permuted set of search strings.
  • a partial amino acid sequence of LCW would generate the following 14 possibilities in the Permuted Search set: LCW, ICW, LCD A, ICDA, LCAD, ICAD, LCGE, ICGE, LCEG, ICEG, LCVS, ICVS, LCSV, ICSV.
  • the set of search sequences may include one or more tetramers.
  • the first three sequence permutations (L «I, F-*M* and Q «K), and permutations of W ⁇ (AD or DA or VS or SV or EG or GE) are used in the construction of the search strings.
  • permutations from N ⁇ GG are not considered.
  • only the first three sequence permutations (L «I, F ⁇ M* and Q K) are used in the construction of the search strings.
  • permutations from one amino acid to multiple amino acids e.g. N GG and W «(AD or DA or VS or SV or EG or GE)) are not considered based on the principle that only sequences supported with observed peaks in the fragmentation mass spectrum are used for construction of the search string.
  • the mass spectrometer has sufficient resolution to dispel at least one (and preferably all) of the mass ambiguities enumerated as 3-6 above. Accordingly, only isobaric amino acid residues (I vs. L and N vs. GG) are taken into account in forming permuted search strings. In a particular embodiment, only I and L are permuted. Either before or after formation of the permuted set of search strings, the first and second masses, Ml and M2 are determined. These values are determined within the cumulative error defined by the errors associated with each individual peak (we have calculated the mass range of this error for our current analysis to be equal to [(mass resolution)* -42 ]).
  • the N-terminal mass (Ml) may be calculated as the difference between the mass of the singly-protonated molecular ion, [M + H] + , (which may be obtained from a fragmentation mass spectrum or more preferably from a primary mass spectrum) and the mass of the highest m/z peak in the fragmentation spectrum that defines the bounds of the search sequence (e.g. trimer).
  • the C-terminal mass (M2) may be calculated as the value of the lowest m/z value that defines the bounds of the search sequence (e.g. trimer), minus the combined mass of a water molecule and a proton.
  • the search sequence is deduced from the spectrum as though the spectral read is based on a y ion series. If the ions considered in the spectral read are in fact b ions, the sequence orientation would be reversed (FIG. 3) and the values of Ml and M2 would change accordingly to Ml' and M2 ⁇ For example, a string produced with a y ion spectral read, (NH2)-Ml-Leu-Val-Ala-M2-(COOH), would become (HOOC)-M2'-Ala-Val-Leu-Mr-(NH2), if b ions are used to deduce the sequence. In one embodiment, both possibilities are taken into account in forming a set of search strings. In another embodiment, the set of search strings is constructed on the assumption that the ions detected are y ions.
  • the mass of Ml cannot equal the mass of a single naturally-occurring amino acid residue, ii) the mass of M2 must be greater than the mass of a protonated arginyl amino acid iii) only a single permuted residue (L ⁇ I or F-M*) is allowed within the trimer; and iv) permuted sequences which are based on mass ambiguities between the mass of a single amino acid residue and residue dimers are not incorporated into the set of search strings.
  • the trimer sequence cannot contain only combinations of the high frequency residues V or A or combinations of either V or A residue and a single permutated residue.
  • IVA would not be allowed as the trimer sequence of a search string.
  • the search string or set of search strings may now be used to search a database of peptide sequences.
  • suitable databases for this purpose include: a database comprising peptide sequences derived from sequencing a plurality of peptides (e.g. by Edman sequencing or by MS analysis); or a database comprising peptide sequences derived by conceptual translation of a plurality of nucleotide sequences; or a database comprising peptide sequences derived by conceptual translation of a database of nucleotide sequences (e.g. a database comprising cDNA sequences and/or genomic sequences).
  • the peptide sequence database is derived by conceptual translation of genomic sequences representing a plant, mammalian or the human genome or a substantial portion thereof (e.g. at least 50%, 60%, 70%, 80%, 90%, 95% or 99% of a plant, mammalian or the human genome).
  • conceptual translation comprises applying the rules of the universal genetic code to obtain hypothetical peptide sequences by translating the nucleotide database in both orientations, and for all three reading frames of each orientation.
  • the conceptually translated database is constrained by excluding any peptide sequence that includes a residue encoded by a codon that appears adjacent to (alternatively after) a stop codon in the relevant reading frame of the nucleotide database that was conceptually translated. If desired, this exclusion criterion may be applied after in silico digestion.
  • sequences in a peptide database (for instance, one derived by conceptual translation) are permuted to allow for all possible sequences that could arise from one or a plurality of types of post-translational modification (e.g. all permutations are constructed that could arise from phosphorylation of Ser, Thr and Tyr residues).
  • a peptide database e.g.
  • one derived from conceptual translation is subjected to in silico digestion according to one or more specific endopeptidases used for selective cleavage to generate peptide fragments for analysis by mass spectrometry.
  • a database may be constrained or edited to remove sequences according to previously established criteria, e.g., the Allowed Database Sequence constraints detailed in FIG. 10.
  • predicted peptides containing more than one Lys or Arg residue are preferably removed except those in which such a residue is immediately followed by a Pro residue on the C-terminal side.
  • Any peptide database may also be amplified by permuting all Met residues to oxidized and unoxidized forms; where a peptide contains multiple Met residues, the database preferably contains all such permutations.
  • the peptide database to be searched may comprise over 200,000 sequences, over
  • the methods described herein are capable of searching such a database with a set of search strings within 30 seconds, 20 seconds, 10 seconds, or even 5 seconds.
  • a "mass-constrained text search” is performed to identify peptide sequences that (a) have a predicted mass compatible with Ml plus M2 plus the mass of the intervening peptide sequence (e.g. tripeptide); and (b) contain the text string (e.g. tripeptide) in the search sequence or at least one text string (e.g. tripeptide) in a set of search sequences. This may be accomplished by identifying a subset of sequences having the correct predicted mass and then performing a text search on this subset, or applying these criteria in the reverse order, or applying both criteria simultaneously or in succession to peptide sequences in the database without identifying an intermediate subset of sequences. The peptide sequences that satisfy this mass-constrained text search are then tested to identify those compatible with Ml or M2.
  • the Ml is mass-matched to the characters immediately preceding the 'text-matched sequence' i.e. toward the putative N-terminal.
  • the program sequentially calculates the mass of the amino acid residues represented by the database characters starting at the residue closest to the 'N-terminal' end of the 'text matched sequence'. The process continues until the total mass of the character set exceeds the value of Ml. If the value of the character set is equal to the value specified by Ml (within pre-determined error ranges dictated by the mass spectrometer resolution. as described above), then an N-terminal mass match is made and the sequence is 'passed' for C-terminal mass matching.
  • the present invention is capable of analyzing fragmentation mass spectral data and identifying a corresponding sequence in a peptide database based on a single subsequence match, for example a single tripeptide match. This is a significant advance over prior art methods, which required identification of two matching subsequences within a peptide for identification.
  • the result of the database search is one or more peptide sequences (typically, a single peptide sequence or a small set of peptide sequences) related by total mass, trimer sequence and mass-matched sequences for Ml and M2.
  • each of the retrieved sequence(s) is assessed by a back read algorithm to identify a sequence, if any, that truly matches the peptide represented in the fragmentation spectrum.
  • the back-read is performed by searching for ions from the relevant suite that flank the common trimer sequence. (If both y and b ions have been used for the search process, the sequence of the matching tripeptide and the values found for Ml and M2 will reveal whether the relevant suite is the y or b ions.) In a preferred embodiment, this process is implemented by searching an appropriate peak list, generated from a fragmentation mass spectrum of the experimental peptide, for flanking y ions. Each sequence is parsed, character by character. The peptide sequences that meet these criteria are used to generate a list of theoretical m/z values for the appropriate suite (preferably, the y ion series).
  • the theoretical m/z values, or corresponding assigned mass values, are compared with the observed values in a peak list from a fragmentation mass spectrum of the experimental peptide.
  • the peak list used for the back read and the peak list used for the spectral read are prepared from the same fragmentation mass spectrum, but they may less preferably be prepared from different fragmentation spectra of the experimental peptide.
  • the peak list used for the back-read contains at least one peak, more preferably a plurality of peaks, absent from the peak list used for the spectral read process (e.g., the Edited Peak List).
  • the peak list used for this comparison has not been subjected to filtering or editing or has been subjected to a less stringent filtering or editing process than that applied to obtain the peak list used for the spectral read process.
  • the matching ions e.g. y-ions
  • Peptide sequences are then processed (e.g., in text form) where the trimer sequences are identified, flanking ions which support each sequence are flagged, and the sequences are scored.
  • a preferred scoring scheme which uses one y ion signal on the N terminal end of the peptide and two y ion signals on the C terminal end is as follows for a peptide
  • the algorithm is capable of performing the back-read process according to previously determined criteria, without the intervention of an operator.
  • the back-read (and if desired the method as a whole) is performed without the intervention of a person having a doctoral degree in science, preferably without the intervention of a person having a master's or higher degree in science, more preferably without the intervention of a person having a bachelor's or higher degree in science, still more preferably without the intervention of a person skilled in mass spectral interpretation, and yet more preferably without the intervention of an operator.
  • the back-read provides the specificity needed to select or verify a true match, and further to delineate expressed gene regions in large genomic databases without using exon- prediction algorithms.
  • a search string e.g. extending a trimeric ⁇ sequence to a sequence of 5, 6, or more amino acid residues through the process described above
  • the back-read provides the specificity needed to select or verify a true match, and further to delineate expressed gene regions in large genomic databases without using exon- prediction algorithms.
  • the back-read can be performed by a computer-mediated algorithm that allows for gaps in the spectrum and identifies inter-peak offsets corresponding to two or more successive flanking residues in a retrieved sequence, or by a vectorial scoring method, as taught herein.
  • the result of this step of the method is one "matching" peptide sequence (within the limits of isobaric residues (Leu vs. He) and any mass ambiguous residues (e.g., Phe vs. oxidized Met, and Lys vs. Gin) for each tandem spectrum.
  • mass ambiguity is avoided by using a mass spectrometer with an appropriately high resolution, as described herein.
  • isobaric residues are distinguished by interpretation of peaks representing d and w ions arising from side chain cleavage. (Biemann, 1990, op.
  • additional information is obtained from accurately determined peptide molecular weights (e.g. as measured in primary mass spectra) and/or fragmentation mass spectra of peptide fragments obtained by selective cleavage (e.g. trypsinolysis) of a polypeptide.
  • the additional data may be mapped onto a peptide or nucleotide database after a matching peptide sequence has been identified using a search sequence deduced from a fragmentation mass spectrum, as described above.
  • Peptide molecular weights (without associated fragmentation spectra) and additional sequences identified from fragmentation spectra by a spectral read program are useful for: 1) unambiguous identification of exons, i.e.
  • nucleotide sequences that are expressed as peptides; 2) determining a correct reading frame of a nucleotide sequence (e.g. a nucleotide sequence comprising an exon or a portion of an exon); 3) identifying artefacts and errors in nucleotide or peptide sequences; 4) identifying base changes (mutations) and protein polymorphisms; 5) identifying post-translational modifications; and 6) identifying exon-intron boundaries as well as exon-exon boundaries in splice variants.
  • a nucleotide sequence e.g. a nucleotide sequence comprising an exon or a portion of an exon
  • identifying artefacts and errors in nucleotide or peptide sequences e.g. a nucleotide sequence comprising an exon or a portion of an exon
  • deduced peptide sequences can be used for this purpose, including those that do not satisfy the criteria for a search string.
  • the deduced sequence is permuted and a text search is performed against the set of aligned sequences to identify one or more subsequences that contain a member of the permuted sequence set.
  • the tryptic (or other cleavage) peptide within which the match occurs is tested to see whether it matches the mass data associated with the deduced peptide sequences (e.g. whether Ml and M2 match).
  • the Ml match is tested by a module that begins with the molecular mass of the amino acid residue on the N-terminal side of the matching sequence and tests whether it matches Ml within the error of measurement, if not, the module iteratively moves to the next flanking residue, adds its molecular mass, and tests whether the sum matches Ml . This process is repeated until a match is found or the sum exceeds Ml, in which case there is no match. A similar test is performed for flanking residues on the C-terminal side of the matching sequence to test for a match to M2.
  • the value of M2 as determined from the fragmentation mass spectrum must be adjusted to conform with the algorithm by which the mass of the C-terminal flanking region is iteratively determined; either the mass of a molecule of water (in addition to the mass of a proton) must be subtracted from the value of M2 determined from the y ion in the fragmentation spectrum, or the mass of a water molecule must be added to the sum calculated for the C- terminal flanking region under consideration. If a match is found for both Ml and M2, optionally a check is made to determine whether the identified subsequence is compatible with the cleavage patten of the endopeptidase that was used to produce the experimental peptides (e.g.
  • the subsequence is checked to determine whether it terminates with a C-terminal Arg or Lys residue and whether its N-terminal is preceded by an Arg or Lys residue). Preferably, such a match is required to accept the subsequence for mapping.
  • PTMs Post-translational modifications within the aligned sequences are identified by a variant of the peptide-mapping algorithm in (b). This is done by modifying the algorithm that tests the summed masses of the flanking amino acid residues against Ml or M2 so that the summed flanking mass is calculated from the molecular mass of each flanking amino acid residue both in its unmodified state and as incremented or decremented by one or more PTMs under consideration, or fragments thereof, such as any or all those listed in Table 2.
  • the algorithm steps along the flanking sequence and upon encountering a Ser (or Thr or Tyr) keeps parallel totals of the previously summed flanking mass, incremented by the monoisotopic mass of a Ser (or Thr or Tyr) with and without the mass increment due to phosphorylation.
  • the resultant parallel summed flanking masses are then tested in each step against Ml (in the case of an N-terminal flanking sequence) or M2 (for a C-terminal flanking sequence).
  • a back-read is performed on the fragmentation mass spectrum to identify the characteristic ions of the PTM of interest in order to accept the PTM for mapping.
  • Such characteristic ions include, without limitation: (a) modified fragmentation ions; (b) ions arising from cleavage of side chains; and * (c) ions arising from cleavage within side chains.(See, e.g., Gibson, B. W. and Cohen, P. (1990) Methods Enzymol.
  • deduced sequences are mapped onto aligned sequences and the flanking sequences are matched to Ml and M2 as described in (b) above (i.e., without taking PTMs into account) or, more preferably, as described in (c) above (i.e., allowing for PTMs) If a match is found for Ml but not M2, the mapping algorithm returns to the C-terminal flanking sequence and recalculates the flanking mass, permuting one amino acid residue at a time to all other possible amino acids; if a match is found for M2 but not Ml, the same operation is performed on the N-terminal flanking sequence.
  • the set of possible amino acids can exclude PTMs or can include one or more PTMs.
  • a permutation of a given amino acid residue in the sequence produces a match for the relevant flanking mass (Ml or M2)
  • the result is recorded as an error or polymorphism
  • the sequence is corrected or the polymorphism is mapped.
  • a back-read is performed on the fragmentation mass spectrum to identify a peak corresponding to the corrected amino acid residue before recording it as an error.
  • polymorphisms can be distinguished from sequencing errors. If the uncorrected sequence matches fragmentation spectra from some individuals but not others, it is recorded as a polymorphism; if it never matches, it is recorded as a sequencing error.
  • the present invention may also be used to identify exon-intron boundaries in one or more genomic sequences that have been conceptually translated to peptide sequences or exon-exon boundaries within expressed peptide sequences. In one embodiment, this is done by using a "spanning sequence" i.e. a sequence that includes amino acid residues encoded by two distinct exons e.g. two exons separated by a single intron or a plurality of introns. In one embodiment, the spanning sequence is identified by a spectral read program (e.g. a consensus sequence from HOPS). Preferably the spanning sequence comprises more than 3 amino acid residues, e.g.
  • a text comparison is performed to match a first part of the spanning sequence with a first portion of a conceptually translated peptide sequence encoded by a first exon and to match a second part of the spanning sequence with a second portion of the conceptually translated peptide sequence encoded by a second exon, thereby revealing exon- intron boundaries in the genomic sequence.
  • the present invention is used to identify exons in a genomic sequence that encodes a protein whose expression is subject to splice variation.
  • Identifying a spanning sequence in such a protein permits identification in the genomic sequence of some or all of the exons which are co- expressed in the splice variant.
  • a plurality of aliquots of a polypeptide preparation are subjected in parallel to distinct selective cleavage procedures to generate overlapping cleavage peptides.
  • Parallel in silico digestion of translated peptide sequences is also performed to generate overlapping in silico cleaved peptide sequences, which may now be identified and aligned, as described above. Where an aligned sequence is encoded by more than one exon, the exon-intron boundaries are now revealed.
  • Use of overlapping cleavage fragments advantageously permits greater sequence coverage to be obtained, thereby facilitating the editing of nucleotide databases.
  • the FIREPROT module (FIG. 11) performs the mapping function, as exemplified in Example 1 below, using the same permutation rules that are used in FIREPEP.
  • the method described herein may be applied to identify, sequence or map a wide variety of post-translational modifications including, without limitation, those that yield: N-formyl-L-methionine; L-selenocysteine; L-cystine; L-erythro-beta-hydroxyasparagine;
  • N-acetyl-L-alanine N-acetyl-L-aspartic acid
  • N-acetyl-L-cysteine N-acetyl-L-glutamic acid;
  • N-acetyl-L-glutamine N-acetylglycine; N-acetyl-L-isoleucine; N2-acetyl-L-lysine;
  • N-methyl-L-alanine N,N,N-trimethyl-L-alanine; N-methylglycine; N-methyl-L-methionine;
  • N6,N6-dimethyl-L-lysine N6-methyl-L-lysine; N6-palmitoyl-L-lysine;
  • L-arginine amide L-asparagine amide; L-aspartic acid 1 -amide; L-cysteine amide;
  • L-glutamine amide L-glutamic acid 1 -amide; glycine amide; L-histidine amide; L-isoleucine amide; L-leucine amide; L-lysine amide; L-methionine amide; L-phenylalanine amide;
  • L-proline amide L-serine amide; L-threonine amide; L-tryptophan amide; L-tyrosine amide;
  • L-valine amide L-cysteine methyl disulfide
  • S-farnesyl-L-cysteine S-farnesyl-L-cysteine
  • S-palmitoyl-L-cysteine S-diacylglycerol-L-cysteine; S-(L-isoglutamyl)-L-cysteine; 2'-(S-L-cysteinyl)-L-histidine; L-lanthionine; meso-lanthionine; 3-methyl-L-lanthionine;
  • N6-(4-amino-2-hydroxybutyl)-L-lysine N6-biotinyl-L-lysine; N6-lipoyl-L-lysine;
  • N6-pyridoxal phosphate-L-lysine N6-retinal-L-lysine; L-allysine; L-lysinoalanine;
  • L- ⁇ S'-topaquinone L-tryptophyl quinone; 4'-(L-tryptophan)-L-tryptophyl quinone;
  • O-phosphopantetheine-L-serine N4-glycosyl-L-asparagine; S-glycosyl-L-cysteine;
  • N-aspartyl-glycosylphosphatidylinositolethanolamine N-cysteinyl-glycosylphosphatidylinositolethanolamine; N-glycyl-glycosylphosphatidylinositolethanolamine;
  • N-seryl-glycosylsphingolipidinositolethanolamine O-(phosphoribosyl dephospho-coenzyme
  • L-3-oxoalanine lactic acid; L-alanyl-5-imidazolinone glycine; L-cysteinyl-5-imidazolinone glycine; D-alanine; D-allo-isoleucine; D-methionine; D-phenylalanine; D-serine;
  • D-asparagine D-leucine; D-tryptophan; L-isoglutamyl-polyglycine;
  • O4-glycosyl-L-hydroxyproline O-(phospho-5VRNA)-L-serine; L-citrulline;
  • L-beta-methylthioaspartic acid 5'-(N6-L-lysine)-L-topaquinone; S-methyl-L-cysteine;
  • N-(L-glutamyl)-L-tyrosine S-phycobiliviolin-L-cysteine; phycoerythrobilin-bis-L-cysteine; phycourobilin-bis-L-cysteine; N-L-glutamyl-poly-L-glutamic acid; L-cysteine sulfinic acid;
  • L-cysteine persulfide 3'-(r-L-histidyl)-L-tyrosine; heme P460-bis-L-cysteine-L-lysine;
  • PTMs may be used to detect a variety of post-translational modifications relevant to basic research or to the clinical diagnosis of disease.
  • Examples of the types of PTMs that may be analyzed using the methods described herein include, but are not limited to, alkylation, see e.g. Saragoni et al. (2000) Differential association of tau with subsets of microtubules containing posttranslationally-modified tubulin variants in neuroblastoma cells. Neurochem. Res.
  • Examples of phosphorylation include, but are not limited to, Vanmechelen et al. (2000) Quantification of tau phosphorylated at threonine 181 in human cerebrospinal fluid: a sandwich ELISA with a synthetic phosphopeptide for standardization. Neurosci. Lett. 285:49-52; Lutz et al. (1994) Characterization of protein serine/threonine phosphatases in rat pancreas and development of an endogenous substrate-specific phosphatase assay.
  • a example of sulfation includes, but is not limited to, Manzella et al. (1995) Evolutionary conservation of the sulfated oligosaccharides on vertebrate glycoprotein hormones that control circulatory half-life. J. Biol. Chem. 270S:21665-71, the contents of which is hereby incorporated in its entirety.
  • Examples of post-translational modification by oxidation or reduction include, but are not limited to, Magsino et al. (2000) Effect of triiodothyronine on reactive oxygen species generation by leukocytes, indices of oxidative damage, and antioxidant reserve. Metabolism 49:799-803; or Stief et al. (2000) Singlet oxygen inactivates fibrinogen, factor V, factor Vi ⁇ , factor X, and platelet aggregation of human blood. Thromb. Res. 97:473-80, the contents of which are hereby incorporated in their entirety.
  • ADP-ribosylation examples include, but are not limited to, Galluzzo et al. (1995) Involvement of CD44 variant isoforms in hyaluronate adhesion by human activated T cells. Eur. J. Immunol. 25:2932-9; or Thraves et al. (1986) Differential radiosensitization of human tumour cells by 3-aminobenzamide and>benzamide: inhibitors of poly( ADP-ribosylation). Int. J. Radiat. Biol. Relat. Stud. Phys. Chem. Med. 50:961-72, the contents of which are hereby incorporated in their entirety.
  • hydroxylation includes, but is not limited to, Brinckmann et al. (1999) Overhydroxylation of lysyl residues is the initial step for altered collagen cross-links and fibril architecture in fibrotic skin. J. Invest. Dermatol. 113:617-21, the contents of which is hereby incorporated in its entirety.
  • glycosylation examples include, but are not limited to, Johnson et al. (1999) Glycan composition of serum alpha-fetoprotein in patients with hepatocellular carcinoma and non-seminomatous germ cell tumour. Br. J. Cancer 81:1188-95; Fulop et al. (1996) Species-specific alternative splicing of the epidermal growth factor-like domain 1 of cartilage aggrecan. Biochem. J. 319:935-40; Dow et al. (1994) Molecular correlates of spinal cord repair in the embryonic chick: heparan sulfate and chondroitin sulfate proteoglycans. Exp.
  • RNA polymerase II is a glycoprotein. Modification of the COOH-terminal domain by O-GlcNAc. J. Biol. Chem. 268:10416-24; Goss et al. (1995)
  • Inhibitors of carbohydrate processing Anew class of anticancer agents. Clin. Cancer Res. 1:935-44; or Sleat et al. (1998) Specific alterations in levels of mannose 6-phosphorylated glycoproteins in different neuronal ceroid lipofuscinoses. Biochem. J. 334:547-51, the contents of which are hereby incorporated in their entirety.
  • glycosylphosphatidylinositide addition includes, but is not limited to,
  • ubiquitination includes, but is not limited to, Chu et al. (2000)
  • translocation leading to a disease state includes, but is not limited to, Reddy et al. (1999) Recent advances in understanding the pathogenesis of Huntington's disease. Trends Neurosci. 22:248-55, the contents of which is hereby incorporated in its entirety.
  • An example of detection of an artificial modification includes, but is not limited to, Romero et al. (1993)
  • proteolytic processing examples include, but are not limited to, Kurahara et al. (1999) Expression of MMPS, MT-MMP, and TIMPs in squamous cell carcinoma of the oral cavity: correlations with tumor invasion and metastasis. Head Neck 21:627-38; or Thorgeirsson et al. (1994) Tumor invasion, proteolysis, and angiogenesis. J. Neurooncol. 18:89-103, the contents of which are hereby incorporated in their entirety.
  • primary sequence variability e.g., mRNA splicing variability, gene mutation
  • a database is constructed from a combination of human genomic sequence entries in the database held at the European Molecular Biology Laboratory (EMBL), peptide entries in the non-redundant database held by the National Centre for Biotechnology Information (NCBI) which is accessible at http://www.ncbi.nlm.nih.gov/ and peptide sequence entries held at the SWISSPROT database held at the Swiss Institute of Bioinformatics (SIB).
  • the genomic sequences (including edited as well as unassembled, unordered segments of the genome) are translated, in all three reading frames and in both directions, into a computer listing of predicted amino acid sequences by applying the rules of the genetic code to form conceptually translated peptide sequences.
  • FIG. 12 shows 6 sequences which are retrieved from the translated human genome and protein database using the search string 226.15-NEN-621.34.
  • the translated genome sequence (Sequences 1-5) which show the identified peptide sequences represent nucleotide sequences which are flanked by stop codons.
  • a dimer sequence and associated Ml and M2 values (718.37-SF-274.19) are determined. This dimer string does not adhere to the set of empirical rules required for database searching using FIREPEP (FIG. 10) and is not used to search the database.
  • FIREPROT algorithm the SF dimer and associated masses are assigned to database entries 1, 3 and 6 (FIG.
  • Peptide sequences are derived by conceptual translation of genome sequences and subjected to in silico trypsinolysis. Fragmentation mass spectra are interpreted by HOPS and the conceptually translated database is searched with FIREPEP to identify matching peptide sequences. Where a peptide sequence is identified that matches the search string, a cross-referencing algorithm identifies the contig in the genomic database from which the matching peptide has been translated. A set is then formed from all in silico digested sequences encoded by that contig. A mass matching algorithm is used to assign recorded masses to sequences within the set of translated peptide sequences. Sequences deduced from spectral read of tandem spectra are mapped onto the set of translated sequences.
  • the output of database peptide sequences are read into the mass-mapping and post-translational modification module.
  • the modified peptide mass list consists of all masses, unmodified and modified, for each peptide which could result from endoprotease digestion and post-translational modifications of interest.
  • the modified mass list is then used to compare with the experimentally-determined peptide mass list. All mass agreements with translated genome peptide sequences within the mass tolerance of the instrument are used to determine sequence coverage and post-translational modifications.
  • Table 3 lists the masses matched to peptides of the transferrin receptor gene.
  • SNPs single nucleotide polymorphisms
  • Apo E apolipoprotein E which converts the Cys residues at position 130 (the ⁇ 3 isoform) to Arg ( ⁇ 4 isoform).
  • the tryptic peptides containing the amino acid change from the SNP are (R)LGADMEDVCGR and (R)LGADMEDVR for isoforms ⁇ 3 and ⁇ 4, respectively.
  • Apolipoprotein E peptides from individuals with the ⁇ 3 and ⁇ 4 polymorphism were analysed according to the present invention.
  • the SNP search sequences were used to search the human genome peptide database with the FIREPEP module. In the case of the ⁇ 3 search string Apo E gene sequences were returned. The search with ⁇ 4 did not return any records.
  • Pentoses (Ara, Rib, Xyl) 132.1161
  • Hexosamines (GalN, GlcN) 161.1577 5 Hexoses (Fuc, Gal, Glc, Man) 162.1424

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Urology & Nephrology (AREA)
  • Biomedical Technology (AREA)
  • Hematology (AREA)
  • Immunology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Cell Biology (AREA)
  • Microbiology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Peptides Or Proteins (AREA)

Abstract

L'invention porte sur un procédé entièrement automatique, géré par ordinateur et indépendamment de l'utilisateur permettant d'identifier et caractériser une séquence de peptide présente dans une base de données de peptides correspondant à un peptide expérimental. Le procédé identifie la séquence appariée, si présente dans base de données, sans nécessiter d'observateur spécialisé pour choisir dans la liste des appariements possibles. En recourant à une rétrolecture automatique, le procédé peut identifier de manière univoque une séquence de peptide de la base de données en se basant sur une seule séquence de peptide appariée. Le procédé permet également de faire correspondre des données de spectre de masse du peptide ou de la base de données: pour identifier sans ambiguïté les exons, pour déterminer un cadre un cadre de lecture idoine, pour identifier les artéfacts et les erreurs de séquences, pour identifier les mutations et les polymorphismes, pour identifier les modifications post-transductionnelles, et pour identifier les frontières exons/introns. L'invention porte également sur un support lisible par ordinateur comprenant des instructions d'exécution dudit procédé, sur un ordinateur à cet effet, une base de données de peptides ou d'acides nucléiques, un fichier ou une liste lisibles par ordinateur ou un écran affichant les informations résultant de l'exécution dudit procédé.
PCT/GB2001/004034 2000-09-08 2001-09-10 Identification automatisee de peptides WO2002021139A2 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP01965415A EP1317765A2 (fr) 2000-09-08 2001-09-10 Identification automatisee de peptides
AU2001286059A AU2001286059A1 (en) 2000-09-08 2001-09-10 Automated identification of peptides

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
GB0022136.6 2000-09-08
GB0022136A GB0022136D0 (en) 2000-09-08 2000-09-08 A method for the sequencing indentification and characterization of peptides or proteins and uses thereof
US23227300P 2000-09-13 2000-09-13
US60/232,273 2000-09-13
US72440500A 2000-11-28 2000-11-28
US09/724,405 2000-11-28

Publications (2)

Publication Number Publication Date
WO2002021139A2 true WO2002021139A2 (fr) 2002-03-14
WO2002021139A3 WO2002021139A3 (fr) 2003-02-06

Family

ID=27255879

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/004034 WO2002021139A2 (fr) 2000-09-08 2001-09-10 Identification automatisee de peptides

Country Status (3)

Country Link
EP (1) EP1317765A2 (fr)
AU (1) AU2001286059A1 (fr)
WO (1) WO2002021139A2 (fr)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003038728A2 (fr) * 2001-11-01 2003-05-08 Biobridge Computing Ab Systeme informatique et procede utilisant des donnees de spectrometrie de masse et une base de donnees proteine pour l'identification de proteines inconnues
WO2004008371A1 (fr) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Procede d'identification de peptides et de proteines
WO2003102572A3 (fr) * 2002-05-30 2004-03-04 Shimadzu Res Lab Europe Ltd Spectrometrie de masse
WO2004019035A3 (fr) * 2002-08-22 2005-02-17 Applera Corp Procede de caracterisation de biomolecules au moyen d'une strategie dependant de resultats
EP1600776A2 (fr) * 2004-05-25 2005-11-30 Oxford Gene Technology Ip Limited La spectrométrie de masse du peptide
GB2419355A (en) * 2004-10-20 2006-04-26 Protagen Ag Analysis of biopolymers by mass spectrometry
EP1447833A3 (fr) * 2003-02-14 2006-08-09 Hitachi, Ltd. Méthode d' analyse de données spectrométriques.
DE102006041644A1 (de) * 2006-08-23 2008-03-20 Panatecs Gmbh Verfahren zur Detektion von Modifikationen in einem Protein oder Peptid
JP2009528517A (ja) * 2006-02-28 2009-08-06 フェノメノーム ディスカバリーズ インク 認知症及び他の神経障害の診断方法
CN103512877A (zh) * 2013-10-16 2014-01-15 长春新产业光电技术有限公司 一种拉曼光谱物质检测快速样本查找方法
US9034923B2 (en) 2007-02-08 2015-05-19 Phenomenome Discoveries Inc. Methods for the treatment of senile dementia of the alzheimer's type
WO2017027858A1 (fr) 2015-08-12 2017-02-16 The Trustees Of Columbia University In The City Of New York Procédés de traitement de la déplétion plasmatique et d'une lésion rénale
US9682123B2 (en) 2013-12-20 2017-06-20 The Trustees Of Columbia University In The City Of New York Methods of treating metabolic disease
WO2019050966A3 (fr) * 2017-09-05 2019-04-18 Discerndx, Inc. Portillonnage de flux de travail d'échantillon automatisé et analyse de données
US20200271661A1 (en) * 2013-09-23 2020-08-27 The Trustees Of Columbia University In The City Of New York High-throughput single molecule protein identification
CN112567465A (zh) * 2018-06-11 2021-03-26 默沙东有限公司 复杂分子子结构的识别系统、装置和方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999062930A2 (fr) * 1998-06-03 1999-12-09 Millennium Pharmaceuticals, Inc. Sequençage de proteines au moyen de la spectroscopie de masse en tandem

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999062930A2 (fr) * 1998-06-03 1999-12-09 Millennium Pharmaceuticals, Inc. Sequençage de proteines au moyen de la spectroscopie de masse en tandem

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BARTELS C: "FAST ALGORITHM FOR PEPTIDE SEQUENCING BY MASS SPECTROSCOPY" BIOMEDICAL AND ENVIRONMENTAL MASS SPECTROMETRY, WILEY, LONDON, GB, vol. 19, 1990, pages 363-368, XP001051563 ISSN: 0887-6134 cited in the application *
DONGRE A R ET AL: "Emerging tandem-mass-spectrometry techniques for the rapid identification of proteins" TRENDS IN BIOTECHNOLOGY, ELSEVIER PUBLICATIONS, CAMBRIDGE, GB, vol. 15, no. 10, 1 October 1997 (1997-10-01), pages 418-425, XP004090548 ISSN: 0167-7799 *
LINDH I ET AL: "De novo sequencing of proteolytic peptides by a combination of C-terminal derivatization and nano-electrospray/collision-induced dissociation mass spectrometry" JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, ELSEVIER SCIENCE INC., NEW YORK, NY, US, vol. 11, no. 8, August 2000 (2000-08), pages 673-686, XP004210133 ISSN: 1044-0305 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003038728A3 (fr) * 2001-11-01 2003-11-06 Biobridge Computing Ab Systeme informatique et procede utilisant des donnees de spectrometrie de masse et une base de donnees proteine pour l'identification de proteines inconnues
WO2003038728A2 (fr) * 2001-11-01 2003-05-08 Biobridge Computing Ab Systeme informatique et procede utilisant des donnees de spectrometrie de masse et une base de donnees proteine pour l'identification de proteines inconnues
WO2003102572A3 (fr) * 2002-05-30 2004-03-04 Shimadzu Res Lab Europe Ltd Spectrometrie de masse
US8620588B2 (en) 2002-05-30 2013-12-31 Shimadzu Research Laboratory (Europe) Limited Method, system, and computer program product for determining a putative amino acid sequence
WO2004008371A1 (fr) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Procede d'identification de peptides et de proteines
WO2004019035A3 (fr) * 2002-08-22 2005-02-17 Applera Corp Procede de caracterisation de biomolecules au moyen d'une strategie dependant de resultats
US6940065B2 (en) 2002-08-22 2005-09-06 Applera Corporation Method for characterizing biomolecules utilizing a result driven strategy
EP1447833A3 (fr) * 2003-02-14 2006-08-09 Hitachi, Ltd. Méthode d' analyse de données spectrométriques.
EP1600776A2 (fr) * 2004-05-25 2005-11-30 Oxford Gene Technology Ip Limited La spectrométrie de masse du peptide
EP1600776A3 (fr) * 2004-05-25 2005-12-07 Oxford Gene Technology Ip Limited La spectrométrie de masse du peptide
GB2419355A (en) * 2004-10-20 2006-04-26 Protagen Ag Analysis of biopolymers by mass spectrometry
DE102004051016A1 (de) * 2004-10-20 2006-05-04 Protagen Ag Verfahren und System zur Aufklärung der Primärstruktur von Biopolymeren
JP2014197018A (ja) * 2006-02-28 2014-10-16 フェノメノーム ディスカバリーズ インク 認知症及び他の神経障害の診断方法
JP2009528517A (ja) * 2006-02-28 2009-08-06 フェノメノーム ディスカバリーズ インク 認知症及び他の神経障害の診断方法
EP2322531A3 (fr) * 2006-02-28 2011-09-07 Phenomenome Discoveries Inc. Méthodes permettant de diagnostiquer la démence et autres troubles neurologiques
US8304246B2 (en) 2006-02-28 2012-11-06 Phenomenome Discoveries, Inc. Methods for the diagnosis of dementia and other neurological disorders
JP2012255793A (ja) * 2006-02-28 2012-12-27 Phenomenome Discoveries Inc 認知症及び他の神経障害の診断方法
DE102006041644B4 (de) * 2006-08-23 2011-12-01 Panatecs Gmbh Verfahren zur Detektion von Modifikationen in einem Protein oder Peptid
DE102006041644A1 (de) * 2006-08-23 2008-03-20 Panatecs Gmbh Verfahren zur Detektion von Modifikationen in einem Protein oder Peptid
US9034923B2 (en) 2007-02-08 2015-05-19 Phenomenome Discoveries Inc. Methods for the treatment of senile dementia of the alzheimer's type
US9517222B2 (en) 2007-02-08 2016-12-13 Phenomenome Discoveries Inc. Method for the treatment of senile dementia of the Alzheimer's type
US20200271661A1 (en) * 2013-09-23 2020-08-27 The Trustees Of Columbia University In The City Of New York High-throughput single molecule protein identification
US11927593B2 (en) * 2013-09-23 2024-03-12 The Trustees Of Columbia University In The City Of New York High-throughput single molecule protein identification
CN103512877B (zh) * 2013-10-16 2016-03-30 长春新产业光电技术有限公司 一种拉曼光谱物质检测快速样本查找方法
CN103512877A (zh) * 2013-10-16 2014-01-15 长春新产业光电技术有限公司 一种拉曼光谱物质检测快速样本查找方法
US9682123B2 (en) 2013-12-20 2017-06-20 The Trustees Of Columbia University In The City Of New York Methods of treating metabolic disease
WO2017027858A1 (fr) 2015-08-12 2017-02-16 The Trustees Of Columbia University In The City Of New York Procédés de traitement de la déplétion plasmatique et d'une lésion rénale
WO2019050966A3 (fr) * 2017-09-05 2019-04-18 Discerndx, Inc. Portillonnage de flux de travail d'échantillon automatisé et analyse de données
CN112567465A (zh) * 2018-06-11 2021-03-26 默沙东有限公司 复杂分子子结构的识别系统、装置和方法
US11854664B2 (en) 2018-06-11 2023-12-26 Merck Sharp & Dohme Llc Complex molecule substructure identification systems, apparatuses and methods
CN112567465B (zh) * 2018-06-11 2024-02-20 默沙东有限责任公司 复杂分子子结构的识别系统、装置和方法

Also Published As

Publication number Publication date
EP1317765A2 (fr) 2003-06-11
AU2001286059A1 (en) 2002-03-22
WO2002021139A3 (fr) 2003-02-06

Similar Documents

Publication Publication Date Title
US6963807B2 (en) Automated identification of peptides
US7783429B2 (en) Peptide sequencing from peptide fragmentation mass spectra
Zhang et al. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data
Yan et al. Mass spectrometry-based quantitative proteomic profiling
Mann et al. Error-tolerant identification of peptides in sequence databases by peptide sequence tags
Aebersold A mass spectrometric journey into protein and proteome research
Shevchenko et al. Peptide sequencing by mass spectrometry for homology searches and cloning of genes
JP4818270B2 (ja) 選択されたイオンクロマトグラムを使用して先駆物質および断片イオンをグループ化するシステムおよび方法
US8909481B2 (en) Method of mass spectrometry for identifying polypeptides
JP4672614B2 (ja) 迅速かつ定量的なプロテオーム解析および関連した方法
Blueggel et al. Bioinformatics in proteomics
EP1317765A2 (fr) Identification automatisee de peptides
WO2003042774A2 (fr) Systeme de profilage de l'intensite de masse et utilisations correspondantes
JP2002505740A (ja) 新たなペプチド配列決定法
US20210239708A1 (en) Method to Map Protein Landscapes
JP2006510875A (ja) コンステレーションマッピングおよびそれらの使用
Hoffert et al. Taking aim at shotgun phosphoproteomics
BR112019025095A2 (pt) Métodos para a quantificação absoluta de polipeptídeos de baixa abundância utilizando espectrometria de massa
Merkley et al. A proteomics tutorial
Matthiesen et al. Analysis of mass spectrometry data in proteomics
Schweigert Characterisation of protein microheterogeneity and protein complexes using on-chip immunoaffinity purification-mass spectrometry
Fournier et al. Sequencing of a branched peptide using matrix‐assisted laser desorption/ionization time‐of‐flight mass spectrometry
Addona et al. De novo peptide sequencing via manual interpretation of MS/MS spectra
JP4614960B2 (ja) 同位体比によるペプチドを構成するアミノ酸配列の検定
Tschager Algorithms for Peptide Identification via Tandem Mass Spectrometry

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2001965415

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001965415

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Ref document number: 2001965415

Country of ref document: EP