WO1999062930A2 - Sequençage de proteines au moyen de la spectroscopie de masse en tandem - Google Patents

Sequençage de proteines au moyen de la spectroscopie de masse en tandem Download PDF

Info

Publication number
WO1999062930A2
WO1999062930A2 PCT/US1999/012221 US9912221W WO9962930A2 WO 1999062930 A2 WO1999062930 A2 WO 1999062930A2 US 9912221 W US9912221 W US 9912221W WO 9962930 A2 WO9962930 A2 WO 9962930A2
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
mass
graph
peptide
ions
Prior art date
Application number
PCT/US1999/012221
Other languages
English (en)
Inventor
Vladimir Dancik
Original Assignee
Millennium Pharmaceuticals, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Millennium Pharmaceuticals, Inc. filed Critical Millennium Pharmaceuticals, Inc.
Priority to AU42284/99A priority Critical patent/AU4228499A/en
Publication of WO1999062930A2 publication Critical patent/WO1999062930A2/fr

Links

Classifications

    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01JELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
    • H01J49/00Particle spectrometers or separator tubes
    • H01J49/0027Methods for using particle spectrometers
    • H01J49/0036Step by step routines describing the handling of the data generated during a measurement
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/12General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length by hydrolysis, i.e. solvolysis in general
    • C07K1/128General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length by hydrolysis, i.e. solvolysis in general sequencing
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01JELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
    • H01J49/00Particle spectrometers or separator tubes
    • H01J49/004Combinations of spectrometers, tandem spectrometers, e.g. MS/MS, MSn

Definitions

  • a tandem mass spectrometer is capable of automatically ionizing a mixture of peptides and measuring their respective parent mass/charge ratios, then selectively fragmenting each peptide into constitutive pieces and measuring the mass/charge ratios of the fragment ions (MS/MS spectra of peptides).
  • the peptide sequencing problem is then to derive the sequences of peptides given their MS/MS spectra.
  • sequence of the peptide could be simply determined by converting the mass differences of consecutive fragmentions in the spectrum to their corresponding amino acids.
  • de novo peptide sequencing remains an open problem and even simple spectrum may require tens of minutes for a trained expert to interpret.
  • the number of sequence permutations examined can be further pruned by limiting the possible amino acid composition derived either through chemical amino acid analysis or through composition measurement for ions below m/z 160 in the tandem mass spectrum.
  • the difficulty with the prefix approach is that pruning frequently discards the correct sequence if its prefixes are poorly represented in the spectrum.
  • Another intrinsic problem with the global approach is that the spectrum information is used for scoring only after the potential peptide sequences are generated.
  • the global approach de novo programs typically have running time on the order of hours.
  • the peaks in the spectrum serve as vertices in the spectrum graph while the edges of the graph correspond to linking of vertices differing by the mass of an amino acid residue.
  • Fundamental to graph theory approaches is the prior transformation of each peak in the experimental spectrum into several vertices in a spectrum graph. Each vertex represents a different possible fragment ion type assignment for the peak.
  • the de novo peptide sequencing problem is thus cast as finding the longest path in the resulting directed acyclic graph. Since the number of edges in the spectrum graph is at most quadratic in the number of ions in the spectrum and since efficient algorithms for finding the longest paths are known such approaches have the potential to efficiently prune the set of all peptides to the set of high-scoring paths in the spectrum graph.
  • A be the set of amino acids with molecular masses w(a) , a e A .
  • a (parent) peptide P Pi .
  • Pn is a sequence of amino acids
  • a partial peptide P' c P is a substring p i ..p j ⁇ f P of mass ⁇ i ⁇ t ⁇ : m(p t ).
  • Electronic spectrum E(P) of peptide P is a set of masses of its partial peptides.
  • a match m(S,P) ⁇ s € S m(s,P) between spectrum S and peptide P is the number of ions from the spectrum S that match peptide P.
  • m(S,P) is the number of masses that experimental and electronic spectra have in common.
  • the peptide sequencing problem can stated as follows. Given spectrum S and a parent mass m find a peptide of mass m with the maximal match to spectrum S.
  • a ⁇ - ion of a partial peptide P' c P is such modification of P' that has molecular mass m(P')- ⁇ .
  • electronic spectrum E of peptide P is created by subtracting all offsets from ⁇ from the masses of all partial peptides of P (denoted as E ⁇ ).
  • W ⁇ (P) W ⁇ (P,) ⁇ W ⁇ (P (n . 1 ⁇ ).
  • the set of vertices of spectrum graph then is ⁇ s ⁇ jnjtial ⁇ ⁇ V(s 1 ) .... V(s m )
  • a spectrum S of a peptide P is called "complete" if S contains an ion corresponding to P j for every 1 ⁇ i ⁇ n.
  • the use of spectrum graph is based on the observation that for a complete spectrum S of peptide P, S is a complete spectrum of a peptide P when there exists a path of length n from V ⁇ initjal ⁇ to V ⁇ final j in G ⁇ (S) that is labeled by P and
  • ⁇ vet s(v), there s(v) denotes the multiplicity with which vertex v was created.
  • An offset frequency function is introduced that represents an important new tool for defining the ion type tendencies for particular mass-spectrometers.
  • the offset frequency function allows one to compare different mass spectrometers based on their propensity to generate different ion types thus making our algorithm instrument- independent.
  • Peaks in a spectrum either represent random noise or ⁇ -ions of partial peptides.
  • d(S) be the average distance between the peaks.
  • is approximately (l-p( ⁇ )) + p( ⁇ ) where p( ⁇ ) is the d(S) probability of ⁇ -ion (the portion of partial peptides that produce ⁇ -ions).
  • p( ⁇ ) is the d(S) probability of ⁇ -ion (the portion of partial peptides that produce ⁇ -ions).
  • the average d(S) for our sample spectra is 17.5, therefore probability of random offset is 0.057.
  • the probability of an a-ion with offset -27 is 0.23.
  • the offset -27 is observed 4 times more frequently then the average offset.
  • the statistics of offsets over all ions and all partial peptides provides a reliable learning algorithm for ion types.
  • Offsets ⁇ ⁇ j ,..., ⁇ k ⁇ corresponding to peaks of H(x) represent the ion-types produced by a given mass-spectrometer. Under normal circumstances we expect these offsets to correspond to the ion types that have sufficient support by chemistry.
  • Table 1 Information about terminal ion types learned from experimental spectra. The remaining offsets have average count 45 and average intensity 0.431024. When computing filtered counts, the peaks that have been identified as ions are not counted again for subsequent ion types.
  • Table 1 contains the list of offsets that have larger than expected counts and the corresponding ion types as known in chemistry All the significant offsets we found correspond to known ion types Surprisingly enough, some ion types turned to be more significant than previously thought (i.e. b-H 2 0-H 2 0 has larger count that y-NH 3 ). Also Fig. 1 clearly shows the presence of internal b-ions in the spectra.
  • a part of the learning of ion types is to decide what interval of offsets should be considered for particular ion type.
  • Peaks in a spectrum differ in intensity and one has to address the question of setting a threshold for distinguishing the signal from noise in a spectrum prior to transforming it to a spectrum graph. Low thresholds lead to excessive growth of the spectrum graph while high thresholds lead to fragmentation of the spectrum graph.
  • Earlier de novo sequencing algorithms set up the intensity thresholds for experimental spectra in a largely heuristic manner and have not addressed the fact that the intensity thresholds are ion-type dependent.
  • the offset frequency function allows one to set up intensity thresholds in a rigorous way.
  • K the length of the underlying peptide. Since this information is usually unavailable, K may be chosen as the ratio of the peptide mass and the average mass of an amino acid.
  • K may be chosen as the ratio of the peptide mass and the average mass of an amino acid.
  • the analysis of b-ions can be limited to intensity ranks 1, 2 and 3, while the analysis of b-H 2 0 can be limited to intensity ranks 3, 4 and 5.
  • a similar analysis implies that only intensities ranked 1 and 2 (i.e 20-30 high-intensity peaks) should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H 2 0 ions.
  • Fig. 3 shows that only intensities ranked 1 and 2 should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H 2 0 ions.
  • the merging algorithm decides what vertices in the spectrum graph are to be merged into one vertex. It is important to merge appropriate vertices; if we do not merge vertices that correspond to the same partial peptide, we will interpret meaningful peaks of spectra as a noise. On the other hand, if we merge vertices that do not correspond to the same peptide, we may interpret noise as meaningful peaks.
  • SHERENGA uses greedy a algorithm for merging vertices and introduces bridge edges in the resulting graph.
  • a gap edge in the spectrum graph is a directed edge from u to v such that v - u is - l i ⁇
  • the goal of scoring is to answer the question of how well a candidate peptide "explains" a spectrum and to choose the peptide that explains the spectrum the best.
  • p(P,S) be the probability that a spectrum S is generated by a peptide P produces spectrum S. It is appropriate to design scoring schema so that the high scoring peptides P have the high probability p(P,S).
  • p(P,S) evaluate p(P,S) and derive a scoring schema for paths in the spectrum graph, by the probabilities ofthe responding peptides. The longest path in the weighted spectrum graph corresponds to the peptide P that "explains" spectrum S the best.
  • the protein sequencing algorithm involves the generation ofthe weighted spectrum graph (as described above) and the search for the highest scoring paths in the spectrum graph.
  • Every peak in the spectrum may be interpreted either as an N-terminal ion or C-terminal ion. Therefore, every "real" vertex (corresponding to a mass m) has a
  • G be a graph and let T be a set of forbidden pairs of vertices of G (twins).
  • a path in G is called anti-symmetric if it contains at most one vertex from every forbidden pair.
  • Anti-symmetric longest path problem is to find a longest anti-symmetric path in G with a set of forbidden pairs T.
  • the intrinsic property ofthe conventional longest path algorithms is that they use only neighbors of a given vertex while computing the shortest path ending in this vertex.
  • Vertices in the spectrum graph are numbers that correspond to masses of potential partial peptides.
  • Two forbidden pairs of vertices (x 1 ? yj) and (x 2 , y 2 ) are non- interleaving if the intervals (x l5 y j ) and (x 2 , y 2 ) do not interleave, i.e. one of them is contained inside another.
  • a graph G with a set of forbidden pairs is called proper if every two forbidden pairs of vertices are non-interleaving.
  • Tandem mass-spectrometry peptide sequencing problem corresponds to antisymmetric longest path problem in a proper graph. We submit that there exists an efficient algorithm for anti-symmetric longest path problem in a proper graph.
  • C(G) a graph having a path that corresponds to a path in spectrum graph that is folded in the middle.
  • the vertices ofthe combined graph are pairs (e,x) such that edge e covers vertex x.
  • An initial vertex corresponds to pair (V ⁇ initjaj ⁇ ,v ⁇ f ⁇ nalj ) and a final vertex ( ⁇ p / ⁇ ,V jP 2j ) corresponds to a folding point ofthe spectrum graph.
  • the weight of new vertex will be the weighted average (i(s) u+i(t) v)/(i(s)+i(t)) of weights of u and v.
  • the greedy algorithm for merging provides satisfying results for most spectra.
  • a peak of a spectrum is actually a mass/charge ( m/z ) ratio ofthe corresponding ion.
  • m/z 1
  • m/z ofthe peak is the same as the mass ofthe corresponding ion.
  • some Mass-spectrometers are capable of producing ions with charge 2 or even more, in this case observed mass is half (third,%) ofthe ion's actual mass.
  • c(S, S (x)) be the number of peaks s ; e S and ⁇ e S(x) such that
  • the value of x that maximizes c(S, S (x)) then would be an appropriate choice for parent mass. Should there be many choices for x, we can select one that minimizes the sum of distances
  • This approach significantly improves the accuracy ofthe parent mass determination.
  • This approach can similarly be used to correct a mis-assignment ofthe parent mass/charge value resulting from an incorrect charge assignment.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

L'invention concerne un nouvel algorithme, SHERENGA, servant à effectuer une interprétation spectrale de novo apprenant automatiquement des types d'ions fragmentaires et des seuils d'intensité à partir d'un ensemble recueilli de spectres d'essai générés à partir de tout type de spectromètre de masse. Cet algorithme met en application une approche théorique graphique. On utilise les données d'essai afin de construire une valeur optimale de trajet dans les représentations graphiques de spectres de masse en tandem. Une liste classifiée de trajets présentant une valeur forte correspond à des séquences potentielles de peptides. SHERENGA est particulièrement utile pour interpréter des séquences de peptides provenant de protéines inconnues non encore rencontrées en séquençage génomique, ainsi que pour mettre en correspondance des configurations basées sur des textes éprouvés afin de rechercher une homologie avec des protéines connues. Cet algorithme sert également d'appoint efficace permettant de valider les résultats d'algorithmes de correspondance de bases de données en séquençage de peptides très productif et totalement automatisé.
PCT/US1999/012221 1998-06-03 1999-06-02 Sequençage de proteines au moyen de la spectroscopie de masse en tandem WO1999062930A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU42284/99A AU4228499A (en) 1998-06-03 1999-06-02 Protein sequencing using tandem mass spectroscopy

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US8778598P 1998-06-03 1998-06-03
US60/087,785 1998-06-03

Publications (1)

Publication Number Publication Date
WO1999062930A2 true WO1999062930A2 (fr) 1999-12-09

Family

ID=22207246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/012221 WO1999062930A2 (fr) 1998-06-03 1999-06-02 Sequençage de proteines au moyen de la spectroscopie de masse en tandem

Country Status (2)

Country Link
AU (1) AU4228499A (fr)
WO (1) WO1999062930A2 (fr)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002021139A2 (fr) * 2000-09-08 2002-03-14 Oxford Glycosciences (Uk) Ltd. Identification automatisee de peptides
WO2003046577A1 (fr) * 2001-11-30 2003-06-05 The European Molecular Biology Laboratory Systeme et procede de sequencage automatique de proteines par spectrometrie de masse
WO2003075306A1 (fr) * 2002-03-01 2003-09-12 Applera Corporation Procede d'identification de proteines au moyen de donnees de spectrometrie de masse
WO2003098190A2 (fr) * 2002-05-20 2003-11-27 Purdue Research Foundation Identification de proteines a partir de spectres d'ions produits de proteines
EP1366360A2 (fr) * 2001-03-09 2003-12-03 Applera Corporation Procedes d'appariement de proteines a grande echelle
WO2004008371A1 (fr) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Procede d'identification de peptides et de proteines
WO2004083233A2 (fr) * 2003-02-10 2004-09-30 Battelle Memorial Institute Identification de peptides
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
DE10323917A1 (de) * 2003-05-23 2004-12-16 Protagen Ag Verfahren und System zur Aufklärung der Primärstruktur von Biopolymeren
US6963807B2 (en) 2000-09-08 2005-11-08 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
US7158862B2 (en) * 2000-06-12 2007-01-02 The Arizona Board Of Regents On Behalf Of The University Of Arizona Method and system for mining mass spectral data
DE102011014805A1 (de) * 2011-03-18 2012-09-20 Friedrich-Schiller-Universität Jena Verfahren zur Identifizierung insbesondere unbekannter Substanzen durch Massenspektrometrie

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7158862B2 (en) * 2000-06-12 2007-01-02 The Arizona Board Of Regents On Behalf Of The University Of Arizona Method and system for mining mass spectral data
WO2002021139A3 (fr) * 2000-09-08 2003-02-06 Oxford Glycosciences Uk Ltd Identification automatisee de peptides
WO2002021139A2 (fr) * 2000-09-08 2002-03-14 Oxford Glycosciences (Uk) Ltd. Identification automatisee de peptides
US6963807B2 (en) 2000-09-08 2005-11-08 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
EP1366360A2 (fr) * 2001-03-09 2003-12-03 Applera Corporation Procedes d'appariement de proteines a grande echelle
EP1366360A4 (fr) * 2001-03-09 2005-03-16 Applera Corp Procedes d'appariement de proteines a grande echelle
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
WO2003046577A1 (fr) * 2001-11-30 2003-06-05 The European Molecular Biology Laboratory Systeme et procede de sequencage automatique de proteines par spectrometrie de masse
WO2003075306A1 (fr) * 2002-03-01 2003-09-12 Applera Corporation Procede d'identification de proteines au moyen de donnees de spectrometrie de masse
WO2003098190A2 (fr) * 2002-05-20 2003-11-27 Purdue Research Foundation Identification de proteines a partir de spectres d'ions produits de proteines
WO2003098190A3 (fr) * 2002-05-20 2004-07-15 Purdue Research Foundation Identification de proteines a partir de spectres d'ions produits de proteines
WO2004008371A1 (fr) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Procede d'identification de peptides et de proteines
WO2004083233A3 (fr) * 2003-02-10 2004-12-29 Battelle Memorial Institute Identification de peptides
WO2004083233A2 (fr) * 2003-02-10 2004-09-30 Battelle Memorial Institute Identification de peptides
US7979214B2 (en) 2003-02-10 2011-07-12 Battelle Memorial Institute Peptide identification
DE10323917A1 (de) * 2003-05-23 2004-12-16 Protagen Ag Verfahren und System zur Aufklärung der Primärstruktur von Biopolymeren
DE102011014805A1 (de) * 2011-03-18 2012-09-20 Friedrich-Schiller-Universität Jena Verfahren zur Identifizierung insbesondere unbekannter Substanzen durch Massenspektrometrie

Also Published As

Publication number Publication date
AU4228499A (en) 1999-12-20

Similar Documents

Publication Publication Date Title
Colinge et al. OLAV: Towards high‐throughput tandem mass spectrometry data identification
Xu et al. MassMatrix: a database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data
US7409296B2 (en) System and method for scoring peptide matches
Zhang et al. ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision‐induced dissociation spectrum collected by a tandem mass spectrometer
EP1047108A2 (fr) Méthode et dispositif d' identification de peptides et de protéines par spectrometrie de masse
Colinge et al. High‐performance peptide identification by tandem mass spectrometry allows reliable automatic data processing in proteomics
Bafna et al. On de novo interpretation of tandem mass spectra for peptide identification
WO1999062930A2 (fr) Sequençage de proteines au moyen de la spectroscopie de masse en tandem
WO2008008919A2 (fr) Procédés et systèmes de conception de transitions et expériences de suivi de réactions multiples à partir de séquences
Razumovskaya et al. A computational method for assessing peptide‐identification reliability in tandem mass spectrometry analysis with SEQUEST
Lu et al. A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications
Ahrné et al. An improved method for the construction of decoy peptide MS/MS spectra suitable for the accurate estimation of false discovery rates
US7230235B2 (en) Automatic detection of quality spectra
US20020046002A1 (en) Method to evaluate the quality of database search results and the performance of database search algorithms
Zou et al. Charge state determination of peptide tandem mass spectra using support vector machine (SVM)
CN114639445B (zh) 一种基于贝叶斯评价和序列搜库的多肽组学鉴定方法
Park et al. Human plasma proteome analysis by reversed sequence database search and molecular weight correlation based on a bacterial proteome analysis
Fei Novel Peptide Sequencing With Deep Reinforcement Learning
Zhang et al. A new strategy to filter out false positive identifications of peptides in SEQUEST database search results
Fei et al. GameTag: A New Sequence Tag Generation Algorithm Based on Cooperative Game Theory
Sanders et al. A transformer model for de novo sequencing of data-independent acquisition mass spectrometry data
Colinge et al. A systematic statistical analysis of ion trap tandem mass spectra in view of peptide scoring
Liu et al. PRIMA: peptide robust identification from MS/MS spectra
Hubbard Computational approaches to peptide identification via tandem MS
Dančík et al. De novo peptide sequencing via tandem mass spectrometry: A graph-theoretical approach

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
WA Withdrawal of international application
122 Ep: pct application non-entry in european phase