EP1272657A2 - Procede et systeme d'identification de micro-organismes par recherche dans une base de donnees de proteomes fondee sur la spectrometrie de masse - Google Patents
Procede et systeme d'identification de micro-organismes par recherche dans une base de donnees de proteomes fondee sur la spectrometrie de masseInfo
- Publication number
- EP1272657A2 EP1272657A2 EP01928435A EP01928435A EP1272657A2 EP 1272657 A2 EP1272657 A2 EP 1272657A2 EP 01928435 A EP01928435 A EP 01928435A EP 01928435 A EP01928435 A EP 01928435A EP 1272657 A2 EP1272657 A2 EP 1272657A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- spectral peaks
- microorganisms
- known microorganisms
- probability
- unknown source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01J—ELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
- H01J49/00—Particle spectrometers or separator tubes
- H01J49/0027—Methods for using particle spectrometers
- H01J49/0036—Step by step routines describing the handling of the data generated during a measurement
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/53—Immunoassay; Biospecific binding assay; Materials therefor
- G01N33/569—Immunoassay; Biospecific binding assay; Materials therefor for microorganisms, e.g. protozoa, bacteria, viruses
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
- G01N33/6848—Methods of protein analysis involving mass spectrometry
- G01N33/6851—Methods of protein analysis involving laser desorption ionisation mass spectrometry
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N2570/00—Omics, e.g. proteomics, glycomics or lipidomics; Methods of analysis focusing on the entire complement of classes of biological molecules or subsets thereof, i.e. focusing on proteomes, glycomes or lipidomes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Definitions
- the present invention relates to microorganism identification. More specifically, the present invention relates to a method and system for identifying microorganisms by mass spectrometry-based proteome database searching.
- Proteins expressed in microorganisms can be used as biomarkers for microorganism identification.
- mass spectra obtained by matrix-assisted laser desorbtion/ionization (MALDI) time-of-flight (TOF) instruments have been employed for rapid microorganism differentiation and classification.
- the identification is based on differences in the observed "fingerprint" protein profiles for different organisms, typically in the mass range 4-20 kDa.
- a crucial requirement for successful identification via fingerprint techniques is spectral reproducibility.
- mass spectra of complex protein mixtures depend in an intricate and oftentimes poorly characterized fashion on a number of factors including sample preparation and ionization technique (e.g., MALDI matrixes, laser fluence), bacterial culture growth times and media, etc.
- sample preparation and ionization technique e.g., MALDI matrixes, laser fluence
- bacterial culture growth times and media etc.
- a need exists to develop, validate and apply an algorithmic model of the matching and measurement processes and use it to estimate the likelihood of misidentification and to gain insight into the nature of the microorganism identification problem.
- a need also exists to decrease the number of false matches by restricting the number of known proteins in the proteomic database.
- the present invention provides a system and method of quantifying the significance of microorganism identification by mass spectrometry-based proteome database searching through the use of a statistical model of false matches.
- the key to the false match model is the simplifying assumption that the proteins in a microorganism's proteome are uniformly distributed in the mass range of interest. This allows one to calculate the expected number of matches between the peaks in a mass spectrum and the peaks in a proteome. Thus, one can immediately test the null hypothesis that the mass spectrum was not generated by the microorganism in question.
- the present invention provides a system for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms.
- the system includes a proteomic database for storing data of known microorganisms; a processing module for determining the spectral peaks of known microorganisms using the proteomic database; and a scoring algorithm for comparing the spectral peaks of the unknown source with the spectral peaks as determined by the processing module for the known microorganisms.
- the scoring algorithm derives a score for the unknown source based on the number of spectral peaks of the unknown source that match spectral peaks of known microorganisms.
- the system further includes a probability module using at least the derived score and proteomes corresponding to the known microorganisms to determine the probability of observing false matches between the spectral peaks of the unknown source and the spectral peaks of the known microorganisms.
- FIG. 1 is a block diagram of a system for identifying an unknown source having a proteome database, a processing module and a scoring algorithm according to the present invention
- FIG. 2 is a chart illustrating a probability density function (p.d.f.) of protein masses for bacterial proteins in the SWISS-PROT proteome database;
- the present invention derives a model-based distribution of scores due to false matches.
- the inventive model denotes this distribution as P ⁇ (k), where K is the number of peaks in the spectrum of the unknown and k is the number of these peaks that match proteins in the proteome.
- the distribution derived is based on the approximation that the proteins in the underlying proteome are uniformly distributed. This approximation amounts to characterizing the true distribution of proteins by its first moment.
- the derived distribution P ⁇ (k) is compared to histograms obtained from simulated experiments which are performed by sampling simulated spectra from the true protein distributions contained in the proteome database.
- the distribution P ⁇ (k) allows testing of the significance of the scores via hypothesis testing and allows for quantifying the scalability of the approach by establishing limits on the size of the database (number of individual proteomes) and on the size of the proteomes in the database.
- the null hypothesis, H 0 is tested that the unknown and the known microorganisms are not the same.
- This section derives and justifies an approximate probability distribution for observing exactly k false matches when a spectrum from an unknown microorganism is compared to the proteome of a known microorganism according to the invention.
- the spectrum is assumed to have Speaks and the proteome is assumed to have n proteins.
- FIG. 1 A preferred system setting according to the present invention is illustrated in FIG. 1 and contains three primary components: 1) a database 10, 2) a processing module 20; and 3) a scoring algorithm 30.
- the database 10 contains a label and the corresponding proteome for each potentially observable microorganism. It is understood that the proteomes in the database 10 are neither necessarily complete, nor error free. Proteomes may be incomplete because the microorganism in question has not been fully sequenced, or because the proteome has been pruned of low abundance proteins to reduce the likelihood of false matches. Proteomes may have errors due to genetic variability, i.e., strain differences and because the process of annotation is itself an imperfect process. Nevertheless, the inventive system and method assumes that each proteome is sufficiently inclusive and sufficiently accurate, that it is reasonable to expect that some of the proteins in the proteomes will be found in a physical mass spectrum. In such a setting it is reasonable to compare a spectrum to a proteome.
- the processing module 20 includes a biochemical module 22 and a measurement module 24.
- the proteome of a microorganism is not directly observable. Instead, proteomes are inferred from measurements.
- a measurement is a random process that starts with the proteome and generates an observable spectrum through a set of stochastic transformations that account for complex biochemical and measurement, i.e., physical, processes.
- biochemical processes 42 are posttranslational modification and RNA edits.
- Examples of measurement processes 44 are multiple charge states, adduct ion formation, prompt and metastable ion fragmentation.
- Noise processes that create spurious peaks also contribute to the complexity of the measurement process.
- the scoring algorithm 30 is simple and known by one ordinarily skilled in the art. For example, the scoring algorithm is used in Demirev et al. The spectrum from an unknown source is compared to a known proteome by matching spectral peaks against proteins in proteomes. A database hit occurs when the mass of a protein in the database 10 differs from the mass of a spectral peak by at most Am / 2.
- a spectral peak with one or more database hits is said to be a "matched peak”.
- the number of spectral peaks that match proteins in a microorganism's proteome is said to be the "score" of the microorganism.
- Equation (3) we refer to
- Equation (3) is approximated by the standard normal approximation, then, in terms of the fraction of matched peaks,/ ⁇ k I K, we obtain
- the model predicts the following: 1) for sparse proteomes, linear dependence of matched fraction as a function of proteome size, 2) for dense proteomes, saturation of matched fraction at 100%, and 3) transition from linear dependence to saturation at a proteome size that is inversely proportional to the matching tolerance, Am .
- These general features are easily derived from the theoretical form, but they can also be understood intuitively.
- linear behavior of the matched fraction follows from considering a small number of proteins, randomly distributed throughout the mass range m mm , j, The likelihood of at least one database hit is proportional to the number of proteins in m min , m max .
- the database 10 is further restricted to a mass range of 4000 to 20000 Da, since this is the mass range used in previously conducted experiments (Demirev et al.). This leaves a working database of 17652 proteins distributed among 219 microorganisms.
- FIGS. 4A and 4B compare the observed and predicted error bars. Simulated spectra were generated with exactly 15 peaks. The mass range was 4000-20000 Da. For larger proteome sizes, a systematic deviation of approximately 10% is apparent at a resolution of m l Am - 400 (FIG. 4A), whereas the agreement at m l Am - 4000 is better (FIG. 4B). The discrepancy is attributed to the non-uniformity of the actual proteome distributions. This hypothesis was tested by repeating the simulation with an artificially generated database consisting of uniformly distributed proteomes. In this case, excellent agreement between the theory and the simulation data is observed.
- microorganisms with dense proteomes have a high probability of matching all the peaks in an unknown spectrum implies that simple ranking algorithms are likely to fail when used with databases that contain such microorganisms.
- simple ranking algorithms will be biased towards incorrectly identifying an arbitrary spectrum as belonging to the microorganism with the densest proteome.
- E. coli contains (in SWISS-PROT, release 37) by far the largest number of proteins (2124 against 1464 for currently the next largest microorganism proteome - that of B. subtilis) in the 4-20 kDa mass range. Accordingly, mass accuracy of-7.5Da or better is needed for the mass spectral data to be useful for microorganism identification via a simple ranking algorithm. This corresponds to rn / ⁇ m ⁇ 2 x 10 3 , or mass resolution of ⁇ 500 ppm. This relatively modest mass accuracy requirement enhances the prospects for small and inexpensive laboratory instruments for microorganism identification, since such mass accuracy may be achieved in the near future in field-portable instruments.
- the inventive system e.g., the processing module or another module, uses the derived probability distribution of false matches to test Ho (the null hypothesis that the unknown and the known proteomes are not the same) by calculating the probability that the score exceeds the observed score, k obs ,
- the significance values imply quite tight restrictions on the size of the database 10 that can be used for microorganism identification with the full proteome.
- the database 10 contained three or more microorganisms whose proteome sizes were comparable to that o ⁇ E.coli (2124 proteins), it would have been likely for at least one of these other microorganisms to have been accidentally achieved a score exceeding the E.coli score. This would have resulted in a misidentification.
- a database containing 10 or more microorganisms with proteomes whose sizes were comparable to that of B. subtilis would be likely to yield a microorganism that would exceed the observed number of matches against the B. subtilis proteome.
- the computed significance levels are sufficient to demonstrate the ability to identify microorganisms if the number of microorganisms under consideration is limited. It is clear, from the relatively modest significance levels that there is considerable room for improvement in both experimental and data processing techniques. In particular, the identification accuracy can be improved by maximizing true matches and minimizing false matches. True matches could be increased by: 1) improving measurement techniques so that more proteins are detected and 2) accounting for biochemical (e.g. posttranslational modifications) and measurement processes (e.g., multiple charge states adduct ions, etc.) that modify the molecular masses of the nominal proteomes.
- biochemical e.g. posttranslational modifications
- measurement processes e.g., multiple charge states adduct ions, etc.
- False matches could be reduced by: 1) increasing the mass-accuracy of the measurements, and 2) pruning the proteomes (e.g., excluding low abundance or unexpressed proteins) to reduce the protein density in the desired mass range.
- proteomes e.g., excluding low abundance or unexpressed proteins
- only ribosomal proteins are included in the proteome database 10.
- the proteomic database 10 is restricted to only include that more prevalent proteomes, such as ribosomal proteins.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Immunology (AREA)
- Chemical & Material Sciences (AREA)
- Urology & Nephrology (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Hematology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Medicinal Chemistry (AREA)
- Bioethics (AREA)
- Pathology (AREA)
- Cell Biology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Microbiology (AREA)
- Databases & Information Systems (AREA)
- Food Science & Technology (AREA)
- General Physics & Mathematics (AREA)
- Biochemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Optics & Photonics (AREA)
- Tropical Medicine & Parasitology (AREA)
- Virology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
L'invention concerne un procédé et un système permettant d'obtenir et de valider un modèle statistique simple destiné à prédire la distribution de fausses correspondances entre des pics présents dans des données de spectrométrie de masse par désorption-ionisation par impact laser assistée par matrice et des protéines présentes dans des bases de données de protéomes. En raison des nombreux parasites et de la nature incomplète des données, il est probable qu'un simple classement et qu'une vérification d'hypothèse soient insuffisants pour une identification sans failles de micro-organismes parmi un grand nombre de micro-organismes candidats. Afin d'améliorer l'identification de micro-organismes, les bases de données de protéomes sont restreintes afin de n'inclure que les données relatives à un ensemble de protéines donné. Cette restriction permet d'obtenir un modèle plus robuste, c'est-à-dire un modèle présentant un nombre inférieur de fausses correspondances.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US19636800P | 2000-04-12 | 2000-04-12 | |
US196368P | 2000-04-12 | ||
PCT/US2001/011649 WO2001079523A2 (fr) | 2000-04-12 | 2001-04-11 | Procede et systeme d'identification de micro-organismes par recherche dans une base de donnees de proteomes fondee sur la spectrometrie de masse |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1272657A2 true EP1272657A2 (fr) | 2003-01-08 |
Family
ID=22725109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP01928435A Withdrawn EP1272657A2 (fr) | 2000-04-12 | 2001-04-11 | Procede et systeme d'identification de micro-organismes par recherche dans une base de donnees de proteomes fondee sur la spectrometrie de masse |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1272657A2 (fr) |
JP (1) | JP2003530858A (fr) |
AU (1) | AU764402B2 (fr) |
WO (1) | WO2001079523A2 (fr) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10155707B4 (de) * | 2001-11-13 | 2006-11-16 | Bruker Daltonik Gmbh | Massenbestimmung für Biopolymere |
EP2157599A1 (fr) * | 2008-08-21 | 2010-02-24 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO | Procédé et appareil d'identification de matériaux biologiques |
EP2439536A1 (fr) | 2010-10-01 | 2012-04-11 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO | Nouveau procédé de classification pour données spectrales |
WO2014014353A1 (fr) | 2012-07-18 | 2014-01-23 | Nederlandse Organisatie Voor Toegepast-Natuurwetenschappelijk Onderzoek Tno | Nouveau procédé de classification pour données spectrales |
JP7151556B2 (ja) * | 2019-03-05 | 2022-10-12 | 株式会社島津製作所 | 微生物同定システム及び微生物同定用プログラム |
CN112614542B (zh) * | 2020-12-29 | 2024-02-20 | 北京携云启源科技有限公司 | 一种微生物鉴定方法、装置、设备及存储介质 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2308917B (en) * | 1996-01-05 | 2000-04-12 | Maxent Solutions Ltd | Reducing interferences in elemental mass spectrometers |
US6059724A (en) * | 1997-02-14 | 2000-05-09 | Biosignal, Inc. | System for predicting future health |
-
2001
- 2001-04-11 AU AU55293/01A patent/AU764402B2/en not_active Ceased
- 2001-04-11 WO PCT/US2001/011649 patent/WO2001079523A2/fr not_active Application Discontinuation
- 2001-04-11 EP EP01928435A patent/EP1272657A2/fr not_active Withdrawn
- 2001-04-11 JP JP2001577506A patent/JP2003530858A/ja not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
See references of WO0179523A2 * |
Also Published As
Publication number | Publication date |
---|---|
WO2001079523A2 (fr) | 2001-10-25 |
AU5529301A (en) | 2001-10-30 |
AU764402B2 (en) | 2003-08-21 |
WO2001079523A3 (fr) | 2002-03-21 |
JP2003530858A (ja) | 2003-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9354236B2 (en) | Method for identifying peptides and proteins from mass spectrometry data | |
Tsur et al. | Identification of post-translational modifications via blind search of mass-spectra | |
US7409296B2 (en) | System and method for scoring peptide matches | |
CN103245714B (zh) | 基于候选肽段区分度标记图谱的蛋白质二级质谱鉴定方法 | |
CN107480470B (zh) | 基于贝叶斯与泊松分布检验的已知变异检出方法和装置 | |
Bailey et al. | Score distributions for simultaneous matching to multiple motifs | |
US20110264377A1 (en) | Method and system for analysing data sequences | |
US7979214B2 (en) | Peptide identification | |
AU764402B2 (en) | Method and system for microorganism identification by mass spectrometry-based proteome database searching | |
Heredia-Langner et al. | Sequence optimization as an alternative to de novo analysis of tandem mass spectrometry data | |
US20030065451A1 (en) | Method and system for microorganism identification by mass spectrometry-based proteome database searching | |
Wu et al. | HMMatch: peptide identification by spectral matching of tandem mass spectra using hidden Markov models | |
JP5610347B2 (ja) | リボ核酸同定装置、リボ核酸同定方法、プログラムおよびリボ核酸同定システム | |
Fenyö et al. | Informatics development: challenges and solutions for MALDI mass spectrometry | |
CN114420213A (zh) | 一种生物信息分析方法及装置、电子设备及存储介质 | |
US20050100980A1 (en) | Method for using saddle-point approximation for the evaluation of intractable conditional probabilities in biotechnology | |
JP4651341B2 (ja) | マススペクトル測定方法 | |
Lysiak et al. | Interpreting Mass Spectra Differing from Their Peptide Models by Several Modifications | |
AL-Qurri | Improving Peptide Identification by Considering Ordered Amino Acid Usage | |
Garcia et al. | An EM‐type approach for classification of bivariate MALDI‐MS data and identification of high fertility markers | |
Kaltenbach et al. | SAMPI: protein identification with mass spectra alignments | |
US20050196811A1 (en) | Peptide identification | |
Boisson et al. | Protein sequencing with an adaptive genetic algorithm from tandem mass spectrometry | |
He et al. | Optimization-based peptide mass fingerprinting for protein mixture identification | |
Rose et al. | An information theoretic approach to rescoring peptides produced by de novo peptide sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20020831 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Free format text: AL;LT;LV;MK;RO;SI |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20050908 |