AU764402B2 - Method and system for microorganism identification by mass spectrometry-based proteome database searching - Google Patents
Method and system for microorganism identification by mass spectrometry-based proteome database searching Download PDFInfo
- Publication number
- AU764402B2 AU764402B2 AU55293/01A AU5529301A AU764402B2 AU 764402 B2 AU764402 B2 AU 764402B2 AU 55293/01 A AU55293/01 A AU 55293/01A AU 5529301 A AU5529301 A AU 5529301A AU 764402 B2 AU764402 B2 AU 764402B2
- Authority
- AU
- Australia
- Prior art keywords
- proteome
- microorganisms
- database
- mass
- proteins
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01J—ELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
- H01J49/00—Particle spectrometers or separator tubes
- H01J49/0027—Methods for using particle spectrometers
- H01J49/0036—Step by step routines describing the handling of the data generated during a measurement
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/53—Immunoassay; Biospecific binding assay; Materials therefor
- G01N33/569—Immunoassay; Biospecific binding assay; Materials therefor for microorganisms, e.g. protozoa, bacteria, viruses
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
- G01N33/6848—Methods of protein analysis involving mass spectrometry
- G01N33/6851—Methods of protein analysis involving laser desorption ionisation mass spectrometry
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N2570/00—Omics, e.g. proteomics, glycomics or lipidomics; Methods of analysis focusing on the entire complement of classes of biological molecules or subsets thereof, i.e. focusing on proteomes, glycomes or lipidomes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Immunology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Urology & Nephrology (AREA)
- Hematology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Microbiology (AREA)
- Bioethics (AREA)
- Cell Biology (AREA)
- Pathology (AREA)
- Food Science & Technology (AREA)
- Medicinal Chemistry (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Virology (AREA)
- Tropical Medicine & Parasitology (AREA)
- Optics & Photonics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Description
WO 01/79523 PCT/US01/11649 METHOD AND SYSTEM FOR MICROORGANISM IDENTIFICATION BY MASS SPECTROMETRY-BASED PROTEOME DATABASE SEARCHING BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to microorganism identification. More specifically, the present invention relates to a method and system for identifying microorganisms by mass spectrometry-based proteome database searching.
2. Description of the Related Art [0002] Proteins expressed in microorganisms can be used as biomarkers for microorganism identification. In particular, mass spectra obtained by matrix-assisted laser desorbtion/ionization (MALDI) time-of-flight (TOF) instruments have been employed for rapid microorganism differentiation and classification. The identification is based on differences in the observed "fingerprint" protein profiles for different organisms, typically in the mass range 4-20 kDa. A crucial requirement for successful identification via fingerprint techniques is spectral reproducibility. However, mass spectra of complex protein mixtures depend in an intricate and oftentimes poorly characterized fashion on a number of factors including sample preparation and ionization technique MALDI matrixes, laser fluence), bacterial culture growth times and media, etc.
[0003] It has been proposed to exploit the wealth of information contained in prokaryotic genome and proteome databases to create a potentially more robust approach for mass spectrometry-based microorganisms identification (See Demirev, Ho, Ryzhov, F1.eneu, A.na. Cen 1999, 71, 232-8). This appronah is independent ofthe chosen ionization and mass analysis model. The central idea of this proposed approach is to match the peaks, in the spectrum of an unknown microorganism, with the annotated proteins of known microorganisms in a proteomic database the internet-accessible SWISS-PROT proteomic database).
[0004] The plausibility of the proposed approach was demonstrated by identifying two microorganisms whose genomes are known subtilis and E.coli). The identification was performed by assigning a matching score, k, to each microorganism. This score was simply the number of spectral peaks that matched (to within a specified mass tolerance) the annotated proteins of each of the microorganisms in the database. The microorganisms were subsequently WO 01/79523 PCT/US01/11649 ranked according to their score, and the microorganism with the highest score was declared to be the unknown source of the spectrum.
[0005] Although this simple ranking algorithm succeeded in correctly identifying two microorganisms from a relatively small database, it was nonetheless understood from the onset that more rigorous methods would be necessary to perform robust identification of a broader range of microorganisms over more comprehensive databases. A key component of robust microorganism identification must be the ability to quantitatively assess the risk of false identification. In the present setting, false identification can occur when a large number of spectral peaks accidentally match the masses of proteins in the proteome of an unrelated microorganism. The likelihood of accidental matches, and hence the likelihood of false identification, increases, if the mass tolerance is increased or if the size of the known proteome increases.
[0006] In general, it is impractical to estimate the risk of false identification by exhaustively performing a large number of proteome-spectrum comparisons with a large number of experimentally obtained spectra. Instead, it is necessary to base quantitative methods on models of the matching and measurement processes.
[0007] Accordingly, a need exists to develop, validate and apply an algorithmic model of the matching and measurement processes and use it to estimate the likelihood of misidentification and to gain insight into the nature of the microorganism identification problem. A need also exists to decrease the number of false matches by restricting the number of known proteins in the proteomic database.
SUMMARY OF THE INVENTION [0008] The present invention provides a system and method of quantifving the significance of microorganism identification by mass spectrometry-based proteome database searching through the use of a statistical model of false matches. The key to the false match model is the simplifying assumption that the proteins in a microorganism's proteome are uniformly distributed in the mass range of interest. This allows one to calculate the expected number of matches between the peaks in a mass spectrum and the peaks in a proteome. Thus, one can immediately test the null hypothesis that the mass spectrum was not generated by the microorganism in question.
10009] Specifically, the present invention provides a system for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms. The system includes a proteomic database for 3 storing data of known microorganisms; a processing module for determining the spectral peaks of known microorganisms using the proteomic database; and a scoring algorithm for comparing the spectral peaks of the unknown source with the spectral peaks as determined by the processing module for the known microorganisms. The scoring algorithm derives a score for the unknown source based on the number of spectral peaks of the unknown source that match spectral peaks of known microorganisms. The system further includes a probability module using at least the derived score and proteomes corresponding to the known microorganisms to determine the probability of observing false matches between the spectral peaks of the unknown source and the spectral peaks of the known microorganisms.
According to one embodiment of this invention there is provided a system for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms, said system comprising: a proteomic database for storing data of known microorganisms; 1i a processing module for determining the spectral peaks of known microorganisms using the proteomic database; a scoring algorithm for comparing the spectral peaks of the unknown source with the spectral peaks as determined by the processing module for the known microorganisms, said scoring algorithm deriving a score for the unknown source based on 20 the number of spectral peaks of the unknown source that match spectral peaks of known microorganisms; and a probability module using at least the derived score and proteomes corresponding to the known microorganisms to determine the probability of observing false matches between the spectral peaks of the unknown source and the spectral peaks of the known microorganisms.
According to another embodiment of this invention there is provided a method for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms, said method comprising the steps of: 30 providing a proteomic database for storing data of known microorganisms; determining the spectral peaks of known microorganisms using the proteomic database; comparing the spectral peaks of the unknown source with the spectral peaks of the known microorganisms and deriving a score for the unknown source based on the number of spectral peaks of the unknown source that match spectral peaks of known [I:\DAYLIB\LIBXX]04293.doc:KOB 3a microorganisms; and using at least the derived score and proteomes corresponding to the known microorganisms to determine the probability of observing false matches between the spectral peaks of the unknown source and the spectral peaks of the known microorganisms.
Brief Description of the Drawings Fig. 1 is a block diagram of a system for identifying an unknown source having a proteome database, a processing module and a scoring algorithm according to the present invention; Fig 2 is a chart illustrating a probability density function of protein masses ,0 for bacterial proteins in the SWISS-PROT proteome database; Fig. 3 is a chart illustrating a fraction of incorrectly matched peaks as a function of proteome size for Am={1, 3, 10, 30} Da according to the present invention; and Figs. 4A and 4B are charts illustrating a standard error in the fraction of incorrectly matched peaks as a function of proteome size for Am={30, 3}Da, respectively, using the present invention.
Description of the Preferred Embodiments To assess the likelihood of false identification, the present invention derives a model-based distribution of scores due to false matches. For a given known microorganism with a corresponding annotated proteome, the inventive model denotes this distribution as Pk(k), where K is the number of peaks in the spectrum of the unknown and k is the number of these peaks that match proteins in the proteome. The distribution Ana ,ccn 0 unonrn.An-mofk irf~c n the-b nA-A-1r ,rnr.f-m -o I UJ0. i LIiC& IWt 1 110 A LIV UU•I yI 6 uniformly distributed. This approximation amounts to characterizing the true distribution of proteins by its first moment. To test this approximation, the derived distribution Pk(k) is compared to histograms obtained from simulated experiments which are performed by sampling simulated spectra from the true protein distributions contained in the proteome database.
*oooo oooo.
a.
[[:\DAYLIB\LIBXX]04293.doc:KOB WO 01/79523 PCT/US01/11649 10015] The distribution PK(k) allows testing of the significance of the scores via hypothesis testing and allows for quantifying the scalability of the approach by establishing limits on the size of the database (number of individual proteomes) and on the size of the proteomes in the database. Finally, the null hypothesis, is tested that the unknown and the known microorganisms are not the same.
I. Theory I.a. The setting [00161 This section derives and justifies an approximate probability distribution for observing exactly k false matches when a spectrum from an unknown microorganism is compared to the proteome of a known microorganism according to the invention. In the mass range the spectrum is assumed to have Kpeaks and the proteome is assumed to have n proteins. For the purposes of statistical analysis it is useful to work within an unambiguous problem setting. A preferred system setting according to the present invention is illustrated in FIG. I and contains three primary components: 1) a database 10, 2) a processing module 20; and 3) a scoring algorithm [0017] The database 10 contains a label and the corresponding proteome for each potentially observable microorganism. It is understood that the proteomes in the database are neither necessarily complete, nor error free. Proteomes may be incomplete because the microorganism in question has not been fully sequenced, or because the proteome has been pruned of low abundance proteins to reduce the likelihood of false matches. Proteomes may have errors due to genetic variability, strain differences and because the process of annotation is itself an imperfect process. Nevertheless, the inventive system and method assumes that each proteome is sufficiently inclusive and sufficiently accurate, that it is reasonable to expect that some of the proteins in the proteomes will be found in a physical mass spectrum. In such a setting it is reasonable to compare a spectrum to a proteome.
[0018] The processing module 20 includes a biochemical module 22 and a measurement module 24. The proteome of a microorganism is not directly observable. Instead, proteomes are inferred from measurements. For purposes of the present invention, a measurement is a random process that starts with the proteome and generates an observable spectrum through a set of stochastic transformations that account for complex biochemical and measurement, physical, processes. Examples of biochemical processes 42 are WO 01/79523 PCT/US01/11649 posttranslational modification and RNA edits. Examples of measurement processes 44 are multiple charge states, adduct ion formation, prompt and metastable ion fragmentation.
10019] Noise processes that create spurious peaks also contribute to the complexity of the measurement process. To obtain a tractable preliminary analysis it is useful to neglect all these complexities and to model the measurement process as a simple random draw (without replacement) of the proteins in the source proteome. The mass of each randomly draw protein is referred to as a "peak" and the set of masses is referred to as a "spectrum".
[0020] The scoring algorithm 30 is simple and known by one ordinarily skilled in the art. For example, the scoring algorithm is used in Demirev et al. The spectrum from an unknown source is compared to a known proteome by matching spectral peaks against proteins in proteomes. A database hit occurs when the mass of a protein in the database 10 differs from the mass of a spectral peak by at most Am 2. A spectral peak with one or more database hits is said to be a "matched peak". The number of spectral peaks that match proteins in a microorganism's proteome is said to be the "score" of the microorganism.
I.b. Theoretical Distribution of False Matches [0021] To derive the approximate distribution of false matches, assume that the unknown source and the known microorganism are distinct s Then, by definition, all matches are false matches. We make the simplifying assumption that the proteins in the proteomes are uniformly distributed throughout the mass range [mm, m, The only free parameter in a uniform distribution is the density of proteins the number of proteins per unit mass interval). Under this assumption, it is straightforward to write down which is the probability that a given peak will be a matched peak. In particular, given any interval of width Am about a mass m, the probability P(q) of obtaining exactly q database hits is Poisson distributed: pAm (1) where p n m) is the density of proteins in the proteome in the mass range [mm, Consequently, the probability of obtaining no database hits is exp(-pAm) and the probability of obtaining at least one database hit is P(0) 1 e (2) Taking into account the form of and the number of ways that k matches can be selected from K peaks, yields WO 01/79523 PCT/US01/11649 PK(k)- (K .k e- (3) In Equation we refer to mNL. mmi' (4) Am as the critical proteome size. If Equation is approximated by the standard normal approximation, then, in terms of the fraction of matched peaks, f k K, we obtain PK where f 0 I- exp(-n n) (6) is the expected fraction of matched peaks, and exp(-n n exp(-n (7) 0,
K
is the standard deviation of matched fraction. The normal approximation to the binomial distribution is generally good for Kp,, 5 when 0.5, and K(l- p 5 when p,a 0.5. The expression for f. justifies our previous assumption as n being the critical proteome size, since f£ 1 when n and f. n n' when n Accordingly, we refer to a proteome that satisfies n n" as a "dense" proteome and a proteome with n n" as a "sparse" proteome.
[0022] The model predicts the following: 1) for sparse proteomes, linear dependence of nmatched fractin as a fintionn nf prntneome sie, 2) fnr dense protenmes, saturation of matched fraction at 100%, and 3) transition from linear dependence to saturation at a proteome size that is inversely proportional to the matching tolerance, Am. These general features are easily derived from the theoretical form, but they can also be understood intuitively.
[0023] In particular, linear behavior of the matched fraction follows from considering a small number of proteins, randomly distributed throughout the mass range The likelihood of at least one database hit is proportional to the number of proteins in [mmin, mw].
Saturation for dense proteomes occurs because in any Am interval there is likely to be at least one protein, so that almost every peak is likely to have at least one database hit, the fraction of matched peaks is 1. The transition between linear and saturated behavior occurs at the transition between sparse and dense proteomes. We can arbitrarily take this point as the density WO 01/79523 PCT/US01/11649 at which, on average, the spacing between proteins is Am. This corresponds to a critical proteome size of n' s m Am, which is inversely proportional to the matching tolerance.
I.e. The Empirical Distribution of False Matches [00241 The previous section derives the distribution of false matches under the assumption that the underlying distribution of proteins was uniform. Since the underlying distribution of proteins is not uniform FIG. it is necessary to demonstrate that the derived distribution of false matches, reproduces the observed distribution. To do this, the first two moments (mean and standard deviation) of the empirical distribution are estimated, by performing simulated matching experiments, and then comparing the observed moments with those predicted by the theoretical distribution.
[0025] To perform the simulations, a subset of the SWISS-PROT proteome database (release 37) is used. At the present time, only a small fraction of the microorganisms represented in SWISS-PROT are fully sequenced. Moreover, most of the microorganisms (about 85%) are poorly characterized, in the sense that they have fewer than 10 proteins deposited in the database 10. The latter is eliminated from the database 10, since the distribution of the deposited proteins is likely to reflect the intellectual currents of scientific investigation, rather than being representative of any natural distribution.
[0026] The database 10 is further restricted to a mass range of 4000 to 20000 Da, since this is the mass range used in previously conducted experiments (Demirev et This leaves a working database of 17652 proteins distributed among 219 microorganisms. Only three fields are preserved from the SWISS-PROT database in the working database: the protein mass (mass accuracy to i Da), the SWISS-PROT accession number, and the name of the micUroorgani [0027] For each source microorganism, 3000 spectra in silico were simulated, by randomly selecting 15 proteins (without replacement) from its proteome. Each protein was equally likely to be chosen. To assure that each of these 3000 spectra is unique, the source microorganisms were restricted to the set of 58 microorganisms that contain 50 or more proteins. Each of these microorganisms has over 2 x 101 2 distinct 15-peak spectra.
Consequently, it is extremely unlikely for a spectrum to appear more than once in the simulation.
(0028] Each simulated spectrum is compared against the proteomes of the remaining 218 microorganisms. For each source microorganism, there are 3000 x 218= 6.5 x WO 01/79523 PCT/US01/11649 comparisons. Since there are 58 source microorganisms, the total number of spectrumproteome comparisons is 3.8 x 10 7 The software is implemented in portable ANSI-C and runs on either PowerPC or Pentium-based machines. It requires approximately 1/2 hour to perform all the simulations reported in this section using a Pentium-II Xeon 400 MHz processor.
100291 The theoretical distribution predicts that the expected fraction of false matches should depend simply on proteome size. Accordingly, a plot is made of the expected fraction of false matches obtained from the simulations, as a function of proteome size for Am={1, 3, (FIG. Simulated spectra were generated with exactly 15 peaks. The mass range was 4000-20000 Da. Proteome sizes for eight organisms in this mass range are marked. Solid lines are theoretical predictions. The data points are superimposed on the theoretically predicted curves. It is evident that there is excellent agreement between the simulation results and the theoretical prediction. The error bars in FIG. 3 are determined by the standard deviation of the empirically observed distribution and are proportional to the inverse square root of the number of random matching trials used to calculate the mean.
[0030] FIGS. 4A and 4B compare the observed and predicted error bars. Simulated spectra were generated with exactly 15 peaks. The mass range was 4000-20000 Da. For larger proteome sizes, a systematic deviation of approximately 10% is apparent at a resolution of m Am 400 (FIG. 4A), whereas the agreement at m Am 4000 is better (FIG. 4B). The discrepancy is attributed to the non-uniformity of the actual proteome distributions. This hypothesis was tested by repeating the simulation with an artificially generated database consisting of uniformly distributed proteomes. In this case, excellent agreement between the theory and the simulation data is observed.
10031] To conclude, the theory presented herein agrees well with the simulatio -n results despite the non-uniformity of the underlying proteome mass distributions. Except for a handful of proteomes, the protein mass distributions of individual microorganisms resemble the mass distribution of all bacterial proteins in SWISS-PROT FIG. This distribution is far from uniform, especially in the 4000-20000 Da mass range. Moreover, since the model assumes a uniform mass distribution, one can overestimate the protein density near 4000 Da and underestimate it near 20000 Da. Intuitively, over estimates near 4000 Da tend to cancel underestimates near 20000 Da, leading to a value of P(k) that approximates the true distribution.
WO 01/79523 PCT/US01/11649 [0032] Strictly speaking, a large discrepancy between the actual protein distribution and the uniform distribution leads to systematic bias in expected values. For the problem at hand, these biases are small. But in the case of protein distributions that are peaked or have a wide dynamic range, the exponential mass distributions of tryptic peptides resulting from enzymatic protein digestions, these biases are not small and the empirical distribution of false matches is not well described by a model based on a uniform approximation.
I. Theory II.a. Mass Accuracy and Proteome Density [0033] The fact that microorganisms with dense proteomes have a high probability of matching all the peaks in an unknown spectrum implies that simple ranking algorithms are likely to fail when used with databases that contain such microorganisms. In particular, simple ranking algorithms will be biased towards incorrectly identifying an arbitrary spectrum as belonging to the microorganism with the densest proteome. Thus, to use simple ranking algorithms, it is necessary to use databases that exclude microorganisms with dense proteomes.
This is problematic if excluded microorganisms are likely to be the sources of unknown mass spectrum. Increasing the sophistication of identification algorithms by taking into account complex physical processes, posttranslational modifications, multiple charge states, adducts, etc.), can exacerbate the problem if including molecular species due to these processes effectively increases the size of the proteome beyond the critical proteome size.
[00341 The existence of a critical proteome density implies a lower limit on the mass accuracy that can be used with a simple ranking algorithm. In particular, suppose the densest proteome in the database 10 has n, proteins in the mass range [mii,, The requirement that dense proteomes be excluded from the database 10 implies that n n which in um implies a relationship between the maximum proteome size and the mass accuracy, Am m_ mi. (8) nm [0035] For example, E. coli contains (in SWISS-PROT, release 37) by far the largest number of proteins (2124 against 1464 for currently the next largest microorganism proteome that of B. subtilis) in the 4-20 kDa mass range. Accordingly, mass accuracy of 7.5Da or better is needed for the mass spectral data to be useful for microorganism identification via a simple ranking algorithm. This corresponds to m Am 2 x 103, or mass resolution of-500 ppm. This relatively modest mass accuracy requirement enhances the prospects for small and WO 01/79523 PCT/US01/11649 inexpensive laboratory instruments for microorganism identification, since such mass accuracy may be achieved in the near future in field-portable instruments.
II.b. Significance Testing and Database Size [00361 The inventive system, the processing module or another module, uses the derived probability distribution of false matches to test Ho (the null hypothesis that the unknown and the known proteomes are not the same) by calculating the probability that the score exceeds the observed score, k,,
K
a=P(k kJ (9) This sum can be evaluated exactly from Equation or approximately in terms of the matched fractions from Equation The test is performed with Am 3 Da which, given the mass range 4-20 kDa, implies that n" 5333.3. This critical proteome size exceeds n,,a =2124 so there are no dense proteomes in our bacterial subset of SWISS-PROT. Moreover, the database 10 is restricted to fully sequenced microorganisms only. The calculated significance levels and the scores for the B. subtilis and E. coli MALDI mass spectra published previously (see Demirev et al.) are summarized in Table 1. In both cases the correct microorganism is identified as the source of the spectrum, based on significance level. In the case of E.coli, the null hypothesis was rejected at the a 0.311 significance level, while in the case of B. subtilis, the null hypothesis was rejected at the a 0.095 significance level.
Table 1. Matching scores and significance test results for two experimentally obtained MALDI mass spectra of intact organisms (see Demirev et al.).
B. subtilis (Am 3 Da), 14 spectral peaks proteome score significance name size level (a) 1464 6 0.095 BACILLUS SUBTILIS.
587 2 0.437 BORRELLA BURGDORFERI.
509 1 0.737 HELICOBACTER PYLORI.
2124 3 0.888 ESCHERICHIA COLI.
E. coli spectrum (Am 3 Da), 17 spectral peaks proteome score significance name size level (a) 2124 7 0.311 ESCHERICHIA COLl.
508 1 0.802 HAEMOPHILUS INFLUENZAE.
509 1 0.803 HELICOBACTER PYLORI 1464 3 0.813 BACILLUS SUBTILIS 10037] These are not particularly significant rejections of the null hypothesis.
Moreover, the significance values imply quite tight restrictions on the size of the database WO 01/79523 PCT/US01/11649 that can be used for microorganism identification with the full proteome. For example, in the case of E.coli, had the database 10 contained three or more microorganisms whose proteome sizes were comparable to that of E.coli (2124 proteins), it would have been likely for at least one of these other microorganisms to have been accidentally achieved a score exceeding the E.coli score. This would have resulted in a misidentification. Similarly, a database containing or more microorganisms with proteomes whose sizes were comparable to that of B. subtilis would be likely to yield a microorganism that would exceed the observed number of matches against the B. subtilis proteome.
[0038] Had the database 10 not been limited to fully sequenced microorganisms, the search would have turned up a large number of microorganisms with lower, yet more significant scores. One way to more firmly reject the null hypothesis, is to observe more matches. In particular, one would need scores of nine matches out of 14 peaks and 10 matches out of 14 peaks to yield significance levels better than 0.05 and 0.01, respectively. Another way of more firmly rejecting the null hypothesis is to decrease the proteome sizes by pruning out proteins that are unlikely to be observed. This would reduce the likelihood of false matches.
HI. Discussion 10039] The computed significance levels are sufficient to demonstrate the ability to identify microorganisms if the number of microorganisms under consideration is limited. It is clear, from the relatively modest significance levels that there is considerable room for improvement in both experimental and data processing techniques. In particular, the identification accuracy can be improved by maximizing true matches and minimizing false matches. True matches could be increased by: 1) improving measurement techniques so that LUore proteins are detected and 2) accounting for U l uil~ l a 1 vJpostranslationavl modifications) and measurement processes multiple charge states adduct ions, etc.) that modify the molecular masses of the nominal proteomes. False matches could be reduced by: 1) increasing the mass-accuracy of the measurements, and 2) pruning the proteomes excluding low abundance or unexpressed proteins) to reduce the protein density in the desired mass range. In a preferred embodiment, only ribosomal proteins are included in the proteome database [00401 As already pointed out, taking into account biochemical and measurement processes effectively increases the number of potential matches and thus increases the opportunity for false matches. In effect, it is equivalent to increasing the proteome size and must be done parsimoniously so as not to exceed the critical proteome size, One must begin WO 01/79523 PCT/US01/11649 with a pruned proteome and then limit the number of biochemical and measurement processes that one includes in the model.
[0041] Finally, it is noted that to the extent that these complex processes introduce uncertainty in the observable mass of every protein in the proteome, they will have the effect of convolving the underlying distribution with a distribution whose width represents the range of biochemical and measurement uncertainties. The resulting smearing of the effective protein distribution will tend to make the effective protein distribution more uniform and thus the approximate theoretical distribution disclosed herein should become more accurate.
[0042] To conclude, the present invention quantifies the significance of microorganism identification by mass spectrometry-based proteome database searching through the use of a statistical model of false matches. The model is a useful tool for assessing the significance of identification scores and highlights areas where improvement is necessary in both experimental and data analysis techniques. Given the cluttered and incomplete nature of the data, it is likely that neither simple ranking, nor simple hypothesis testing will be sufficient for truly robust microorganism identification. Accordingly, in an effort to increase microorganism identification and to decrease the number of false matches, the proteomic database 10 is restricted to only include that more prevalent proteomes, such as ribosomal proteins.
[0043] What has been described herein is merely illustrative of the application of the principles of the present invention. For example, the functions described above and implemented as the best mode for operating the present invention are for illustration purposes only. Other arrangements and methods may be implemented by those skilled in the art without departing from the scope and spirit of this invention.
Claims (3)
- 22. A system for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms, said system being substantially as herein described with reference to the Figures.
- 23. A method for determining a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms, said method being substantially as herein described with reference to the Figures.
- 24. A system of any one of claims 1 to 9 or 22 when used to determine a probability of observing false matches between spectral peaks of an unknown source and spectral peaks of known microorganisms. Dated 25 June, 2003 The John Hopkins University Patent Attorneys for the Applicant/Nominated Person SPRUSON FERGUSON *ee LI:\DAYLIB\LIBXXj04293.doc:KOB
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US19636800P | 2000-04-12 | 2000-04-12 | |
US60/196368 | 2000-04-12 | ||
PCT/US2001/011649 WO2001079523A2 (en) | 2000-04-12 | 2001-04-11 | Method and system for microorganism identification by mass spectrometry-based proteome database searching |
Publications (2)
Publication Number | Publication Date |
---|---|
AU5529301A AU5529301A (en) | 2001-10-30 |
AU764402B2 true AU764402B2 (en) | 2003-08-21 |
Family
ID=22725109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU55293/01A Ceased AU764402B2 (en) | 2000-04-12 | 2001-04-11 | Method and system for microorganism identification by mass spectrometry-based proteome database searching |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1272657A2 (en) |
JP (1) | JP2003530858A (en) |
AU (1) | AU764402B2 (en) |
WO (1) | WO2001079523A2 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10155707B4 (en) * | 2001-11-13 | 2006-11-16 | Bruker Daltonik Gmbh | Mass determination for biopolymers |
EP2157599A1 (en) * | 2008-08-21 | 2010-02-24 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO | Method and apparatus for identification of biological material |
EP2439536A1 (en) | 2010-10-01 | 2012-04-11 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO | New classification method for spectral data |
EP2875518A1 (en) | 2012-07-18 | 2015-05-27 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO | New classification method for spectral data |
JP7151556B2 (en) * | 2019-03-05 | 2022-10-12 | 株式会社島津製作所 | Microorganism identification system and program for identification of microorganisms |
CN112614542B (en) * | 2020-12-29 | 2024-02-20 | 北京携云启源科技有限公司 | Microorganism identification method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1998035609A1 (en) * | 1997-02-14 | 1998-08-20 | Biomar International, Inc. | A system for predicting future health |
US5910655A (en) * | 1996-01-05 | 1999-06-08 | Maxent Solutions Ltd. | Reducing interferences in elemental mass spectrometers |
-
2001
- 2001-04-11 WO PCT/US2001/011649 patent/WO2001079523A2/en not_active Application Discontinuation
- 2001-04-11 EP EP01928435A patent/EP1272657A2/en not_active Withdrawn
- 2001-04-11 JP JP2001577506A patent/JP2003530858A/en not_active Withdrawn
- 2001-04-11 AU AU55293/01A patent/AU764402B2/en not_active Ceased
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5910655A (en) * | 1996-01-05 | 1999-06-08 | Maxent Solutions Ltd. | Reducing interferences in elemental mass spectrometers |
WO1998035609A1 (en) * | 1997-02-14 | 1998-08-20 | Biomar International, Inc. | A system for predicting future health |
Also Published As
Publication number | Publication date |
---|---|
WO2001079523A3 (en) | 2002-03-21 |
AU5529301A (en) | 2001-10-30 |
EP1272657A2 (en) | 2003-01-08 |
WO2001079523A2 (en) | 2001-10-25 |
JP2003530858A (en) | 2003-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karpievitch et al. | Normalization and missing value imputation for label-free LC-MS analysis | |
Benson et al. | A method for fast database search for all k-nucleotide repeats | |
EP2450815B1 (en) | Method for identifying peptides and proteins according to mass spectrometry data | |
US20110264377A1 (en) | Method and system for analysing data sequences | |
CN112259167B (en) | Pathogen analysis method and device based on high-throughput sequencing and computer equipment | |
AU764402B2 (en) | Method and system for microorganism identification by mass spectrometry-based proteome database searching | |
Feng et al. | Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies | |
Heredia-Langner et al. | Sequence optimization as an alternative to de novo analysis of tandem mass spectrometry data | |
US20030065451A1 (en) | Method and system for microorganism identification by mass spectrometry-based proteome database searching | |
Wu et al. | HMMatch: peptide identification by spectral matching of tandem mass spectra using hidden Markov models | |
CN114420213B (en) | Biological information analysis method and device, electronic equipment and storage medium | |
WO2019170501A1 (en) | System and method for categorization of nucleic acid sequencing | |
JP7437310B2 (en) | Systems and methods that use local unique features to interpret transcriptional expression levels of RNA sequencing data | |
NZ533685A (en) | Improvements in and relating to interpreting DNA | |
US20050100980A1 (en) | Method for using saddle-point approximation for the evaluation of intractable conditional probabilities in biotechnology | |
Lysiak et al. | Interpreting Mass Spectra Differing from Their Peptide Models by Several Modifications | |
Kaltenbach et al. | SAMPI: protein identification with mass spectra alignments | |
CN115019892B (en) | Confidence determination method for sequence coverage in sequencing of environmental microbiota metagenome | |
JP7543431B2 (en) | FAST-NA for detection and diagnostic targeting | |
Borodinov et al. | Methodology for Assessing the Quality of Genomic Assembly Based on the Analysis of K-Mers Frequency in a Parallel Sequencing Sequencer | |
US7603240B2 (en) | Peptide identification | |
Li | Read simulator for single cell RNA sequencing | |
AL-Qurri | Improving Peptide Identification by Considering Ordered Amino Acid Usage | |
Boisson et al. | Protein sequencing with an adaptive genetic algorithm from tandem mass spectrometry | |
JP2008021260A (en) | System for identifying rna sequence on genome by mass spectrometry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGA | Letters patent sealed or granted (standard patent) | ||
DA3 | Amendments made section 104 |
Free format text: THE NATURE OF THE AMENDMENT IS: AMEND APPLICANT S NAME TO READ: THE JOHNS HOPKINS UNIVERSITY |