Identification of Micro-organisms
Field of Invention
The field of the invention is the analysis of complex data such as that derived from mass spectrometry and, in particular, the analysis, storage and comparison of spectra derived from biological samples.
Background
The use of mass spectrometry, and in particular matrix-assisted laser desorption- ionisation, time of flight (MALDI-ToF) mass spectrometry in the analysis of biological macromolecules is well-known. Although originally used in the analysis of semi- purified solutions of peptides, this approach has more recently been applied to the analysis of peptides and other macromolecules and fragments thereof derived directly from whole cells, especially bacterial cells, with the objective of using it to allow identification of species or strains of interest (Krishnamurthy and Ross, 1996, Rapid Commun Mass Spectrom 10: 1992; Claydon et al, 1996, Nature Biotechnology 14: 1584; Hettick et a/, 2004, Anal Chem 76: 5769).
One major problem with the use of intact cell mass spectrometry (ICM) is the complexity and poor reproducibility of the spectra generated from such complex mixtures of biological macromolecules and the wide variety of contaminants. The identification of informative peaks and patterns from such heterogenous spectra and the frequently poor and variable signal-to-noise ratios presents significant technical challenges, not least in terms of the very large datasets represented by individual stored spectra. Any workable system for recognising spectra and thereby identifying particular species, strains or mutants present must involve some form of comparative database comprising stored spectra and be capable of analysing and comparing a very large number of features in order to identify matching spectra.
In addition to the problem of the sheer size of the datasets involved, the reproducibility of data generated from ICM techniques suffers from shift, (especially at higher masses), uncertainty in amplitude value, and variations due to sample condition (age, culture, preparation etc) unrelated to the species itself. Such trivial or artefactual differences between spectra are difficult for any analytical system to deal
i
with since the rules for deciding on the significance of the variations are complex and subtle.
A variety of analytical approaches have been attempted including conventional multivariate statistical analysis and pattern recognition techniques such as intelligent systems, fuzzy logic, and neural networks. Some instrument makers have used modified Jaccard methods and RMS (i.e. effective averaging of spectrum) but without effective success. Such methods involve large computational tasks, which may be practicable for small data sets but may take days for a full comparison to be made.
International patent application WO 00/29987 discloses the generation of mass spectra from whole cells, the peaks from these are compared with predicted molecular masses from sequence databases. There is no disclosure of the concept of building a database of complex spectra derived from whole cells, nor of any computing method to analyse such spectra and assign identity.
International patent application WO 01/79523 uses a scoring algorithm used in conjunction with mass spectrometric analysis. However, the application requires the use of a proteome database, rather than a database of reference spectra, and the method relies on determining a probability of observing false matches between compared spectra.
US patent application US2002/0138210 relates to a method of compensating for changes in 'fingerprint spectra' of micro-organsisms resulting from environmental changes, based on comparison with observed changes in the spectra of related and unrelated organisms subjected to similar environmental changes using principle component analysis (PCA). Among the methods that may be used to detect differences in the spectra are artificial intelligence methods, including neural networks and fuzzy logic.
The problem with fuzzy logic (especially using fuzzy sets), neural networks, neurofuzzy and intelligent systems is that they can not resolve the difference between similar micro-organisms within required resolutions and accuracy (strain level) as they are either too fuzzy or too rigid.
International application WO 00/28573 describes a method of comparing complex datasets, especially mass spectra, based on the principle of defining a plurality of
datapoints to be compared across a complete range of data, converting each datapoint to a vector spatial function, said function being characteristic of the position/shape and/or relative intensity of the data at that point, assembling the vector spatial functions for the data range in question as a cluster and then determining the kernel function of said cluster. A radial base function of the cluster kernel of the sample is then determined, which is compared with radial base function of the cluster kernel of of the other data items in the database. The radial base function of the spectral data may be applied across a neural network, which may use to analyse pattern distributions of radial base functions of the local kernel clusters using Cover's Theorem. Two important points flow from this use of Cover's Theorem.
1. A non-linear transformation 0 of input patterns, X, to a Euclidean measurement space 0: X→Ed might transform a complex pattern classification problem into a linearly separable one.
2. The high dimensionality of measurement space Ed compared to the input space means that a complex pattern classification problem cast in this high dimensional space is more likely to be linearly separable than in a low dimension input space.
A search engine based on this approach was developed (Manchester Metropolitan University Search Engine - MUSE). A method of improving the quality of a database for use with such a search method was described in International Application WO 01/67295. The method comprises determining a single searchable reference point for a plurality of replicate samples of each item, establishing the co-ordinates of the replicate reference points in high dimensional space and thereafter determining a single reference co-ordinate for the cluster of replicate reference points for initial searching and/or comparison.
The use of MMUSE based on Cover's theorem and using fuzzy sets overlaps for transferring data to a higher dimensional space was had some success in dealing with above problems. Fuzzy sets were also used in the integration of neural networks with fuzzy logic to develop a so-called 'neurofuzzy' system. However, there remain significant limitations in MUSE. It is true that increasing dimensionality helps in achieving higher discrimination in dealing with ICM data but this is at the costs of the 'curse of dimensionality', which means that the number of calculations required increases rapidly with the increase of dimensionality. The calculation time required is
significant and limits the practicality of this approach. There are also implications for the error created in the transformation to a higher dimensional space and computations in that space. For example, if an error of order 0.01 ppm (i.e., 10"8), is created in each calculation for a typical ICM data of dimension 104, the final accumulative error, depending on the size of calculation, would be around 100 ppm, which is comparable with the error created by ICM process itself. This also creates an upper level limit in using MUSE.
A further problem with the application of ICM analysis to practical problems in health care, such diagnosis and epidemiology, is the high probability of samples containing cells of more than one species or strain. The presence of unrelated ICM data due to impurities and mixed-culture samples leads to major difficulties even for relatively sophisticated analytical approaches such as MUSE, since, by transforming all data to a higher dimensional space there is then no means of separating the misleading data.
In MUSE whole data are taken into account and transformed into a higher dimensional space, this includes noise and disturbance created by instrument. Therefore the final results in MUSE are be dependent on the quality of the instrument too.
Because of the way that MUSE works, no particular region of interest is selected, and a whole map or universe of discourse must be magnified and focused. This demands a higher dimensional space, a higher number of calculations and the overall performance and discrimination power of the search engine is limited by errors created by the high number of calculations and required time for performing these calculations. Therefore the search engine is not always able to deliver the required magnification for the whole map or required discrimination power for typing of some micro-organisms.
It is an object of the invention to be able to identify and detect micro-organisms at strain level. Pathology labs are well equipped to rapidly detect at species level, but clinical diagnosis often requires identification at strain level. For example, the species Salmonella has over 2000 different strains, and the correct identification of Escherichia coli and Staphylococcus aureus has crucial importance for effective treatment planning and public health policy.
For example, it presently takes 2-7 days to identify a meningococcal infection to strain level, following sub-culturing, which itself can take between 6-48 hours. It is an object of the invention to provide a method of analysing spectral data, generated from bombarding whole bacterial cells (after 18 hour cultures) on the intact cell- MALDI-TOF-MS, much more quickly.
Statement of Invention
The invention provides a novel method of analysing complex datasets, such as mass spectrometric data, applicable to detection, identification, typing and analysis.
There are four clear steps in the process as applied to identifying micro-organisms.
1. Isolating and preparing the test micro-organism for testing
2. Creating the spectra
3. Creating and using a database of reference spectra
4. Analysing the spectra for identification
Methods for steps 1 and 2 are well-known. The invention concerns steps 3 and 4. It will be clear to one of skill in the art that the methods disclosed are of more general application than the field of identification of micro-organisms alone.
In one aspect, the invention provides a method for creating a reference database of orthonormal mass spectrum data. Preferably the spectra are MALDI-ToF spectra derived from intact cells. It is also preferred that the spectra are derived from micro¬ organisms, more preferably from pathogenic micro-organisms, most preferably pathogenic bacteria including Staphylococcus aureus (including antibiotic resistant strains, such as methicillin-resistant Staphylococcus aureus or MRSA), Escherichia coli (including pathogenic strains such as 0157), Salmonella sp, Enterococcus faecium, Listeria monocytogenes. The method comprises the following steps.
1. For each record in the database, the MALDI-ToF spectrum of a set of replicates (identical samples) is acquired. Preferably 2 to 20, and more preferably 5 to 10 replicates are acquired.
2. Optionally, the generated spectra may be inspected for any error and consistent replicates selected.
3. A series of templates of meaningful data and regions of interest (such as peaks) for each record are defined using selected replicate spectra for each record. This is done by setting a number of thresholds dividing data into non- null data (above the threshold) and null data (below the threshold). Preferably 2 to 10 thresholds are set, more preferably 5 to 7.
4. In the first highest threshold, the number of contiguous non-null data regions in all replicates is determined.
5. Missing non-null data in one or more replicates (as compared with the others) may be searched for by lowering the threshold.
6. Lower thresholds are selected until all replicates produce similar templates or the minimum threshold is reached.
7. The degree of similarity of the templates is examined by means of a dendrogram across the replicates and most similar ones are selected.
As an alternative to steps 5-7, it is possible to set a different threshold which is higher than above threshold if the minimum number of templates that will be selected is at least equal to less than 3 times higher than number of records in the database.
8. The range of masses used in the expanding templates is determined by using a histogram of templates across the whole range of records in the database. 9. Common templates are selected from the population of the same strains in the database records. These common templates are taken to represent that strain in the database and are termed the 'basic PeaksCell'.
10. Templates are expanded or modelled across all obtained mass ranges and expanded models of templates are defined as PeaksCells. In this way, depending on the complexity of the expanded model, the PeaksCell could have different dimensions ranging from 1 to any desired accuracy. In a database of very different micro-organisms and when identification to strain level is not required, a simple model of the contiguous non-null data, PeaksCell or basic PeaksCell (such as boxes, spikes or curvilinear co¬ ordinates) would be sufficient for identification purposes. For accurate identification of closely related organisms at strain level, the PeaksCells are obtained by expanding the region of meaningful data in terms of any localised functions (i.e. Gaussians).
11. The expanded model of each strain in the database is expressed as a vector.
12. The vectors are then transformed to an orthonormal (orthogonal) set of vectors (Ri), which span vector space (G), which have no projection on each other. In this way new vectors are in the directions of the components with no projection on any other vectors in old space, and are called invariants. Although it is possible to transform them in any other direction, this has no benefit. The dimension that new transformed vectors span in the new space is equal to the dimension of the old space and is less than the dimension of each vector itself. In this way, the dimension of each vector is much less than the dimension of the original spectral data.
In a second aspect, the invention provides a method of comparing unknown microbes with records in the reference database for the purpose of identification.
1. For an unknown spectrum, the method as described above in steps 1 to 12 is followed, resulting in a vector (x) in vector space G.
2. X is projected into an orthonormal set of vectors (Ri) giving the component of x in each vector Ri (x.Ri).
3. In the case of a pure culture and if the unknown microbe strain exists in the database, it is expected that all major components will be in the direction of the vector corresponding to the record (micro-organism and strain) which identifies the unknown.
However, in situations where there are variations and uncertainty resulting in very small components in the remaining projections, fuzzy calculus may be used as a means of projection. In this case, the membership function corresponding to the record in the database which attains the largest value provides the identification.
In the case of an unknown not present in the database, projection of x produces small components in any direction.
In cases of contamination or mixed sample cultures, there will be more than one major component in the direction of vectors corresponding to the records of micro¬ organisms in the database, which identify the unknown micro-organisms. In other words, there would be more than one major membership function corresponding to the records of the database.
In a third aspect the invention provides a search engine which analyses spectral data for the identification of Intact Cell MALDI (ICM) fingerprints of micro-organisms at strain level rapidly and accurately by use of one or more of the methods herein described.
In a fourth aspect, the invention provides a method for discovering similarities in sources of epidemic infections and identifying the spread of pathogenic micro¬ organisms within a society and for facilitating the limitation of such outbreaks and cross infection control according to the mthods herein described and comprising the additional steps of finding and selecting templates in a population of similar strains in the database records which are not common with each other. These uncommon templates represent the differences within said strains in the database. The vectors obtained by the methods described are then transformed to an orthonormal (orthogonal) set of vectors say (FTi), which span a vector space (G'), which has no projection on each other. In this way new vectors are in the directions of the components with no projection on any other vectors in old space, and are called variants as they points the differences in strain level.
In a further aspect, the invention provides a method of identifying a micro-organism comprising the methods herein described.
In a final aspect, the invention provides a computer program for performing the methods herein described, recorded on a data carrier. Also provided is a computer programmed to carry out one or more of said methods. Further provided is an apparatus for analysing mass spectra comprising a computer programmed to carry out said methods.
Detailed description of the invention
A search engine known as Biocypher™ , has been developed as an intelligent search/analysis engine that has demonstrated capability of learning and interpreting complex spectral data.
The invention will now be described in more detail, with reference to the following drawings.
Figure 1 : Outline of ICM process and Biocypher analysis
Figure 2: (a) ICM and selected threshold;
(b) simple boxes are used to encapsulate selected peaks;
(c) a simplified model of (b) depicting only the centres of the selected peaks.
Figure 3: a curvilinear coordinate (r θ z) model of the spectrum of Figures 1 and 2. Figure 4: (a) a complex peak and
(b) its internal imaginary components Figure 5: (a) a complex peak and
(b) its external imaginary components
Figure 6: shows distinguishing biomarkers in different strains of MRSA Figure 7: shows two replicates of the same strain of MRSA Figure 8; shows the results of 13 isolates of 3 different strains analysed by conventional techniques, including MUSE. 1 : Vancomycin res. Enterococcus, 2: Vancomycin res. Enterococcus, 3: Vancomycin res.
Enterococcus, 4: Enterococcus faecium, 5: Enterococcus faecium, 6: Enterococcus faecium, 7: Enterococcus faecium , 8: Enterococcus faecium, 9: Enterococcus faecium, 10:
Enterococcus faecium, 11 : Enterococcus faecium, 49: Listeria monocytogenes, 50: Listeria monocytogenes
Figure 9: shows 3 strains of vancomycin resistant Enterococcus
Figure 10: shows 8 strains of Enterococcus faecium
Figure 11 : shows the results of Figure 8 analysed by means of simple 'boxes' as in Figure
2(b), which is an improvement but does not yet have the required accuracy produced by
BioCypher
Figure 12: BioCypher analysis of 3 strains of vancomycin resistant Enterococcus
The invention provides a new search and analysis tool designed to identify unknown spectra obtained from whole cells. BioCypher™ operates at a much lower dimension than original ICM data, whilst at the same time retaining the required higher discrimination power to resolve complex data. BioCypher™ does not require ICM data have the same dimensions or to be compatible with each other, since it does not need to work in a fixed dimension. It works in an adaptive way.
The basic concept behind BioCypher™ is that working in a lower dimension and achieving higher discrimination at the same time is possible by adaptively focusing on a particular region of interest and not whole universe of discourse. This method of focusing and magnifying a region of interest avoids the limitations imposed by Cover's theorem.
BioCypher™ examines ICM data and sets a threshold [HOW?], which breaks the data into a series of regions of data surrounded by common background (Null data). By doing this, BioCypher™ effectively reduce the dimensionality of ICM data and at the same time increase reproducibility of the whole system. The threshold is a value to quote noise level in the system and null data are all data below threshold. The selection of threshold is important and implies a compromise between filtering of random noise and handling undesirable effect of shift in ICM data. Lower values result in more robust handling of shift and taking more small meaningful variations into account but at the same time accepting more moise and interference. Higher values mean filtering out noise but results in less robust handling of shift and ignoring small meaningful variations (although we may be able to analysis noise for driving meaningful information from it). Please note that to avoid spikes in ICM data, not all data above threshold would be considered for having a new PeaksCell™. Spikes of negligible area which include 5 or fewer datapoints are difficult or impossible to model and so are not included as PeaksCells.
'PeaksCell'™ means a region of data in a mass spectrum, which lies above a defined threshold and which is delimited along the x axis by regions of null data lying below the threshold. PeaksCell™ is a term used to describe and model non-null data surrounded by null data. Note that a PeaksCell™ is neither only a peak, nor numerical data which describe ICM data. It may be considered as a dimension that contains or generates other dimensions. It is a model, which describes non-null data surrounded by null data, in that sense that it could be described by a few parameters
with much lower dimensionality than non-null data and with visualization possibility for example in a curvilinear coordinate system (see Figure 3) .
'BioVariant'™ means a series of PeaksCells belonging to one orthonormal dataset from a transformed spectrum in a set of replicates.
'Biolnvariant'™ means a set of PeaksCells™ common to a set of replicate data from identical samples. A histogram of all such sets, for all records in the database, defines informative mass regions that should be spanned and transformed or modelled, for example in terms of localised functions.
The problem is then how to obtain an appropriate model of each PeaksCell™. It is important to note that the complexities of describing models and the precision with which PeaksCell™s can be defined by descriptive models will determine the resolution achievable. There are a number of model structures available for applying to PeaksCell™ with its own advantages and disadvantages. It is possible to select the most suitable and there is no general restriction to limit all PeaksCell™ to the same selected model structure. It is within the capability of a person skilled in the art to select the appropriate model.
With a reasonable PeaksCeII™s model of ICM data, it is possible to handle uncertainties and variation in ICM data, since noise is excluded or minimised in system calculations and it has been observe that shift has much less effect in PeaksCell™ sets than in the raw spectrum itself.
Further, it is possible to resolve problems regarding incompatibility and dimensionality differences in ICM data (created by using different instrument resolutions), because PeaksCell™s are not dependent on the number or dimension of data.
PeaksCells™ also provide a clearer and simpler visualisation of ICM data. It is possible to perform a rough search and analysis of ICM database using even conventional clustering, multivariate statistical analysis or pattern recognition techniques including intelligent systems, fuzzy logic, neural networks or neurofuzzy systems, although the application of the appropriate technique depends on how PeaksCell™ are modelled. The inventions discloses the following methods.
1. If PeaksCell™s are considered as a series or array of numerical parameters, then they may be modelled or presented in an appropriate curvilinear coordinate system such as (r θ z, Figure 3), then apply conventional multivariate statistical analysis can be applied directly to a concatenation of arrays including all arrays of numerical parameters of the PeaksCell™ models.
2. For pattern recognition techniques, including intelligent systems, fuzzy logic, neural networks or neurofuzzy systems, PeaksCell™s, may be most efficiently used, by integrating integrating the models as a part of the above pattern recognition models. For example PeaksCell™s can be considered as a special first layer of them.
The discrimination power of the search engine and analysis is dependent on how PeaksCell™s are presented and interpreted. This is like focusing on a specific area of map and increasing magnification power for that area. A PeaksCell™ can be broken further into its components. Two ways are considered here for such 'diffraction' of a PeaksCell™ as follows:
1. Superposition of constituent components inside the PeaksCell™
2. Superposition of constituent components outside the PeaksCell™
The first way can be used mostly without affecting the PeaksCell™ itself. This kind of PeaksCell™ is called a primary PeaksCell™. The primary components of PeaksCell™ are normally due to instrument accuracy. It can be however increased without increasing instrument accuracy by adding more primary components satisfifying the boundary conditions of PeaksCell™s.
However in the second way the PeaksCell™ itself might be concealed (partly or wholly) and as a result new primary PeaksCell™s might be created and replace it. This is because of effects such as double charge ions or PSD that have been included in ICM data. The new PeaksCell™s might themselves undergo further diffraction of first type.
The invention provides specific criteria for considering differences in the origin of ICM data and taking into account which parts of should be considered together. This includes: a) parts of data that are related to similar effects of the same pathogens,
b) parts of data, which are not related to anything or are related to everything, c) parts of data that are related to differences within the same pathogens, and d) parts of data that are related to different pathogens.
Therefore PeaksCell™s themselves can be divided according to the variations they may present in a population of replicates of known micro-organisms. All PeaksCell™s which show constant presentation in the population of the replicates of known micro¬ organism (group a) are defined as Biolnvariant™, and the remaining PeaksCell™s within that population (group c) are called BioVariant™. The Biolnvariant™s in a database of mixed pathogens can themselves be divided further into those that are nearly common and those that are not common not at all (group d). Similarly BioVariant™s in a population of the same known pathogens can be divided further into those that are common in a subgroup of the population and those that are not common not at all (group b).
These divisions are especially important since they allow for more discrimination power for handling contamination and mixed cultures, and for epidemiology applications and cross-infection control, as follows;
i. the use of Biolnvariant™ (groups a, b and d) for discriminating and identifying micro-organisms in cases of contamination and mixed cultures; ii. the use of BioVariant™ (groups b and c) for discovering of similarities in the source of epidemic infections and identification the spread of micro-organisms in a society (like hospitals) for handling outbreaks and cross-infection control.
Microbiologists and epidemiologists normally use learning capability rather than algorithms for typing and classification problems. One possibility is for BioCypher™ to analyse PeaksCell™s by employing "human" inspired techniques for analysing, typing, and classifying biological patterns. This may be achieved as follows.
1. Definition of specific attributes of the spectra of the biological patterns as words. This involves extraction or identification of PeaksCell™ (Biolnvariant™s and BioVariant™s) features and representing them as linguistic variables.
2. Each of the linguistic variables defines a single membership function of the spectral pattern under investigation.
3. When all the attributes of an unknown spectral pattern (say A) are correctly assigned to all the "PeaksCell™" of a known pattern (say B) in the database, then A is a full member of B.
(P J J J J
R : If xi is Fi and xn is Fn then v is G
(where xi, 1 = 1 ,2 n, are PeaksCell™ (Biolnvariant™ or BioVariant™) defined by linguistic variables Fi and G. Note that the dimension of xi is depend on how PeaksCell™s are built).
The degree of belongingness of 'A' to 'B' is determined by membership functions of the projected patterns. As mentioned above, Biolnvariant™ (groups a, b and d) may be used for discriminating and identification of micro-organisms in handling contamination and mix-culture cases. BioVariant™s (points b and c) are useful for discovering the source of epidemic infections, epidemiology and cross-infection control.
As the membership functions can have values other than null and one, they introduce the degree of belongingness of 'A' to other biological patterns in the database, such as 'C, 'D' etc. This is common for biological patterns, which have a degree of similarity and overlap.
In analysing a database each record may be considered and the analysis above repeated. When all records have been considered individually against the rest of records in the database, the results may be presented in the form of a membership function matrix, which can be used for pictorial purposes, such as drawing a dendrogram.
If this is repeated for only an unknown (or test sample) against all of records in an ICM database, then the highest degree of belongingness determines the appropriate candidate for the unknown pattern. A maximising decision is defined as a point in space of the alternatives at which the membership functions of decision attains its maximum value. There is a possibility of using negation of membership functions when we are interested to know which biological patterns should not be considered, or whether the unknown pattern exists in the database or not. In this case we will have:
0) j J j
R : If xi is Fi and xn is Fn, then v is NOT G
Using the linguistic rules, above, and calculus of these linguistic rules, it is possible to make checks on the completeness, interaction, consistency, and generality of the database.