GB2520758A

GB2520758A - Method of mass spectral data analysis

Info

Publication number: GB2520758A
Application number: GB1321149.5A
Authority: GB
Inventors: Oliver Serang
Original assignee: Thermo Fisher Scientific Bremen GmbH
Current assignee: Thermo Fisher Scientific Bremen GmbH
Priority date: 2013-11-29
Filing date: 2013-11-29
Publication date: 2015-06-03
Also published as: GB201321149D0

Abstract

A method of analysing mass spectrometer data to determine the presence of one or more sample compounds comprises receiving mass spectrometer data comprising shared data and unique data. The share data are associated with each one of a plurality of candidates for one or more sample compounds, and each of the unique data is associated with only one respective candidate from the plurality of candidates. The plurality of candidates is determined from the mass spectrometer data, and a convolution tree is formed based on the plurality of candidates. The unique data and the shared data are used with the convolution tree to calculate a respective candidate probability for each of the plurality of candidates. The invention, which may be applied to proteomics, may reduce the computation time for posterior probabilities by using a tree-shaped Bayesian network wherein the probabilities are calculated for each of the nodes by considering the shared and unique data in separate nodes.

Description

Method of mass spectral data analysis

Field of the invention

The invention relates to a method of probabilistic determination of the species present in a sample based on mass spectral data. In particular, the method uses Bayesian inference to determine the probabilities of the presence and/or the abundance of candidate species in light of the measured mass spectral data.

Background to the invention

In mass spectrometry, a sample is ionised and the resulting ions categorised according to their mass to charge ratio (m/z). The output of a typical mass spectrometer is a spectrum showing the distribution of m/z detected by the is system, which must be carefully interpreted to provide information about the sample.

Figure 1 shows a schematic diagram of a mass spectrometer that can be used to carry out the invention. Such a mass spectrometer is described in detail in W02012/160001. The mass spectrometer2 is shown in which ions are generated from a sample in an ion source (not shown), which may be a conventional ion source such as an electrospray. Ions may be generated as a continuous stream in the ion source as in electrospray, or in a pulsed manner as in a MALDI source. The sample which is ionised in the ion source may come from an interlaced instrument such as a liquid chromatograph (not shown). The ions pass through a heated capillary 4, are transferred by an RF only S-lens 6, and pass the S-lens exit lens 8. The ions in the ion beam are next transmitted through an injection flatapole 10 and a bent flatapole 12 which are RF only devices to transmit the ions, the RF amplitude being set mass dependent. The ions then pass through a pair of lenses and enter a mass resolving quadrupole 18.

The differential RF and DC voltages of the quadrupole 18 are controlled to either transmit all ions (RF only mode) or select ions of a particular m/z range for transmission by applying RF and DC according to the Mathieu stability diagram. It will be appreciated that, in other embodiments, instead of the mass resolving quadrupole 18, an RF only quadrupole or multipole may be used as an ion guide but the spectrometer would lack the capability of mass selection before analysis. In still other embodiments, an alternative mass resolving device may be s employed instead of quadrupole 18, such as a linear ion trap, magnetic sector or a time-of-flight analyser. Such a mass resolving device could be used for mass selection and/or ion fragmentation. Turning back to the shown arrangement, the ion beam which is transmitted through quadrupole 18 exits from the quadrupole through a quadrupole exit lens 20 and is switched on and off by a split lens 22.

Then the ions are transferred through a transfer multipole 24 (RF only, RF amplitude may be set mass dependent) and collected in a curved linear ion trap (C-trap) 26. . The ions are trapped radially in the C-trap by applying RF voltage to the curved rods of the trap in a known manner. The C-trap is elongated in an axial direction (thereby defining a trap axis) in which the ions enter the trap.

is Voltage on the C-Trap exit lens 28 can be set in such a way that ions cannot pass and thereby get stored within the C-trap 26. Similarly, after the desired ion fill time (or number of ion pulses e.g. with MALDI) into the C-trap has been reached, the voltage on C-trap entrance lens 30 is set such that ions cannot pass out of the trap and ions are no longer injected into the C-trap. More accurate gating of the incoming ion beam is provided by the split lens 22 Ions which are stored within the C-trap 26 can be ejected orthogonally to the axis of the trap (orthogonal ejection) by pulsing DC to the C-trap in order for the ions to be injected, in this case via Z-lens 32, and deflector 33 into a mass analyser 34, which in this case is an electrostatic orbital trap, and more specifically an OrbitrapTMFT mass analyser made by Thermo Fisher Scientific.

The orbital trap 34 comprises an inner electrode 40 elongated along the orbital trap axis and a split pair of outer electrodes 42, 44 which surround the inner electrode 40 and define therebetween a trapping volume in which ions are trapped and oscillate by orbiting around the inner electrode 40 to which is applied a trapping voltage whilst oscillating back and forth along the axis of the trap. The pair of outer electrodes 42, 44 function as detection electrodes to detect an image current induced by the oscillation of the ions in the trapping volume and thereby provide a detected signal. The outer electrodes 42, 44 thus constitute a first detector of the system. The outer electrodes 42, 44 typically function as differential pair of detection electrodes and are coupled to respective inputs of a differential amplifier (not shown), which in turn forms part of a digital data s acquisition system (not shown) to receive the detected signal. The detected signal can be processed using Fourier transformation to obtain a mass spectrum.

The mass spectrometer 2 further comprises a collision or reaction cell 50 downstream of the C-trap 26. Ions collected in the C-trap 26 can be ejected orthogonally as a pulse to the mass analyser 34 without entering the collision or reaction cell 52 or the ions can be transmitted axially to the collision or reaction cell for processing before returning the processed ions to the C-trap for subsequent orthogonal ejection to the mass analyser. The C-trap exit lens 28 in that case is set to allow ions to enter the collision or reaction cell 50 and ions can be injected into the collision or reaction cell by an appropriate voltage gradient between the C-trap and the collision or reaction cell (e.g. the collision or reaction cell may be offset to negative potential for positive ions). The collision energy can be controlled by this voltage gradient. The collision or reaction cell 50 comprises a multipole 52 to contain the ions. The collision or reaction cell 50, for example, may be pressurised with a collision gas so as to enable fragmentation (collision induced dissociation) of ions therein, or may contain a source of reactive ions for electron transfer dissociation (ETD) of ions therein. The ions are prevented from leaving the collision or reaction cell 50 axially by setting an appropriate voltage to a collision cell exit lens 54. The C-trap exit lens 28 at the other end of the collision or reaction cell 50 also acts as an entrance lens to the collision or reaction cell 50 and can be set to prevent ions leaving whilst they undergo processing in the collision or reaction cell if need be. In other embodiments, the collision or reaction cell 50 may have its own separate entrance lens. After processing in the collision or reaction cell 50 the potential of the cell 50 may be offset so as to eject ions back into the C-trap (the C-trap exit lens 28 being set to allow the return of the ions to the C-trap) for storage, for example the voltage offset of the cell 50 may be lifted to eject positive ions back to the C-trap. The ions thus stored in the C-trap may then be injected into the mass analyser 34 as described before.

The mass spectrometer 2 further comprises an electrometer 60 which is situated downstream of the collision or reaction cell 50 and can be reached by the ion beam through an aperture 62 in the collisional cell exit lens 54. The electrometer 60 may be either a collector plate or Faraday cup and is connected s to a high gain charge sensitive amplifier. It will be appreciated, however, that the electrometer 60 in other arrangements may be another type of charge measuring device. Preferably, the electrometer is of differential type which reduces noise pick-up from other electrical sources nearby. A first input of the electrometer is arranged to receive current or charge from the ion source while another input is arranged to have similar capacitance, dimensions and orientation to the first input but receives no ion current or charge at all. The electrometer 60 thus constitutes a second detector of the system, which is independent of the first detector, namely the image current detection electrodes 42, 44 of the mass analyser 34. In some arrangements the collision or reaction cell 50 may not be present, in which case the electrometer 60 is preferably located downstream of the C-trap behind C-trap exit lens 28.

It will be appreciated that the path of the ion beam through the spectrometer and in the mass analyser is under appropriate evacuated conditions as known in the art, with different levels of vacuum appropriate for different parts of the spectrometer.

It is to be understood, that any other mass spectrometer may be as well suited for use in connection with this invention. The Orbitrap mass analyser could e.g. be replaced by a time of flight analyser, or mass spectrometer 2 could be a triple quadrupole mass spectrometer, an ion trap mass spectrometer, or a quadrupole-or ion trapping quadrupole -time of flight mass spectrometer.

Instead of the LC device any other separation device, including an ion mobility device, HPLC, OC or ion chromatography could be interfaced to the mass spectrometer. Also any known fragmentation method (including collisionally activated dissociation, photon induced dissociation, electron capture or electron transfer dissociation) produces data suitable for use with the invention.

The mass spectrometer 2 is under the control of a control unit, such as an appropriately programmed computer (not shown), which controls the operation

S

of various components and, for example, sets the voltages to be applied to the various components and which receives and processes data from various components including the detectors. The computer is configured to use a known algorithm to determine the settings (e.g. injection time or number of ion pulses) for the injection of ions into the C-trap for analytical scans in order to achieve the desired ion content (i.e. number of ions) therein which avoids space charge effects whilst optimising the statistics of the collected data from the analytical scan. The algorithm may rely on preceding measurements of the mass analyser 34 or electrometer 60.

For a simple sample, the spectrum may be straightforward to analyse, and the most likely candidates for the sample easily identified. However, for complex samples, for example mixtures of organic substances, the spectra from individual complexes within the sample may overlap, making analysis more difficult.

One technique that can be applied to analysing the data from a mass spectrometer is Bayesian inference. In Bayesian inference, the probability that a set of parameters X=X1, X2,.. .Xk were responsible for a set of measured data, D, can be expressed as P(XID) = P(DIX)P(X)/P(D) Where P(X) is known as the prior, and represents an initial guess as to the probability of the parameters, X. P(DIX) is known as the likelihood, and represents the probability that a set of measurements, 0, arise from a set of given conditions X. P(XID) is known as the posterior, and represents the updated probability that the set X was responsible for a particular set of measured data, D. In essence, the posterior is the updated guess of the probability of the parameters X in light of the measured data. Finally, P(D) can be considered as a normalisation constant, and is often left out of the equation.

When applied to mass spectrometry, X represents the candidate species that may be responsible for a given spectrum. The value of X1 will be "1"if the species X1 is present in the sample, and will be "0" if it is not. The values of X are the parameters that are to be determined based on the data retrieved from the mass spectrometer. Therefore, the spectral data is analysed to provide a possible set of candidate species, X, that could be responsible for a spectrum or the data from a plurality of spectra.

The values of the likelihoods, P(DIX), are then calculated for the various possible combinations of the candidate species, X. An initial estimate is made of s the probabilities of the various candidates, X, being present to provide the priors, P(X). This value of the priors may be based on some previous knowledge of the sample or similar samples, or it may be simply a guess. The accuracy of the technique is enhanced if an initial guess of the prior can be made based on some empirical information. However, if no such prior knowledge of the sample exists, the priors are usually chosen to reflect a uniform distribution. Finally, the posterior, P(XID), is calculated by multiplying the likelihood, F(DIX), with the prior, P(X). The posterior, P(XID), represents the probability that the set of candidates, X, are those present in the sample.

By way of further explanation, it is useful to consider a specific example of a type of sample that can be analysed in a mass spectrometer. However, the invention should not be considered to be limited to the field of the example, since the invention can be applied to many types of mass spectral data.

Mass spectrometry is particularly useful for the identification of proteins (or, in general, other compounds hereafter referred to as "proteins") within an organic sample, a field called proteomics, which is an important aspect of biological and medical research.

In one method of analysing organic samples, sometimes known as "bottom-up" mass spectrometry, proteins are pre-digested into their constituent peptides (or, in general, any components of a compound, hereafter referred to as "peptides"), which are examined and classified in a mass spectrometer. Several databases exist that are able to provide the spectra expected for any given peptide. Therefore, if only a small number of proteins are present in the original sample, it is relatively straightforward to compare the spectrum of the peptides identified in the mass spectrometer with known spectra from various predetermined proteins and to identify the closest match, and so identify the most likely protein that was present in the sample.

The development of large protein databases has made it possible to identify many otherwise unidentified proteins by comparing information from their analysis, such as their sequences or mass spectra, with information in or from the database. Developments in high-throughput peptide analysis techniques, such as s robotic gel band excision and digestion, and matrix-assisted laser desorption/ionization (MALDI) mass spectrometry, have made it possible to collect large volumes of data that characterise large numbers of experimental proteins. Such information can be compared with information in databases of known proteins in order to identify such experimental proteins.

Mass spectrometry (MB) is particularly well suited to the analysis of these peptides, especially when used in conjunction with liquid chromatography (LC).

With the use of LC/MS, the peptides of proteins that have been proteolytically digested are separated using methods of [C. A mass spectrometer then analyses the peptides according to give their relative mass-to-charge ratio (m/z), is producing a characteristic spectrum of peaks for the peptide, which may belong to one or more proteins. With the use of tandem mass spectrometry (MS/MB), a single peptide of a protein can be selected and subjected to collision-induced dissociation (OlD). CID produces fragment ions that are sorted according to their mass-to-charge ratios, producing a characteristic spectrum for the selected peptide. The repeated application of liquid chromatography tandem mass spectrometry ([C-MS/MB) can produce a large number of spectra, each characterizing a plurality of different peptides.

A protein that has been characterized by methods such as [C-MS/MS can be identified by comparing its experimental data such as the mass spectra of its peptides with characteristic data such as theoretical mass spectra for peptides of previously identified ("known") proteins. By comparing the experimental data of an unknown peptide to theoretically derived properties of known peptide sequences, the unknown peptide as well as the unknown protein to which the unknown peptide belongs can be identified.

Searchable protein databases are available, e.g., at the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov). They include databases of nucleotide sequence information and amino acid sequence information for proteins.

To evaluate MS/MS data for peptides using a nucleotide or protein sequence database, sequences in the database that represent proteins can be s divided into sequences representing the peptides that would result from an actual proteolytic digestion of the proteins. A theoretical spectrum can then be generated for each peptide of a protein represented in the database, based on the sequence of the peptide. The theoretical spectrum includes mass-to-charge peaks that would be expected if the protein in the database were subjected to MS/MS and the peptide of interest was selected for characterization. Each theoretical peptide spectrum for proteins represented in the database can be compared to observed peptide spectra for an unknown protein. The similarity of the theoretical peptide spectra to the unknown peptide spectra can then be used to determine the identity of the unknown protein. The SEQUEST or MASCOT is search engines implement such a routine for protein identification. See, for example, Eng JK, McCormackAL, and Yates JR 3rd, "AnApproach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database", J. Am. Soc. Mass. Spectrom., 1994, 5: 976-989.

The matching of proteins based on their MS/MS fragmentation spectra to data from peptides extracted from databases does not necessarily identify them unambiguously or with 100% confidence. Some spectra may match very closely while others match less closely. A close match may or may not indicate the identity of the unknown peptide. The likelihood of observing a close match by chance can be influenced by a variety of aspects of the comparison and search, including the amount of experimental data, size of the database, and redundancy in the database. Ideally, the effects of this variety of aspects are evaluated probabilistically and together, but finding the exact analytical expression can be very difficult.

When the number of proteins in the original sample is large, the identification of individual proteins becomes much more difficult, and sometimes impossible. Some peptides, so called degenerate or shared peptides, are shared with more than one protein, making an exact identification of the proteins very difficult.

One way in which the probability that a particular set of proteins (or the probability that a particular protein) is present in the original sample can be s calculated is using the Bayesian inference technique described earlier. As an example, the Bayesian network depicted in figure 2 can be used in mass spectrometry-based proteomics when attempting to identify splice variants or closely homologous proteins. All proteins X1,X2,.. .X in the weakly connected subgraph share an identical collection of peptides matching the observed spectra. These shared peptides and spectra are denoted D. In addition, each protein also has any number of unique peptides, which are only found in a single protein. These peptides and spectra unique to protein X, are denoted D1.

In this model, the probability that a given peptide is present is assumed to only depend on the number of present proteins that can produce that peptide.

is The approach presented is compatible with any such model; for example, in Serang, 0., MacCoss M.J., and Noble, W.S. "Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data." Journal of Proteome Research 9.10 (2010): 5346-5357 and in Serang, 0., and Noble, W.S.

"Faster mass spectrometry-based protein inference: junction trees are more officient than sampling and marginalization by enumoration" Computational Biology and Bioinformatics, IEEE/A CM Transactions on Computational Biology and Bioinformatics 9.3 (2012): 809-817.

The brute force method of calculating the posterior for a given protein, X1, say, for this system would be to consider every possible combination of the proteins X1, for which X1 = 1. For example, the posterior fora given protein, X1 is proportional to the value of its likelihood function. The likelihood function is calculated by considering the probability of the shared peptide being present given that the protein Xi is present. However, since the shared peptide data will also depend on the presence of each of the other proteins, X2 -Xk, the likelihood function for X1 must take into consideration each of these proteins. As there are k proteins in this set, and each protein is either present or absent, the total number of combinations of the other proteins is 0(2k), where 0 means "of the order of', and therefore, this number of calculations must be required to calculate the posterior for a single protein.

To calculate the posteriors for all k proteins, then, the total number of calculations would be O(kx2k). This brute force method of calculating the posteriors for the proteins by considering every possible combination of the candidate proteins is called "power-set enumeration".

For this method, the likelihood is calculated for every possible starting state of X=Xl,X2,...Xk Pr(X1 -1)) oc t'i'D -Pr(D.Xi =1) = JPr(DX = .X.T7.Xk Pr(.D.JV -. . ,J., --..) Pr(D.. .r)lJPr'r)._X.; The power set enumeration method will calculate the exact posteriors for each of the proteins, since all combinations of the other proteins are taken into account in the calculation. However, this brute force method is particularly computationally demanding, as there are many calculations to make, even for a reasonably small set of proteins. The graph labelled "power set" in figure 5a is shows the time taken to calculate the posteriors on an Intel i3 laptop as a function of the total number of proteins considered. As can be seen, the time taken very quickly becomes unfeasibly large as the number of proteins increases (note that logarithmic scales are used in the figure).

Despite its inefficient runtime in terms of computer processing requirements, power-set enumeration requires only 0(k) space (enumeration of the power-set can be performed by incrementing a large base-2 number, and therefore does not require storing the entire power-set).

In light of the above discussion, it would be desirable to be able to compute the posterior functions for each of the candidates, for example the proteins X1, X2.. .Xk in the proteomics example above, more efficiently. By reducing the number of calculations necessary to calculate the posterior functions, the total time to return the required values would be reduced, and therefore, the total number of candidates, k, that can be considered in the calculation could be increased. In the proteomics example above, this would mean that more complex organic samples could be considered that contain proteins with similar peptide spectra.

Summary of the invention

s According to the present invention, there is provided a method of analysing mass spectrometer data from a sample to determine one or more sample compounds therefrom according to claims 1 and 15 and a system according to claim 24 for carrying out the method.

Mass spectral data can be analysed using a convolution tree to reduce the number of calculations necessary to determine compounds in a sample.

In one embodiment, a convolution tree is used to determine the probability of the presence of a candidate compound based on data that has been identified as being unique to given candidates and shared data identified as being shared with two or more candidates. The unique and shared data may be is data relating to peptides identified from the spectra or other components of a compound, and the candidate may be a protein or other compound identified as being a possible candidate associated with the unique and shared data.

In another embodiment, the convolution tree is used to determine the abundance of a compound within a sample based on peaks from the mass spectrometer data that have been identified as being unique to given compounds and shared peaks identified as being shared between two or more compounds.

In a first aspect, a method of analysing mass spectrometer data from a sample to determine one or more sample compounds therefrom according to the present invention comprises: receiving mass spectrometer data, the data comprising shared data which are associated with each one of a plurality of candidates for the one or more sample compounds, and unique data each of which is associated with only one respective candidate from the plurality of candidates; determining the plurality of candidates from the mass spectrometer data; and forming a convolution tree based on the plurality of candidates and using the unique data and the shared data with the convolution tree to calculate a respective candidate probability for each of the plurality of candidates.

The convolution tree method allows streamlining the number of calculations necessary to compute the Bayesian inference probabilities of the candidates. This convolution tree enables the more efficient calculation of probabilities related to the candidate, which is related to the Bayesian posterior of s the candidate. For example, as stated later, this could be the probability that the candidate is present or the probability of the candidate having a particular abundance.

Preferably, the convolution tree comprises connected nodes in a layer zero, a plurality of intermediate layers, and a final layer, the layer zero comprising a candidate node for each respective candidate of the plurality of candidates, the plurality of intermediate layers each comprising a respective plurality of intermediate nodes, and the final layer comprising a single, final node, wherein pairs of the candidate nodes are connected to a respective intermediate node in a first intermediate layer, pairs of the intermediate nodes from the first intermediate layer are connected to a respective intermediate node in a second intermediate layer, and so on until a single pair of intermediate nodes is reached in a final intermediate layer, the single pair of intermediate nodes being connected to the final node in the final layer.

This structure of the convolution tree leads to an inverted triangle of nodes with each layer having half the number of nodes than the previous layer.

This has an advantage that a joint probability of each individual node with all data that is above it can be stored so when this joint probability is needed to be used again, it does not have to be recalculated.

Preferably, calculating a respective candidate probability comprises determining an initial probability for each of the candidates based on the unique data for that candidate and associating the initial probability with the respective candidate node for that candidate. This essentially provides a Bayesian prior of the candidates taking into account the unique data to provide an initial guess of the candidate probabilities.

Preferably, the initial probabilities for pairs of candidate nodes are convolved to provide an for each respective intermediate node of the first intermediate layer, and the effective joint probabilities of respective pairs of intermediate nodes from the first intermediate layer are convolved to provide an effective joint probability for each respective intermediate node of the second intermediate layer, and so on until the effective joint probabilities of the single pair of intermediate nodes in the final intermediate layer are convolved to provide an s effective joint probability for the final node in the final layer. At each node, then, an effective "prior" is calculated for a node based on the joint probabilities of the nodes parents. This prior is calculated down the tree until priors for all the intermediate nodes have been calculated. The effective joint probability may be the actual joint probability or may be proportional to it (for example, as mentioned above, a normalisation constant may not be included in the calculation) Preferably, calculating a respective candidate probability comprises determining at the final node a likelihood of the final node based on the shared data. This is similar to how a Bayesian likelihood may be calculated for a system with only one unique data node, in that a "prior" can be calculated based on the joint probability of the final node.

Preferably, calculating a respective candidate probability comprises calculating the likelihood of each intermediate node and candidate node, by a deconvolution of the likelihood of the connected node in the layer below with the effective joint or initial probability of the node, starting from the final layer and proceeding upwards to the layer zero until the likelihoods of the candidate nodes have been calculated. This calculation is essentially opposite to the way in which the joint probabilities were calculated in a downward manner. The likelihood of any node is derived from the likelihood of its child node, and the calculations are then continued up the convolution tree until the likelihoods of the candidate nodes have been calculated.

Preferably, calculating a respective candidate probability comprises calculating the product of the likelihood and initial probability of each of the candidate nodes.

Preferably, each intermediate node in each intermediate layer is connected to only three other nodes, two nodes from the layer immediately above and one node from the layer immediately below. This allows a simple convolution tree which enables the calculations to be carried out efficiently. However, it is possible that a node has more than one child, in which case, the junction tree can be considered to contain more than one convolution tree, where each child from such a node belongs to a different convolution tree.

Preferably, determining the one or more sample compounds comprises s determining the respective presence of the one or more sample compounds based on the calculated respective candidate probability for each of the plurality of candidates. In this case, the candidate can be attributed a value of either 0" indicating that it is not present or 1" indicating that it is present. The candidate probability is then related to the probability that this value is 0" or "1 Advantageously, the candidates may be proteins in the sample and the unique and shared data may be derived from the mass spectrometry data of component peptides. This is an important application of the invention, in which the probabilities of the presence of a large number of proteins can be computed, whereas with prior art techniques, the computing power needed means that only is samples with a small number of proteins can be analysed.

Advantageously, the shared data are peptides that could belong to more than one candidate.

Advantageously, the shared data are mass spectral peaks common to a plurality of candidates.

Preferably, determining the one or more sample compounds comprises determining a respective abundance of the one or more sample compounds based on the calculated respective candidate probability for each of the plurality of candidates and peak intensities derived from the mass spectrometer data, wherein the shared data comprises a set of peak intensities which are common to two or more of the candidates and the unique data comprises a respective peak intensity which results from only one respective candidate. This further embodiment of the invention is related to embodiments that calculate the probability that a candidate is present or not. However, in this embodiment, the candidate probability represents not just the probability that the candidate takes the value "0" or 1" indicating that the candidate is present or not, but can take intermediate values to indicate a relative abundance of the candidate. This is normally done by dividing the range of possible values into a number of bins ("binning"), and in this way discretising the value of the candidate.

Preferably one or more of the nodes is connected to a further node in the layer below, the further node not comprising part of the convolution tree, but a s further convolution tree being formed which comprises the further node. In this manner, when nodes have more than one child, then each child is treated as being in a different convolution tree, and the whole junction tree comprises two or more cascaded convolution trees.

Preferably, the candidates are proteins in the sample and the unique and shared data are derived from the mass spectrometry data of component peptides.

In another aspect of the invention, a method of determining a set of compounds within a sample comprises the steps of: determining a set of components, D, based on experimental data; determining a set, X=X1, X2. . .Xk, of compounds from which the determined compounds could have originated; calculating the probabilities for each of X1. X2. . .Xk using Bayesian inference based on the determined components, D = D, D, where D= D1, D2,...Dk is a set of components, each of which is uniquely identified with a respective one of the compounds, X1, X2,...Xk, and D is a set of components that are common to all of Xi, X2,...Xk, based on a convolution tree, the convolution tree comprising a layer 0 of nodes attributed to each of the initial states X1, X, . . .Xk, with each node in layer 0 being associated with a respective one of components, D1, 1J2,...Dk, the convolution tree further comprising a layer 1 formed by pairing each of the nodes from layer 0 with another node from layer 0 into a respective node in layer 1, and subsequent layers formed by similar pairing of the nodes of the layer immediately above, until the final layer comprises a single node associated with the shared components, D(S).

The sample may be prepared by splitting the proteins into constituent peptides and the peptides analysed in a mass spectrometer to obtain experimental data regarding the peptides.

Preferably, the nodes have two sub-trees L and R as parents and one sub-tree as a child, D. Preferably, the joint probability of each node N having a value n is calculated from the joint probability of its parents L having a value I, and R having a value raccording to: }tflEi f)C \ -I' i-Pt Dt?h__\p J-j,-i-\ - = n.-Fr1 p(L L Preferably, messages are passed upwards through the convolution tree, wherein each node passes Pr(D,DIL) to its left parent and Pr(DR&DIR) to its right parent, the messages being defined by: S»= j \ \pU' \ \p/) r11 _ i"rLi)', = = \ Pi(L)'1V = = ----r) Vi-(t = --Preferably, a posterior for a node N = n is calculated by multiplying the joint probability stored at the nodes with the likelihood calculated from its child below according to: h<i) X = = = ifl VtUJ, , S Preferably, a posterior probability of the candidates Pr(X1 = x1ID) is proportional to the joint probability Pr(D, X1=x1) and can be computed by is normalisation so that the summation over all x1 is 1, according to: 1r( DX =:r) PrA = = L] 1r;D,X = Preferably, the resulting computational complexity is less than or proportional to nlog(n)log(n) where n is the number of candidates. Computational complexity may depend on the number of calculations needed to compute the candidate probability.

In another aspect of the invention, a system for analysing a sample and providing posterior probabilities of compounds within the sample comprises: a mass spectrometer arranged to scan a sample and provide data relating to components of the compounds within the sample; a processor in communication with the mass spectrometer arranged to receive the data from the mass spectrometer; and memory in communication with the processor, in which instructions for carrying out any of the above-described methods are stored, wherein, the processor is arranged to perform the instructions stored in the s memory on the data provided by the mass spectrometer to provide posterior probabilities of components with in the sample.

Other embodiments of the invention are now described.

The method of analysing mass spectrometer data from a mixed sample to determine the relative abundances of multiple compounds therefrom, may comprise: receiving mass spectrometer data, the data comprising a collection of paired intensities and optionally a retention time, ion mobility or other physico-chemical property associated with the observed mass and intensity pair and the corresponding mass to charge ratios (a single paired intensity and mass to charge ratio measurement is hereafter referred to as an "observed fragment") of fragments that could originate from a plurality of possible candidate compounds; determining the plurality of candidates from the mass spectrometer data; forming a causal dependency graph that depicts the additive aggregation of abundances of each fragment possible from every candidate compound; cascading this causal dependency graph into a convolution tree based on the plurality of candidates and using the unique data as observed fragments that can only originate from one such compound, and multiple instances of shared data as observed fragments that can originate from a plurality of such compounds, where each shared data fragment is associated with a single intermediate node in the convolution tree; wherein a posterior distribution is calculated for all discretized quantities for each of the plurality of candidate compounds is calculated using this convolution tree.

The method wherein the posterior distribution calculated may be used to compute predicted relative abundances for each intermediate node in the convolution tree as well as the proportion of that abundance originating from each compound of interest, and thus to probabilistically apportion the observed fragments among the candidate compounds.

Multiple convolution trees may be cascaded to send messages (the joint probability messages sent down and the likelihood messages sent up) throughout a junction tree, so that convolution trees that are sub-trees of the junction tree can compute and send messages, and compute posterior distributions efficiently s without realizing a data structure for the full multidimensional probability distribution.

Other preferred features and advantages of the invention are set out in the description and in the dependent claims which are appended hereto.

Brief description of the drawings

The invention may be put into practice in a number of ways and some embodiments will now be described by way of non-limiting example only, with reference to the following figures, in which: Figure 1 shows a schematic diagram of a typical mass spectrometer is suitable for providing the mass spectral data for the method of the present invention; Figure 2 shows a graph representing the candidate nodes, the unique data nodes and shared data nodes; Figure 3a is a schematic representation of how the cells are connected in the quadratic dynamic programming algorithm, Figure 3b shows a generalisation of figure 3a where each layer has been merged into a single node, Ni, to form a decomposition tree graph; Figure 4 shows a schematic representation of a convolution tree according to embodiments of the present invention; Figure 5a shows experimental data comparing the performance of the quadratic dynamic programming algorithm to the power-set enumeration method; Figure 5b shows experimental data comparing the performance of the convolution tree algorithm with the quadratic dynamic programming algorithm; Figure 6a shows a Bayesian network in which several nodes have more than one child node; Figure 6b shows a cascaded graph equivalent to that of figure 6a; Figure 7a shows a spectral demixing problem from mass spectrometry; and Figure 7b shows a cascaded convolution tree to infer compound quantities from the demixing problem in Figure7a, Detailed description of embodiments of the invention The inventor of the present invention set about trying to find a method of improving the efficiency of calculating the exact posterior functions of candidates that have been identified by mass spectral data in order to be able to analyse more complex samples than can currently be analysed.

Upon recognising that using the power-set enumeration method involves the repetition of many of the calculations, the inventor sought to reduce the total number of calculations necessary to evaluate the posteriors of the candidate nodes. In particular, when evaluating the joint probabilities of the presence of the is various candidates when evaluating the likelihood functions for a given candidate, the inventor recognised that the repetition of many of the intermediate calculations could be avoided. The inventor formulated two methods of reducing the total number of calculations necessary to calculate these posteriors by essentially reusing joint probability calculations that had already being evaluated.

Both methods enable samples to be analysed in a shorter amount of time, and also enable the analysis of more complex samples, which would not be able to be analysed with the prior art method due to impossibly long calculation times.

The first of these has been called the quadratic dynamic programming method, and this reduces the number of calculations to 0(k2), and the second has been called the convolution tree method, and this reduces the number of calculations to O(k log2(k) log2(k)).

Quadratic dynamic programming The brute force power-set enumeration method is inefficient at calculating the posteriors of the candidates. This is because for every posterior, the joint probabilities of all of the candidates are calculated from scratch. Since many of the calculations are identical for different combinations of present candidates, many of the intermediate calculations are carried out over and over again for different candidate posteriors. Therefore, in the quadratic dynamic programming approach, the intermediate calculations are carried out a first time, and their result stored so that when the intermediate calculation is required again, its value s is simply looked up, rather than recalculated, as in the power-set enumeration method.

This is achieved by adding a series of nodes, N0, NIl...Nk in a chain, with each candidate node, X1, connected to a corresponding node, NI1, of the chain, as shown in figure 3b. The final node of the chain is connected to the shared data node, D. This graph can be decomposed into the graph shown in figure 3a having k÷1 layers of cells, labelled NI0 to NIk, connected by "edges" to cells in the subsequent layer.

The cells in any layer are connected to the cells in the next layer by edges. Each edge represents the value of the candidate node, X1, at that layer. If X1 = U, then the edge connects the node to a node on the same row in the next layer; that is, the edge is horizontal on the graph. If X = 1, the edge connects the node to a node that is one row higher in the next layer. As shown in the figure, if the value of X can be larger than 1, then further edges can be added which increase the level of the node in the next level that the edge connects to.

However, in the proteomic example, only two edges are needed on each node, for X1 = U or 1.

To calculate the posteriors of X using this method, each edge for X1 = x1 is weighted by the product between the prior that the protein has that assignment and the unique likelihood contribution that arises from that protein taking that assignment: = ±:.) Pr(L)..F 1A±:t rzI+i These layers are built so that the final layer is the cumulative sum of all proteins: Nh = Nk1 +Xk =Nk2 + Xk1 + X =1X1. In this manner, all possible paths to arrive at every possible cumulative sum N = n, 0< k < n are available in the graph shown in figure 3a. After the graph is initialized by making sure the left-most No[O].fromLeft = 1 and Nk[].frompjght = Pr(D5INk = i), as defined below, two passes are performed: one from the left and one from the right (these passes are sometimes denoted a forward-backward" algorithm). A special class of message passing on a path graph commonly used by hidden Markov models (HMM5)).

Rowj of layer i is denoted as N1[j], which indicates that m<jXj = j. The pass s from the left starts by initializing No[0].fromLeft e-1, because N0 has a 100% probability that it is U (the outcome for the layer is indicated by the row number).

Each node propagates to the right in the following manner: 1Ac1[m]frümLrt.

The second pass, from right to left, is performed in an almost identical manner. For this pass, every node in layer k is initialized with the appropriate shared likelihood given n: Nk[j] e-Pr(D5IN = j). Then the same propagation is performed, but from right to left: A FjJ,f?'o kt. +-5"' V4a [tn] .rom.Rhqht.

ILX41:-31 The forward-backward likelihood of being in a given cell can be computed by N[j].JikeJihood = N[j].fromLeft xNI[j].fromRight. Thus, the posterior that any variable is in a certain state X1 = x is computed by the total weight of all paths that pass through edges assigning X1 = x: = 1)) cx. V V.. t from.Left e*1 [1 ± .r:]> For the three preceding equations, out-of-bound indices (e.g. querying Ni[-1].fromLeft and Ni[-1].fromRight) should return zero.

Figure 3b generalizes this approach and merges each layer N1[U], N[1] into a single variable N1. The result is a graph that is visibly simple; in fact, it closely resembles a hidden Markov model (HMM), for which the tree decomposition is trivial using a similar forward-backward algorithm. This graph generalizes the dynamic programming performed in figure 3a so that it can be applied to similar problems by transforming the graph and then performing inference. From the generalised figure 3b, it is straightforward to understand the algorithm described above: in figure 3a, passing through the node indicates that the partial sum N1 = = j. For this reason, we can see that propagation from the left accumulates the priors and unique likelihood contributions that would lead to N = j.

P5i)L);. .. )i>r(L)1.X.

Likewise, propagation from the right computes the remaining likelihood terms, along with the shared likelihood: F) I\i..-.-..-..

£ U$..+2 it)'' For both equations above, we exploit the fact that Pr(N11 = n11N1i = n11, X1 = x1) is 1 if and only if n1 = n11 + x1 (and is otherwise 0). For each of the two equations above, this allows us to collapse the nested sum into a single sum over the variable with a smaller domain (if all variables X1 are binary, as in the case of the example in proteomics distinguishing splice variants, then x1 will have a smaller domain than nj).

Lastly, we see that the posterior for a protein, which is always proportional to its joint probability with the data, is computed by the forward-backward probability passing through edges labelled X1 = 1)) E = ... = = . 1:(.t.) .1). = Fr(D.3D÷:....D,iY:'),Rc which is given by the algorithm.

The advantage of this evaluation method is that the total number of calculations necessary to evaluate the posterior functions is greatly reduced.

Figure 5a shows a comparison of the power set enumeration method and the quadratic dynamic programming method. It shows the times taken for an i3 Intel processor to calculate the posteriors of candidate proteins in the proteomic example as a function of the number of candidates. As can be seen, the time taken increases very rapidly for the power-set enumeration method, but the quadratic dynamic programming method shows a significant improvement.

The quadratic dynamic programming approach reduces the runtime to 0(k2). However, in terms of computer memory requirements, this approach uses 0(k2) space to store all of the necessary intermediate calculations. However, this time-space trade off remains favourable compared to the power-set enumeration s method, since if the value of k2 becomes too large to store in the RAM of a modern computer, k would also be so large that the processing power required to calculate the 2kk of the power-set enumeration method would result in an inordinately large runtime for power-set enumeration. Nevertheless the space necessary for the quadratic dynamic programming method is considered to be a limiting factor, and so like other 0(k2) algorithms, it is not considered applicable as-is to very large problems.

Convolution tree As a further improvement over the dynamic programming method is described above, the inventor has developed a convolution tree algorithm to calculate the posteriors of a large number of candidates in greatly reduced time.

The motivation for the convolution tree, shown in figure 4, is similar to that for the dynamic programming, shown in figure 3a. However, where the quadratic dynamic programming algorithm constructs a chain N0 = 0, N1 = + N0, N2 = N1 + X2 Nk = Nk1 + Xk, the convolution tree constructs intermediate layers in which the number of nodes is reduced in each layer until the final layer has just one node. The following description relates to the simpler situations where k is a power of 2. However, this is readily generalised, as shall be shown later.

In the convolution tree method, each candidate is represented by a node in a first layer (layer 0) of the tree. In the proteomics example, these candidates would represent the individual proteins that have been identified as possible candidates that could have given rise to the measured data. As with the graph associated with the power-set enumeration method (figure 1), each candidate is associated with a unique data node that represents data evidence that has been measured that can only be associated with one of the candidates; for example, a peptide that is only observed when a specific protein is present in the sample. At the bottom of the tree is the shared data node, D, which represents the observed data that is associated with all of the candidates that are represented in the tree. However, unlike in figure 1, the candidate nodes are not directly connected with the shared data node. Instead, intermediate nodes are disposed between the candidate nodes and the shared data node, in a series of layers.

s The number of nodes in each layer is reduced as the convolution tree is descended. In the example shown in figure 3a, each intermediate node is connected to exactly two nodes (its parent nodes) in the layer above, meaning that the number of nodes in any layer is half of the number of nodes in the layer above. As the tree is descended, the number of nodes in the layers reduces until only one node is in the final layer. This final node is then connected with the shared data node. In this manner, the candidate nodes are not connected directly to the shared data node, but instead connected via the multiple paths through the convolution tree.

The posteriors of the candidates are calculated in the following manner. It is will be understood that references to probability include a probability function or distribution for the respective probabilities under consideration. First, the joint probability of each candidate node and its corresponding unique data node is calculated. This may take into consideration previously determined probability model. For example, in the model presented in the field of proteomics in Serang ot aL 2010 (Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometer data" J. Proteome Res. 2013 October 1; 9(1 0): 5345 -5357) and Serang and Noble, 2012 ("A review of statistical methods for protein identification using tandem mass spectrometr' Stat, Interface. 2012; 5(1): 3-20), the joint probability of each candidate node and its corresponding unique data node could be determined by the values of cc' and 3,, where cc' represents the probability that the presence of a certain protein (candidate) would lead to the measurement of its unique peptide (unique data), and 3, represents the probability that a unique peptide is measured due to noise, when the corresponding protein is not present. Then, at each node of the topmost layer of the intermediate nodes, the joint probability of the node's parent candidates is calculated as a convolution (as described mathematically below), and stored at that node. This procedure of calculating the joint probability at each node of that node's parent nodes is continued at each subsequent layer, until the joint probabilities of every node with the data above it have been calculated and stored.

Just as the joint probabilities of the nodes in the convolution tree with s their data above them were calculated from the top down, now the likelihood of the node given the data below, which is the conditional probability of the data below given the node, is computed for every node in the tree from the bottom up.

At each node, a deconvolution is carried out, as described mathematically below, to provide the likelihood of the parent nodes. The likelihood of a parent node given its data below can be calculated by using its child node and the other parent of that child node. The likelihood of the child node (given its data below) is then deconvolved with the joint probability of the other parent node, thus marginalizing out the other parent node and computing the likelihood of the node of interest.

is This calculation is carried out for each of the intermediate layers, starting at the bottom layers and working up to the top layer of intermediate nodes. Once the likelihood functions of the topmost layer of the intermediate nodes have been calculated, these likelihood functions can be multiplied by the stored values of the joint probabilities of the unique data nodes and the candidate nodes to calculate the posterior functions for each of the candidate nodes. Therefore, the probability that the candidates were present in the sample can be determined.

Mathematically, this is achieved by arranging probabilistic adder nodes, sometimes called noisy adder nodes, partitioned into multiple layers: N' = (N1' where NI° = + as shown in figure 4a. These adder nodes are used to add two inputs together in which the values of the inputs are not single valued, but represent a distribution. In this manner, every node in the tree from layer 1 to layer log(k) has exactly one child and two parents. Layer o is composed of the parent-free variables Xl,X2,...Xk, each node X in this layer having two children: their unique data Di and a noisy adder node in layer one, N11, which depend on them.

Because of this consistent structure after layer 0, the entire tree can always be viewed from a single noisy adder node N as having two sub-trees L and R as parents (with dependent data D(L) and D(R), respectively) and one sub-tree as a child, which is composed of the data below, D, as shown in figure 4b.

D includes the shared data D5 as well as any other data nodes reachable by a downward pathway through the lower nodes. For instance, the tree as seen from the perspective of node N11bas L = X1, R = X2, 0(L) = D1, D = D2, and (D3, 04, .. . ok, D). Because this ternary structure is ubiquitous in the tree, a message passing algorithm can be constructed by using only the sub-tree from figure 4b.

First, the prior probability of a node can be constructed readily using the prior probabilities of its parents: = n) = 5' Pi(L = ) Pr(B = ?) Pr(N = nIL = P. B = And because N = L + H, this can be condensed to a single summation: 1'r(iV = = )i Pr(L = Pi(B = n -().

The joint probabilities for all nodes in layer 0 could be calculated based on the unique data and knowledge of how the unique data and candidates are connected (e.g. though the parameters c and I3 in the case of the model presented in Serang etal. 2010 and Serang and Noble 2012, although other models could be used), and so it is clear by induction that a joint probability can be computed for every node (the first layer is the base case and every joint probability in the next layer can be computed given the priors from the previous layer).

However, as written above, computing the joint probability for every node in the tree would be at best a constant speedup over the quadratic dynamic programming approach, because the final layer of the intermediate nodes, will have state space k/2 and k12, and all pairs must be combined to form the joint probability on the final intermediate node, (enumerating these combinations requires quadratic time). However, we have now defined Pr(N = n) as a convolution (see figure 4b inset), and so it can be computed in m log(m) time (where m is the state space of the node N being updated). If a probabilistic statement with a single unassigned variable (e.g. Pr(NI), Pr(L), or Pr(R)) is considered as a vector, then the prior vector is given by Pr(N) -Pr(L) * P41?).

where is the vector convolution operator.

s In a first step, the convolution tree algorithm computes the joint probability of each node with the data above it. This can be performed almost identically to computation of prior probabilities in the example above. Also, like above, the base case is known for all nodes in the layer U (it is simply the element-wise product between the node's prior and its likelihood from unique data). Thus in the same manner presented in the example above, the joint probability of the node's state with all data above, Fr(D, 0(R) N), can be computed using a convolution: Pr(D, ryfl. ç = xi> Pr(DCR fl -.)PflL. & II. n. -= 5.' 7)'j.= £} 1>1L... P:rL=fkF? = = UPflD. = ..

= Pt/YThL) . Like the example above where prior probabilities are computed, each node in layer 0 has a known joint probability with the data above. And so, proceeding layer-by-layer, for each node we compute the joint probability of that node with data above it.

The second step of the convolution tree algorithm is to pass messages upward through the tree after all messages have been propagated downward (i.e. after completion of step 1). Where the first step passes down the joint probability of each node with the data above it, the second step computes the likelihood of each node given all data that can be reached below. For the left parent, we compute Pr(D,D,L) and for the right parent we compute Pr(D,D,R). Note that we only need values proportional to these probabilities, and so they can be normalized before proceeding to the next layer in the tree, for better numeric stability.

These messages can be defined thus: Pri)L)tA)IL = = = n)Pr(.LY'i? = i = - = r) = EPrFN =n)Pr(DWJL =t-riPr(L = Both of these are, of course, deconvolutions, and thus can also be performed with convolutions where one vector is reversed (however, they are shifted, and the shift must be inverted in the final convolution). So the final results can be computed, for example using the notation of the Python computing language, as: p = JR.) Pr( RI) [:: -11 pp(1r)) [LnRi -t iv ( -1 + kn(L)I p1.:j)L fY. = (pr D( ((1) Fr L) -1j * IY)) 1/ en(L) -1: 1 m (L) -1 + h(RI] where len(R) and len(L) depict the number of states for R and L, vector[:: -1] reverses the vector (Le.it reflects it), and yEa: b] = (v2, v2÷1,... vbl) slices the vector, to remove the unwanted shift mentioned earlier.

As messages were propagated downwards before, these messages can be passed back up the tree to form the likelihood given data below for the left and right parent nodes. That is, for the left parent L, the likelihood given all data below will be initialized by the message passed back, Pr(D,DIL). Now that the is node L has its likelihood given the data below, it can pass messages up the tree.

This process continues until all messages have been passed up to layer 0, which contains the proteins themselves. The posterior for protein X1 = x, can be computed by computing the joint probability Pr(E', X - P i)LV == *) Pr(L>l))ç which is proportional to Pr(X = xID). Note that when a value is proportional to the posterior probability, then the posterior proportionality can be computed by dividing by the sum: E 1) V r. -

- LPX -A

Like the quadratic dynamic programming algorithm, the convolution tree algorithm described here is applicable to variables with more than two states. It also does not require that the left and right parents of a noisy adder node have identical state space, although of course a fast Fourier transform-based convolution will be faster when this is the case, because it will not need to pad the shorter vector with zeros.

This convolution tree algorithm shows a vast improvement in the number of calculations needed to evaluate the posteriors of the candidates. Each layer i s has kJ2 nodes, and the state space to store the result of the convolutions is «= 2.

Therefore, the runtime of the fast Fourier transform for a node on layer i is 211og(21), and the runtime to compute the posteriors can be expressed as: k Y' ! !iJ That is, the runtime is O(k log(k) log(k)).

Similarly, it is straightforward to show that the memory needed to store the intermediate calculations is O(k log(k)), since the length of the vectors storing these calculations at each node in layer i is «= 2, and there are k/2' such nodes in that layer. Therefore, the total memory requirement is: )..:i( k) k I / iq(L) Figure 5b shows a comparison of the quadratic dynamic programming method and the convolution tree method. It shows the times taken for an i3 Intel processor to calculate the posteriors of candidate proteins in the proteomic example as a function of the number of candidates. As can be seen, for large number of candidate proteins, the convolution tree method quickly becomes more efficient at calculating the posteriors.

For the sake of illustration, a summary of the steps for carrying out a specific example of an embodiment of the invention shall now be described. The example relates to the field of proteomics, as discussed earlier, but the discussion also applies to the analysis of other types of sample in a mass spectrometer.

Firstly, a sample is provided containing a mixture of proteins that are to be identified. This sample is prepared by digesting the proteins into a characteristic solution of peptides. These peptides are then analysed in a mass spectrometer as described earlier. The mass spectrometer provides data relating to the analysed sample. In particular, the mass spectrometer will provide spectra of the mlz values of the peptide solution. The data is then passed to a computer having a processor. The computer may be connected to the mass spectrometer s directly or via a network. The computer may also be remote from the mass spectrometer, with the data being transferred between them by means of a data storage device (memory device).

The processor analyses the spectra and compares the data with a database of peptide spectra so that the most probable constituent peptides can be identified. The identified peptides will then be classified depending on if they are unique to a single protein, or whether they may be formed from a set of proteins. For the shared proteins, a list is formed of candidate proteins that could have been present in the sample such that the shared peptide was produced. It is the aim of this invention to provide a probabilistic determination of the presence of these candidate proteins.

The posteriors of these candidate proteins are determined using the above described convolution tree method, by forming a hypothetical convolution tree, similar to figure 4a.

First, the joint probability of each candidate protein and its unique proteins is determined and stored in the candidate node, X1 These joint probabilities could be determined from prior knowledge of the candidate proteins, and how they split into its unique peptides.

Once all of the joint probabilities for the candidate nodes have been calculated and stored, the joint probabilities of the next layer of the convolution tree are calculated. Each node in the next layer is connected to two parent nodes in the layer of candidate nodes (layer 0). The joint probability of the two parent nodes are calculated as a convolution of the two probability distributions and the result stored at that node. Again, this is repeated for each node in this layer.

Exactly the same method of forming a convolution of the two parent nodes and storing the result is carried out on each node of the tree until the joint probabilities have been calculated and stored for each node of the convolution tree.

The bottom layer of the convolution tree comprises only one node and stores the joint probability of all the nodes in the convolution tree. From this joint probability and the shared peptide data, the likelihood of the bottom node is calculated. The likelihoods of the bottom node's two parent nodes are calculated s from a deconvolution of the likelihood of the bottom node, as mathematically described above.

The likelihoods of the nodes in the next layer above are calculated using the same deconvolution method to provide the likelihood of each node's parents.

This is continued at each layer up the tree until the likelihoods of the candidate nodes have been computed.

Finally, the posteriors of the candidate proteins can be calculated by multiplying the likelihoods that have been calculated with the previously stored joint probabilities of the corresponding candidate nodes.

In this manner, full use is made of the spectroscopic data, including measurement of the unique peptides and any shared peptides, to determine the probabilities of the presence of each of the candidate proteins.

Once the posteriors of the candidate proteins have been calculated, these can be outputted by the computer. The output could be to a display unit, for example a monitor, a printer, some kind of storage device (memory), or any other suitable output device.

Figures 5a and 5b show a runtime comparison between the three algorithms using a more general model, which is not restricted to noisy or nodes and instead uses probabilistic adder nodes. Figure 5a compares power-set enumeration with the quadratic dynamic programming method on smaller problems (k c 2, 4, 6, 8 20). Figure 5b compares quadratic dynamic programming with the convolution tree method on larger problems (k e 32, 64, 128, 256 2048). Note that both axes are log-scaled and so a growing gap between the two series represents a super-linear speedup in the runtime. On larger problems (e.g. 4096 proteins), the quadratic dynamic programming runs out of memory on a 4GB computer.

Both the quadratic dynamic programming and the convolution tree have runtimes far superior to power-set enumeration. But moreover, the convolution tree offers scalability to substantially larger problems than the quadratic dynamic programming approach. For example, computing exact posteriors for 32768 proteins takes only 28.03 seconds, while the quadratic dynamic programming cannot even be run.

s As stated earlier, the above method assumes that the number of candidates, k, is a power of two. However, the method can be readily extended when k is not a power of two by adding variables Xk+1, Xk+2... with 0 prior probability of being present and with no unique peptides until the total number of candidates in the set X is a power of two. Using a 0 prior is important for this approach, because it prevents 1X1, and subsequently Pr(D IX), from being altered by including these dummy variables. The addition of these extra variables does not change the overall order of number of calculations.

Rather than computing posteriors where D depends on N and N = X1 + X2 + ... + Xk, the convolution tree method can be generalised to compute posteriors when N = s1X1 + s2X2 + ... + skXk for fixed integer scaling factors si, s2... 5k* It is first observed (without loss of generality) that for any positive integer s, s1X creates a new vector X'1 with si -1 zeros padded between every entry of Xk x:1 is an integer Jft e/se It is second observed that for any two nodes L and R, the subtraction L -R can be accomplished by reversing R before convolution with L (and recording the fact that the zero index of the resulting array no longer refers to N = 0, but instead refers to the minimum value achievable by the subtraction). For this reason, scaling X1 by a negative integer s can be performed by reversing the vector Xi and then padding with zeros as mentioned above, and then adding normally with the convolution tree (again, in this case, each node in the convolution tree would also keep track of the minimum integer summation value corresponding to the zero index). For completeness, when s = 0 (and is thus neither positive nor negative), then the scaled vector X'1 = [1.0], indicating a 100% probability that X'k is zero; however, if the fixed value i is known to be zero ahead of time, then there is effectively no edge connecting X1 to the scaled summation N, and that input can be ignored with no consequence.

Thus scaling X1 by i can be accomplished by permuting the indices of X1 (into a potentially larger result vector). The vector X1 is first reversed if s <U s and is then padded with zeros as described above.

For straightforward implementation, single-input single-output scaling nodes (with input X1 and output X'1 and a fixed parameter si) can be used to first transform any X1 into X'1, and then fed into a convolution tree node with N = X'1 + X'2 + ... + X'k. Thus we can model an integer-scaled sum without any modification to the convolution tree data structure. Messages passed backward through these scaling nodes simply undo the deterministic permutation of the indices mapping X1 to X'1.

In the previously described application to proteomics, the convolution tree could make it feasible to query protein databases with much greater sequence is similarity than is currently possible, due to the large number of shared dependencies introduced. Moreover, the convolution tree can be used to iteratively perform protein inference and model peptide detectability, because it can offer substantially better runtimes on large data sets. The convolution tree can also be used to efficiently place arbitrary categorical priors on the number of present variables or on the sum of variables. Without this advance, such priors would not be considered because they are too inefficient for large data sets: by creating a dependency between all proteins, such a prior would render factorization impossible. Without factorization, even a runtime quadratic in the number of variables (e.g. using the quadratic dynamic programming approach) could potentially become the factor limiting efficiency (not to mention the limitations of the quadratic space requirement). A sub-quadratic method with low space complexity could be used to bring the applicability of such priors to many graphical inference problems.

It should be noted that the convolution tree method can readily be applied when including node-specific data, Df', which depends only on the node Nf°in the tree (as long as the resulting graph is still a tree). The modified method would multiply (element-wise) the likelihood given data below Pr(DbQIOW, Nf°) by the unique likelihood Pr(D'I Ni°) when passing messages up and multiply (element-wise) the joint probability with data above Pr(Dabove, NfL) ) by the unique likelihood Pr(D°I Nf') when passing messages down. This allows nearly identical runtime (point-wise multiplication is cheaper than convolution, which is already performed S by the algorithm).

On graphs where data is shared in a manner such that it is cascaded (i.e. D(l,2) depends on the sum of X1 + X2, and D(l,2.3) depends on the sum of X1 + + X3, and so forth), the sums can be arranged using a greedy algorithm.

One such simple greedy algorithm would cascade proper subsets like N(l2) = + X2 into a superset N(l23) = X1 + X2 + X3 = N(l,2) + X3. An example of a more general approach would find shared computations (e.g. N(l,2.3) = X1 + X2 + X3 and N(2,3,4) = + X3 + X4 could factor out the shared intermediate computation N(23) = X2 + X3, to compute N(l23) = + N(23) and N(234) = N(23) + X4, so that a probabilistic adder node N12 has predecessors X1X2 and then a second probabilistic adder node N1,23 has predecessors Ni, X3, as shown in figure 6b).

Thus, cascading makes it possible to use the convolution tree even when the shared data D5 do not have identical predecessors like the example shown in figure 4a. Cascading probabilistic adder nodes allows inference in the same runtime when the cascaded nodes form a tree.

Furthermore, when the graph contains loops (i.e. when the cascaded graph does not form a tree), variables can be merged into the convolution tree as larger nodes composed of the joint outcomes of the variables contained.

Essentially, what this demonstrates is that the convolution tree can be performed in conjunction with tree decomposition. When inference is performed, the full joint conditional probability table would not be generated for any clique node in the junction tree containing only a probabilistic adder node and all of the probabilistic adder node's predecessors. The edges connecting these pure probabilistic adder clique nodes to other clique nodes in the junction tree would also need to carry messages of a single variable only, because the convolution tree does not allow arbitrary joint distributions of its inputs as messages to be passed in. These joint distributions would first be merged into a single intermediate probabilistic adder node, and that probabilistic adder node would be cascaded and fed into the larger probabilistic adder node. Messages passed out of a pure probabilistic adder node would be computed using the convolution tree, and would simply pass the likelihood of all data preceding the edge along which the message is s passed (i.e. all data found by moving backward against the direction of message passing).

The convolution tree method can also be applied to other areas of analysing mass spectra. One example is the demixing problem (sometimes called a "deconvolution problem ") in which the relative abundances of compounds within a sample can be inferred from the peaks in the data. In the preceding embodiments, the value, x1 was a binary variable indicating the presence (x = 1) or absence (x = 0) of a candidate (e.g. a protein), X. However, in the demixing problem, the value of x1 indicates a relative abundance of a candidate relative to the other candidates. In this case, the value of x1 takes on a is range of values representing the range of possible values of the abundances.

The values of the abundances are discretised by dividing the possible range into discrete bins. As an example, x may be divided into the following bins: [0.0-0.1], [0.1-0.2] [0.9-1.0] Figure 7a depicts a classic example of spectra from mass spectrometry.

In the example, four compounds of unknown relative abundances contribute to a chimeric spectrum (a spectrum arising from several different compounds). An observed chimeric spectrum with data D is composed of a linear combination of four different spectra from the compounds with unknown relative abundances W, X, Y, Z. These values, W, X, Y and Z play an equivalent role as the values of the candidates, X, in earlier embodiments, but with the values associated with W, X, Y and Z having a discretised distribution, rather than being binary. Of course, the method can be extended to more than four compounds.

In figure 7a, three peaks associated with m/z values that can arise from multiple compounds are indicated by solid, dashed and dotted rectangles. Other peaks in the spectra are unique to a specific compound. In this manner, the situation is analogous to earlier embodiments, in that there are shared data (the common peaks), and unique data (the peaks associated with only one compound). Therefore, a convolution tree can be created, in a similar manner as previous embodiments, as shown in figure 7b. However, rather than having candidate nodes X representing the presence of a candidate and which can take only binary values, there are candidate nodes W, X, Y and Z representing the s abundances of a set of compounds. As in previous embodiments, probabilistic adder nodes can be cascaded to compute posterior probability distributions on the relative abundances of each compound without jointly enumerating the four-dimensional space of all possible relative abundances.

The graph in Figure 7b shows a cascade of convolution trees that describes the data in shown Figure 7a. This graph could even be augmented with priors based on previous knowledge of the compounds present. As explained earlier, the range of the variables representing the abundances of the compounds can be discretised using bins. However, a simplification would be to use a threshold value to categorise the peaks in the chimeric spectrum into two categories ("intense" and "not intense") and then perform inference using a convolution tree whose variables are binary, similar to the formulation for the earlier embodiments. Regardless of whether binary variables or binned continuous variables are used, an arbitrary likelihood model could then be used to evaluate the match between the observed peak (observed from the actual data) and the abundance variable (W, X, Y or Z) for that peak (note that these likelihood functions can be peak-specific). Conditional probabilities individually treat each intensity as proportional to the abundance of the compound that produces it. Data unique to each compound are represented by the unique data nodes labelled Dw, Dx, D, Dz (equivalent to the unique data nodes, D1 in the previous embodiments), and are conditionally independent given W, X, Y, 7. In figure 7b, three shared data nodes are shown using a solid, dashed and dotted line, respectively, corresponding to the peaks that can arise from multiple compounds. Probabilistic adder nodes are cascaded to build a convolution tree for probabilistic inference, in the same manner as for earlier embodiments, enabling the computation of a posterior distribution for the relative abundance of each compound.

The previously described method and mathematical algorithm can be applied with the only difference that the candidate nodes no longer represent the presence or absence of the candidates, but instead represent the discretised abundances of the given compounds.

s Aspects of the invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Some or all aspects of the invention can be implemented as a computer program product, i. e. , a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e. g., a programmable processor, a computer, or multiple computers.

A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

Some or all of the method steps of the invention can be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output.

Method steps can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e. g. , an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e. g. , magnetic, magneto-optical disks, or optical disks.

Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices; magnetic disks, e. g. , internal hard disks or s removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.

To provide for interaction with a user, the invention can be implemented on a computer having a display device, e. g. , a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e. g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well.

The above disclosed subject matter is to be considered illustrative of the general principle of the invention and not restrictive on the scope of the invention.

The scope of the present invention is limited only by the appended claims which are intended to cover all such modifications, enhancements, and other embodiments of the invention.

Claims

CLAIMS: 1. A method of analysing mass spectrometer data from a sample to determine one or more sample compounds therefrom, comprising: s receiving mass spectrometer data, the data comprising shared data which are associated with each one of a plurality of candidates for the one or more sample compounds, and unique data each of which is associated with only one respective candidate from the plurality of candidates; determining the plurality of candidates from the mass spectrometer data; and forming a convolution tree based on the plurality of candidates and using the unique data and the shared data with the convolution tree to calculate a respective candidate probability for each of the plurality of candidates.
2. A method according to claim 1 wherein the convolution tree comprises connected nodes in a layer zero, a plurality of intermediate layers, and a final layer, the layer zero comprising a candidate node for each respective candidate of the plurality of candidates, the plurality of intermediate layers each comprising a respective plurality of intermediate nodes, and the final layer comprising a single, final node, wherein pairs of the candidate nodes are connected to a respective intermediate node in a first intermediate layer, pairs of the intermediate nodes from the first intermediate layer are connected to a respective intermediate node in a second intermediate layer, and so on until a single pair of intermediate nodes is reached in a final intermediate layer, the single pair of intermediate nodes being connected to the final node in the final layer.
3. A method according to any preceding claim wherein calculating a respective candidate probability comprises determining an initial probability for each of the candidates based on the unique data for that candidate and associating the initial probability with the respective candidate node for that candidate.
4. A method according to claims 2 or claim 3 when dependent on claim 2, wherein the initial probabilities for pairs of candidate nodes are convolved to provide an effective joint probability for each respective intermediate s node of the first intermediate layer, and the effective joint probabilities of respective pairs of intermediate nodes from the first intermediate layer are convolved to provide an effective joint probability for each respective intermediate node of the second intermediate layer, and so on until the effective joint probabilities of the single pair of intermediate nodes in the final intermediate layer are convolved to provide an effective joint probability for the final node in the final layer
5. A method according to any of claims 2 to 4, wherein calculating a respective candidate probability comprises determining at the final node a is likelihood of the final node based on the shared data.
6. A method according to claim 5, wherein calculating a respective candidate probability comprises calculating a likelihood of each intermediate node and candidate node, by a deconvolution of the likelihood of the connected node in the layer below with the effective joint or initial probability of the respective intermediate node or candidate node, starting from the final layer and proceeding upwards to the layer zero until the likelihoods of the candidate nodes have been calculated.
7. A method according to claim 6, wherein calculating a respective candidate probability comprises calculating the product of the likelihood and initial probability of each of the candidate nodes.
8. A method according to any of claims 2 to 7, wherein each intermediate node in each intermediate layer is connected to only three other nodes, two nodes from the layer immediately above and one node from the layer immediately below.
9. A method according to any of claims 2 to 8, wherein one or more of the nodes is connected to a further node in the layer below, the further node not comprising part of the convolution tree, but a further convolution tree being s formed which comprises the further node.
10. A method according to any of claims 1 to 9 wherein determining the one or more sample compounds comprises determining the respective presence of the one or more sample compounds based on the calculated respective candidate probability for each of the plurality of candidates.
11. A method according to any of claims 1 to 9, wherein determining the one or more sample compounds comprises determining a respective abundance of the one or more sample compounds based on the calculated respective candidate probability for each of the plurality of candidates and peak intensities derived from the mass spectrometer data, wherein the shared data comprises a set of peak intensities which are common to two or more of the candidates and the unique data comprises a respective peak intensity which results from only one respective candidate.
12. A method according to any of claims ito 10, wherein the candidates are proteins in the sample and the unique and shared data are derived from the mass spectrometry data of component peptides.
13. A method according to claim 12, wherein the shared data represent peptides that could belong to more than one candidate.
14. A method according to any of claims 1 to 9 and ii, wherein the shared data comprise mass spectral peaks common to a plurality of candidates.
15. A method of determining a set of compounds within a sample, the method comprising the steps of: determining a set of components, D, based on experimental data; determining a set, X=X1, X2.. .Xk, of compounds from which the s determined compounds could have originated; calculating the probabilities for each of X1, X2...Xk using Bayesian inference based on the determined components, D = D, D1, where D1 = D1, D2, . . .Dk is a set of components, each of which is uniquely identified with a respective one of the compounds, X1, X2,...Xk, and D is a set of components that are common to all of X1, X2,...Xk, based on a convolution tree, the convolution tree comprising a layer 0 of nodes attributed to each of the initial states X1, X2,...Xk, with each node in layer 0 being associated with a respective one of components, D1, D2,.. . Dk, the convolution tree further comprising a layer 1 formed by pairing each of the nodes from layer 0 with another node from layer 0 is into a respective node in layer 1, and subsequent layers formed by similar pairing of the nodes of the layer immediately above, until the final layer comprises a single node associated with the shared components, D.
16. A method according to claim 15, wherein the nodes have two sub-trees Land R as parents and one sub-tree as a child D(N).
17. A method according to claim 16, wherein the joint probability of each node NI having a value n is calculated from the joint probability of its parents L having a value I, and R having a value raccording to: B) flflL. Pr(D' ft - PrL = Ci flL(-c U" ft=. i J=L' .0= flPrD.ft=-P) = J[Y3[ I Vu I?
18. A method according to claim 17, wherein messages are passed upwards through the convolution tree, wherein each node passes Pr(D,DIL) to its left parent and Pr(D,DIR) to its right parent, the messages being defined by: Pcfl, j P.r(D:\ n) P4Th' a fl -Prifl D N) B = = .P.L = a) L) = r = a
19. A method according to claim 18, wherein a posterior for a node N = n is calculated by multiplying the joint probability stored at the nodes with the likelihood calculated from its child below according to: rH/I V = = PdJ)> (\ = h L)1 = a
20. A method according to claim 19, wherein a posterior probability of the candidates Pr(X1 = x1ID) is proportional to the joint probability Pr(D, X1=x1) and can be computed by normalisation so that the summation over all x is 1, according to: -i_ \ f*j) \ -A.< 21 A method according to any of claims 15 to 20, wherein determining the set of compounds comprises determining the respective presence of the one or more compounds based on the calculated respective candidate probabilities for each of Xi, X2...Xk.22. A method according to any of claims 15 to 20, wherein determining the set of compounds comprises determining a respective abundance of the one or more compounds based on the calculated respective probabilities for each of X1, X2. . .Xk and peak intensities derived from the mass spectrometer data, wherein D comprises a set of peak intensities which are common to two or more of the set, X=X1, X2.. .Xk, and the D1 comprises a respective peak intensity which results from only one respective member of the set, X=X1, X2.. .Xk.23. A method according to any of claims 15 to 21, wherein the set, s X = X1, X2.. .Xk, are proteins in the sample and D1 and 0(s) are derived from the mass spectrometry data of component peptides.24. A method according any preceding claim, wherein the resulting computational complexity is less than or proportional to n log(n) log(n) where n is the number of candidates.25. A system for analysing a sample and providing posterior probabilities of compounds within the sample, the system comprising: a mass spectrometer arranged to scan a sample and provide data is relating components of the compounds within the sample; a processor in communication with the mass spectrometer arranged to receive the data from the mass spectrometer; and memory in communication with the processor, in which instructions for carrying out the method of any preceding claim are stored, wherein, the processor is arranged to perform the instructions stored in the memory on the data provided by the mass spectrometer to provide posterior probabilities of components within the sample.