GB2392275A

GB2392275A - Estimating a property of a structure in data employing a prior probability distribution

Info

Publication number: GB2392275A
Application number: GB0317527A
Authority: GB
Inventors: Neil Lawrence; Marta Milo
Original assignee: CAMBRIDGE BIOINFORMATICS Ltd
Current assignee: CAMBRIDGE BIOINFORMATICS Ltd
Priority date: 2002-08-01
Filing date: 2003-07-28
Publication date: 2004-02-25
Also published as: GB0217827D0; GB0317527D0

Abstract

A method of estimating a property of a structure in data comprising the steps of establishing a prior probability distribution, selecting values which characterise at least part of an initial approximation to a posterior distribution, updating a factor of the approximation in accordance with the other approximation factors and a likelihood function. The method is repeated for all approximation factors. An embodiment embraces microarray DNA data which is captured by image analysis 4. Estimates are made of spot radius and location of a spot centre 6. Expectation values of the posterior distribution are initialised at 8. The posterior function is updated by parameterising proposal distributions 10, obtain samples from the proposal distributions 12. An importance weight is determined 14, which is repeated for all samples. The importance weights are normalised 16. The importance weights are used to compute estimates of expectation values 18, and subsequent estimates 20. A convergence criterion is tested at 22 which if met causes the spot intensities and corresponding expression levels 24 to be determined. Finally the variance of the intensity values is calculated 26 which enables error bars to be determined.

Description

1 2392275

Title: Estimator properties of structures in data Field of the Invention

The present invention relates to the field of computerized estimation of properties of

structures in data. The technique is particularly applicable to the estimation of expression levels from pixelated images of microarray spots.

Background to the yen.tion

The limitations of the prior art will now be illustrated with an example from the field of

microarray analysis, although the principles apply more generally to the estimation of properties of structures in data.

The basis of DNA microarray technology is the construction of high density arrays of DNA molecules on glass slides that are hybridised with fluorescently-labelled mRNA-

derived complex targets (Eisen & Brown, 1999; Schena, 2000). The pattern of hybridization to the elements is visualised by fluorescent imaging and the raw microarray data images are subsequently transformed into gene expression matrices (Brazma & Vito, 2000). Several different software packages are available to determine the position of each element by fitting grids over the image to extract the level of the sample's expression (Yang et al., 2000).

The recent increase in microarray experiments has led the academic community to an international effort to adopt standards for DNA-array experiment annotations and data representation'. The Minimum Information About a Microarray Experiment (MIAME) (Brazma et al., 2001) standard, for example, aims to establish gene expression data repositories that will allow consistent comparison of the data collected from different sources. It is therefore imperative that the processing of microarray image data by

different researchers leads to consistent results. The experimental results below show that this is not the case; we have found that even when two researchers share the same software package to process the same images, results are not consistent.

Comparative experiments (exemplified below) show the inconsistencies between results achieved by individual researchers, using standard software packages which rely on a user supplying information. Example of the information typically supplied by a user include the layout of the spots, in the form of a grid, and the size and shape of each spot.

We have found that variation in the image analysis by different researchers leads to large discrepancies in the measured gene expression levels values obtained from a single array.

Other workers who have investigated the results of carrying out the same experiment on different slides (Hughes et al, 2000) discovered variations of a similar magnitude but assumed that they resulted from biological noise.

We propose that a substantial source of variation in measured microarray spot intensities results from errors in grid placement. Thus, the aim of an embodiment of the present invention is to provide more reproducible analysis of microarray data images. Preferred embodiments also provide a measure of the uncertain in spot intensify, allowing error bars to be calculated on the intensities and functions of the intensities (such as the logs ratio of intensities). This may also lead to more accurate calculations.

Bayesian Inference (Bernardo ef al., 1980; Cox, 1946) is a madernatical way of formulating the process of gaining information. We have considered the application of Bayesian inference to microarray image processing and developed a new Bayesian technique applicable to estimating properties of structures of any data.

The Bayesian approach involves encapsulating the information which is initially available about the parameters, 43, of a structure in a prior distribution p(13). (An example of parameters of a structure relevant to the example of microarray analysis are the size, shape i

and location of a microarray spot). A model of the observed data (X) given the parameters of a structure is also constructed, p(X163) (this is referred to as a likelihood function).

Bayesian inference uses Bayes's rule to determine the posterior distribution: (e Ix) p(xle)p(8) The posterior distribution represents belief about the structure's parameters given the image. The difficulty in deterrnimug the posterior distribution normally lies in evaluation; of the constant of proportionality Z = Jp(XI)P(8 ie For many likelihood functions and prior distributions, this constant of proportionality is untractable. Researchers have developed a number of approximate methods for obtaining the posterior distribution from the prior distribution when this constant of proportionality is untractable.

One known approximate technique for finding the posterior distribution is importance sampling. This is one of a number of known sampling techniques (reviewed by MacKay, 1988.)

In importance sampling, the aim is to approximate an expectation, (ós(x)) (), over a distribution, p(x)= P z), which is known subject to a constant Z. A set of samples {xs3 is obtained from a so-called proposal distribution, R(x). The expectation can be estimated as: N ((X))p() Wsó(X) ED

where the parameters, wx. known as importance weights, are given by - all w = Es-l d), P (x) R(xs) In importance sampling, it is important that the proposal distribution is similar to the distribution of interest if the method is to be effective. In the context of Bayesian inference, a candidate for the proposal distribution is the prior: if the prior belief is accurate, it might be hoped that the prior is similar to the posterior. This is the approach taken by, for example, particle filters (Doucet et al., 2001).

In this case, the unnorrnalised importance weights are simply given by the likelihood of each sample. This means that if the prior distribution is not well matched to the posterior, the importance sampler's estimate will be dominated by a small number of samples with larger weights. Hence the accuracy and applicability of this form of importance sampling

is limited where the prior distribution is not well matched to the posterior. This becomes particularly relevant when sampling a distribution over many parameters as in some cases only a small fraction of samples will have substantial likelihoods.

Another approximate technique for finding He posterior distribution is variational inference. In variational inference Jordan et al., 1998; Lawrence, 2000) an intractable posterior, pi), is approximated with an alternative tractable distribution, qua). The approximation is found by minirnising the Kullback-Leibler (KL) divergence between two distributions: KC(q(fl)llp()) = - |q()In it|'

If no constraints are imposed upon the form of qua), mirumising the KLdivergence simply recovers q()= pa). To discover a tractable approximation we must impose constraints on the functional form of qua). For example, the constraint that qua) factorises into disjoint subsets of the parameters, i e. q((l)= q(3,)q(02)q(03)...

In this case, as is shown below, minimising the Kullback-Leibler (KL) divergence yields: q(8k) exp(logp(xl8)p())n'-q(e') (Equation 1) where the angled brackets represent an expectation of the distribution in the subscript.

To apply variational inference it is necessary to obtain the constant of proportionality in Equation 1 for each factor of the posterior distribution: = |explog p(X)P(43))n,q(e, ok (Equation 2) Ghararnani and Beal (2001) explore a class of distributions for which this integral will be soluble, known as the conjugate-exponential family. However, if this integral is not soluble for every factor of the posterior distribution, then the technique cannot be applied.

Thus, this technique is only applicable where the prior distribution and the likelihood function are selected so that this integral is soluble for each factor of the posterior distribution. This limits the choice of distributions, and thus the flexibility of the method.

Sunarv of the Invention According to a first aspect of the present invention there is provided a method of estimating a property of a structure in data (X), comprising the steps of:

i) establishing a prior probability distribution (p(O)) of a plurality of parameters (God 62 etC.); ii) selecting or obtaining values (corresponding to one or more expectations), which characterize at least in part an initial approximation to a posterior distribution ( q(13)) which is selected to be separable into factors across disjoint subsets of the parameters q()=nq{e,) =, iii) updating a factor of the approximation to the posterior distribution (q(8,)) or expectations thereof taking into account the other factors of the approximation to the posterior distribution or expectation thereof, and a likelihood function p(X163), with reference to: q(0k) x exp(log p(XlO)P())n9(e/) (Equation 1) either by calculation or by using sample based approximations to approximate expectations under the other subsets of the approximation to the posterior ( q(1,) where j:- k) where a sample based approximation is used at least once; iv) repeating the procedure of step iii) for each factor of the approximation to the posterior distribution or expectation thereof.

Preferably, the procedure of step iii) is repeated for each factor of the approximation to the posterior distribution or expectation thereof a plurality of times. More preferably, the procedure of step iii) is repeated for each factor of the approximation to the posterior distribution of expectation thereof until a convergence criterion is met.

Thus, the above method enables the approximation to the posterior distribution ( qua)) to converge on a distribution from which the structure and/or its properties can be approximated. Importantly, the method can be used where the constant of proportionality ( Z) of equation 1 is not tractable. This constant of proportionality is: z= fexp(logP(xl0) p())n'-q(eJ)k (Equation2) Where this constant of proportionality is untractable for a factor of the approximation to the posterior distribution, so that factor cannot be updated by calculation, the use of a sample based approximation instead enables expectations under the factor to be approximated. This is achieved as expectations of a factor can be calculated by sampling from the distribution of equation 1. Furthermore, iteratively updating individual factors or expectations thereof allows the approximation to He posterior distribution to converge on a suitable distribution to enable properties of the structure to be calculated.

This enables greater freedom of choice in selecting a prior distribution and/or likelihood function than with conventional variational inference. In the conventional variational method, the necessity for tractable equation constrains both the choice of prior distributions and the choice of likelihood function. In equation 1, both p(Xp) and p(El) are dependent on 8. It is their dependence on that dictates whether the integral in equation 2 is soluble. The intractability of that integral may be as a result of the functional dependence of the likelihood p(Xp), on or as a result of the functional dependence of the prior, pa), on O. Ghahramani & Beal (2001) discuss situations where the integral is tractable. The particular example given below resolves an intractability in the likelihood.

However, the method could equally be applied when the intractability arises in He prior.

The method above works even if the constants of proportionality for each factor of the approximation to the posterior distribution are untractable, although it iS preferred to select the prior distribution pa), the approximation tO the posterior distribution qua) and the

likelihood function so that equation 1 is tractable for at least some factors of the approxnation to the posterior distribution.

Preferably, the sample based approximation used to approximate expectations is importance sampling. Combining importance sampling with variational inference in this way mitigates the limitation of importance sampling that samples with high importance weights may be found in only a small proportion of the proposal distribution.

Typically, the approximation to the posterior distribution is updated for each factor in turn and then the process is repeated. Preferably, this is achieved by updating the expectations of the approximation to the posterior distribution. Each time, the most recent approxnation of each other factor of the approximation to the posterior distribution is used (as is required by equation 1) and it Will be seen that q(6k) depends on each qua) for j k. For example, if in a 3 factor system, the constant of proportionality can be calculated for 2 factors q(8,) and q(8,) but we must turn to sampling to compute expectations for 1 factor, q(83), then q(G,) is calculated taking into account that most recent approximation of q(82) and q(93) then q(82) is calculated (simultaneously or subsequently) taking into account the most recent approximation of q() and al;.

Then expectations under q(83) are estimated taking into account the calculated q(8,) and q(82). These improved approximations are used in the next iteration, and so forth.

Several calculable factors may be calculated at once rather than sequentially. It may not be necessary to have initialised one of the factors of the posterior distribution in order to update it if it can be estimated by sampling taking into account other initialised factors.

Preferably, the data is measured data. The method may comprise the step of measuring the data or the step of receiving measured data. The measured data is typically digitised and is preferably multidimensional, for example a ID or 3D pixellated image. The data may also be multichannel in which case different channels may be combined (for example by summation or by raking the maximum of the value of several channels) during iteration.

St.cnres may be any feature of the data, for example spots in a microarray, discussed

further below. Other examples include monitoring the location of e.g. a head moving in a video image, or a robot or missile to be intercepted, or determination of a trend in a share price. A structure is preferably, but not necessarily a discrete object.

In an example used to illustrate the invention, the measured data is a scanned pixellated image of fluorescent spots in a microarray. Such a microarray of cDNA spots may be constructed by techniques well known to those skilled in the art. In use, one or more fluorescencent labelled cDNA probes are hybridised to the microarray in use. A pixellated image of spots where the cDNA has hybridised is acquired, e.g. by scanning the microarray with a laser and quantifying emitted light. Where multiple cDNA probes are used, each is typically labelled with a label that has different fluorescence properties (such as emitting at different wavelengths) enabling a multichannel scanned image to be obtained. In the microarray example, the structures in the measured data are the spots corresponding to the distribution of cDNA spots in the original microarray. An important property of these spots to be estimated is their intensity as this indicates qualitatively or quantitatively whether hydridisation occurred. Another property is the location and size of the spots, quantified by their radii, which may be used in quantifying their intensity. The maxima of each channel for each pixel can be used to form a combined image for the purpose of locating spots and then each channel can be used separately to establish the intensity due to each fluorescent indicator. Intensities of each channel of the scanned image may be estimated separately, to allow comparison of the extent to which different biological samples hybridised to a single microarray spot. Other properties which may be estimated include the logs ratio of hybridisation of different biological samples to the same spot or qualitative measures such as whether a biological sample has hybridised in an amount sigruficantly above background. This enables locations where spots have not been

identified to be flagged.

The prior distribution (pa)) may be calculated, estimated and/or based on data input by an operator.

The prior distribution and approximation to the posterior distribution (qua)) are selected as a balance between several considerations. Firstly, they must be of appropriate form and with sufficient degrees of freedom to fit closely to typical data. Secondly, there will be benefits of speed and accuracy if equation 1 is tractable for as many factors of the posterior distribution as possible. A benefit of the invention is that the method works if equation 1 is untractable for one, more or even all factor distributions, allowing greater freedom of choice of prior distribution and likelihood function than with known variational inference techniques. In the example of microarray analysis, a suitable prior distribution has, as parameters, the centres c and radii r of each spot. The prior distribution may be established by the user specifying a grid that represents the approximate centres of the SpOtS, tic, and radius values representing the expected size of the spots, fir Alternatively, initial values of c and r may be calculated from another approximate algorithm.

Preferably, the method includes the step of calculating the most likely value of a property of a structure, and/or the variance or other measures of uncertainty in that value. This can be achieved because the result of the above process is a probability distribution. This provides the benefit of enabling the reliability of the values of properties to be judged.

Preferably also, the method further comprises functions of the estimated properties, such as ratios of properties, taking into account the uncertainty in Hose estimated properties.

This improves the calculation of functions of properties.

For example, in the case of microarray analysis, the intensity of a spot and the variance of the intensity of a spot can be calculated. Furthermore, the log, ratio of two intensities can be calculated using a weighted mean which downweights values with high variances.

Structures (such as microarray spots) for which properties (such as centre and size) cannot be found may be flagged as not present.

Preferably, the prior distribution is a Gaussian distribution; in two dimensions, a Gaussian distribution is specified by its mean,,u e 2, and its covariance matrix, 22 N(x, 7:)=, exp(- 2 (X - p)T[(X - A)) 21l 2 It is preferred to use matrices which lead to "spherical" Gaussian priors, i. e. 2, = ct,-'I, and EC=aciT. The as are then measures of the "precision-. Higher values of a represent increased confidence in the prior belief. Preferably also, it is assumed that there is independence in the prior between the radii and the centres giving, P(r)= p(c)p(r) where p(c)= N(c, ucac 'I) p(r) = N(r,ura, 'I) It is preferred to express the prior distribution as a hyper-prior, i.e. a distribution of the parameters of the prior distribution. This technique is known to those skilled in the field

of variational inference and is used to make equation 1 tractable in applicable circumstances. It is preferred also to use an approximation to the posterior distribution which is a hyper-posterior, i.e. a distribution of the parameters of the approximation to the posterior distribution which it represents.

It is thus preferred to use a hyper-prior, which has parameters Eland ac governed by, for the centres, c:;

Preferably, corresponding distributions apply for the radii, r.

Replacing the posterior with the posterior approximation, qua), and minimising the Kullback-Leibler divergence with the constraint that q() is separable into disjoint subsets of the parameters gives: q(C) oc exp(ln p(llc)p(clmcae))q(-c)q(c), q(mC) a: exp(lop(C|mc.ac)P(mc))q(c)q(c), q(ac) or exp(ln p(Clmc arc)P(ac))q(c)q(mc)' which means that the approximations to the posteriors take the form q(? = Art\;qlalc,Im 1 q(C) = gam(Clac.bc) making use of

me = I,,, ((ac)(c) + fcAcl Im = ((Ctc)+[c) I _. ac = acts be = he + (C C)2(C) (me) + (mcmC)' where we have dropped the subscript for the angled brackets where the distribution under which He expectation should be taken is the factor of the approximation to the posterior for the only variable which is within the bracket. Some of the necessary expectations may be found as (I) = n (my) = + Tr(: c (a.) = No equivalent simple fimctional form can be found for q(c). Thus, expectations of q(mc) and quay) can be calculated, but expectations of q(c) cannot.

However, for this particular example, equation 1 can be written as: q(c) = z. p(I|c)N(c|(mc), (c) I) Thus, preferably, expectations under q(c) are estimated using importance sampling with a proposal distribution, it(c) = N(C(nt), (ad) 1) Hence the factors q(n) and q(cc) can be updated by calculation and q(c? can be updated by a sampling technique, such as importance sampling. Thus, unlike conventional variational inference which would have required equation 1 to be tractable for each factor, it is possible to iteratively converge on an approximation to the posterior.

Alternative sampling techniques includes Markov chains, Monte-Carlo methods, Gibbs Sampling, The Metropolis Algorithm, Hybrid Monte-Carlo etc.

An example convergence criterion is whether specified parameters have changed by less than a specified amount, e.g. once (nI),(n),(ae)and(zr) change by at most a specified tolerance. In importance sampling, the quality of the samples obtained can be surnmarised by Seff, known as the effective number of samples.

5 = st s S s=! where S is the total number of samples.

This is a preferred convergence criterion and the procedure of step iii) may repeat until Sea > S / 4, for example.

The likelihood function p()) is a model of the probability of the data (8 given some or all of the parameters (3). In the example of microarray analysis above, a suitable likelihood function is found by considering a "box" of pixels with associated intensities, I,o, surrounding the proposed position of the microarray SpOt. The positioning and size of the oval placed in this box constitutes a hypothesis, H. about the location and shape of the spot. This likelihood model is based on individual likelihoods for each pixel, taking into account the probability of the intensity of an individual pixel from the box, I, given that the pixel is part of a spot, p(I,|S), and the probability of the intensity given that a pixel is not part of a spot, p(I-S). A particular hypothesis for an oval's position partitions the pixels in the box into a set which comes from the spot, S. and a set that is not from the spot, S. The likelihood function, due to our independence assumption may then be written as the product of each individual pixel's likelihood

LL - p(IboXlH)= p(I,|S)P(I,l S) INS ye-S -- P(ibox 19 r) A typical estimate of the individual pixels' likelihood uses a uniform likelihood for pixels from the spots, p(I, is) =-, where ImaX is the maximum possible intensity value, and a max truncated exponential distribution for the background model, p(li| S)= P(( 2)).

The rate of the background model was set such that Pi)= P(I',ack| S), where Ib,,k is

a threshold beyond which we would consider the pixel to be from a spot.

Preferably, however, the background model comprises a probability distribution which has

a highest probability for an intensity of zero, decreasing in probability as the intensity increases until an intensity value (Idus) above which it likely that any signal is from a dust-

spot or other source of error, and then increases again. For example, where the probability, p = Prefix at intensity = 0; p-1 / Im,X at Ispomn; p = 1 / I, at Ispo,,'9x; p= 0 at IdUsr; and p = 2 / IBEX at Im9X, where 1spormn is a lower estimate of the lowest expected spot intensity, 15pO:m'' is an upper estimate of the lowest expected spot intensity, and P'nax is chosen to normalise the distribution.

Although the embodiments of the invention described with reference to the figures comprise processes performed in computer apparatus, the invention also extends to computer apparatus and computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention. The program may be in the form of a module or library component used with or for use with other programs and the invention extends to

such other programs. The carrier may be any entity of device capable of carrying the program. For example, the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disk or hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal which may be conveyed via electrical or optical cable or by radio or other means.

When the program is embodied in a signal which may be conveyed directly by a cable or other device or means, the carrier may be constituted by such a carrier or means.

Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant processes.

Short description of the figures

Figure 1 is a flow chart of a computation process for estimating the intensity of microarray spots according to the present invention; Figure 2 illustrates sample of pc,r) from the specified prior distribution, represented by ovals; Figure 3A illustrates a hypothesized oval with its radii shown as arrows over a spot from a rnicroarray image, with a rectangle indicating which pixels belong to the set Ibo,; Figure 3 B illustrates the data from Figure 3A with pixels allocated as part of the spot in black as those allocated as part of the background in grey;

Figures 3C and 3D correspond to Figures 3A and 3B respectively for a case where the hypothesis is poorer; Figures 4A and 4B illustrate the Cy3 and CyS channels extracted from the same microarray image by two different researchers using ScanAlyze software with the same grid parameters. (Each axis is linearly scaled by a factor of 1000).

Figure 4C is a graph of logs ratios of the results illustrated in Figures 4A and 4B; Figure 4D is a norrnalised histogram of the distances of the logl ratios from the 45 line in Figure 4C; Figures SA to SD correspond to Figures 4A to 4D respectively for the same data analysed by the method of the present invention; Figure 6A illustrates an image of a National Institute of Aging 15K slide with spot positions and sizes labelled by a researcher, without special care; Figure 6B illustrates the same data as Figure 6B analysed by the method of the present invention; Figure 7A shows a portion of an image of microarray spots from a further experiment; Figure 7B shows rough placement of grids by ScanAlyze on the image of Figure 7A; Figure 7C shows the last set of samples from the proposal distribution when the location of the spots of Figure 7A were analvsed by the method of the present invention; -in -clod -123 -(! I) Figures 8A Is a plot, with y' -y' on the x-axs and y, -Yi on the y-axs under of the 200 germs (from 6000) farthest from the origin of this plot, from the experiment of Figures 7A to 7C, without taking uncertainties into account;

Figure 8B corresponds to Figure 8A, but with y' s calculated taking uncertainties into account, with dotted lines around each points showing one standard deviation of error in the values of y,.

Figure 8C illustrates arrows indicating the direction of the 15 largest movements of data points between Figures 8A and 8B; and Figure 8D is a detail from Figure 8B.

Detailed description of an example embodiment Figure 1 is a flow chart of a computation process for estimating the

intensity of microarray spots. The process is for execution on a computer having storage means, a microprocessor, user input means such as a keyboard and mouse, output means such as a computer monitor and printer, and means for retrieving an image, such as a scanner.

Once the process is started (2), an image, I is retrieved (4). The image I is a two dimensional pixellated array obtained by scanning a microarray having Cy3 and Cy5 fluorophore labelied cDNA hybridised thereto and is separated into two channels, Cy3 and Cy5 corresponding to the fluorescence measured for that each due to Cy3 labelled DNA and Cy5 labelled DNA respectively. (Cy3 and Cy5 are trade marks of Amersham plc).

Such images, I, are known in the art and typically analysed with software packages such as ScanAlyze, available from ht://rana.lbl gov/EisenSoftware.han.

For each spot in turn, an initial approximate centre of the spots Wand radius values representing the expected size of the spots, fur is established (6) by the user specifying a grid and inputting typical radius values. This establishes a prior distribution (discussed below).

In the following steps, we have found that good results are obtained where BC=pr =0.25,aC =a, =0 05 and he =b, = 0 1 Here, AL and blur are the precision of the hyper-prior in the same way that ac and a, above are the precisions of the prior.

Next, expectations of an approximation to the posterior distribution are initialised (8). The approximation to Me posterior distribution is surnmarised below and is separable into factors as follows: q(S it, Arc) q(C\((c) (and similarly for radii) Relevant expectations are initialised as follows: ('to) = pc.() = fl,,(CtC) = b and (ct.) = b This initialised posterior approximation corresponds to the hyper-prior, which is an appropriate first approximation. (The hyper-prior is discussed below.) The posterior approximation is now updated iteratively as follows.

For each iteration, a sampling technique (here importance sampling) is used to estimate expectations of (c>, (cTc),(r) and (rTr).

This is carried out by parameterising (1O) proposal distributions, R(c)= N(C(),(c) I) and it(r) = N(r|(mr), (ar) I) using (I/, An j'(ac) and (Cr)

Next, samples {rS, cs} are obtained (12) from each of the proposal distributions, it(r) and it(c). An importance weight, w, = p(I IrS,c5) is allocated to each, use the likelihood function summarised below.

Next, the importance weights are normalised (16), w w = s IS-I Wit Once a predetermined number of samples (N) have been determined, these importance weights are used to compute estimates of (ó,(cTc),\r) and (rTr) (18).

N (óp(x) WsI(Cs) (and similarly for (cTc),(r) and (rTr)).

Thus, we have updated expectations of the factors q(c) and q(r) of the approximation to the posterior distribution. (In fact, on the first iteration we create initial expectations of 7(c) end q(r) Now, expectations of the factors q(mc) and q(ac) of the posterior approximation can be calculated (20) using the solutions to equation 1 for each of q(mc) and q(ac) using: ha = Im((C)(C)+{CC), Fin = ((C)+c) I ac = ac+1 bc = bC + (C C)-2(ó (I) + (aft'

(I) = (my) = 7ó n2 + Tr(2c), (arc) aC (Corresponding equations apply for mr >, < m,Tmr > and < tar >).

Note that equation 1 above indicates that each factor of the posterior approximation is a function of the other two factors of He posterior approximation. For the specific prior and posterior approximation used in the present example, equation 1 is tractable for q(rq) and q(cz) but not q(c) However, improved expectations of q(c) have been estimated by importance sampling, whereupon improved expectations of q(n) and q(ac) have been calculated. Next, a convergence criteria is checked (22). If Sew (as defined above) is not greater than S -sampling is repeated and the posterior approximation converges iteratively.

However, if the convergence criteria is fulfilled, then the spot intensities and corresponding expression levels are calculated (24). This can be achieved by calculating the intensities of each of the Cy3 and CyS channels using standard calculations given the values of the expectations of the posterior approximation. Alternatively, expected values can be calculated. Finally, the variance of the intensity readings is calculated (26), enabling results lo be output with error bars.

The expression level for a particular spot, E, is a function of the image intensities, the spot position and shape, E = f(I,c;ri This function is known in the art, and can be calculated, e.g. by ScanAlyze.

We are given the image intensities, l, so if c and r are also precisely specified then the expression level may be computed. In the case of Bayesian inference though, c and r are not specified exactly, our knowledge about them is encapsulated in a posterior distribution, q(;4I). However, we cannot solve the necessary integral, but can use the samples and their importance weights. One option is to simply compute E for the most likely values of c and r. Another option is to compute the expectation of E computed under the approximation to the posterior distribution, (E) () = Of (1, c, r)q(C, r|l,tC f (I, C,, rS)s s=! It is possible to compute other quantities of interest such as a I error bars" on the quantity which may be obtained through the expression}evel's variance: var(E) = (E2) - (E) Prior distribution The prior distribution used in this example is a Gaussian distribution. In two dimensions, a Gaussian distribution is specified by its mean,,U2xl, and its covariance matrix, ú r22 N((u, Z) = I, exp(- 2 (X- U)T ú(x- hi)) 2lZII For simplicity we consider covariance matrices which lead to 'spherical' Gaussian priors so for Me prior governing the radii we take ú, = {Zr '1 and for Mat governing the centres we have ú ='zlI. The ces are then measures of the "precision": the higher Weir values, the more confident we are about our prior belief. Finally, we assume independence in the prior between the radii and the centres giving, P(9r)= p(C)p(r)

where p(c) = 1V(CCaC 'I) p(r) = \(rA,r 'I) Figure 2 illustrates samples of p(q r) from this prior distribution, represented by ovals. In the illustrated example,,UC = [O 0]T,cc = 1,,u, = 5]T and a, = 1.

In practice, calculations are carried out on a distribution of the parameters of the above prior distribution, referred lo as a hyper-prior. The hyper-prior has parameters Wand ac governed by, for the centres, c: p(C) = (P(CImc,e)P(C)P(mc)3mC0uC, p(Clmc,dzc) = \(C!mC,Act) P(mc) = (mc |c,) p(c) = gam(ac lac, bc)' where 8am () is the gamma distribution defined by gam(x|a, b) = At) x exp(- be), in which (-) is the gamma function.

Corresponding equations govern the radii.

Likelihood function The likelihood function used in this example models the assumption that, for an oval of particular centre and radius, pixels within the oval are from the spot and those outside are from the background. It is also assumes that the probability of each pixel coming from Me

background is independent. This is rarely correct, but has been found to be suitable for

the present application and has the advantage of computational simplicity.

The likelihood function can be found by considering a "box" of pixels with associated intensities, I'm, surrounding the proposed position of the microarray spot. I-he positioning and size of the oval placed in this box constitutes a hypothesis, H. about the location and shape of the spot. The likelihood model is based on individual likelihoods for each pixel, taking into account the intensity of an individual pixel from the box, I,. The likelihood function takes into account the probability of the intensity given that the pixel is part of a Spot, p(/, Is), and the probability of the intensity given that a pixel is not part of a spot, p(Ij| S). A particular hypothesis for an oval's position partitions the pixels in the box into a set which comes from the spot, S. and a set that is not from the spot, 5.

The likelihood function, due to our independence assumption may then be written as the product of each individual pixel's likelihood P(lbox|H)= p(IilS)n P( it as ia-S -- p(IbOx|Q r) In the present example, we have found it suitable to use a um form likelihood for pixels from the spots, p(I, |S)= I where I,, a' is the maximum possible intensity value, and a max truncated exponential distribution for the background model, p(I | 5) = P(( 2)).

The rate of the background model was set such that P(Ibck|S)= P(/btc':| S) , where Iback is

a threshold beyond which we would consider the pixel to be from a spot.

Figure 3A illustrates a hypothesised oval with its radii shown as arrows over a spot from a microarray image. The box shows the pixels which belong to the set I boy.

Figure 3B shows pixels allocated to the set S as black and those allocated to the set S as grey. Figures 3C and 3D correspond to 3A and 3B for a case where Me hypothesis is not as credible. We would expect the likelihood associated with this hypothesis to be lower than that associated with the hypothesis of Figures 3A and 3B.

Posterior approximation We assume that the posterior approximation is separable across disjoint subsets of S me and arc, i.e.: q( Itt, C) = q(C) q()q(c) Minimising the discrepancy between the posterior approximation and the prior (i.e. through a free form minimization of the Kullback- Leibler divergence (Waterhouse et al., 1996; MacKay, 1995)) given the above constraint on the posterior approximation results in equation 1, becoming, in this case: q(c) x exp(ln p(|C)P(CImc,ac))q(mh(a), q(mc) exp(ln p(c|mc,ac)P(mc))q(c)q(e)' q(ac) x exp(ln P(Clmc,tZc)P(c))q(ch(m), which means that the approximations to the posteriors take the form q() 1v(J m 1 4(ac) = gam(aC l aC 6 where we have made use of

= Zn. ((C)<C) [civic 1 I m = ((tic) + /3c) I as = ac+l.

b = be + (clc) - 2(ó (no) + (amp), and some of the necessary expectations may be found as (I) = nit (my) = lll + Tr(l c); (arc) = be No equivalent simple functional form can be found for q(c).

q(n) and q(ac) are tractable but we use a sampling technique to obtain the required expectations under q(c), as discussed above.

Experimental Results - l] In the following experiment, labelled cDNA was prepared from rnRNA obtained from optic primordia dissected from E11.5 wildtype and aphakia (Varnum & Stevens. 1968) mouse embryos. Sufficient material was collected to hybridise 4 slides. Each slide was hybridised with 2,ug of labelled Cy3 cDNA and 2,ug of labelled CyS cDNA was hybridised to each microarray. Scans were obtained at 10 pixels/,on resolution using a Genomic Solutions GeneTac LSIV scanner. The subsequent work focuses on one of these slides, although any of these slides could have been chosen to illustrate the same point.

Figures 4A through 4D compare expression levels obtained through analysis of the single rnicroarray slide (6144 PCR products printed in duplicate from plates 1-16 of the National Institute of Ageing 15K cDNA Clone Set (Tanaka et al., 2000) obtained from the MRC UK Human Genome Mapping Project) using the ScanAlyze software from two different users. Microarray Gene Expression Database Group coordinated by the EBI, http://www.mDed.org/, ScanAlyze software is free for academic use and downloadable

from httE:/lrana.lbl.gov/EisenSoftware.htm. The ScanAlyze package relies on information provided by the user including the layout grids, size and shape of each spot. Each user provided the software with the same information, but then placed the grids independently.

Figures 4A and 4B illustrate the Cy3 and CyS channels extracted from the same microarray image by two different researchers using ScanAlyze software with the same grid parameters. Each axis is linearly scaled by a factor of 10,000. Figure 4C is a graph of logarithms to the base 2 of the ratios of the results illustrated in Figures 4A and 4B.

Figure 4D is a normalised histogram of the distances of the base 2 logarithms of the ratios from the 45 C line in Figure 4C.

These figures show that image analysis by different researchers leads to large discrepancies in the gene expression level values obtained from a single array. Focussing on the centre of Figure 4C we can see that there is up to a two-fold variation in the ratios purely arising from grid placement errors. Interestingly, when performing the same experiment across different slides other researchers (Hughes et al., 2000) have discovered variations of a similar magnitude, but assumed that they resulted from "biological noise". Whilst slide image data and grid placement information is not publicly available for this work, it seems very plausible that a portion of this variation was due to grid placement error.

The microarray slide was also analysed by the method of the present invention. Spot positions and sizes identified by the two different researchers using ScanAlyze software were used to initialise the algorithm. After carrying out the method as in the detailed example above, new estimates for the spot position and size were obtained from the values of (< and (r). These new values were loaded back into the ScanAlyze software and intensities were calculated by ScanAlyze.

These results are shown in Figures 5A to 5D which correspond to Figures 4A to 4D for the data analysed using the method detailed above.

It can be seen from the reduced variation in Figures SA to 5D in comparison to Figures 4A to 4D that variability between users has been reduced. The mean distance of the data points from the x - y line had decreased by 50%. The results from two different users are now much more consistent than previously.

The National Institute of Ageing 15K slide used in the experiment above was of relatively high quality in comparison to many microarray slides, in that the spots were of similar size and matching locations on a grid. The method of the present invention has been found by us to be work well on slides of a much lower quality.

Figure 6A shows an example of such a slide. Spot size is more variable and the spots are not accurately fitted to a grid. In Figure 6A, spot positions and sizes have been labelled by a researcher, without special care, by means of the circles illustrated 10. Figure 6B shows the same data analysed by the example method above. Locations where no spot was found are flagged 20. Such locations are identified where the algorithm cannot find the centre and size of a spot with any precision.

These experimental results were achieved using the above example algorithm implemented in the MATLAB programming environment (MATLAB is a trade mark). The processing of the slides, containing 6,144 spots, took approximately two hours on a computer with a 1.4GHz Pentium IV microprocessor. (Pentium is a trade mark). This corresponds to about 1.2 seconds per spot although further speed improvements should possible by optirnisation. Two hours correspond to Me time typically required by a researcher in detailed grid placement; however, no human attendance is required during processing and the reduced variability shown above was achieved.

Experimental Results - 2

In a further experiment, data was analysed from twelve bespoke manufactured cDNA microarray slides from experiments on aphasia mice. The experiments involved a comparison of eye tissue extracted from wild-type and aphasia mice embryos for Me presence or absence of 6,000 genes at three times points corresponding to 10, 11 and 12 days gestation, with four repetitions at each time point.

For each gene, i, the important property to be measured is the log2 ratio between the red, r, and green, g,, channels, y; = log2 rat / g' where each of the channels is associated with one of the two different biological samples. This measure is extremely sensitive when the signal intensity defined as a' = r'.gj, is low. It is of biological interest to examines genes for which the magnitude of Yi is 'high' (normally above 1) as this indicates that the genes are being expressed differently in the two samples and are therefore associated with the aphasia mutation.

We calculated estimates of both: (y,) fl' and the variance of y', {J,2 =(yt2) _<y' >20 using the determined importance weights. Error bars equal to the square root of the variance (i.e. the standard deviation) can therefore be displayed.

Figure 7A shows a portion of one of these images showing a variety of spot types. Figure 7B shows a rough grid placement from ScanAlyze. Figure 7C shows the last set of samples from the proposal distribution when the location of the spots was analysed by the -

above method. (Importances associated with each samples are not shown).

Each gene at each time point is associated with a maximum of sixteen values of Yi,(i = Itol6), as each slide contains the genes twice. (These are repeated next to each other on the slides, hence pairs of neighbouring dots in Figures 7A, 7B and 7C have similar intensities). Lt is common, due to experimental variation, for the values to be missing in some slides. in fact, two of the slides at time point 12 hours were rejected altogether as a result of their poor quality.

We are interested in an estimate of the actual logs ratio of the expression levels y, at each time point, t=10,11,12 hours. In conventional microarray processing techniques, this would be estimated as the mean of all values for each time point.

However, by using the variance, we estimated ye through a weighted mean which downweights observations with high,2. y<" =y(') Figures 8A to 8D are plots with y -y on the x-axis and y -y on the y-axis under various circumstances of the 200 genes (from 6000) furthest from the origin. Figure 8A shows y calculated without taking uncertainties into account. Figure 8B shows y calculated taking uncertainties into account, with dotted lines around each point showing one standard deviation of error in the values of y Calculated as described above. Figure 8C illustrates arrows indicating the direction of the 15 largest movements of data points between Figures 8A and 8B. This shows that data points generally move towards the origin by this process. These were genes associated Witty IQW intensities Tom t=12.

Figure 8D is a detail from Figure 8B.

Thus, various genes which would have been falsely identified as relevant under the standard calculation of y: have moved closer to the origin, allowing researchers to focus their efforts on the most relevant genes.

References Bernardo, J. M., DeGroot, M. H., Lindley, D. V. & Smith, A. F. M., eds (1980) Bayesian Statistics. Valencia University Press, Valencia.

Brazma, A., Hingarnp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C. P., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., SchulzeKremer, S., Stewart, J., Taylor, R., Vilo, J. & Vingron, M. (2001) Minimum information about a rnicroarray experiment (MIAME) - toward standards for microarray data. Nature Genetics, 29, 365-371.

Brazma, A. Vilo, L. (2000) Gene expression data analysis. FEBS Letters, 480, 17-24.

Cox, R. T. (1946) Probability, frequency and reasonable expectation. American Journal of Physics, 14 (1), 1-13.

Doucet, A., de Preitas, I. F. G. & Gordan, N. J., eds (2001) Sequential Monte Carlo Methods in Practice. Springer-Verlag.

Eisen, M. B. & Brown, P.O. (1999) Dna arrays for analysis of gene expression. Methods in Enzymology, 303.

Ghahrarnani, Z. & Beal, M. J. (2001) Propagation algorithms for variational Bayesian learning. In Advances in Neural Information Processing Systems, (teen, T. K., Dietterich, T. G. & Tresp, V., eds), vol. 13, MIT Press, Cambrid.ge, MA.

Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffey, E., Dai, H., He, Y. D., Kidd, M. J., King, A. M., Meyer, M. R., Slade, D., Lum, P. Y.,Stepaniants, S. B., Shoemaker, 12). D., Gachotte, D., Chakrabutty, K., Simon, J., Bard, M. & Friend, S. H. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109-126.

lordan, M. 1., ed. (1998) Learriing in Graphical Models, vol. 89, of Series D: Behavioural and Social Sciences. Kluwer, Dordrecht, The Netherlands.

Jordan, M. I., Ghahrarnani, Z., Jaaldola, T. S. & Saul, L. K. (1998) An introduction to

variational methods for graphical models. In (Jordan, 1998) pp. 105-162.

Lawrence, N. D. (2000). Variational Inference in Probabilistic Models. PhD thesis, Computer Laboratory, University of Cambridge New Museums Site, Pembroke Street, Cambridge, CB2 3QG, U.K. Available from http://www. elawrences.net/neil.

MacKay, D. J. C. (1995) Developments in probabilistic modelling win neural networks-

ensemble learning. In Neural Networks Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Netherlands, 14-15 September 1996 pp. l91-l98 Springer, Berlin.

MacKay, D. J. C. (1998) Introduction to Monte Carlo methods. In (Jordan, 1998) pp.

175-204.

Schena, M., ed. (2000) DNA Microarray, A Practical Approach. Oxford University Press. Tanaka, T. S., Jaradat, S. A., Lim, M. K., Wang, G. J. K. X., Grahovac, M. J., Pantano, S., Sano, Y., Piao, Y., Nagaraja, R., Doi, H., 3rd, W. H. W., Becker, K. G. & Ko, M. S. H. (2000) Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. Proceedings of the National Academy of Sciences of the United States of America, 97, 9127-9132.

Varnurn, D. S. & Stevens, L. C. (1968) Aphakia, a new mutation in the mouse. Journal of Heredity, 59 (2), 147-150.

Waterhouse, S., MacKay, D & Robinson, T. (1996) Bayesian methods for rrulres of experts. In Advances in Neural Information Processing Systems, (Touretzly, D. S., Mozer, M. C. & Hasselmo, M. E., eds), vol. 8, pp. 351357 MIT Press, Cambridge, MA.

Yang, Y. H., Buckley, M. J., Dudoit, S. & Speed, T. P. (2000). Comparison of methods for image analysis on cDNA microarray data. Technical report Department of Statistics, University of California, Berkeley. Available from http:/lwwv'. stat. Berkeley. EDU/users/terrv/zarrav/Html/image.htrnl.

Claims

1. A method of estimating a property of a structure in data (X), comprising the steps of: i) establishing a prior probability distribution (p(8)) of a plurality of parameters (i, 02 etc.); ii) selecting or obtaining values (corresponding to one or more expectations), which characterize at least in part an initial approximation to a posterior distribution (qua)) which is selected to be separable into factors across disjoint subsets of the parameters qua) = nave,) A=} iii) updating a factor of the approximation to the posterior distribution (q(8,)) or expectations 'hereof taking mto account Me other factors of the approximation to the posterior distribution or expectation thereof, and a likelihood function p(X|), with reference to: q(0k)= exp(logp(xl8)p())n, q(fl) (Equation 1) either by calculation or by using sample based approximations to approximate expectations under the other subsets of the approximation to the posterior (q(8,) where j k) where a sample based approximation is used at least once;

iv) repeating the procedure of step iii) for each factor of the approximation to the posterior distribution or expectation thereof.

2. A method according to claim 1, wherein the procedure of step iii) is repeated for each factor of the approximation to the posterior distribution or expectation thereof a plurality of times.

3. A method according to any one preceding claim, wherein the approximation to the posterior distribution is updated for each factor in turn and then the process is repeated.

4. A method according to claim 3 wherein the approximation to the posterior distribution is achieved by updating the expectations of the approximation to the posterior distribution.

5. A method according to any preceding claim, wherein the procedure of step iii) is repeated for each factor of the approximation to the posterior distribution of expectation thereof until a convergence criterion is met.

6. A method according to claim S wherein the procedure of step iii) is repeated until successive values of specified parameters change by less than a specified amount.

7. A method according to any one preceding claim, wherein the constant of proportionality ( Z) of equation 1 is not tractable.

8. A method according to any one preceding claim, wherein the sample based approximation used to approximate expectations is importance sampling.

9. A method according to any one preceding claim wherein the data are measured data.

10. A method according to claim 9 wherein the measured data comprise a digitised multidimensional image.

11. A method according to any one preceding claim wherein the data are multichannel.

12. A method according to any one preceding claim wherein the structure is a discrete object.

13. A method according to any one preceding claim wherein the prior distribution (pi)) is calculated, estimated and/or based on data input by an operator.

14. A method according to any one preceding claim wherein the method includes the step of calculating the most likely value of a property of a structure, and/or the variance or other measures of uncertainty in that value.

15. A method according to claim 14 wherein the method further comprises the step of computing functions of the estimated properties, such as ratios of properties, taking into account the uncertainty in those estimated properties.

16. A method according to any one preceding claim further comprising the step of flagging as not present any structures for which properties cannot be found.

17. A method according to any one preceding claim wherein the prior distribution is a Gaussian distribution.

18. A method according to any one preceding claim wherein the prior distribution is specified by a hyper-prior.

19. A method according to any one preceding claim wherein the measured data is a scanned pixellated image of fluorescent spots in a microarray.

20. A method according to claim 19 wherein the centres c and radii r of each spot are the parameters of the prior distribution.

21. A method according to claim 20, wherein, for the centres c, the posterior approximation is considered to be separated into disjoint subsets such that: q(c) or exp(ln p(l|C)p(Clmc "c))q('c)q(ac) q(mc) exp(lnp(C|mc,ac)P(mc))q(ch(ec)' q(ac) x exp(lnp(C|mc,ac)P(ac))q(')4(me)' which means that the approximations to the posteriors take the form q(mc) = N(mC|iilc,úm), q(a) = gam(Xc |ac, be i wherein the expectations of q(mc) and q(ac) are updated by calculation using the equations: (me) = me (mCTmC) = m'Tmc + Tr(úC), (arc) = aC bc and wherein expectations of q(c) are estimated by sampling of the distribution: q(c)= z p(I|C)N(C|(mC),(aC) I)

22. A method according to claim 21, wherein expectations under q(c) are estimated using importance sampling with a proposal distribution, R(c) = N(C(mc),(ac) I)

23. A computer program comprising program code means which, when executed on a computer, cause the computer to carry out the method of any one preceding claim.

24. A method of estimating a property of a structure in data substantially as described herein.