WO2005071594A1

WO2005071594A1 - Estimation of signal thresholds for microarray data using mixture modeling

Info

Publication number: WO2005071594A1
Application number: PCT/EP2004/000586
Authority: WO
Inventors: Khalid S. Abu Khabar; Musa Hakan Asyali
Original assignee: King Faisal Specialist Hospital & Research Center; Terramark Markencreation Gmbh
Priority date: 2004-01-23
Filing date: 2004-01-23
Publication date: 2005-08-04

Abstract

The application relates to a method of classifying DNA microarray intensity data into two classes of reliable and unreliable data, comprising the steps of extracting median and mean intensities of two fluorescent signals of different wavelengths from cDNA sequences arranged in a microarray as λ1-channel and λ2λ-channel raw data; estimating normal mixture model parameters corresponding to class conditional densities and prior probabilities using expectation maximization algorithm; determining in univariate analysis an optimal signal threshold for each channel or in bivariate analysis a threshold surface for said microarray raw data using Bayesian classification theory; denoting a signal intensity value as reliable, if either of the channels is larger than its respective threshold in univariate analysis or if the 2-dimentional data point falls outside of a hyper-quadratic decision region in bivariate analysis.

Description

Estimation of Signal Thresholds for Microarray Data Using Mixture Modeling

The present invention relates to the estimation of signal thresholds for microarray data using mixture modeling and in particular to a method of classifying DNA microarray intensity data into two classes of reliable and unreliable data.

DNA microarray, or DNA chip technology, provides a powerful and efficient means for measuring the relative abundance of expressed genes in a variety of applications such as profiling of gene expression between normal and diseased subjects. Microarray data consist of observations, i.e. pairs of Cy3 (green) and Cy5 (red) channel intensity readings/measurements obtained from the complimentary DNA (cDNA) spots on the microarray. The ratio of the two channels in an observation gives the expression level for the particular gene.

Microarray gene expression data usually bear substantial number of false positives due to different possible sources of error. A common problem and probably the major source of the error are the low signal intensities due to low expression of many of the genes being investigated. Such low signals cause variability or impair reproducibility of the measured ratios between control and experimental samples. In microarray experiments, only a small fraction of the genes become expressed as a result of the investigated conditions. However, there are also other situations that can give rise to low signal values such as deposition of suboptimal amounts or poor quality of probes or incorrect identification of spot areas at the spot segmentation phase. Due to problems of this sort, the intensity signals from both channels are can become unacceptably low and their ratio may turn to be very large, which will incorrectly indicate a high level of expression for the particular genes.

A serious limitation in microarray analysis is the unreliability of the data from low signal intensities, which generally constitutes a large portion of the microarray data. Such data may produce erroneous gene expression ratios and results in unnecessary validation or post- analysis follow-up tasks. Only few studies have addressed this issue and devised thresholding methods to eliminate the data from low signal intensities. The suggested methods involve subjective thresholding or filtering mechanisms where some parameters are required to be determined by trial and error. A study suggested the use of a significance test based on the standard deviation of the signals from the repeated genes. However, this method is limited by the number and accuracy of the repeated measurements. Furthermore, none of the methods suggested so far has taken the bivariate nature of the data into account.

It is the object of the present invention to identify and eliminate such unreliable observations, i.e. pairs of low intensity readings, that may give rise to false positives.

This object is achieved by a method comprising the steps according to claim 1.

According to the inventive method the problem of identifying low or unreliable readings is formulated as an optimal, i.e. minimum error rate, classification problem. Univariate or bivariate normal mixture modeling is used to segregate the microarray data into two classes, namely classes of reliable and unreliable data, i.e. low signal intensity data. Then an expectation maximization technique is used to estimate the mixture parameters and their weights, i.e. class prior probabilities and the optimal thresholds (classifiers) are determined (or decision surfaces in the case of bivariate analysis) using Bayesian classification theory. Such a classifier assigns observations into classes based on posterior probabilities and therefore minimizes the error or misclassification rate. The term "observation" refers to a pair of Cy3 (green) and Cy5 (red) channel intensity readings or measurements obtained from the same spot on the microarray.

Once class information for the observations is obtained, observations that have a higher probability of belonging the class of unreliable data can be identified and eliminated while keeping the maximum possible number of reliable measurements.

In the case of univariate analysis, each channel of data is considered separately and optimal signal thresholds are established for each channel separately and a two-component univariate normal mixture model for each channel is estimated. Then, the classification information obtained for each channel is combined to come up with an overall decision about reliability of the observation. Bivariate analysis, on the hand, takes the bivariate nature of the data into account and directly estimates a two-component bivariate normal mixture for the data in 2-dimensions (2-D). Thus the two dimensional structure of the data is preserved and each observation or data, respectively, is directly classified using class posterior probabilities computed from the mixture components.

The inventive method comprises the steps of extracting median or mean intensities of two fluorescent signals of different wavelengths from cDNA sequences arranged in a microarray as λi -channel and λ₂-channel raw data; estimating normal mixture model parameters corresponding to class conditional densities and prior probabilities using expectation maximization algorithm; ^' determining in univariate analysis an optimal signal threshold for each channel or in bivariate analysis a threshold surface for said microarray raw data using Bayesian classification theory; and denoting a signal intensity value as reliable, if either of the channels is larger than its respective threshold in univariate analysis or if the 2-dimentional data point falls outside of a hyper-quadratic decision region in bivariate analysis.

With the inventive method data from low-signal intensities that may give rise to false positives are identified and eliminated and also false negatives, i.e. the reliable measurements tagged as incorrect or invalid, are kept at a minimum level to achieve maximum efficiency. This yields more informative and accurate gene expression data and will prevent the propagation of erroneous expression results into the subsequent phases of analysis and therefore increase the accuracy, reliability and reproducibility of the microarray experiments.

Furthermore, the model parameters that determine the optimal thresholds are as an advantage estimated from the data itself. Therefore, only the nature of the data determines the optimal thresholds, i.e. there is no external validation sets, etc. and as such this technique adaptable/applicable to any microarray data.

According to an embodiment of the invention the expectation maximization algorithm comprises the steps of: estimating initial mixture parameters μ, Σ determining or estimating the posterior probability that an observation belongs to the ith class given by _f l, 2, ,n;

determining new prior probabilities given by π_i = _y In 7=1 determining new updates of the mean vector , and the covariance matrix Σ,. according to

repeating the 2^nd to 4^th expectation and maximization steps until updates/changes in mixture parameter estimates are smaller than a preset tolerance level or a certain number of iterations is reached.

According to another embodiment the inventive method may provide that in univariate analysis the probability density for each channel is given by the sum of the weighted normal density components of the low (unreliable) and high (reliable) intensity class of data and is given by the equation f, (x) = π_tN{x; μ , σ„ ) + (1 - π, )N{x; μ_u,σ ² _h),

wherein N(x; /,,σ,²) is the univariate normal probability den-

sity function for the i-th channel, μ. is the mean and σ .2 . is the variance, x denotes the signal intensity, subscript i denotes the g- channel and the r- channel, respectively, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

The inventive method may also provide that in bivariate analysis the probability density is given by the sum of the weighted normal density components of the low (unreliable) and high (reliable) intensity class of data and is given by the equation/ (x) = π N (x; μ_1? ∑i) + (1 - π) N(x; μ₂, Σ₂), wherein N (x; μ_i5 Σ_;) = {2π | ∑j |)^"rf/2exρ {(x - μ ^∑f¹ (x - μO / 2} is the multivariate normal probability density for the i-th channel, μ. is the dxl mean vector, Σ; is the dxd covariance matrix, x is the signal intensity, and subscripts 1 and 2 denote the weighted normal den- sity components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

According to still another embodiment in univariate analysis the optimal classification threshold for each channel is obtained by equating the class posterior probabilities, i.e. is given for each channel i by the solution of π ,N{x;μ_u,σ ² ) = (1 - π₍)N{x; μ_2i, σ²2ι),

wherein N{ ;μ_i,σ_i ) = is the univariate normal probability den

sity function for the i-th channel, μ_t is the mean and σ is the variance, x denotes the signal intensity, subscript i denotes the g- channel and the r-channel, respectively, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

Yet another embodiment provides that in bivariate analysis the optimal classification surface is obtained by equating the class posterior probabilities, i.e. is given by the solution of τzN(x; μ_l,Σ_λ) = {l -π)N{x;μ,Σ₂) ,

wherein N(x; ,,∑.) is the multivariate normal probability density, μ. is the dxl mean vector, ∑. is the covariance matrix, x is the signal density, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

According to the invention also a method implemented by a computer program including the above method steps is provided, wherein the computer program may be contained in a corresponding computer program product.

For classifying DΝA microarray intensity data also a corresponding apparatus comprising a light source, a detector, a microarray holder, an interface connected to the detector and to a data processing means, and means for carrying out the inventive method according to one of the above steps is provided. The preferred light source is a laser preferably emitting at 532 nm and 635 nm. Furthermore, the data processing means preferably comprises a microprocessor and a memory.

Further details, features and advantages of the invention will become apparent from the following detailed description.

Mixture Modeling

Finite mixture modeling is a technique for probability density function estimation and found significant applications in various biological problems. Finite mixture modeling assumes that a multivariate probability density f{x) , x being a dxl data point or an observation can be expressed as a weighted sum of k densities, hence the name "finite mixture."

f{x) = f{x;Q) = ∑π_ig_i{x;β_i) . (1) (=1

Here, is a probability density with mxl parameter vector θ., and π_t > 0 represents the "weight" or "mixing coefficient" of the fth component, and θ is the kmxl cumulative set of parameters, i.e. θ = [θ_j ^r θ₂ ^r ...θ ]^r .

In the case of normal mixtures, the density component g_;.(x;θ.) is the multivariate normal probability density, i.e. g_t (x; θj) = {2π)^'d/2 | ∑i \)^' exp {- (x - μ ^τ ∑ (x - μ / 2}, where ^ is the d xl mean vector and ∑. is the dx d covar9iance matrix. The corresponding length of θ,- is m = d + d{d + 1) , because of the symmetry of the covariance matrix. One should have -_r — _\ k 2__l.₌ π_i = l , to make sure that f{x) is a valid probability density. Therefore, total number parameters to be estimated in Eq. (1) are N = km + k - 1. In order to estimate these parameters with acceptable variance/accuracy, one must also make sure that N « n, where n is the number of observations or data points, i.e. x = _;. , j = 1, 2, ... , n .

While using the normal mixture modeling technique in classification applications, it is assumed that the data come from a ^-component normal mixture, i.e. there are ^-classes, but the class membership information for the data is not known/available. Let g-.(x;θ.) be the probability density and π_i be the proportion of the tth class. Then,

Ϊ ^τAA == ll»,22,,....../kc;; yj = = ll,,22,,......,,nn (2)

is the posterior probability that an observation x . belongs to tth class, where the dominator f{X_j) is given by Eq. (1).

Assuming that the observations in the data set are independent, the likelihood function is constructed as

Likelihood Function = l{θ) = l{ ;x_l,x₂,...,x_n) = Pr[D"₌₁ x _, ; θ] = J [ f{x . ; θ) . 7=1

or the log likelihood function as

Log Likelihood = L(β) = log Z(θ) = J log f{ _} ; θ) . (3) 7=1

This is the joint probability of having a particular set of observations given θ . While estimating the parameters of a mixture model, one should select a parameter vector θ that will maximize either of these likelihood functions. Therefore, the maximum likelihood (ML) estimate of θ can be obtained by solving the likelihood equation dL{Q) = 0. (4) δθ

The Expectation Maximization (EM) method is used for estimating the mixture parameters. The EM is an iterative method for optimizing the likelihood function in situations where there is some missing information. In the present case, missing information is the class membership of the observations. The EM algorithm is now a standard tool in statistics. The EM algorithm for mixture modeling can be summarized as follows: 1. Starting with a proper initial estimate of the mixture parameters θ . Assuming that the number of^* components k is known, this initialization can be accomplished by using one of the available nonparametric classification algorithms. For instance the data can be classified into k classes using the :-means algorithm and the initial values of the parameters can be estimated from the classified data. Instead of using random initial parameter estimates, using this approach improves convergence of EM algorithm significantly.

2. Expectation step: Determining or estimating the posterior probabilities given by Eq. (2) and computing the new or updated π_i 's, i.e. class weights or prior probabilities, as:

^^{• =}Σ^" ₌ / '

3. Maximisation step: Given the estimated π_i 's from the expectation step, updating the parameter vector θ by maximising the likelihood function, i.e. finding the ML estimate of θ . The ML estimate of θ can be derived for different classes of probability densities using Eq.s (3) and (4). For the normal mixtures, the updates on the mean vector μ, and the covariance matrix Σ. are as follows:

4. Repeating expectation and maximisation steps until the updates/changes in mixture parameter estimates are smaller than a small preset tolerance level or a certain number of iterations is reached.

All the computations regarding normal mixture density estimation may be performed using a software developed under Matlab™ (The Math Works Inc., Natick, MA) environment. Before proceeding with the estimation of mixture parameters, the data may be preferably log transformed to bring the histograms closer to normality. Compressing data this way also avoids dealing with mixture components with large variances, which generally causes numerical problems. Bayesian Classification

In a two-class problem classification, let x denote an observation and i = 1,2 be the class index and suppose that the class prior probabilities, P_t = P{x e Class i) , and the class conditional densities, P{x; x e Class i) are known. Then, the Bayes rule states that the posterior probabilities, P{x e Class i;x) , can be expressed as:

P{x;x e Class i)P_; P(x <≡ Class i; x) ■

where the evidence, P(x) , is merely a normalizing factor that ensures the posteriors add up to 1. Bayesian decision rule simply assigns an observation to the class that has the highest posterior probability, i.e. decide that x e Class 1 if P{x e Class 1; x) is larger than P{x e Class 2; x) and vice versa. (This method is also known as maximum a posteriori classification.) Hence, for any given value of x, probability of enor P{e;x) is minimized, because

[P{x e Class l;x) ifx e Class 2 is decided . , , . ,

P{e;x) = \ = min {P(x e Class l;x),P(x e Class 2; x)} I P(x e Class 2;x) if x e Class 1 is decided

The optimal decision surface can be found by equating the posterior probabilities. In the case of multivariate normal densities, the optimal decision surfaces will be hyper-quadratics, whose form and position are determined by the prior probabilities and the mean vectors and the covariance matrices of the distributions. Assuming that the class conditional densities and priors are estimated through the mixture modeling approach, this optimal classification technique that minimizes the probability of error to our microanay data classification problem may be conveniently applied.

In the following results obtained by carrying out the inventive method are described by way of examples with reference to the accompanying drawings, in which:

Fig. 1. A and B show univariate normal mixture histograms and threshold plots for two different data sets. Fig. 2. A and B show bivariate normal mixture scatter plots, classification plots, and validation plots for two different data sets.

Finite Mixture Modeling

Two independent experiments of microarray gene expression generated from the same cell system, for consistency, have been prepared in order to demonstrate the utility of the present approach.

The Microarray Preparation was carried out in the following manner:

A complementary DNA (cDNA) microarray has been used, which contained -2000 cDNA probes. The cDNA probes were generated via PCR amplification of the clone-containing glycerol bacterial culture stocks using M13 primers and spotted (100 μm-diameter) at least in duplicate on poly-L-lysine- glass slides using OmniGrid robot (GeneMachines, CA). Total RNA samples (20 μg, 5 ug/ul) from THP-1 cell line or THP-1 cells line that was stimulated with 10 ug/ml endotoxin (Sigma, St. Louis) were reversed transcribed using Superscript II (Gibco) and a primer provided by the Genisphere kit (Genisphere, Inc.). The cDNAs from the total samples (e.g., control and treatment) were combined and ethanol precipitation was performed. The hybridization buffer supplied by the Genisphere kit was heated to 55°C for 10 min. and mixed with two blocking solutions, oligo dT and LNA dT for nonspecific hybridization of labeled cDNA to elements containing poly dA sequences and for poly A containing elements including spotted poly dA sequences, respectively. The cDNA pellets were incubated in the hybridization buffer with the blockers as "hybridization mix" of 40% formamide, 4X SSC, 1% SDS, and 2X Denhardt's Solution and was then applied to the microarray and covered with a coverslip for 18 hr at 55° C in a sealed, humidified chamber. After hybridization, the slides were washed briefly by a series of washes to remove unbound cDNA. Subsequently, the microarrays were subjected to dendrimer hybridization using Genisphere DNA dendrimer capture reagent labeled with Cy3 (control) or Cy5 (experiment). The hybridization mix was first incubated at 75-80°C for 10 minutes, and then at 50°C for 20 minutes, and then added to pre-warmed microarray slides. The coverslips were applied to the array and incubated 3 hours in a dark humidified chamber at 45-50°C. After hybridization the slides were washed briefly several times to remove unbound dendrimers molecules and microanays were dried in a centrifuge. Image Acquisition and Intensity Extraction:

The cDNA microanay was scanned at 10 μm resolution using ScanArray scanner that excites the Cy3 (green) and Cy5 (red) fluorophores at optimal wavelengths of 532nm and 635nm, respectively. The median intensities of green and red fluorescent signals from each spotted cDNA sequence on the microanay image were extracted using adaptive circle algorithm (QuantAnay software for the GSI Lumonics Scanner). The spot intensities with large local background were excluded if the median intensity of the spot was less than 1.4 of the median background intensity of the same spot. The negative values, if present, were also excluded from the data. The background intensities were subtracted from the spot intensities. The background conected intensities were then normalized to total signal intensities to account for different input RNA concentrations or labeling efficiency in the individual Cy3 and Cy5 reverse transcriptase reactions. The intensity ratios (Cy5/Cy3), which represent the relative expression profile of the Au (T)-rich element-containing genes (ARE-genes) in the two starting RNA samples, were calculated.

Subsequently data (fluorescence signal intensities) for the two experiments were first log transformed to bring distribution to normality; summary statistics for each experimental data was obtained (Table 1).

Table 1. Summary statistics for the three microarray expression data. Dataset 1 Dataset 2 Cy3 Cy5 Cy3 Cy5 Log Transformed Signal Intensities n (number of points) 3027 3040 Mean 6.28 6.21 Median 5.99 5.81 5.95 5.77 SD 1.08 1.16 1.05 1.15 Correlation coefficient (/?cy3,_Cys) 0.93 0.92 Background Signal Intensities Median 93.83 54.08 90.58 60.12 SD 138.92 67.33 153.13 107.36

What is also reported in Table 1 is the median and standard deviation (SD) of the raw, i.e. not log transformed, background intensities. As it will become apparent shortly, these values are needed to implement Fielden's thresholding technique (see below) that depends on these sim- pie statistics estimated from the background intensities.

Then normal mixture modeling was applied whose results are presented in Table 2, to segregate microanay data into two populations on the basis of probabilities. Normal mixture modeling using either univariate or bivariate analysis has been used, in which Cy3 (green) and Cy5 (red) channel data are either independently or dependently analyzed, respectively.

Table 2. The results of univariate and bivariate mixture modeling

Univariate analysis Bivariate analysis

Data Cy3 Cy5 set Component 1 Component 2 Component 1 Component 2 Component 1 Component 2 ^"5.826^" ^"7.530^" Mean (μ or μ) μ 5.846 7.649 5.636 7.450 μ 5.656 _7.554_ ¹ , ^'0.229 0.172^" ^"1.643 1.518^" SD (σ² or Σ) σ² 0.249 1.628 0.239 1.718 Σ 0.172 0.257 1.518 1.697 Weight (π) 0.761 0.239 0.712 0.288 0.735 0.265 ^"5.797^" ^"7.494^" Mean (μ or μ) μ 5.807 7.567 5.588 7.422 μ _5.608_ _7.541_ 2 . ^"0.235 0.197^" ^"1.634 1.502^" Var. (σ² or Σ) σ² 0.246 1.615 0.285 1.717 ∑ 0.197 0.305 1.502 1.671 Weight (π) 0.769 0.231 0.731 0.269 0.755 0.245

In the univariate approach, the log-transformed data from each channel (Cy3 and Cy5) was processed separately and assuming that their probability density functions consist of two normal components. That is, f_g{x) = π_gN{x; μ_lg,σ _g) + {l-π_g)N{x;μ_2giσ₂ ² _g) (5.a) f (^x) = π_rN{x; / ,, , σ ) + {l -π_r )N{x; μ_2r ,

(5.b) where fix) is the probability density, x is the signal intensity, and g and r subscripts denote Cy3 and Cy5 channels, respectively. The first and second terms, i.e. weighted normal density components, in each channel's density represent the posterior probability of the low (unreliable) and high (reliable) intensity class of data, respectively. In Fig. 1 histograms in 60 bins from two different data sets {A and B) representing estimated normal density components for Cy3 green channel {upper panel) and Cy5 red channel {lower panel) channel are shown. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or πN{x;μ₁,σ₁) and {l-π)N{x;μ₂,σ₂) are shown by dashed lines and dotted lines; respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities,jg (X) andf_g (X) (solid lines in upper and lower panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained above. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.

As seen clearly, the two-component mixture model closely approximates the histogram of the each channel data in each set. Table 2 {univariate analysis) shows the parameters, i.e. means, SDs, and weights of the normal components for each channel data in each set.

In bivariate analysis, the Cy3 and Cy5 channels were considered together and the bivariate density function (x) of the microarray data was modeled as follows:

The first and second weighted bivariate normal density components correspond to the posterior probabilities of the low (unreliable) and high (reliable) intensity classes of data, respectively.

The results of bivariate mixture estimation, i.e. mean vector μ, covariance matrix Σ, and the weights of the two bivariate density components for each data set are shown in Table 2 {bivariate analysis). As the conelation figures presented in Table 1 indicate, there is a high association between Cy3 and Cy5 channels and considering them separately is not justified. Although the analysis is slightly more complicated, bivariate mixture modeling preserves the two-dimensional nature of the data and estimates the bivariate normal components.

The results of bivariate mixture modeling are depicted in Fig. 2 (A-B). In Fig. 2 Cy3 (green channel) and Cy5 (red channel) scatter plots for two datasets of two independent experiments {A and B) are shown. The grey "+" marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden's method (see below) for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.

Classification Methodology using Univariate and Bivariate Mixture Modeling

In univariate analysis, the optimal classification threshold for each channel can be found by equating the class posterior probabilities. From Eq. (5. a), the g-threshold (Cy3 signal intensity threshold) is the solution of π_gN{x;μ_lg,σ _g) = (1 -π_g)N{x;μ_2g,σ₂ ² _g) ,

and similarly from Eq. (5.b), the r-ihreshold (Cy5 signal intensity threshold) is the solution of

= {\ -π_r)N{x;μ_2r,σ₂ ² _r) .

As may be seen in Fig. 1 (A-B), for each channel, the optimal threshold is exactly at the intersection of the two weighted normal density components that models or approximates its density. Since the mixture model parameter estimation was done on the log-transformed intensity data, the exponential of the optimal threshold that was found in the log-domain was taken and the conesponding optimal threshold on the raw intensity was obtained. The g- and r- thresholds that have been obtained in this manner for the two data sets are reported in Table 3. Table 3. Classification results and comparisons with the reference classifications.

Univariate analysis Bivariate analysis Fielden's method

Data Reliable Unreliable Total Reliable Unreliable Total Reliable Unreliable Total set True positive 26 0 26 26 0 26 26 0 26 True negative 1 26 27 0 27 27 23 4 27 Total 27 26 53 26 27 53 49 4 53 Sensitivity 100 100 100

1 Specificity 96 100 15 Observed Agreement 98 100 57 κ-(95% CI) 0.962 (0.889 - 1.036) 1.000 (1.000 - 1.000) 0.146 ( [-0.117 - 0.408) Green Ch. Threshold 1001 NA 511 Red Ch. Threshold 761 NA 256 True positive 26 1 27 27 0 27 27 0 27 True negative 2 25 27 0 27 27 7 20 27 Total 28 26 54 27 27 54 34 20 54 Specificity 96 100% 93%

2 Sensitivity 93 100% 96% Observed Agreement 94 100% 94% κ-(95% CI) 0.889 (0.767 - 1.011) 1.000 (1.000 - 1.000) 0.741 (0.562 - 0.920) Green Ch. Threshold 963 NA 550 Red Ch. Threshold 789 NA 382

An observation that the signal intensity value = [*_g x_rf should be flagged as reliable, if either of the channels is larger than its respective threshold, i.e. if x_g > g-threshold or x_r > r- threshold. Then x is identified as belonging to the class of reliable signal intensities. If this condition is not satisfied, i.e. both of the values were lower than their respective thresholds, then the observation is tagged as unreliable.

In bivariate analysis, it is not possible to talk about "optimal signal thresholds" anymore, instead we can talk about an optimal classification surface. Similar to univariate analysis, this optimal classification surface can be found by equating the two weighted mixture components or posterior probabilities that are given in Eq. (6) as, πN{x;μ Σ,_ϊ) = {Aπ)N{x;μ₂,Σ₂) The solution of this equation defines a quadratic curve and consequently the decision regions in the two-dimensional observation space. On one side of this decision curve, observations belong to one class (e.g. the population of low or unreliable signal intensities) and on the other side to the other class (e.g. the population of high or reliable signal intensities).

Validation: Comparison of Classification Performance against the Reference Sets

In order to test the classification methods, a reference or validation set was generated. The monocytic leukemia cell line, THP-1, induced by the endotoxin, LPS was used. This is a common and reproducible model in inflammation research and has been used recently in large-scale expression study in Serial Analysis of Gene Expression (SAGE) and microanay approaches. The prior knowledge about the expression of some genes facilitates construction of a good reference set against which the performance of different classification algorithms can be compared. Therefore, a reference classification or validation set for each case that was studied to test the proposed algorithms was constructed. The validation sets contained only true positives, i.e. reliable observations, and true negatives, i.e. unreliable observations, that are classified based on the prior knowledge about the expression the conesponding genes in literature and in large-scale expression analysis in SAGE and microarray experiments. The true positive observations are those for which the expression information judged from the microanay elements agree with the prior knowledge about their expressions while false observations are those for which their microarray elements (due to different problems such as failure of amplifications, other low probe concentration, or under-printing, etc.) are low in signal intensity and are not consistent with the expected expression or inducibility of these elements. An example is interleukin-8 which is highly and consistently inducible in endo- toxin-stimulated monocytes, the same cellular model which the microanay data derived from. In the anays, IL-8 probe was generated from two different sources; one source is clone- containing culture stock that failed to amplify. Another example is eukaryotic elongation factor 2 probe that is commonly expressed, and thus, act as a housekeeping gene; their elements were present both as true signals or false signals due to optimal and sub-optimal probe concentrations, respectively.

Prior knowledge about the expression of many genes also facilitates the discrimination between true and false positives in microanay data. For each data set for testing the proposed methods, a reference validation set of 54 genes, was compiled. The results agreed in large- scale expression analysis in both SAGE and microarray data.

To quantitatively assess how well our method classifies the signal intensities in the reference/validation gene set into the two components of reliable and unreliable signal intensities, we computed the kappa coefficient of agreement, along with the sensitivity, specificity, and observed agreement rates (Table 3). Both univariate and bivariate methods resulted in ability to classify almost all of the true positives (Sensitivity is 96-100%) and true negatives (Specificity, 93-100%, depending on the dataset) in the reference gene set as reliable and unreliable signal intensities, respectively (Table 3). The accuracy of the method can also be statistically evaluated using kappa coefficient of agreement in which a value of 1 indicates an excellent (e.g., 100%) agreement. As the kappa values indicated, the best agreement with the reference classification information available in the validation sets was obtained using our classification technique based on bivariate modeling, with 100% agreement rate compared to 96% and 94% of univariate modeling for dataset 1 and 2, respectively (Table 3).

Table 4 shows examples of utility of our bivariate analysis-based classification method using four expressed genes (of the ~50 genes in the reference set) that are well-known to be induced in human monocytes by endotoxin (LPS), in addition to data on one housekeeping gene.

The univariate and bivariate classification approaches have been compared with Fielden's method, a method for inclusion of reliable signal intensities. In this method, the threshold for each channel is obtained as

MedianBackG + 3xSD_BackG

where MedianBackG and SDβackG refer to the median and standard deviation of the signal intensity estimated from the spot backgrounds. These thresholds are shown with dashed lines in Figures 2 (A-B).

In the case of univariate analysis, the observations that would be identified as belonging to the class of unreliable data fall into the lower left quarter of the 2-D space, bounded on the right by the green-threshold and on the top by the red-threshold. Whereas, in the case of bivariate analysis, the region for the class of unreliable data is the interior of an ellipse, shown in Fig- ures 2 (A-B). In each case it was noted that the classification using bivariate method provides the best results in terms of agreement with the reference classification. Table 3 present 2 by 2 agreement comparisons that show how well each of these classification methods are performing with respect to the reference classification for the three sample cases that we have studied. To assess how well a method classifies the observations in the reference or test sets into groups, the kappa statistics along with the sensitivity, specificity, and observed agreement rates was calculated. The strength of agreement is considered "good" when the kappa values are close to 1. The optimal threshold values, obtained using univariate mixture modeling approach and Fielden's method, for the individual channels are also reported in the table.

Table 4 Examples of validation of^* the algorithm using reference list of genes known to be expressed in endo- toxin-stimulated human monocytes. ID Gene Cy3 Cy5 Cy5/Cy3 Expression Algorithm Status³ Validation^b Hs.624 Interleukin-8 1501 9252 6.16 True Reliable (LPS inducible Gene) 2084 10287 4.93 True Reliable 338 605 1.79 False Unreliable 492 415 0.84 False Unreliable

Hs.1722 Interleukin lα 1805 3013 1.7 True Reliable (LPS inducible Gene) 2220 3500 1.6 True Reliable 348 459 1.3 False Unreliable 291 364 1.2 False Unreliable Hs.211600 TNF- induced protein 3 1989 8891 4.47 True Reliable (LPS inducible Gene) 1554 6107 3.93 True Reliable 2585 8281 3.20 True Reliable 2652 7929 2.99 True Reliable 441 819 1.85 False Unreliable 641 829 1.29 False Unreliable Hs.196384 Cyclooxygenase 2 751 677 0.92 False Unreliable (LPS inducible Gene) 662 617 0.93 False Unreliable 5086 7166 1.41 True Reliable 751 677 0.90 False Unreliable 320 469 1.47 True Reliable 3748 5448 1.45 True Reliable 478 541 1.13 False Unreliable 320 469 1.47 True Reliable Eukaryotic translation elongation True Reliable Hs. 75309 factor 2 9808 11188 1.1 (Housekeeping Gene) 13254 13733 1.0 True Reliable 623 313 0.5 False Unreliable 358 104 0.3 False Unreliable ^a Gene expression status of known endotoxin (LPS) inducible genes: If the gene is known to be upregulated (>1.4) in LPS-stimulated human monocytes then, the status is true. ^b Results of the utility of our described algorithm, using bivariate Normal Mixture modeling, in discriminating reliable from unreliable signals and subsequently producing accurate gene expression ratios. As the kappa figures indicate, the best agreement with the reference classification information available in the validation sets is obtained using the classification technique according to the invention based on bivariate modeling. The next best performer is the thresholding technique according to the invention based on univariate mixture modeling and finally Fielden's thresholding method has the lowest agreement rate.

Although the microarray data are of bivariate nature, i.e. pairs of green and red readings obtained from the same DNA spots, both univariate and bivariate mixture modeling was tried to classify them. Both techniques, when applied to validation sets, have produced excellent results in terms of agreement with the reference classification information that was available in the validation sets. However, classification based on bivariate mixture modeling, fully exploiting the bivariate nature of the data, i.e. there is a large conelation between Cy3 and Cy5 channels of data, yielded superior results. The results of the classification method according to the invention were also compared with those of a cunently utilized simple thresholding method and it was observed that the classification algorithm according to the invention performed better in all cases.

The previous implications are very important because a solid statistical modeling and classification approach was established to tackle problem of eliminating unreliable data from microanay experiments. The result of microanay experiment, i.e. gene expression information, is very crucial; as such information is key to the important biological and clinical decisions. Therefore, the exclusion of unreliable data, i.e. minimizing false positive rate and while keeping highest possible amount of reliable data, i.e. maximizing true positive rate, in microanay experiments will not only increase the reliability of the results but will also reduce the cost by eliminating preventable validation or post-analysis follow-up tasks.

The use of the described approach in determining optimum signal thresholds and subsequently to eliminate false positives significantly improves the reliability and reproducibility of microanay data.

Claims

1. Method of classifying DNA microarray intensity data into two classes of reliable and unreliable data, comprising the steps of: a) extracting median or mean intensities of two fluorescent signals of different wavelengths from cDNA sequences ananged in a microanay as λi -channel and λ₂-channel raw data; b) estimating normal mixture model parameters conesponding to class conditional densities and prior probabilities using expectation maximization algorithm; c) determining in univariate analysis an optimal signal threshold for each channel or in bivariate analysis a threshold surface for said microanay raw data using Bayesian classification theory; d) denoting a signal intensity value as reliable, if either of the channels is larger than its respective threshold in univariate analysis or if the 2-dimentional data point falls outside of a hyper-quadratic decision region in bivariate analysis.

2. Method of claim 1, characterized in that the expectation maximization algorithm comprises the steps of: a) estimating initial mixture parameters μ, Σ b) determining or estimating the posterior probability that an observation belongs to the ith class given by

*t / ^χj T) ^{; i = 1> 2}' '^{k; j = ■}*^{» 2>} '^n; c) determining new prior probabilities given by n π, =∑f_y In d) determining new updates of the mean vector },. and the covariance matrix ∑i according to

e) repeating expectation and maximization steps b) to d) until updates/changes in mixture parameter estimates are smaller than a preset tolerance level or a certain number of iterations is reached.

3. Method according to claim 1 or 2, characterized in that in univariate analysis the probability density for each channel is given by the sum of the weighted normal density components of the low (unreliable) and high (reliable) intensity class of data and is given by the equation f (x) = π_tN{x; μ_u , σ² ) + (1 - π_t )N{x; μ_u , σ ² _1;) wherein N(x; t_;,σ_r. ) = is the univariate normal probability

density function for the i-th channel, μ. is the mean and σ ._t2 is the variance, x denotes the signal intensity, subscript i denotes the g- channel and the r- channel, respectively, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

4. Method according to claims 1 and 2, characterized in that in bivariate analysis the probability density is given by the sum of the weighted normal density components of the low (unreliable) and high (reliable) intensity class of data and is given by the equation f{x) = πN ( ; μi, ∑i) + (1 - π) N(x; μ₂, Σ₂), wherein N (x; μ_i5 ∑ = {2π)^'d/2\ Σ_; |^{"1 2}exp {-(x - μ_i)⁷∑_i ^"1 (x - μi) / 2} is the multivariate normal probability density for the i-th channel, μ. is the dxl mean vector, Σ; is the dxd covariance matrix, x is the signal intensity, subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively, and | ∑j | refers to the determination of the covariance matrix ∑i and superscript T refers to the transpose operation.

5. Method according to one of claims 1 to 3, characterized in that in univariate analysis the optimal classification threshold for each channel is obtained by equating the class posterior probabilities, i.e. is given for each channel i by the solution of π_gN[x;μ ,σ ² ) = (1 - π₍)N{x; μ₂i, σ² _2l), wherein N{x;μ_i,σ_i ) = _{1S t}he univariate normal probability

density function for the i-th channel, μ. is the mean and σ is the variance, x denotes the signal intensity, subscript i denotes the g- channel and the r-channel, respectively, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

6. Method according to one of claims 1, 2 or 4, characterized in that in bivariate analysis the optimal classification surface is obtained by equating the class posterior probabilities, i.e. is given by the solution of πN{x; μ_l,Σ_l) = Q.-π)N{x;μ,Σ₂) , wherein N(x; μ_i , Σ,) is the multivariate normal probability density, μ. is the dx 1 mean vector, Σ . is the covariance matrix, x is the signal density, and subscripts 1 and 2 denote the weighted normal density components representing the low (unreliable) and high (reliable) intensity class of data, respectively.

7. Method according to one of the preceding claims implemented by a computer program.

8. Computer program product containing a computer program controlling a method according to claim 7.

9. Apparatus for classifying DΝA microanay intensity data comprising a light source; a detector; a microarray holder; an interface connected to the detector and to a data processing means; and means for carrying out the method according to one of claims 1 to 6.

10. Apparatus according to claim 9 characterized in that the light source is a laser preferably emitting at 532 nm and 635 nm.

1. Apparatus according to claim 9 or 10 characterized in that the data processing means comprises a microprocessor and a memory.