EP1141878A2 - Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees - Google Patents

Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees

Info

Publication number
EP1141878A2
EP1141878A2 EP00903107A EP00903107A EP1141878A2 EP 1141878 A2 EP1141878 A2 EP 1141878A2 EP 00903107 A EP00903107 A EP 00903107A EP 00903107 A EP00903107 A EP 00903107A EP 1141878 A2 EP1141878 A2 EP 1141878A2
Authority
EP
European Patent Office
Prior art keywords
data set
calculation
scaling
representation
display means
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00903107A
Other languages
German (de)
English (en)
Inventor
Joel S. Bader
Yi Liu
Stephen Gold
Darius Dziuda
Vladimir Gusev
Richard S. Judson
Gregory T. Went
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CuraGen Corp
Original Assignee
CuraGen Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CuraGen Corp filed Critical CuraGen Corp
Priority claimed from PCT/US2000/000167 external-priority patent/WO2000041122A2/fr
Publication of EP1141878A2 publication Critical patent/EP1141878A2/fr
Withdrawn legal-status Critical Current

Links

Definitions

  • This invention relates to statistical analysis of differences between at least two data sets.
  • Measurements of the expression levels of individual genes within a cell provide a wealth of information about cellular processes. This is done by extracting messenger R A molecules (mRNA) from a cell, possibly converting these to more stable cDNA molecules, and measuring the concentrations of each individual species by methods such as differential display or hybridization.
  • mRNA messenger R A molecules
  • a typical analysis strategy is to identify genes whose expression levels differ between particular biological states.
  • One difficulty in performing such an analysis is that experimental measurements of expression levels include variation due to noise. Distinguishing the true differences from the false differences (those due simply to noise) has presented a challenge for gene expression analysis.
  • the relevance of this problem is that genes that are differentially regulated can be converted to commercial products, including protein therapeutics, antibody targets, therapeutic markers, as well as conventional drug targets.
  • One of the major sources of noise in such experiments is that the amount of material analyzed, such as mRNA or cDNA, can differ from experiment to experiment, or among the replicates of a single experiment.
  • Data analysis strategies typically account for this overall variation by performing a global scaling of all the measurements from such experiments. For example, if sample A has twice the overall cDNA concentration of sample B, then the expression level for a gene in sample B must be doubled before comparison with sample A. Often, however, such an overall scaling is not sufficient to discriminate between true differences and those that can be attributed to noise.
  • One source of difficulty is identifying the particular features in the data set or sets that can be used as scaling landmarks. It is not always possible to identify a priori such unchanging features ahead of time
  • the amount of gene expression is related to the amount of PCR product generated in an amplification reaction.
  • the amount of product can depend on the activity of the polymerase enzymes as well as the length of a fragment being replicated. If the enzyme functions effectively, the amount of PCR product is uniformly high from small fragments to long fragments. If the enzyme activity is less effective, however, the amount of PCR product can be relatively less for long fragments than for short fragments.
  • An overall scaling does not account for the non-uniform tapering of the signal with the size of the amplicon.
  • the present invention discloses a method of identifying a difference between at least two groups, wherein each group comprises a data set containing ordered elements.
  • the method includes the steps of: (a) providing a first group having one or more elements in a first data set; (b) applying at least one transformation to said first data set to provide a transformed data set, wherein said transformation is a calculation selected from a normalizing calculation, an averaging calculation and a scaling calculation; and (c) distinguishing differences, if present, between elements of said first transformed data set and a second groups having one or more elements in a second data set; thereby identifying a difference between the data sets.
  • the method corrects the effects of the noise prior to distinguishing the differences.
  • regions in a data set that do not contain useful data may be masked, and regions that have a higher information content may be highlighted. These include regions where the signal intensity in the data set is either too low (noise) or too high (saturation) for accurate measurement, or is at locations of local peaks.
  • the noise includes low frequency noise.
  • the noise includes jiggle. In the latter embodiment, the jiggle includes positional shifts of elements between different data sets and signal alignment within a data set. In a further embodiment, the jiggle is corrected. In another embodiment, correction of jiggle may be considered as signal alignment between two or more data sets.
  • the elements of a data set may represent, for example, a trace, such as a trace arising in an electrophoretogram or a chromatogram.
  • an element of a data set represents a position in a reagent array.
  • the position in the reagent array determines one extent of matching between a reagent affixed to the array at the position and a sample contacting the array position.
  • the reagent may be a first nucleic acid and the sample may include a second nucleic acid.
  • a data set is obtained in an experiment related to identifying significant differences in gene expression.
  • a group includes more than one individual.
  • a data set of the method may be subjected to a masking operation.
  • the data set for each group is obtained by applying at least one calculation, chosen from a normalizing calculation, an averaging calculation and a scaling calculation, to the data sets from each individual.
  • each individual provides at least one replicate sample that is employed to provide a trace.
  • the traces for each individual are advantageously transformed by applying at least one of a normalizing calculation, an averaging calculation and a scaling calculation to the trace from each replicate; and in additional such embodiments, the traces for each replicate are discretized prior to the normalizing, the averaging and/or the scaling.
  • the normalization includes adjusting each data set such that a subset of elements in each data set has similar or identical values.
  • the averaging includes calculating an average for a location or for a discretized position across a collection of data sets. The average may be an unweighted average or a weighted average.
  • the scaling includes a calculation that causes a first data set to resemble a second data set except that an element in the scaled first data set whose intensity differs significantly from the intensity of the element in the second data set at the same location or the same position contributes to identifying the difference between the data sets.
  • the scaling includes calculating a distance between the data sets, or calculating a similarity between the data sets.
  • the scaling calculation employs a scaling function; and in other embodiments, the scaling function is a basis set expansion, such as a piecewise linear basis set or a direct product of basis functions.
  • successive iterations of a cycle that includes at least one of a normalization calculation, an averaging calculation and a scaling calculation are carried out until a specified termination condition has been satisfied.
  • the termination condition is that the transformed data set has converged.
  • the termination condition is that a predetermined number of iterations has been reached.
  • the distinguishing of differences among the elements of the transformed data sets includes application of a difference finding algorithm.
  • the invention also discloses a display means that displays a representation of a difference between data sets, and also discloses the representation itself, wherein the representation is obtained in general by applying methods disclosed herein to the data sets.
  • FIG. 1 is a graphic representation of jiggle arising between two traces.
  • FIG. 2 is a schematic diagram illustrating the flow from different groups to the transformed data sets for those groups.
  • FIG. 3 is a schematic estimation of ⁇ , the experimental noise in a data set.
  • FIG. 4. is a schematic representation of averages of three replicate raw traces for each individual animal in Example 1 , prior to normalization or scaling.
  • FIG. 5. is a schematic representation of averages of 3 normalized replicate traces for each individual animal in Example 1.
  • FIG. 6. is a schematic representation of scaling factors employed to scale the phenobarbital-treated individual average to the sterile-water-treated individual average in Example 1, obtained as a result of iterative scaling.
  • FIG. 7. is a schematic representation of iteratively scaled traces for each individual animal in Example 1.
  • the invention discloses methods for normalizing, scaling, and difference finding that may be used in any experimental study, including gene expression data, in which noise and other uncontrolled variations exist between data sets.
  • these methods have been adapted for differential display experiments, in which gene expression levels are represented by fragment intensities in, for example, an electrophoresis trace.
  • the methods are also applicable to, for example, hybridization experiments, such as those employed with nucleic acid microchip arrays, as well as other experiments not related to gene expression.
  • the present invention discloses a method of identifying a difference between at least two groups, wherein each group comprises a data set containing ordered elements.
  • the method includes the steps of: (a) providing a first group having one or more elements in a first data set; (b) applying at least one transformation to said first data set to provide a transformed data set, wherein said transformation is a calculation selected from a normalizing calculation, an averaging calculation and a scaling calculation; and (c) distinguishing differences, if present, between elements of said first transformed data set and a second groups having one or more elements in a second data set; thereby identifying a difference between the data sets.
  • Each data set can be represented as a set of discretized intensity values, or elements in the data set.
  • the intensity of an element may include effects of noise, as described below, and the method operates to correct the differences for the effects of the noise.
  • the noise includes low frequency noise.
  • the noise includes differences in jiggle.
  • the jiggle includes positional phase shifts of elements between different data sets.
  • the invention also discloses a display means that displays a representation of a difference between data sets, and also discloses the representation itself, wherein the representation is obtained in general by applying methods disclosed herein to the data sets.
  • representation relates to any graphical, visual, or equivalent non-verbal display that provides an image of the results, such as differences between data sets, obtained according to the methods of the present invention. More specifically, a "representation" of the invention is obtained by transforming the quantitative results gathered by experiments underlying the invention. Examples of such data include, by way of non-limiting example, traces from differential gene expression, and intensities from an array, and/or equivalent types of experimental parameter.
  • a representation of the invention is generated by algorithms executed in a computer and is suitable for display on a display means, such as a display screen or monitor, employed in the operation of the computer.
  • the representation is also suitable for storing in a storage module or data archive of such a computer. It is still further suitable for printing from the computer onto a medium such as paper or equivalent physical medium, and for recording it onto a portable storage medium, including, for example, magnetic media, CD ROMs and equivalent storage media.
  • display means includes any of the objects and media identified above in this paragraph, as well as equivalent apparatuses and objects suitable for displaying the results of computational processes for visual inspection.
  • normalization is defined herein as a means for standardizing or correcting elements in a data set, for example, but not by way of limitation, for correcting overall signal strength within a given data set.
  • Features of given elements to be normalized are first identified within a data set. For example, one such feature may be the median peak height of signals within a data set.
  • a summary statistic for the given feature is generated for a data set, and used to normalize the elements, as described below, to allow comparisons across data sets.
  • Algorithms that are designed to either mask or highlight chosen features identified among the elements of a data set may be applied prior to normalization, averaging or scaling. Such features include low intensity signal regions that comprise noise, high intensity signal regions that comprise saturation zones, and local maxima that comprise peaks.
  • Average is defined as combining multiple data sets to generate one average representative data set. Averages are combined into the representative data set in such a way that noise from any one data set so combined does not affect any other data sets used to generate the average.
  • scaling is defined as a correction for low frequency difference is signal strength across data sets.
  • the data sets may arise in any of a number of ways. Any experiment or study in which one group is compared with another may provide the data sets employed in the invention. Such groups may be distinguished by the experimental conditions experienced by the respective groups, or by the experimental state characterizing the respective groups. Experimental subjects may be animate or inanimate, or may be inanimate samples derived from animate subjects.
  • display means and representations the data sets arise from experiments conducted in investigations in which identification of the differential expression of a gene or genes between data sets from at least one experimental group and at least one control group is sought.
  • such differential expression arises in GeneCallingT experiments. See, e.g., United States Patent No. 5,871,697; Shimkets et al, Nat. Biotechnol. 17: 798-803 (1999).
  • such differential expression is evaluated using nucleic acid microchip arrays in order to detect the presence, absence or extent of expression of a gene or gene fragment. Any alternative, equivalent differential expression formats and methods of analysis are encompassed within the scope of the present invention as well.
  • noise may arise during the course of gathering the data elements comprising the data sets.
  • Nonlimiting examples of noise include intensity noise and extension noise leading to longitudinal differences.
  • Intensity noise includes relatively high frequency noise such as that commonly associated with short-time fluctuations in the electronic and/or mechanical components of an experimental system.
  • High frequency noise may be defined as having a frequency greater than about 1 Hz. Examples of high frequency noise include shot noise in photodetectors and comparable electronic noise arising in the various electronic components and circuits of an experimental instrument employed in gathering the data elements of a data set.
  • Low frequency noise has a frequency less than about 1 Hz, and may have frequencies less than about 0.1 Hz, or less than about 0.01 Hz, or even less than about 0.001 Hz or lower.
  • Such low frequency noise may arise during an experiment, for example, by decay of activity of a reagent, catalyst or enzyme during the course of preparing a sample that is applied to generate a data set.
  • an uncompensated low frequency change in response of an electronic instrument may arise during the time in which a data set is being gathered.
  • an array is being used to generate the data set, uncompensated variations in detection across the various positions and/or dimensions of the array may arise that behave as low frequency noise (i.e., they may be considered as low frequency noise even though an array may be subjected to simultaneous detection of all the sample points on the array, since positional variations behave as if they have a long wavelength across the array.)
  • Equivalent sources of low frequency noise are also encompassed in this definition. Normalization and scaling algorithms employed are particularly effective in minimizing or eliminating the effects of low frequency noise.
  • jiggle An additional detrimental effect that may arise in identifying differences between data sets is termed "jiggle".
  • Longitudinal displacement relates to variation in the location or discretized position of a particular feature in a trace even though the feature appears in the traces of more than one group.
  • uncompensated variation in the location or discretized position of the feature may occur due to variations in physical or chemical conditions during the process of accumulating the data elements of the various data sets being considered.
  • Such a variation, or jiggle may be considered to be low frequency noise in the longitudinal, or positional, direction. Jiggle is illustrated in FIG. 1.
  • A(n) and B(n) two discretized, normalized data sets, A(n) and B(n), are shown.
  • A(n) and B(n) should be thought of as each representing the same feature. Nevertheless, they are displayed with a jiggle of 1.75 units on the n axis.
  • the normalization and scaling algorithms employed are particularly effective in compensating for and/or overcoming the effects of low frequency longitudinal noise. Such procedures, as employed in the methods of the present invention, largely or completely eliminate the jiggle and, referring to FIG. 1, restore the overlap of the points for A(n) and B(n). Compensation for jiggle as shown in FIG. 1 is also termed "signal alignment." Groups, Individuals, Replicates and Transformed Data
  • a hierarchy of notation is used herein to indicate data elements and/or the data sets. These notations are discussed below and furthermore are illustrated in the flow diagram presented in FIG. 2.
  • Raw, i.e. untreated or untransformed, data arise from the carrying out the experiments on actual samples obtained from experimental groups.
  • a "group” represents a particular experimental state or condition.
  • the groups are denoted herein in capital letters A, B, ... without any indices or delimiters.
  • FIG. 2 at least two groups comprise the subject matter on which the methods, display means and representations of the present invention are based. Each group gives rise to data elements comprising data sets after samples from the groups have been subjected to a given experimental method of detection or analysis.
  • a group may be initially composed of one or more individuals. The number of individuals is not fixed or constant, but may vary. Each individual in the group is subjected to the same experimental conditions or experimental state. For the case of animate groups, each individual may represent an individual animal, a plant (such as a seedling), or a set of cells grown in cell or tissue culture.
  • each individual may represent, again by way of nonlimiting example, a separate execution of a particular experimental protocol such as a synthetic or preparative procedure, or the implementation of a particular set of physical conditions on separate samples or objects.
  • Equivalent ways of designating individuals of a group are encompassed within the scope of the present invention.
  • the data sets obtained from the individuals of a group may be transformed by any one or more of the normalization, averaging and scaling calculations of this invention in arriving at the differences determined by the present methods.
  • each individual of a group may furnish one or more replicate samples for detection or analysis according to the experimental method employed. Such replicates also represent raw, or untreated, data.
  • Each replicate of an individual is designated with a second index or delimiter j, shown, for example, by ajj, bjj, ... (see FIG. 2).
  • the number of replicates may vary due to experimental circumstances. Commonly replicates are obtained by repetitive sampling from the same individual.
  • the data sets obtained from the replicates of a particular individual may be operated upon by any one or more of the normalization, averaging and scaling calculations of the present invention in arriving at the differences determined by the present methods.
  • the normalization, averaging and/or scaling calculations that are applied to the replicates may be applied prior to, or simultaneously with, the similar calculations applied to the individuals and discussed in the preceding paragraph.
  • each data set may be considered to be comprised of an infinite number of data elements designated using a further delimiter, ajj(x), where x denotes the continuous longitudinal dimension of the analytical method.
  • ajj(x) a further delimiter
  • Such data sets therefore, in general need not carry the additional delimiter x.
  • the delimiter n replaces the delimiter x when the intensity trace has been discretized; i.e., ajj(x) becomes ajj(n) (see FIG. 2).
  • any data sets that have been transformed using the calculations disclosed herein are designated in upper case letters including an index and/or a delimiter.
  • the transformations may include at least one operation chosen from among normalization, averaging and scaling.
  • a transformation that operates to combine the replicates of an individual while leaving the individuals of a group intact results in a transformed data set indicated by one index and a delimiter, Aj(n), B j (n), ... (see FIG. 2).
  • Further transformation that operates to combine the individuals of a group to provide a single data set for an entire group is designated by a delimiter only, as shown, for example, by A(n), B(n), ... (see FIG. 2).
  • the Aj(n), B ⁇ (n), ... are obtained directly without discretization. They may still arise from replicate samples, however.
  • the detection method is based on an array, one or more positions in the array may represent the results of one or more replicates, respectively.
  • a particular embodiment of a data set envisioned in the present invention is differential display.
  • mRNA is extracted from sample, converted to cDNA, and digested with restriction enzymes into fragments (United States Patent No. 5,871,697; Shimkets et al, Nat. Biotechnol. 17: 798-803 (1999)). The fragments are then separated according to length using electrophoresis.
  • nucleic acids consist of an integer number of nucleotides, their electrophoretic transport properties also depend on the nucleotide composition.
  • Electrophoresis experiments currently in use measure the length of a nucleic acid fragment determined electrophoretically to a precision of 0.1 nt, and the actual value of the electrophoretic length is usually within 1-2 nt of the actual number of nucleotides in the fragment.
  • the intensity signal a(x) of the electrophoresis trace for sample A represents the amount of fragments of electrophoretic length x generated from the sample.
  • the intensity a(x) also depends on the particular restriction enzymes used to generate fragments; for simplicity, this dependence is suppressed in the notation.
  • the intensity at length x corresponds to a single fragment; sometimes multiple fragments have the same length and their signals are combined; sometimes no fragments are present and a(x) is a baseline signal.
  • a(x) is a measured intensity, it should be a positive quantity.
  • the intensity a(x) from samples in group A is compared with the intensity b(x) generated using an identical protocol from samples in group B.
  • Some inadvertent differences between A and B can be attributed to underlying genetic variation in the individuals chosen.
  • samples A and B may include organisms or individuals having an allelic variation between them that generates a difference in the measured expression levels but has no biological relevance in the context of the particular experimental study.
  • a neutral single nucleotide polymorphism (SNP) can add or remove a band. For this reason it is preferable to include multiple organisms or individuals for samples A and B to control for these types of individual differences.
  • the expression profile of the i m individual of group A is denoted aj(x), and similarly bj(x) is the expression profile for individual i of group B.
  • each organism or individual may have multiple experimental replicates of the expression profiles for each organism or individual.
  • the j m expression profile, or replicate, of the i m individual of group A is denoted as ajj(x), and similarly for group B.
  • each group may have one or more individuals, and each individual may have one or more replicates. More elaborate hierarchies are also possible and may be analyzed directly with the methods outlined below.
  • An alternative embodiment of an experimental system relates to hybridization.
  • ajj(x) represent the intensity from the j m experimental replicate of the i m organism in group A measured at position x on a hybridization array or chip.
  • x is a two- dimensional coordinate that identifies the location of a particular spot on the hybridization surface.
  • trace is used herein to denote a one-dimensional data set
  • array or hybridization data is used herein to represent a two-dimensional data set. Terms such as data set, signal, and intensity may represent one- or two-dimensional data sets.
  • repeated experiments such as hybridization experiments conducted on a series of biological samples collected over time, may have additional dimensions.
  • each time point in a time course study generates a two-dimensional plane of data, and the time coordinate adds a third dimension.
  • the methods disclosed herein are applicable to data sets such as these, as well as to those of the preceding paragraphs. In full generality the methods disclosed herein are generally applicable to any experimental study that generates multi-dimensional data sets. In particular cases, attention may be restricted to a particular dimensionality of data sets, as the specific character of the study may provide.
  • the terms "signal” and "intensity" may be considered interchangeable references to either measured, normalized, scaled, or averaged data. Data sets can have many representations in the memory of a computer.
  • each data set can be represented as a set of discretized intensity values, or elements in the data set.
  • ⁇ x is preferably close to the reproducibility of the instrument. With currently available instruments, a value of 0.1 nt is preferable. For raw hybridization images, a value corresponding to a single pixel in an image is preferable. For processed hybridization images, it is preferable that each discretization point represents an individual spot with a different probe.
  • the invention allows for the identification of a difference between data sets, wherein each data set contains data elements as described above.
  • the differences are identified by operating on at least two transformed data sets, A(n) and B(n), to discern particular discrete positions n at which differences that exceed a lower limit of distinction are found.
  • the methods are that they use algorithms that automatically identify scaling landmarks that may exist in the data sets being evaluated.
  • the noise mask m no j se (n) depends on a noise level I n oise tnat characterized the experimental uncertainty in the measured intensity. This uncertainty may be estimated, for example, as the standard deviation of the background signal obtained for a blank or control sample. I no ise ma Y a ⁇ so De preferably assigned a value that is a small multiple (0.2X to 5X) of the low end of the dynamic range of the detection instrument.
  • the noise mask is calculated as follows:
  • the saturation mask m sa t(n) marks data collected near the upper limit of the detection range of an instrument.
  • the mask depends on a saturation level, as follows: • For each position n
  • the threshold I sa ⁇ is preferably close to the high end of the dynamic range of the detection instrument (0.95X or to IX).
  • points that are marked as not saturated are checked for saturation in a second pass that depends on a saturation width w sa t as follows:
  • n o m sa t(n) 1 if a(n) is part of a plateau of constant value over a range of ⁇ w sa ⁇ in each dimension.
  • m sa t(n) 1 for each of these points.
  • m sa ⁇ (n) 1 only for the center point; the remaining points require their own saturation checks.
  • o m sa t(n) 0 otherwise.
  • a parameter pg ⁇ defines the half-width for a peak as follows:
  • pg ⁇ n 1 if a(n + ⁇ n') ⁇ a(n + ⁇ n) for all ⁇ n and ⁇ n' such that ⁇ n' is farther than ⁇ n from n, ⁇ n' is no farther than Wp ⁇ from n, and
  • ⁇ n and ⁇ n' are identical except for a single dimension in which they differ by 1.
  • distances may be calculated by any of a number of methods, including the Euclidean metric, the Manhattan metric, or the maximum absolute difference in any dimension.
  • ⁇ aj ⁇ n) 1 if a ⁇ -Wpg ⁇ ) ⁇ ... ⁇ a(n-2) ⁇ a(n-l) ⁇ a(n) > a(n+l) > a(n+2) > ... > a(n+w pea ' c ).
  • a summary peak intensity Ip ea k is calculated from the individual values. If desired, the values can be rank-ordered, and Ip ea k can be defined as the 75 m percentile value (75% of peaks are smaller in value; 25% of peaks are larger). Other methods include using a different percentile, for example the median, or calculating an average value. Rank-order selection methods are more robust than averages.
  • the data set is rescaled by multiplying each point a(n) by the factor (Inorm ⁇ peak)' nere Inorm ls identical for each data set and sets a convenient scale.
  • I norm is arbitrary, a value such as 100 is convenient.
  • the noise threshold I no ise mav a ⁇ so ⁇ > e subject to the same normalization.
  • Inoise mav De set to a fixed value.
  • Averaging is an operation that is applied to a collection of data sets.
  • the average A(n) of a collection of r data sets a](n), a2(n), ... , a r (n) is calculated as
  • weighting functions may be used, and may account for characteristics of a data set such as those discussed in the following.
  • SD ⁇ (n) it is preferable to calculate a standard deviation SD ⁇ (n) to describe the distribution of data points leading to the average A(n).
  • a preferable formula for SD ⁇ (n) is
  • This extent of agreement can be measured, by way of nonlimiting example, by the distance Dist[A,B] or the similarity Sim[A,B] between two data sets A(n) and B(n).
  • Dist[A,B] ⁇ n w[A(n),B(n)] dist[A(n),B(n)] and
  • Dist[A,B] ⁇ n w[A(n),B(n)] dist[A(n),B(n)] / ⁇ n w[A(n),B(n)].
  • dist[A(n),B(n)] is a function that measures the distance between two values
  • the term w[A(n),B(n)] is a mask that determines whether the data points at location n should be included in the calculation.
  • the second formula is preferable.
  • a and b must be regularized to prevent values close to 0 from causing a divergence. This can be accomplished, for example, by replacing a or b by a minimum value I mm if either is smaller than I mm , or by adding a positive constant to raise all values A(n) and B(n) above 0.
  • and D max 3.
  • the weight w ⁇ n ) is preferably [1 - n A,noise( n )][I - niA,sat( n )]> anc s i m il ar ly f° r w B(n)-
  • a similarity Sim[A,B] between two data sets A and B may be defined as
  • Sim[A,B] ⁇ n sim[A(n),B(n)] where the similarity sim(a,b) between two numbers a and b is larger when the quantities are larger and also larger when a and b are closer in value.
  • ⁇ /2 , d
  • sim(a,b) F(p) G(d) where F(p) is an increasing function of p and G(d) is a decreasing function of d.
  • This algorithm is related to one of the literature as a method of decomposing spectra of multicomponent mixtures into separate spectra for each of the pure components.
  • Scaling is an operation that is applied to a subordinate data set a(n) to bring it in closer agreement with a master data set A(n).
  • a scaling algorithm optimizes a scaling function s(n) to minimize the distance or maximize the similarity between the scaled slave data set s(n)a(n) and the master data set A(n).
  • the scaling function s(n) may have various mathematical representations.
  • a basis set expansion may be a cosine series, a sine series, or more generally, a Fourier series. For a one-dimensional data set, a preferred choice is a piecewise linear basis.
  • the p m basis function is zero outside the interval np.i to n p , with X Q taken as the left-most point of the data set and np as the right-most point.
  • s(n) c p _ ⁇ + (c p - c p _ ⁇ ) (n - n p _ ⁇ )/(n p - n p _ ⁇ ).
  • s(n) ⁇ p c p ⁇ p (n), where here n and p are both d-dimensional and ⁇ p (n) can be expressed as
  • ⁇ p(n) ⁇ pl(n ⁇ ) ⁇ p 2(n2) ... ⁇ pd d)
  • n; and p; are the components of n and p in dimension j and ⁇ pj(ni) is a one- dimensional basis set in dimension j .
  • a preferred choice for the one-dimensional basis sets in the multidimensional direct product is an orthogonal basis.
  • a preferred choice for an orthogonal basis is a trigonometric basis,
  • ⁇ pj(n) cos[(pj-l) ⁇ (n-nj 0 )/(n-nji)] , where njo and nu are the left-most and right-most points in dimension j .
  • the coefficients Cp are selected to minimize the distance Dist[A(n),s(n)a(n)] or maximize the similarity Sim[A(n),s(n)a(n)]. Methods to perform this optimization are well-known in the art. Preferable methods are conjugate direction minimization or conjugate gradient minimization, which use linear algebra to optimize the P basis set coefficients simultaneously. See, e.g., Press et al., NUMERICAL RECIPES IN C, THE ART OF SCIENTIFIC COMPUTING, Second Edition, Cambridge Univ. Press, Cambridge UK, 1992, Chapter 10.
  • a preferable approximation that is faster computationally than a full minimization is to obtain Cp from an interval surrounding np, preferably the interval from n p _ ⁇ to n p + ⁇ , by minimizing the distance Dist[A(n),c p a(n)] or maximizing the similarity Sim [ A(n) ,c p a(n)] .
  • the number of piecewise linear basis functions is selected so that the low-frequency noise in the data occurs on a length scale of L/(0.3P) or longer.
  • L 400 nt and a noise length scale approximately 100 nt, P « 13 is preferable (interpolation points spaced every 35 nt).
  • a group of data sets ⁇ aj(n) ⁇ can be brought into closer agreement with each other by first normalizing each data set, then generating an average A(n), then scaling each data set aj(n) to the average A(n), then repeating these steps. If desired, the average A(n) can be re-normalized after every iteration.
  • a preferable termination condition is that A(n) has converged. This means that the distance Dist[A(n),A'(n)] between the value of A(n) after an iteration and its value A'(n) after the next iteration is smaller than some threshold value.
  • the scaling functions sj(n) for each of the slave data sets aj(n) can be checked for convergence.
  • a second preferable termination condition is that a predetermined number of iterations has been reached.
  • the square distance measure essentially calculates the standard deviation of the data sets.
  • minimizing the square distance is essentially identical with performing scaling that minimizes the standard deviation of the scaled traces.
  • Iterative scaling may occur at a hierarchy of levels including experimental replicates, independent individuals, and groups. Recall that the data set corresponding to experimental replicate j of organism i of group A is aij(n). Similarly, the data sets bji(n) are obtained for group B, data sets cjj(n) for group C, and so forth for each of the groupings.
  • data sets are normalized, scaled, and averaged within each organism, then within each group, and then between groups.
  • One process is as follows:
  • the scaling terms must be back-propagated to compare averages other than the final, scaled group averages. For example, if scaled group averages are required, s(n)A(n) is used. If scaled individual averages are required, then s(n)s j (n)A ⁇ (n) is used. If scaled data sets are required, then s(n)si(n)sjj(n)ajj(n) is used.
  • each data set from individual i is preferably given a weight proportional to 1 /(number of replicates from individual i) . This gives each individual equal weight and prevents an individual with many replicates from dominating the average.
  • Other preferable methods include weighting each data set equally and weighting each data sets to give each group equal weight. If each group has equal weight, one method is to weight each replicate equally. Thus, each data set from group A is given a weight proportional to 1 /(number of replicates from all the individuals belonging to group A).
  • An alternate method is to weight each data set variably to give each individual within a group equal weight. Thus, each data set from individual i of group A is given a weight proportional to 1 /[(number of individuals in group A)(number of replicates in individual i)].
  • a preferable threshold for differential display data is 2 iterations.
  • Jiggling One aspect of difference finding is comparing the heights of peaks in two data sets.
  • the same peak may occur at different positions in different data sets. For example, a peak in one replicate of data set may occur at position n, while in a second data set it may occur at position n+1 or n-1 due to experimental variability. (See FIG. 1)
  • a jiggling algorithm identifies the peak height in a data set a(n) that corresponds to a given location n'.
  • a preferred jiggling algorithm requires a parameter wjjggi e , which describes the width of the jiggling window.
  • the preferred algorithm starts at position n' and searches for the peak in a(n) closest to n' and within distance wjig g ⁇ e .
  • the height of a(n) at this peak position is termed the jiggled height of a(n) at n'. If two peaks are within equal distance, the higher value is preferably taken as the jiggled height. If there is no peak within distance wjigg' e , then the height a(n') is the jiggled height.
  • the data range for the jiggling peak search is n'-wjiggi e through n'+wjiggi e . If n' is a peak in a(n), then the value a(n') is the jiggled height of a(n) at n'. Otherwise the positions n' ⁇ l, n' ⁇ 2, ..., n' ⁇ W ⁇ ggi e are tested in turn for peaks; if a peak is found at location n", then a(n") is the jiggled height of a(n) at n'. If no peak is found, then a(n') is the jiggled height.
  • Difference Finding Difference finding identifies locations where at least one of the groups has a peak and its value is significantly different from the other groups.
  • the group averages and individual averages produced by iterative scaling serve as inputs to difference finding.
  • a preferable method employs a parameter Wrjjff that defines the minimum distance between differences.
  • Wrjjff a parameter that defines the minimum distance between differences.
  • Wpg ⁇ a parameter that defines the minimum distance between differences.
  • W fj jff ⁇ x 1.1 nt.
  • a preferred algorithm is as follows: • Generate a master peak mask Mp ea k(n) using one of the following alternatives: o Option 1 : For each group A and individual Aj, calculate a peak mask m pea j (n) from the individual average Aj(n). Then, for each position n, M pea i c (n) is 1 if at least one of the individual peak masks is 1 and is 0 otherwise. o Option 2: Calculate the peak mask Mp ea ] f (n) directly from the grand mean M(n) of all the groups. o Option 3: If there are only two groups A and B, generate a peak mask from the difference A(n) — B(n).
  • a pooled variance t-test may be employed instead of an F-test.
  • a less preferred algorithm for a comparison between two groups, A and B, and one- dimensional data set is as follows: • Perform the final step of iterative scaling by scaling A(n) to B(n).
  • LASTPOSITION n
  • LASTDIRECTION direction(n)
  • LASTPVALUE p-value(n). o Otherwise if p-value(n) ⁇ LASTPVALUE then
  • mice Male Sprague-Dawley rats (Harlan Sprague Dawley, Inc., Indianapolis, Indiana) of 10-14 weeks of age were gavage-fed and dosed with phenobarbitol once a day for three days at a dose of 3.81 mg/kg/day.
  • the drug was dissolved in sterile water prior to treatment. This dosage corresponds to the ED 100 (the upper limit of the effective dose for humans) adjusted for the difference in metabolic rate between rats and humans.
  • Three rats were used for the drug treatment group, and an additional three rats were treated with sterile water to serve as the control group.
  • Rats were sacrificed 24 hours after the final dose and their brains were harvested. Collection of mRNA from the harvested brains, synthesis of the corresponding cDNA, and differential display protocols were done as has been described elsewhere. See, U. S. Patent No. 5,871,697; Shimkets et al. Nat. Biotechnol. 17: 798-803 (1999).
  • the basis set used was piecewise linear with 13 scaling points located every 35 nt beginning at 30 nt and ending at 450 nt.
  • the distance function was [ln(a(n)/A(n))] 2 , and points for which
  • the first step in the scaling procedure was that the 3 normalized traces for each individual were averaged, scaled to the average, re-averaged, then re-scaled to the average for 2 rounds of iterative scaling.
  • the phenobarbital-treated individual average and the sterile-water-treated individual average were themselves averaged, the individual averages scaled to the grand average, then the process repeated for 2 rounds of iterative scaling.
  • the phenobarbital- treated individual average was scaled to the sterile-water-treated individual average.
  • the final scaling factors are shown in FIG. 6 for the two individuals.
  • the final individual averages are shown in FIG. 7.

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Complex Calculations (AREA)

Abstract

La présente invention concerne un procédé qui permet d'identifier une différence entre au moins deux ensembles de données formés d'éléments ordonnés. Ce procédé consiste à utiliser les caractéristiques internes des ensembles de données pour effectuer des calculs relatifs à la normalisation, au changement d'échelle et à la recherche de différences.
EP00903107A 1999-01-05 2000-01-05 Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees Withdrawn EP1141878A2 (fr)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US11480699P 1999-01-05 1999-01-05
US114806P 1999-01-05
US47727300A 2000-01-04 2000-01-04
US477273 2000-01-04
PCT/US2000/000167 WO2000041122A2 (fr) 1999-01-05 2000-01-05 Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees

Publications (1)

Publication Number Publication Date
EP1141878A2 true EP1141878A2 (fr) 2001-10-10

Family

ID=26812554

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00903107A Withdrawn EP1141878A2 (fr) 1999-01-05 2000-01-05 Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees

Country Status (2)

Country Link
EP (1) EP1141878A2 (fr)
JP (1) JP2002539768A (fr)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0041122A3 *

Also Published As

Publication number Publication date
JP2002539768A (ja) 2002-11-26

Similar Documents

Publication Publication Date Title
Wang et al. Confounder adjustment in multiple hypothesis testing
Gottardo et al. Bayesian robust inference for differential gene expression in microarrays with multiple samples
Reinert et al. Alignment of next-generation sequencing reads
Franks et al. Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data
Lai et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data
Eskin et al. Mismatch string kernels for SVM protein classification
Theobald et al. THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures
Abegaz et al. Principals about principal components in statistical genetics
US7805282B2 (en) Process, software arrangement and computer-accessible medium for obtaining information associated with a haplotype
Landgrebe et al. Permutation-validated principal components analysis of microarray data
Verboven et al. Sequential imputation for missing values
Mitteroecker et al. Multivariate analysis of genotype–phenotype association
Calza et al. Normalization of oligonucleotide arrays based on the least-variant set of genes
Bolstad et al. Preprocessing high-density oligonucleotide arrays
Frost et al. Principal component gene set enrichment (PCGSE)
Welle et al. Computational method for reducing variance with Affymetrix microarrays
Fuhrmann et al. Software for automated analysis of DNA fingerprinting gels
Knoll et al. cnAnalysis450k: an R package for comparative analysis of 450k/EPIC Illumina methylation array derived copy number data
WO2000041122A2 (fr) Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees
Giai Gianetto Statistical analysis of post-translational modifications quantified by label-free proteomics across multiple biological conditions with R: illustration from SARS-CoV-2 infected cells
EP1141878A2 (fr) Normalisation, changement d'echelle et recherche des differences entre des ensembles de donnees
Bayat et al. VSS: variance-stabilized signals for sequencing-based genomic signals
Malik et al. Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders
Saviozzi et al. Microarray probe expression measures, data normalization and statistical validation
Rensink et al. Statistical issues in microarray data analysis

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010706

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17Q First examination report despatched

Effective date: 20011001

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20040214