WO2000039337A1

WO2000039337A1 - Methods for robust discrimination of profiles

Info

Publication number: WO2000039337A1
Application number: PCT/US1999/030577
Authority: WO
Inventors: Stephen H. Friend; Roland Stoughton
Original assignee: Rosetta Inpharmatics, Inc.
Priority date: 1998-12-23
Filing date: 1999-12-21
Publication date: 2000-07-06
Also published as: CA2356891A1; JP2002533699A; EP1141415A1; WO2000039337A9; AU2483900A

Abstract

Methods for discriminating between the subtle effects of a first perturbation and a second perturbation on a biological sample are provided. Further, methods for identifying disease states in patients and methods for optimizing drug therapy regiments in diseased subjects are provided. Finally, improved methods for determining the subtle effects of pharmacological agents on a biological system are provided.

Description

Methods for Robust Discrimination of Profiles

This is a continuation-in-part of copending application serial number 09/220,274, by Stoughton et al. filed December 23, 1998 entitled, "Methods for Robust Discrimination of Profiles" which is incorporated by reference herein in its entirety.

1 FIELD OF THE INVENTION The field of this invention relates to methods for discriminating between the subtle effects of a first perturbation and a second perturbation on a biological sample. The invention also relates to improved methods for identifying disease states in patients. In addition, the invention provides improved methods for optimizing drug therapy regimens in diseased subjects. The invention also generally relates to improved methods for determining the subtle effects of pharmacological agents on a biological system.

2 BACKGROUND OF THE INVENTION

2.1 Profiles of Cellular Constituents

"Cellular constituents" include gene expression levels, abundance of mRNA encoding specific genes, and protein expression levels in a biological sample. Levels of various constituents of a cell, such as mRNA encoding genes and/or protein expression levels, are known to change in response to drug treatments and other perturbations of the cell's biological state. Measurements of a plurality of such "cellular constituents" therefore contain a wealth of information about the affect of perturbations on the cell's biological state. The collection of such measurements is generally referred to as the "profile" of the cell's biological state.

There may be on the order of 100,000 different cellular constituents for mammalian cells. Consequently, the profile of a particular cell is typically complex. The profile of any given state of a biological sample is often measured after the sample has been subjected to a perturbation. Such perturbations include, for example, exposure of the sample to a drug candidate, the introduction of an exogenous gene, the deletion of a gene from the sample, or changes in culture conditions. Comprehensive measurements of cellular constituents, or profiles of gene and protein expression and their response to perturbations in the cell, therefore have a wide range of utility including the ability to compare and understand the effects of drugs, diagnose disease, and optimize patient drug regimens. In addition, they have further application in a basic life science research. Within the past decade, several technological advances have made it possible to accurately measure cellular constituents and therefore derive profiles. For example, new techniques provide the ability to monitor the expression level of a large number of transcripts at any one time {see, e.g., Schena et al, 1995, Quantitative monitoring of gene expression patterns with a complementary DNA micro-array, Science 270:467-470; Lockhart et al, 1996, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnology 14:1675-1680; Blanchard et al, 1996, Sequence to array: Probing the genome's secrets, Nature Biotechnology 14, 1649; U.S. Patent 5,569,588, issued October 29, 1996 to Ashby et al. entitled "Methods for Drug Screening"). In organisms for which the complete genome is known, it is possible to analyze the transcripts of all genes within the cell. With other organisms, such as humans, for which there is an increasing knowledge of the genome, it is possible to simultaneously monitor large numbers of the genes within the cell.

In another front, the direct measurement of protein abundance has been improved by the use of microcolumn reversed-phase liquid chromatography electrospray ionization tandem mass spectrometry (LC/MS/MS) to directly identify proteins contained in mixtures. This technology promises to push the dynamic range for which protein abundance can be measured in a biological sample. Using LC/MS/MS, McCormack et al. have demonstrated that proteins presented in sample mixtures can be readily identified with a 30-fold difference in molar quantity, that the identifications are reproducible, and that proteins within the mixture can be identified at low femtomole levels. McCormack et al, 1997, Direct analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the low-femtomole level, Anal. Chem. 69:767-776. In a review of tandem mass spectrometry, Chait points out that an additional advantage of this technology is that it is orders of magnitude faster than more conventional approaches such as Edman sequencing. Chait, 1996, Trawling for proteins in the post-genome era, Nat. Biotech. 14:1544.

Other technological advances have provided for the ability to specifically perturb biological samples with individual genetic mutations. For example, Mortensen et al. describe a method for producing embryonic stem (ES) cell lines whereby both alleles are inactivated by homologous recombination. Using the methods of Mortensen et al, it is possible to obtain homozygous mutationally altered cells, i.e., double knockouts of ES cell lines. Mortensen et al. propose that their method may be generally applicable to other genes and to cell lines other than ES cells. Mortensen et al. 1992, Production of homozygous mutant ES cells with a single targeting construct, Cell Biol. 12:2391-2395.

In another promising technology Wach et al. provide a dominant resistance module for selection of S. cerevisiae transformants which entirely consists of heterologous DNA. The module can also be used to provide PCR based gene disruptions. Wach et al, 1994, New heterologous modules for classical or PCR-based gene disruptions in Saccharomyces cerevisiae, Yeast, 10:1793-808.

Technological advances, such as the use of DNA microarrays, are already being used in drug discovery (See e.g. Marton et al, 1998, Drug target validation and identification of secondary drug target effects using DNA microarrays, Nature Medicine in press; Gray et al, 1998, Exploiting chemical libraries, structure, and genomics in the search for kinase inhibitors, Science 281:533-538).

2.2 Profile comparisons

Comparison of profiles with other profiles in a database (see, e.g., U.S. Patent 5,777,888, issued July 7, 1998 to Rine et al. entitled "Systems for generating and analyzing stimulus-response output signal matrices") or clustering of profiles by similarity can give clues to the molecular targets of drugs and related functions, efficacy and toxicity of drug candidates and/or pharmacological agents. Such comparisons may also be used to derive consensus profiles representative of ideal drug activities or disease states. Profile comparison can also help detect diseases in a patient at an early stage and provide improved clinical outcome projections for a patient diagnosed with a disease.

At the center of all these profile comparison efforts is the need for robust discrimination of subtle differences in activity of the experimental conditions ("perturbations") that are often associated with the different profiles. To date such robust discrimination has not been achieved. In a typical perturbation experiment, the response of several thousand cellular constituents are typically measured, yet only a small number of constituents change significantly. Frequently none of the cellular constituents change at all. Consequently, there is frequently not enough information available in a conventional profile to provide an accurate assessment of the subtle effects of a perturbation. Figure 1 illustrates this art recognized problem. In Figure 1, the results of 365 mRNA transcription profiling experiments are shown. The 365 experiments include experiments with/without drugs at different concentrations, with/without specific genes in the yeast strain, combinations of drug treatment and gene deletion, changes in culture density, growth temperature, medium composition, and stimulations with endogenous hormones like mating factor. Although several thousand cellular constituents are being profiled in each experiment depicted in Figure 1, typically only a small number of constituents change significantly, and often none at all. As a consequence, a profile derived from any of the 365 experiments in Figure 1 would not provide enough information to determine the subtle effects of a particular perturbation. Consequently, profile comparisons using conventional profiles suffer from a failure to provide sufficient information to discern the subtle affects of a perturbation on a biological system. According to the above background, there is a great demand in the art for robust profile comparison methods.

Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.

3 SUMMARY OF THE INVENTION This invention provides robust profile comparison methods. These methods are used to determine a degree of similarity between an effect of a first perturbation and a second perturbation on a biological system. The methods of this invention have extensive applications in the areas of preventive health care, drug discovery, drug candidate lead selection, drug candidate validation, drug regimen optimization in a variety of patient populations, development of clinical trial protocols to satisfy United States Food and Drug Administration (FDA) requirements including those for investigative new drugs, satisfaction of related clinical trial protocol requirements in administrative agencies that are equivalent to the FDA in countries other than the United States, drug and/or drug candidate efficacy, drug and/or drug candidate toxicity, diagnostic applications such as disease monitoring in a variety of patient populations, and for the prediction of the clinical outcome of a patient.

One aspect of the invention includes a method comprising the steps of (a) determining a first set of constituent profiles, wherein each constituent profile in the set is determined by a different one of a plurality of initial states of a biological sample by measuring a response of the biological sample to the first perturbation when the biological sample is in the selected initial state; (b) determining a second set of constituent profiles, each constituent profile of the second set determined using a different one of a plurality of initial states of the biological sample by measuring a response of the biological sample to a second perturbation when the biological sample is in the selected initial state; (c) combining the first set of constituent profiles into a first augmented profile; (d) combining the second set of constituent profiles into a second augmented profile; and (e) comparing the first augmented profile with the second augmented profile to determine the degree of similarity between the first perturbation and the second perturbation.

In accord with a second aspect of the invention at least one constituent profile in the first set of constituent profiles is a first response profile and at least one constituent profile in the second set of constituent profiles is a second response profile. The first response profile is determined by at least one measurement of a at least one cellular constituent in the biological sample when the biological sample is in an initial state selected from a plurality of initial states, and the second response profile is determined by at least one measurement of at lease one cellular constituent in said biological sample when said biological sample is in the selected initial state. In accord with a another aspect of the invention at least one constituent profile in the first set of constituent profiles is a first projected profile and at least one constituent profile in the second set of constituent profiles is a second projected profile. In this aspect of the invention, the first and second projected profiles each contain a plurality of cellular constituent set values derived according to a definition of co-varying cellular constituent sets. The first and second projected profiles could be determined by an initial state selected from said plurality of initial states of the biological sample. An augmented profile could include any combination of projected profiles and response profiles.

In accord with a another aspect of the invention the biological sample is a cell line. The cell line could be an of an unicellular organism and at least one initial state included in a plurality of initial states could be determined by altering the biological sample in a manner that alters cell wall permeability. In another aspect the biological sample is substantially isogenic to Saccharomyces cerevisiae.

In another aspect of the invention, the biological sample is a cell line that expresses a macromolecule that serves as a drug efflux pump. In this embodiment, some of the initial biological states are generated by selecting isogenic cell lines that do not possess macromolecules that have an ability to act as a drug efflux pump.

In another embodiment, the biological sample is a cell line and the first initial state that is selected from a plurality of initial states is determined by a first set of culture growth conditions and a second initial state that is selected from a plurality of initial states is determined by a second set of culture growth conditions. In this embodiment, the first culture growth conditions and the second culture growth conditions vary by a variable such as an amount of a nutrient that is necessary for viability of said cell line, an amount of a trace element, an amount of a mineral, a culture temperature, and/or the nature of the container the sample is cultured in. Examples of containers include but are not limited to shaker flasks, culture plates and incubators.

In another aspect of the invention, the biological sample is a cell line and a first initial state that is selected from a plurality of initial states is determined by a culture growth density of the cell line and a second initial state that is selected from a plurality of initial states is determined by a second culture growth density of the cell line, wherein the two culture growth densities vary by an amount.

In an another aspect of the invention, the biological sample is a cell line and a first initial state that is selected from a plurality of initial states is determined by a first amount of a pharmacological agent that is contacted with the biological sample and a second initial state that is selected from said plurality of initial states is determined by a second amount of a pharmacological agent that is contacted with the biological sample.

In an accord with another aspect of the invention a first initial state is determined by a genetic feature of the biological sample. In this aspect of the invention, the biological sample could be Saccharomyces cerevisiae having a genome and the first initial state that is selected from a plurality of initial states is determined by a genetic feature selected from the group consisting of a haploid state of the genome, a diploid state of the genome, a heterozygous state of a gene included in the genome, a homozygous state of a gene included in the genome, a mutation of a gene included in the genome, a deletion of a portion of a gene from the genome, an alteration of a regulatory sequence of a gene in the genome, an exogenous gene integrated into the genome and an exogenous oligonucleotide integrated into the genome.

In accordance with the another aspect of the invention, the biological sample could be a cell line having a genome wherein the first initial state that is selected from a plurality of initial states is determined by a genetic feature selected from the group consisting of a heterozygous state of a gene included in the genome, a homozygous state of a gene included in the genome, a mutation of a gene included in the genome, a deletion of a portion of a gene from the genome, an alteration of a regulatory sequence of a gene in the genome, an exogenous gene integrated into the genome, and an exogenous oligonucleotide integrated into t genome.

In another aspect of the invention, the biological sample is a cell line and the first initial state that is selected from a plurality of initial states is determined by a state of a biological pathway that is selected from a compendium of biological pathways present in the cell line. In one aspect of the invention, the biological sample is substantially isogenic with Saccharomyces cerevisiae and the biological pathway is a mating pathway.

In yet another aspect of the invention, the first perturbation is a first amount of a first pharmacological agent that is contacted with the biological sample. In another aspect, the second perturbation is a second amount of the first pharmacological agent that is contacted with the biological sample, and the first and second amounts of pharmacological agent vary. In another aspect, the second perturbation is a second amount of a second pharmacological agent that is contacted with said biological sample.

In accordance with another aspect of the invention, the biological sample includes a genome and the first perturbation is determined by the introduction of an exogenous gene into the genome, and or deletion of at least one gene in the genome.

In accordance with another aspect of the invention, the first perturbation is a method, the method comprising: contacting said biological sample with a hormone, a drug, a peptide, an oligonucleotide, a mineral, a composition of media, a phage, a trace element, a salt, a colony stimulating factor or a source of irradiation. In another aspect, the first perturbation is a method, the method comprising: contacting an amount of an organic compound that has a molecular weight less than 1000 Daltons with said biological sample.

In accordance with another aspect of the invention, the first augmented profile is expressed as: P' = [R'.;...;RV] wherein,

P¹ is a first augmented profile;

P is a first constituent profile in a first set of constituent profiles that is determined by measuring a response of a biological sample to a first

5 perturbation when a biological sample is in a first biological state selected from a plurality of initial states;

P'_N is an N^h constituent profile in the first set of constituent profiles that is determined by measuring a response of the biological sample to the first perturbation when the biological sample is in an N^h biological state selected

1 o from the plurality of initial states; and

the second augmented profile is:

P^j = [ "ι;...; 'V] wherein,

15 P* is a second augmented profile;

R", is a first constituent profile in a second set of constituent profiles that is determined by measuring a response of the biological sample to the second perturbation when the biological sample is in said the biological state; P"_N is an N^h constituent profile in the second set of constituent profiles that

20 is determined by measuring a response of the biological sample to the second perturbation when the biological sample is in an Ν^lh biological state selected from the plurality of initial states; and Nis the number of states in the plurality of initial states. In this embodiment the step of comparing the first augmented profile with the second

25 augmented profile to determine the correlation is performed by comparing P¹ to P^j using a quantitative measure of similarity. In one aspect this quantitative measure of similarity is a generalized dot product: rϋ = P¹ * P^j / (|pⁱ||P^j|) _Q wherein * denotes dot product, || denotes vector norm and ry denotes similarity. In another aspect of the invention, the quantitative measure of similarity is derived from Shannon mutual information theory.

In another aspect of the invention, each constituent profile includes a plurality of elements that each represent an amount of a cellular constituent in a biological sample.

*>_c: Accordingly, the cellular constituents are independently selected from the group consisting of a gene expression level, an amount of an mRΝA encoding a gene, an amount of a protein, an amount of an enzymatic activity, an amount of an epitope presented by a macromolecule, an amount of a divalent cation, an amount of a phosphorylated protein, an amount of a dephosphorylated protein, an amount of a hormone, and an amount of a peptide.

Another aspect of the invention is a method of determining an effect of a first perturbation on a subject, the method comprising:(a) determining a plurality of augmented profiles; each augmented profile determined by combining a constituent profile set selected from a plurality of constituent profile sets wherein: each constituent profile set in the plurality of constituent profile sets is determined by obtaining a biological sample from the subject at a different time; and each constituent profile in the constituent profile set is determined by measuring a biological response of the biological sample to a different second perturbation selected from a plurality of perturbations; and (b) comparing the plurality of augmented profiles to determine the effect of the first perturbation on the subject. The first perturbation may be selected from the group consisting of a diseased state, introduction of an exogenous gene into the genome of the subject, and a behavioral health risk. Optionally, the first constituent profile set in the plurality of constituent profiles sets represents a baseline state and all other constituent profile sets in the plurality of constituent profile sets are expressed as a ratio or logarithmic ratio of the first constituent profile set. Optionally, the first perturbation is a drug that is taken by the subject of interest at regular intervals.

4 BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 represents the results of 365 mRNA transcription profiling experiments. Methods were as described for a subset of these experiments in Section 6., supra. Each of the 365 rows in this image has, when printed at full resolution, 6000 gray-scale pixels representing the ratio in mRNA expression of the 6000 yeast genes between the pair of cell conditions in that experiment pair. Black denotes upregulation of a gene's transcription, white denotes downregulation, and the middle gray denotes very little or no change. The gray-scale bar at the bottom of Figure 1 indicates a scale from loglθ(ratio) = -1 (ten fold downregulation) to loglθ(ratio) = +1 (ten fold upregulation) for reference. The 365 condition pairs include comparisons of with/without drugs at different concentrations, with/without specific genes in the yeast strain, combinations of drug treatment and gene deletion, changes in culture density, growth temperature, medium composition, and stimulations with endogenous hormones like mating factor.

Fig. 2 represents profiles to drugs in multiple conditions. Although the response to the drugs under starting State 1 may be small or nonexistent, the concatenated response profiles obtained in different states may provide robust discrimination of the activities of the different compounds, t denotes upregulation. 1 denotes downregulation. Absence of an arrow denotes no change for that cellular constituent.

Fig. 3 A illustrates a profile for the immunosuppressant drug cyclosporin A. Fig. 3B illustrates a profile for the immunosuppressant drug FK506. In both figures, the horizontal axis is the intensity of the individual hybridized spots on the microarrary, representing individual mRNA species abundance in the two cultures. The vertical axis is the loglO of the ratio of the intensity measured for one fluorescent label (Culture 1) to that measured for the other label (Culture 2). Error bars and names are displayed only for those genes which had up or down regulations due to the drug that were significant at the 95% confidence level or better.

Fig. 4 Shows the high correlation (similarity) between the effects of cyclosporin A and FK506 on S. Cerevisiae that had been cultured in the presence of 1 μg/ml of FK506 and 30μg/ml of cyclosporin respectively.

Fig. 5A illustrates a response profile for the gene deletion strain FPR cultured in the presence of 1 μg/ml of FK506.

Fig. 5B illustrates a response profile for the gene deletion strain CPH1 cultured in the presence of 1 μg/ml of FK506.

Fig. 5C illustrates a response profile for the gene deletion strain FPR cultured in the presence of 50μg/ml of cyclosporin.

Fig. 5D illustrates a response profile for the gene deletion strain CPH1 cultured in the presence of 50μg/ml of cyclosporin.

Fig. 6 illustrates the reduced correlation between the effects of cyclosporin and FK506 in yeast when augmented profiles are used.

Fig. 7 illustrates a computer system useful for embodiments of the invention.

5 DETAILED DESCRIPTION OF THE INVENTION A basis for the present invention is the unexpected discovery that augmented profiles provide a method for robustly discriminating between the subtle effects of a first perturbation and a second perturbation on a biological sample. Augmented profiles are derived by the combination of a plurality of response profiles and or projection profiles that are in turn based upon the measurement of cellular constituents within a biological sample as the biological sample is placed in a series of different starting states. This section presents a detailed description of the invention and its applications.

5.1 INTRODUCTION

To appreciate the methods of the present invention, an understanding of some preliminary concepts such as biological state, response profiles, and projection profiles is necessary. After these concepts are understood, one skilled in the art will understand the concept of an augmented profile. Further, the improvements that augmented profiles provide in the field of profile comparison will be appreciated after the details of the present invention are described and an example is presented.

5.1.1 GENERAL DEFINITIONS

Biological sample and/or Biological system: As used herein, a biological sample and/or biological system includes a cell line, a culture of a cell line, a tissue sample obtained from a subject, a Homo sapien, a mammal, a yeast substantially isogenic to Saccharomyces cerevisia, or any other art recognized biological system.

Perturbation: As used herein, a perturbation includes the exposure of a biological sample to a drug candidate or pharmacologic agent, the introduction of an exogenous gene into a biological sample, the deletion of a gene from the biological sample, changes in the culture conditions of the biological sample, or any other art recognized method of perturbing a biological sample.

Constituent Profile: A constituent profile is a profile used in the formation of an augmented profile. The constituent profile may, for example, be a response profile or a projected profile, which are described infra.

Behavioral Health Risk: As used herein, a behavioral health risk includes, but is not limited to, consumption of alcohol and cigarette smoking.

5.1.2 BIOLOGICAL SAMPLE

As used in herein, the term "biological sample" is broadly defined to include any cell, tissue, organ or multicellular organism. A biological sample can be derived, for example, from cell or tissue cultures in vitro. Alternatively, a biological sample can be derived from a living organism or from a population of single cell organisms. The state of a biological sample can be measured by the content, activities or structures of its cellular constituents. The state of a biological sample, as used herein, is determined by the state of a collection of cellular constituents, which are sufficient to characterize the cell or organism for an intended purpose including characterizing the effects of a drug or other perturbation. The term "cellular constituent" is broadly defined herein to encompass any kind of measurable biological variable. The measurements and/or observations made on the state of these constituents can be of their abundances {i.e., amounts or concentrations in a biological sample), their activities, their states of modification {e.g., phosphorylation), or other art recognized measurements relevant to the physiological state of a biological sample. In various embodiments, this invention includes making such measurements and/or observations on different collections of cellular constituents. These different collections of cellular constituents are also called aspects of the biological state of a biological sample.

One aspect of the biological state of a biological sample {e.g., a cell or cell culture) usefully measured in the present invention is its transcriptional state. The transcriptional state of a biological sample includes the identities and abundances of the constituent RNA species, especially mRNAs, in the cell under a given set of conditions. Often, a substantial fraction of all constituent RNA species in the biological sample are measured, but at least a sufficient fraction is measured to characterize the action of a drug or other perturbation of interest. The transcriptional state of a biological sample can be conveniently determined by measuring cDNA abundances by any of several existing gene expression technologies. DNA arrays for measuring mRNA or transcript level of a large number of genes can be employed to ascertain the biological state of a sample.

Another aspect of the biological state of a biological sample usefully measured is its translational state. The translational state of a biological sample includes the identities and abundances of the constituent protein species in the biological sample under a given set of conditions. Preferably a substantial fraction of all constituent protein species in the biological sample is measured, but at least a sufficient fraction is measured to characterize the action of a drug of interest. The transcriptional state is often representative of the translational state.

Other aspects of the biological state of a biological sample are also of use in this invention. For example, the activity state of a biological sample includes the activities of the constituent protein species (and also optionally catalytically active nucleic acid species) in the biological sample under a given set of conditions. As is known to those of skill in the art, the translational state is often representative of the activity state.

This invention is also adaptable, where relevant, to "mixed" aspects of the biological state of a biological sample in which measurements of different aspects of the biological state of a biological sample are combined. For example, in one mixed aspect, the abundances of certain RNA species and of certain protein species, are combined with measurements of the activities of certain other protein species. Further, it will be appreciated from the following that this invention is also adaptable to any other aspect of a biological state of a biological sample that is measurable.

The biological state of a biological sample {e.g., a cell or cell culture) can be represented by a profile of some number of cellular constituents. Such a profile of cellular constituents can be represented by the vector S.

S = [S, , . . S, , . . S_k] (i) Where S_t is the level of the /'th cellular constituent, for example, the transcript level of gene i, or alternatively, the abundance or activity level of protein i.

In some embodiments, cellular constituents are measured as continuous variables. For example, transcriptional rates are typically measured as number of molecules synthesized per unit of time. Transcriptional rate may also be measured as percentage of a control rate. However, in some other embodiments, cellular constituents may be measured as categorical variables. For example, transcriptional rates may be measured as either "on" or "off, where the value "on" indicates a transcriptional rate above a predetermined threshold and value "off indicates a transcriptional rate below that threshold.

5.1.3 RESPONSE PROFILES

The responses of a biological sample to a perturbation, such a pharmacological agent, can be measured by observing the changes in the biological state of the biological sample. A response profile is a collection of changes of cellular constituents. The response profile of a biological sample {e.g., a cell or cell culture) to the perturbation m may be defined as the vector v^{( )}:

where v™ is the amplitude of response of cellular constituent under the perturbation m. In some embodiments of response profiles, biological response to the application of a pharmacological agent is measured by the induced change in the transcript level of at least 2 genes, preferably more than 10 genes, more preferably more than 100 genes and most preferably more than 1 ,000 genes.

In some embodiments, biological response profiles comprise simply the difference between biological variables before and after perturbation. In some prefeπed embodiments, the biological response is defined as the ratio of cellular constituents before and after a perturbation is applied.

In some preferred embodiments, v,^m is set to zero if the response of gene i is below some threshold amplitude or confidence level determined from knowledge of the measurement error behavior. In such embodiments, those cellular constituents whose measured responses are lower than the threshold are given the response value of zero, whereas those cellular constituents whose measured responses are greater than the threshold retain their measured response values. This truncation of the response vector is suitable when most of the smaller responses are expected to be greatly dominated by measurement error. After the truncation, the response vector v^(m) also approximates a 'matched detector' {see, e.g., Van Trees, 1968, Detection. Estimation, and Modulation Theory Vol. I. Wiley & Sons) for the existence of similar perturbations. It is apparent to those skilled in the art that the truncation levels can be set based upon the purpose of detection and the measurement errors. For example, in some embodiments, genes whose transcript level changes are lower than two fold or more preferably four fold are given the value of zero.

In some preferred embodiments of response profiles, perturbations are applied at several levels of strength. For example, different amounts of a drug may be applied to a biological sample to observe its response. In such embodiments, the perturbation responses may be interpolated by approximating each by a single parameterized "model" function of the perturbation strength u. An exemplary model function appropriate for approximating transcriptional state data is the Hill function, which has adjustable parameters a, u₀, and n.

H IT(«) = ^flW ^— (3)

1 + (K v

The adjustable parameters are selected independently for each cellular constituent of the perturbation response. Preferably, the adjustable parameters are selected for each cellular constituent so that the sum of the squares of the differences between the model function {e.g., the Hill function, Equation 3) and the corresponding experimental data at each perturbation strength is minimized. This preferable parameter adjustment method is known in the art as a least squares fit. Other possible model functions are based on polynomial fitting. More detailed description of model fitting and biological response has been disclosed in Friend and Stoughton, Methods of Determining Protein Activity Levels Using Gene Expression Profiles, U.S. Provisional Application Serial No. 60/084,742, filed on May 8, 1998, which is incorporated herein by reference in it's entirety for all purposes.

5.1.4 PROJECTED PROFILES

The methods of the invention are useful for comparing augmented profiles that contain any number of response profile and/or projected profiles. Projected profiles are best understood after a discussion of genesets, which are co-regulated genes. Projected profiles are useful for analyzing many types of cellular constituents including genesets.

5.1.4.1 CO-REGULATED GENES AND GENESETS

Certain genes tend to increase or decrease their expression in groups. Genes tend to increase or decrease their rates of transcription together when they possess similar regulatory sequence patterns, i.e., transcription factor binding sites. This is the mechanism for coordinated response to particular signaling inputs {see, e.g., Madhani and Fink, 1998,

The riddle of MAP kinase signaling specificity, Transactions in Genetics 14:151-155;

Arnone and Davidson, 1997, The hardwiring of development: organization and function of genomic regulatory systems, Development 124:1851-1864). Separate genes which make different components of a necessary protein or cellular structure will tend to co-vary. Duplicated genes {see, e.g., Wagner, 1996, Genetic redundancy caused by gene duplications and its evolution in networks of transcriptional regulators, Biol. Cvbern. 74:557-567) will also tend to co-vary to the extent mutations have not led to functional divergence in the regulatory regions. Further, because regulatory sequences are modular (see, e.g., Yuh et α/.,1998, Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene, Science 279:1896-1902), the more modules two genes have in common, the greater the variety of conditions under which they are expected to co-vary their transcriptional rates. Separation between modules also is an important determinant since co-activators also are involved. In summary therefore, for any finite set of conditions, it is expected that genes will not all vary independently, and that there are simplifying subsets of genes and proteins that will co-vary. These co-varying sets of genes form a complete basis in the mathematical sense with which to describe all the profile changes within that finite set of conditions.

5.1.4.2 GENESET CLASSIFICATION BY CLUSTER ANALYSIS

For many applications, it is desirable to find basis genesets that are co-regulated over a wide variety of conditions. A preferred embodiment for identifying such basis genesets involves clustering algorithms (for reviews of clustering algorithms, see, e.g., Fukunaga, 1990, Statistical Pattern Recognition. 2nd Ed., Academic Press, San Diego; Everitt, 1974, Cluster Analysis. London: Heinemann Educ. Books; Hartigan, 1975, Clustering Algorithms. New York: Wiley; Sneath and Sokal, 1973, Numerical Taxonomy. Freeman; Anderberg, 1973, Cluster Analysis for Applications. Academic Press: New York).

In some embodiments employing cluster analysis, the expression of a large number of genes is monitored as biological samples are subjected to a wide variety of perturbations. A table of data containing the gene expression measurements is used for cluster analysis. In order to obtain basis genesets that contain genes which co-vary over a wide variety of conditions multiple perturbations or conditions are employed. Cluster analysis operates on a table of data which has the dimension m x k wherein m is the total number of conditions or perturbations and k is the number of genes measured.

A number of clustering algorithms are useful for clustering analysis. Clustering algorithms use dissimilarities or distances between objects when forming clusters. In some embodiments, the distance used is Euclidean distance in multidimensional space:

where I(x,y) is the distance between gene X and gene Y; X_t and Y_f are gene expression response under perturbation i. The Euclidean distance may be squared to place progressively greater weight on objects that are further apart. Alternatively, the distance measure may be the Manhattan distance e.g., between gene X and Y, which is provided by:

Again, X_t and Y_t are gene expression responses under perturbation i. Some other definitions of distances are Chebychev distance, power distance, and percent disagreement. Percent disagreement, defined as I(x,y) = (number of X_t ≠ Y,)/i, is particularly useful for the method of this invention, if the data for the dimensions are categorical in nature. Another useful distance definition, which is particularly useful in the context of cellular response, is

I = 1 - r, where r is the correlation coefficient between the response vectors X, Y, also called the normalized dot product X*YI\X\ \Y\.

Various cluster linkage rules are useful for defining genesets. Single linkage, a nearest neighbor method, determines the distance between the two closest objects. By contrast, complete linkage methods determine distance by the greatest distance between any two objects in the different clusters. This method is particularly useful in cases when genes or other cellular constituents form naturally distinct "clumps." Alternatively, the unweighted pair-group average defines distance as the average distance between all pairs of objects in two different clusters. This method is also very useful for clustering genes or other cellular constituents to form naturally distinct "clumps." Finally, the weighted pair- group average method may also be used. This method is the same as the unweighted pair- group average method except that the size of the respective clusters is used as a weight. This method is particularly useful for embodiments where the cluster size is suspected to be greatly varied (Sneath and Sokal,1973, Numerical taxonomy. San Francisco: W. H. Freeman & Co.). Other cluster linkage rules, such as the unweighted and weighted pair- group centroid and Ward's method are also useful for some embodiments of the invention. See., e.g., Ward, 1963, J. Am. Stat Assn. 58:236; Hartigan, 1975, Clustering algorithms. New York: Wiley.

As the diversity of perturbations in the clustering set becomes very large, the genesets which are clearly distinguishable get smaller and more numerous. However, even over very large experiment sets, there are small genesets that retain their coherence. These genesets are termed irreducible genesets. Typically, a large number of diverse perturbations are applied to obtain such iπeducible genesets.

Often, the clustering of genesets is represented graphically and is termed a 'tree'. Genesets may be defined based on the many smaller branches of a tree, or a small number of larger branches by cutting across the tree at different levels. The choice of cut level may be made to match the number of distinct response pathways expected. If little or no prior information is available about the number of pathways, then the tree should be divided into as many branches as are truly distinct. 'Truly distinct' may be defined by a minimum distance value between the individual branches. Typical values are in the range 0.2 to 0.4 where 0 is perfect correlation and 1 is zero correlation, but may be larger for poorer quality data or fewer experiments in the training set, or smaller in the case of better data and more experiments in the training set.

Preferably, 'truly distinct' may be defined with an objective test of statistical significance for each bifurcation in the tree. In one aspect of the invention, the Monte Carlo randomization of the experiment index for each cellular constituent's responses across the set of experiments is used to define an objective test.

In some embodiments, the objective test is defined in the following manner: etp_kl be the response of constituent k in experiment i. Let JI(i) be a random permutation of the experiment index. Then for each of a large (about 100 to 1000) number of different random permutations, construct p_kJJ(l). For each branching in the original tree, for each permutation:

(1) perform hierarchical clustering with the same algorithm ('hclust' in this case) used on the original unpermuted data;

(2) compute fractional improvement/in the total scatter with respect to cluster centers in going from one cluster to two clusters

f= l - ΣD_k ^ω /ΣD <^2> (6)

where D_k is the square of the distance measure for constituent k with respect to the center (mean) of its assigned cluster. Superscript 1 or 2 indicates whether it is with respect to the center of the entire branch or with respect to the center of the appropriate cluster out of the two subclusters. There is considerable freedom in the definition of the distance function D used in the clustering procedure. In these examples, D = I - r , where r is the coπelation coefficient between the responses of one constituent across the experiment set vs. the responses of the other (or vs. the mean cluster response).

The distribution of fractional improvements obtained from the Monte Carlo procedure is an estimate of the distribution under the null hypothesis that a given branching was not significant. The actual fractional improvement for that branching with the unpermuted data is then compared to the cumulative probability distribution from the null hypothesis to assign significance. Standard deviations are derived by fitting a log normal model for the null hypothesis distribution. Using this procedure, a standard deviation greater than about 2, for example, indicates that the branching is significant at the 95% confidence level. Genesets defined by cluster analysis typically have underlying biological significance.

Another aspect of the cluster analysis method provides the definition of basis vectors for use in profile projection described in the following sections.

A set of basis vectors V has kx n dimensions, where k is the number of genes and n is the number of genesets.

V^<n> _k is the amplitude contribution of gene index k in basis vector n. In some embodiments, Ϋ^a) _k = , if gene k is a member of geneset n, and Vⁿ⁾ _k = 0 if gene k is not a member of geneset n. In some embodiments, F is proportional to the response of gene k in geneset n over the training data set used to define the genesets .

In some preferred embodiments, the elements F^ are normalized so that each Vⁿ⁾ _k has unit length by dividing by the square root of the number of genes in geneset n. This produces basis vectors which are not only orthogonal (the genesets derived from cutting the clustering tree are disjoint), but also orthonormal (unit length). With this choice of normalization, random measurement errors in profiles project onto the Vⁿ in such a way that the amplitudes tend to be comparable for each n. Normalization prevents large genesets from dominating the results of similarity calculations.

5.1.4.3 GENESET CLASSIFICATION BASED UPON

MECHANISMS OF REGULATION

Genesets can also be defined based upon the mechanism of the regulation of genes. Genes whose regulatory regions have the same transcription factor binding sites are more likely to be co-regulated. In some prefeπed embodiments, the regulatory regions of the genes of interest are compared using multiple alignment analysis to decipher possible shared transcription factor binding sites (Stormo and Hartzell,1989, Identifying protein binding sites from unaligned DNA fragments, Proc Natl Acad Sci 86:1183-1187; Hertz and Stormo, 1995, Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps, Proc of 3rd Intl Conf on Bioinformatics and Genome Research. Lim and Cantor, eds., World Scientific Publishing Co., Ltd. Singapore, pp. 201-216). For example, as Example 3, infra, shows, common promoter sequence responsive to Gcn4 in 20 genes may be responsible for those 20 genes being co-regulated over a wide variety of perturbations. The co-regulation of genes is not limited to those with binding sites for the same transcriptional factor. Co-regulated (co-varying) genes may be in the up-stream/downstream relationship where the products of up-stream genes regulate the activity of downstream genes. It is well known to those of skill in the art that there are numerous varieties of gene regulation networks. One of skill in the art also understands that the methods of this invention are not limited to any particular kind of gene regulation mechanism. If it can be derived from the mechanism of regulation that two genes are co-regulated in terms of their activity change in response to perturbation, the two genes may be clustered into a geneset.

Because of lack of complete understanding of the regulation of genes of interest, it is often preferred to combine cluster analysis with regulatory mechanism knowledge to derive better defined genesets. In some embodiments, K-means clustering may be used to cluster genesets when the regulation of genes of interest is partially known. K-means clustering is particularly useful in cases where the number of genesets is predetermined by the understanding of the regulatory mechanism. In general, K-mean clustering is constrained to produce exactly the number of clusters desired. Therefore, if promoter sequence comparison indicates the measured genes should fall into three genesets, K-means clustering may be used to generate exactly three genesets with greatest possible distinction between clusters.

5.1.4.4 REPRESENTING PROJECTED PROFILES

The expression value of genes can be converted into the expression value for genesets. This process is referred to as projection. In some embodiments, the projection is as follows:

P = [Pι,..P,,..P_n] = p * V (8)

wherein, ? is the expression profile, P is the projected profile, P, is expression value for geneset / and Fis a predefined set of basis vectors. The basis vectors have been previously defined in Equation 7 (Section 5.1.4.2, supra) as:

wherein J^ is the amplitude of cellular constituent index k of basis vector n. In one preferred embodiment, the value of geneset expression is simply the average of the expression value of the genes within the geneset. In some other embodiments, the average is weighted so that highly expressed genes do not dominate the geneset value. The collection of the expression values of the genesets is the projected profile.

5.2 PROFILE COMPARISON AND CLASSIFICATION

Once the basis genesets are chosen, projected profiles P, may be obtained for any set of profiles indexed by . Similarities between the R, may be more clearly seen than between the original profiles/?, for two reasons. First, measurement errors in extraneous genes have been excluded or averaged out. Second, the basis genesets tend to capture the biology of the profiles /?, and so are matched detectors for their individual response components.

Classification and clustering of the profiles both are based on an objective similarity metric, call it S, where one useful definition is

S_IJ = S(P, , P_J) = P, -P dP.UP_jl) (10)

This definition is the generalized angle cosine between the vectors R, and P. It is the projected version of the conventional correlation coefficient between ;, andp Profile/?, is deemed most similar to that other profile/? for which S_y is maximum. New profiles may be classified according to their similarity to profiles of known biological significance, such as the response patterns for known drugs or perturbations in specific biological pathways. Sets of new profiles may be clustered using the distance metric

D_y = I - S„ (11)

where this clustering is analogous to clustering in the original larger space of the entire set of response measurements, but has the advantages just mentioned of reduced measurement error effects and enhanced capture of the relevant biology.

The statistical significance of any observed similarity S_y may be assessed using an empirical probability distribution generated under the null hypothesis of no correlation. This distribution is generated by performing the projection, Equations (9) and (10) for many different random permutations of the constituent index in the original profile p. That is, the ordered set/?* are replaced \s p_π(k) where U(k) is a permutation, for -100 to 1000 different random permutations. The probability of the similarity S_y arising by chance is then the fraction of these permutations for which the similarity S_y (permuted) exceeds the similarity observed using the original unpermuted data. 5.3 AUGMENTED PROFILES AND ROBUST DISCRIMINATION

5.3.1 METHOD FOR ROBUST DISCRIMINATION

In the methods of this invention, a biological sample is placed in alternative states by, for example, introducing mutations or changing growth conditions, to make the biological sample more responsive to a given perturbation. This concept is illustrated in Figure 2. Under State 1, in Figure 2, the drugs have only limited responses and comparison of their effects is tenuous and based on little information. By forming augmented profiles consisting of concatenated profiles from multiple states or conditions, the profiles become much more informative. Because they are more informative, they can provide improved detail on the effects of different perturbations, such as drugs in the illustration, on a patient. The different states may be different culture growth conditions, background genetic strains, or additional drug treatments, to name a few. These additional states may be chosen based on prior biological knowledge to elicit specific responses in otherwise unresponsive cells, or they may be chosen more or less at random with the knowledge that the resulting additional diversity in the augmented response profile will tend to allow better discrimination, on average. Techniques to change the initial state and possibly elicit responses include, for example, inhibiting drug efflux pumps or enhancing cell wall permeability by genetic modification of the organism, growing in nutrient-poor media, growing on plates vs. in volume culture, adding certain trace elements or minerals to the media, using haploid, diploid, and heterozygous background strains, activating pathways such as the mating pathway which have widespread effects on cell state and are likely to change the responsiveness to the stimuli that are being compared.

Robust augmented profile comparison has wide ranging applicability, such as providing a method for robust discrimination of drug activities or disease states in vivo. In such applications, multiple conditions are provided by following a patient in time or through other environmental or medical insults and by concatenation of the multiple profiles obtained under these different host conditions. Profiles may be expressed as departure profiles from baseline states by forming the ratio or log(ratio) of constituent levels with respect to a baseline state, or any second perturbation.

Mathematically, comparisons of augmented profiles are done in a manner that is analogous to the comparison of profiles obtained in a single state as described in section 5.2. The concatenated profile may be written P = [pl;p2;...;pN] of length NL, where pi is the profile in the first state, N is the number of states and L is the number of cellular constituents measured in a single state. Measures of similarity, such as a generalized dot product,

r_y = P * P l(\P\\P^J\) (12) can be used to define the concatenated profiles, as they would be defined on single-state profiles pi. In Equation (12), * denotes dot product and || denotes vector norm (length). Many other quantitative measures of similarity are possible, such as Shannon mutual information [S.E. Shannon and W. Weaver, The mathematical theory of communication, University of Illinois Press, Urbana, IL, 1949], or modifications of Equation (12) where elements of the profiles are set to "1" ("-1") if they exceed a positive (negative) threshold and "0" if they do not.

These measures of similarity then support searches of augmented-profile libraries for the profile most similar to a query profile, and clustering of sets of augmented profiles into groups that are likely to share characteristics like toxicity or effectiveness.

5.3.2 ILLUSTRATIVE DRUG DISCOVERY APPLICATIONS Robust discrimination of augmented profiles has wide applicability to several aspects of drug discovery as outlined in the following sections.

5.3.2.1 DRUG CANDIDATE LEAD SELECTION

The methods of the present invention have applicability to the field of drug candidate lead selection. In many drug discovery efforts, a target enzyme will be screened against a large library of proprietary and/or nonproprietary compounds. Such a screening effort is refeπed to as a primary assay. Primary assays are often reduced to a robotic format in which thousands of compounds are screened per day. These efforts will result in a large number of compounds that produce the desired activity, which is typically the inhibition of the activity of a selected target enzyme.

Compounds that are successful in the proprietary assay are typically called hits or leads. Hits from the primary assay are typically screened in appropriately designed secondary assays. While the format of the secondary assay may vary depending on the scope of the drug discovery project, a typically secondary assay includes the dose response of a compound on whole cells. Thus in such a cell-based assay, the presence of some cellular constituent, such as TNF secretion, is measured as the cells are incubated in increasing concentrations of test compound.

In order to measure the suitability of a test compound, secondary assays are typically used to compare the activity of hits from the primary assay with the activity of some reference compound. The reference compound may be one that has proven efficacy in the appropriate clinical setting, a known drug or simply a prior lead. Comparison of newly developed compounds against the active reference compound serves as an excellent tool for marking progress and for determining what is to be expected of new compounds.

In one aspect, the methods of the present invention will serve as an improved secondary assay. Accordingly, the effect of dosing an appropriate cell line with a reference compound can be compared to the effect of dosing the same cell line with each of the hits from the primary screening assay. In this embodiment, appropriate cellular constituents of the cell line can be measured using any of the techniques described in this specification or known in the art. Further, these measurements can be done when the cell line is placed in a variety of different initial biological states. For example, cell response profiles can be measured when the reference compound has been contacted with the cells after they have been cultured in a variety of cell culture densities, temperatures, or other culture conditions. Each of these response profiles are combined to form a reference augmented profile. Similar augmented profiles are created for each of the hits from the primary screening assay and these augment profiles are compared with the reference profile. By comparing augmented profiles generated from each compound of interest rather than individual response or projected profiles, subtle differences between the effects of each test compound can be detected. Even small changes in cellular constituents associated with a known toxicity or a desired physiologic event will become statistically meaningful using the methods of this invention.

5.3.2.2 DRUG CANDIDATE VALIDATION

Often in the drug discovery process, a potential drug candidate will exhibit excellent activity in the primary in vitro assay and secondary cell-based assays. Even if a compound is successful in both primary and secondary assays, their remains a need to validate the compound. Compound validation addresses the difficult issue of verifying that a test compound was successful in the primary and secondary assays because of selective affects on the desired target rather than unselective affects on multiple physiological processes. Compounds that selectively affect the desired target are prefeπed over compounds that selectively affect a wide variety of cellular constituents. For example, a compound that is excessively hydrophobic may bind to the target enzyme by unselective hydrophobic interactions. The problem with such an excessively hydrophobic protein is that it is likely to unselectively bind and/or inhibit several cellular constituents as well. Compounds that nonselectively inhibit all enzymes in a class are also undesirable. For example, in addition to inhibiting a target kinase of interest, a nonselective kinase inhibitor such as staurosporine will bind and inhibit dozens of kinases. A test compound may perform well in the secondary assay because it is toxic to the cells or because the compound knocked out a biological pathway that is unrelated to the biological pathway of interest.

The methods of the present invention provide improved means for validating test compounds in a drug discovery effort. In this embodiment of the invention, augmented profiles (reference augment profiles) based on the compounds that have a known effect on a biological sample are compared with augmented profiles generated from compounds that need validation. For example, reference compounds that have a general toxic effect on the biological sample will have distinct augmented profiles. Thus a low coπelation between such reference toxic compounds and test compounds of interest is desired. Similarly, a high coπelation between an augmented profile derived from a previously validated compound and a test compound would indicate that the test compound is selectively influencing the proper biological pathway. A previously validated compound may be obtained from animal trials or from prior scientific publications.

5.3.2.3 DRUG REGIMEN OPTIMIZATION IN A VARIETY OF PATIENT

POPULATIONS

The methods of the present invention provide improved methods for optimizing drug regimens in a variety of patient populations. In one embodiment, augmented profiles developed from biological samples obtained from a patient can be compared with reference augmented profiles that represent model drug responses of patients with favorable clinical outcome. Data derived from such comparisons would then be used to optimize a particular drug regimen thus maximizing the effectiveness of drug treatment and reducing its costs in terms of response time and financial expenditure.

The augmented profiles taken from patients can also be used to discover unsatisfactory therapeutic responses caused by inadequate drug exposure or undesirable side-effects before they manifest in unfavorable symptoms. Robust augmented profile comparison can also be used to detect poor compliance with a dosage regimen. In another embodiment regular comparison of augmented profiles can be used to detect and monitor interactions with co-ingested medications or the effects of changes in the physical status of the patient.

5.3.3 ILLUSTRATIVE DIAGNOSTIC APPLICATIONS

5.3.3.1 PREVENTIVE HEALTH CARE

Because of its improved ability at measuring the subtle effects of a perturbation on a biological sample, comparison of augmented profiles will provide an invaluable service in the field of preventive health care. In one embodiment of the invention, biological samples are obtained from subjects on a routine basis over time. Augmented profiles are developed based upon these biological samples. Comparison of these augmented profiles to a database that includes several model disease states provides advance warning that the subject has a particular disease before the disease manifests itself in any outward clinical symptoms.

Such a diagnostic tool is particularly valuable in diseases such as cancer because early treatment leads to improved chances of recovery and/or survival. Appropriately chosen augmented profile comparisons will also provide useful information on health risks in a subject. Thus appropriately designed augmented profiles will be used to determine if a patient should alter their diet, exercise more, take certain vitamins, or alter other behavioral aspects. As the database of reference augmented profiles is enriched, the utility of the robust profile comparison method will increase.

5.3.3.2 DISEASE MONITORING IN A VARIETY OF PATIENT

POPULATIONS

Robust profile comparison has utility in the field of disease monitoring. For example, robust comparison of an augmented profile obtained from a cancer patient before and after the start of a drug therapy regimen can provide a physician with valuable information about the effects that a particular cancer drug regimen has on a patient. According to the methods of this invention, such augmented profiles are compared with a database of augmented profiles to determine if the subject's augmented profile coπelates with those patients in which the drug had a positive effect, no effect, toxic-side effects, or any combination thereof. In another aspect of the invention, augmented profiles are used to track the health of an AIDS patient. Cuπently, biological markers such at T cell count are used as a crude indicator of the progression of the AIDS virus. However it is difficult to determine the progression of the disease because it approaches dormancy for unknown periods of time due to the potency of drugs such as Epivir®,. Crixivan®, Retrovir®, Viracept®, and Zerit®. The robust augmented profile comparison of AIDS patients while they are taking these drugs will provide an improved means for tracking disease progression. Such augmented profiles will further provide useful information to the physician on which combination of currently available drugs is optimal for any given patient.

5.3.3.3 PREDICTION OF CLINICAL OUTCOME OF A PATIENT

Related to disease monitoring, is the prediction of clinical outcome of a patient. Often when a patient is diagnosed with a life threatening disease such as cancer, the practitioner is forced to rely on crude survival rate statistics based on past survival rates of similarly afflicted patients to advise the patient. For example, a woman who has breast cancer in which the cancer has not metastasized to bone will be advised that her chances of survival are equivalent to the chances of survival of other woman with breast cancer that has not metastasized to bone. However, advise based on survival statistics often fails to take into account a variety of environmental variables such as age, race, health habits, genetic susceptibility, and/or the altered expression level genes that is indicative of the diseased state. Studies that provide improved mortality statistics based on relevant environmental variables such as age, race, health habits, or genetic susceptibility is piecemeal at best. Further, such studies are done on a general basis. For example, a study may survey the affects of smoking or race on survival. But such studies do not develop comprehensive profiles of a smoker wherein the expression level of thousands of genes are tracked. Thus even studies attempt to coπelate environmental variables such as race, smoking, etc. have limited utility in the field of accurate clinical outcome prediction.

In another embodiment of the present invention, augmented profiles taken from patients throughout the various stages of a disease may be stored in national and/or proprietary databases. Then, using the robust profile comparison methods of the present invention, augmented profiles obtained from new patients may be compared with profiles in the database.

5.4 METHODS FOR DETERMINING BIOLOGICAL RESPONSE PROFILES

This invention utilizes the ability to measure the responses of a biological sample to a large variety of perturbations. This section provides some exemplary methods for measuring biological responses.

5.4.1 TRANSCRIPT ASSAY USING DNA ARRAY

The transcriptional rate can be measured by techniques of hybridization to aπays of nucleic acid or nucleic acid mimic probes, described in the next subsection, or by other gene expression technologies, such as those described in the subsequent subsection. However measured, the result is either the absolute, relative amounts of transcripts or response data including values representing RNA abundance ratios, which usually reflect DNA expression ratios (in the absence of differences in RNA degradation rates). In various alternative embodiments, aspects of the biological state other than the transcriptional state, such as the translational state, the activity state, or mixed aspects can be measured.

Preferably, measurement of the transcriptional state is made by hybridization to transcript aπays, which are described in this subsection. Certain other methods of transcriptional state measurement are described later in this subsection.

In a prefeπed embodiment the present invention makes use of "transcript aπays" (also called herein "microaπays"). Transcript aπays can be employed for analyzing the transcriptional state in a biological sample and especially for measuring the transcriptional states of a biological sample exposed to graded levels of a drug of interest or to graded perturbations to a biological pathway of interest.

In one embodiment, transcript aπays are produced by hybridizing detectably labeled polynucleotides representing the mRNA transcripts present in a cell (e.g., fluorescently labeled cDNA synthesized from total cell mRNA) to a microaπay. A microaπay is a surface with an ordered aπay of binding (e.g., hybridization) sites for products of many of the genes in the genome of a cell or organism, preferably most or almost all of the genes. Microaπays can be made in a number of ways, of which several are described below. However produced, microaπays share certain prefeπed characteristics: The arrays are reproducible, allowing multiple copies of a given aπay to be produced and easily compared with each other. Preferably the microaπays are small, usually smaller than 5 cm², and they are made from materials that are stable under binding {e.g., nucleic acid hybridization) conditions. A given binding site or unique set of binding sites in the microaπay will specifically bind the product of a single gene in the cell. Although there may be more than one physical binding site (hereinafter "site") per specific mRNA, for the sake of clarity the discussion below will assume that there is a single site.

It will be appreciated that when cDNA complementary to the RNA of a cell is made and hybridized to a microaπay under suitable hybridization conditions, the level of hybridization to the site in the aπay coπesponding to any particular gene will reflect the prevalence in the cell of mRNA transcribed from that gene. For example, when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microaπay, the site on the array coπesponding to a gene (i.e., capable of specifically binding the product of the gene) that is not transcribed in the cell will have little or no signal (e.g., fluorescent signal), and a gene for which the encoded mRNA is prevalent will have a relatively strong signal.

In prefeπed embodiments, cDNAs from two different cells are hybridized to the binding sites of the microaπay. In the case of drug responses one biological sample is exposed to a drug and another biological sample of the same type is not exposed to the drug. In the case of pathway responses one cell is exposed to a pathway perturbation and another cell of the same type is not exposed to the pathway perturbation. The cDNA derived from each of the two cell types are differently labeled so that they can be distinguished. In one embodiment, cDNA from a cell treated with a drug (or exposed to a pathway perturbation) is synthesized using a fluorescein-labeled dNTP, and cDNA from a second cell, not drug- exposed, is synthesized using a rhodamine-labeled dNTP. When the two cDNAs are mixed and hybridized to the microaπay, the relative intensity of signal from each cDNA set is determined for each site on the aπay, and any relative difference in abundance of a particular mRNA detected.

In the example described above, the cDNA from the drug-treated (or pathway perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from the untreated cell will fluoresce red. As a result, when the drug treatment has no effect, either directly or indirectly, on the relative abundance of a particular mRNA in a cell, the mRNA will be equally prevalent in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally prevalent. When hybridized to the microaπay, the binding site(s) for that species of RNA will emit wavelengths characteristic of both fluorophores (and appear brown in combination). In contrast, when the drug-exposed cell is treated with a drug that, directly or indirectly, increases the prevalence of the mRNA in the cell, the ratio of green to red fluorescence will increase. When the drug decreases the mRNA prevalence, the ratio will decrease. The use of a two-color fluorescence labeling and detection scheme to define alterations in gene expression has been described, e.g., in Shena et al, 1995, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270:467-470. An advantage of using cDNA labeled with two different fluorophores is that a direct and internally controlled comparison of the mRNA levels coπesponding to each aπayed gene in two cell states can be made, and variations due to minor differences in experimental conditions {e.g., hybridization conditions) will not affect subsequent analyses. However, it will be recognized that it is also possible to use cDNA from a single cell, and compare, for example, the absolute amount of a particular mRNA in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell.

5.4.1.1 PREPARATION OF MICROARRAYS

Microaπays are known in the art and consist of a surface to which probes that coπespond in sequence to gene products {e.g., cDNAs, mRNAs, cRNAs, polypeptides, and fragments thereof), can be specifically hybridized or bound at a known position. In one embodiment, the microaπay is an aπay {i.e., a matrix) in which each position represents a discrete binding site for a product encoded by a gene (e.g., a protein or RNA), and in which binding sites are present for products of most or almost all of the genes in the organism's genome. In a prefeπed embodiment, the "binding site" (hereinafter, "site") is a nucleic acid or nucleic acid analogue to which a particular cognate cDNA can specifically hybridize. The nucleic acid or analogue of the binding site can be, e.g., a synthetic oligomer, a full- length cDNA, a less-than full length cDNA, or a gene fragment.

Although in a prefeπed embodiment the microaπay contains binding sites for products of all or almost all genes in the target organism's genome, such comprehensiveness is not necessarily required. Usually the microaπay will have binding sites coπesponding to at least about 50% of the genes in the genome, often at least about 75%, more often at least about 85%o, even more often more than about 90%, and most often at least about 99%. Preferably, the microaπay has binding sites for genes relevant to the action of a drug of interest or in a biological pathway of interest. A "gene" is an open reading frame (ORF) of preferably at least 50, 75, or 99 amino acids from which a messenger RNA is transcribed in the organism {e.g., if a single cell) or in some cell in a multicellular organism. The number of genes in a genome can be estimated from the number of mRNAs expressed by the organism, or by extrapolation from a well-characterized portion of the genome. When the genome of the organism of interest has been sequenced, the number of ORFs can be determined and mRNA coding regions identified by analysis of the DNA sequence. For example, the Saccharomyces cerevisiae genome has been completely sequenced and is reported to have approximately 6275 open reading frames (ORFs) longer than 99 amino acids. Analysis of these ORFs indicates that there are 5885 ORFs that are likely to specify protein products (Goffeau et al, 1996, Life with 6000 genes, Science 274:546-567, which is incorporated by reference in its entirety for all purposes). In contrast, the human genome is estimated to contain approximately 10⁵ genes.

5.4.1.2 PREPARING NUCLEIC ACIDS FOR MICROARRAYS As noted above, the "binding site" to which a particular cognate cDNA specifically hybridizes is usually a nucleic acid or nucleic acid analogue attached at that binding site. In one embodiment, the binding sites of the microarray are DNA polynucleotides coπesponding to at least a portion of each gene in an organism's genome. These DNAs can be obtained by, e.g., polymerase chain reaction (PCR) amplification of gene segments from genomic DNA, cDNA {e.g., by RT-PCR), or cloned sequences. PCR primers are chosen, based on the known sequence of the genes or cDNA, that result in amplification of unique fragments {i.e., fragments that do not share more than 10 bases of contiguous identical sequence with any other fragment on the microaπay). Computer programs are useful in the design of primers with the required specificity and optimal amplification properties. In the case of binding sites coπesponding to very long genes, it will sometimes be desirable to amplify segments near the 3' end of the gene so that when oligo-dT primed cDNA probes are hybridized to the microaπay, less-than-full length probes will bind efficiently. Typically each gene fragment on the microaπay will be between about 50 bp and about 2000 bp, more typically between about 100 bp and about 1000 bp, and usually between about 300 bp and about 800 bp in length. PCR methods are well known and are described, for example, in Innis et al. eds., 1990, PCR Protocols: A Guide to Methods and Applications, Academic Press Inc., San Diego, CA, which is incorporated by reference in its entirety for all purposes. An alternative means for generating the nucleic acid for the microaπay is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N-phosphonate or phosphoramidite chemistries (Froehler et al, 1986, Nucleic Acid Res 14:5399-5407; McBride et al, 1983, Tetrahedron Lett. 24:245-248). Synthetic sequences are between about 15 and about 500 bases in length, more typically between about 20 and about 50 bases. In some embodiments, synthetic nucleic acids include non-natural bases, e.g., inosine. As noted above, nucleic acid analogues may be used as binding sites for hybridization. An example of a suitable nucleic acid analogue is peptide nucleic acid {see, e.g., Egholm et al, 1993, PNA hybridizes to complementary oligonucleotides obeying the Watson-Crick hydrogen-bonding rules, Nature 365:566-568; see also U.S. Patent No. 5,539,083).

In an alternative embodiment, the binding (hybridization) sites are made from plasmid or phage clones of genes, cDNAs {e.g., expressed sequence tags), or inserts therefrom (Nguyen et al, 1995, Differential gene expression in the murine thymus assayed by quantitative hybridization of aπayed cDNA clones, Genomics 29:207-209). In yet another embodiment, the polynucleotide of the binding sites is RNA.

5.4.1.3 ATTACHING NUCLEIC ACIDS TO THE SOLID

SURFACE

The nucleic acid or analogue are attached to a solid support, which may be made from glass, plastic {e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, or other materials. A prefeπed method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al., 1995, Quantitative monitoring of gene expression patterns with a complementary DNA microaπay, Science 270:467-470. This method is especially useful for preparing microaπays of cDNA. See also DeRisi et al, 1996, Use of a cDNA microaπay to analyze gene expression patterns in human cancer, Nature Genetics 14:457-460; Shalon et al, 1996, A DNA microaπay system for analyzing complex DNA samples using two-color fluorescent probe hybridization, Genome Res. 6:639-645; and Schena et al, 1995, Parallel human genome analysis; microaπay-based expression of 1000 genes, Proc. Natl. Acad. Sci. USA 93:10539-11286.

A second prefeπed method for making microaπays is by making high-density oligonucleotide aπays. Techniques are known for producing aπays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface using photolithographic techniques for synthesis in situ {see, Fodor et al., 1991, Light- directed spatially addressable parallel chemical synthesis, Science 251 : 767-773; Pease et al, 1994, Light-directed oligonucleotide aπays for rapid DNA sequence analysis, Proc. Natl. Acad. Sci. USA 91:5022-5026; Lockhart et al, 1996, Expression monitoring by hybridization to high-density oligonucleotide aπays, Nature Biotech 14:1675; U.S. Patent Nos. 5,578,832; 5,556,752; and 5,510,270, each of which is incorporated by reference in its entirety for all purposes) or other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et al, 1996, High-Density Oligonucleotide aπays, Biosensors & Bioelectronics 11: 687-90). When these methods are used, oligonucleotides {e.g., 20- mers) of known sequence are synthesized directly on a surface such as a derivatized glass slide. Usually, the aπay produced contains multiple probes against each target transcript. Oligonucleotide probes can be chosen to detect alternatively spliced mRNAs or to serve as various type of control.

Another prefeπed method of making microaπays is by use of an inkjet printing process to synthesize oligonucleotides directly on a solid phase, as described, e.g., in co-pending U.S. patent application Serial No. 09/008,120 filed on January 16, 1998, by Blanchard entitled "Chemical Synthesis Using Solvent Microdroplets", which is incorporated by reference herein in its entirety.

Other methods for making microarrays, e.g., by masking (Maskos and Southern, 1992, Nuc. Acids Res. 20:1679-1684), may also be used. In principal, any type of aπay, for example, dot blots on a nylon hybridization membrane {see Sambrook et al., Molecular Cloning - A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1989), could be used, although, as will be recognized by those of skill in the art, very small aπays will be prefeπed because hybridization volumes will be smaller.

5.4.1.4 GENERATING LABELED PROBES Methods for preparing total and poly(A)+ RNA are well known and are described generally in Sambrook et al, supra. In one embodiment, RNA is extracted from cells of the various types of interest in this invention using guanidinium thiocyanate lysis followed by CsCl centrifugation (Chirgwin et al, 1979, Biochemistry 18:5294-5299). Poly(A)+ RNA is selected by selection with oligo-dT cellulose {see Sambrook et al, supra). Cells of interest include wild-type cells, drug-exposed wild-type cells, modified cells, and drug-exposed modified cells.

Labeled cDNA is prepared from mRNA by oligo dT-primed or random-primed reverse transcription, both of which are well known in the art {see, e.g., Klug and Berger, 1987, Methods Enzymol. 152:316-325). Reverse transcription may be carried out in the presence of a dNTP conjugated to a detectable label, most preferably a fluorescently labeled dNTP. Alternatively, isolated mRNA can be converted to labeled antisense RNA synthesized by in vitro transcription of double-stranded cDNA in the presence of labeled dNTPs (Lockhart et al, 1996, Expression monitoring by hybridization to high-density oligonucleotide aπays, Nature Biotech. 14:1675, which is incorporated by reference in its entirety for all purposes). In alternative embodiments, the cDNA or RNA probe can be synthesized in the absence of detectable label and may be labeled subsequently, e.g., by incorporating biotinylated dNTPs or rNTP, or some similar means {e.g., photo-cross-linking a psoralen derivative of biotin to RNAs), followed by addition of labeled streptavidin {e.g., phycoerythrin-conjugated streptavidin) or the equivalent.

When fluorescently-labeled probes are used, many suitable fluorophores are known, including fluorescein, lissamine, phycoerythrin, rhodamine (Perkin Elmer Cetus), Cy2, Cy3, Cy3.5, Cy5, Cy5.5, Cy7, FluorX (Amersham) and others {see, e.g., Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic Press San Diego, CA). It will be appreciated that pairs of fluorophores are chosen that have distinct emission spectra so that they can be easily distinguished.

In another embodiment, a label other than a fluorescent label is used. For example, a radioactive label, or a pair of radioactive labels with distinct emission spectra, can be used {see Zhao et al, 1995, High density cDNA filter analysis: a novel approach for large-scale, quantitative analysis of gene expression, Gene 156:207; Pietu et al, 1996, Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA aπay, Genome Res. 6:492). However, because of scattering of radioactive particles, and the consequent requirement for widely spaced binding sites, use of radioisotopes is a less-prefeπed embodiment.

In one embodiment, labeled cDNA is synthesized by incubating a mixture containing 0.5 mM dGTP, dATP and dCTP plus 0.1 mM dTTP plus fluorescent deoxyribonucleotides {e.g., 0.1 mM Rhodamine 110 UTP (Perken Elmer Cetus) or 0.1 mM Cy3 dUTP (Amersham)) with reverse transcriptase {e.g., Superscript™ II, LTI Inc.) at 42° C for 60 min.

5.4.1.5 HYBRIDIZATION TO MICROARRAYS Nucleic acid hybridization and wash conditions are optimally chosen so that the probe "specifically binds" or "specifically hybridizes" to a specific aπay site, i.e., the probe hybridizes, duplexes or binds to a sequence aπay site with a complementary nucleic acid sequence but does not hybridize to a site with a non-complementary nucleic acid sequence. One polynucleotide sequence is considered complementary to another when, if the shorter of the polynucleotides is less than or equal to 25 bases, there are no mismatches using standard base-pairing rules or, if the shorter of the polynucleotides is longer than 25 bases, there is no more than a 5% mismatch. Preferably, the polynucleotides are perfectly complementary (no mismatches). It can easily be demonstrated that specific hybridization conditions result in specific hybridization by carrying out a hybridization assay including negative controls {see, e.g., Shalon et al, supra, and Chee et al, supra).

Optimal hybridization conditions will depend on the length {e.g., oligomer versus polynucleotide greater than 200 bases) and type {e.g., RNA, DNA, PNA) of labeled probe and immobilized polynucleotide or oligonucleotide. General parameters for specific {i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al, supra, and in Ausubel et al, 1987, Cuπent Protocols in Molecular Biology, Greene Publishing and Wiley-Interscience, New York. When the cDNA microaπays of Schena et al. are used, typical hybridization conditions are hybridization in 5 X SSC plus 0.2% SDS at 65 ° C for 4 hours followed by washes at 25° C in low stringency wash buffer (1 X SSC plus 0.2% SDS) followed by 10 minutes at 25° C in high stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Shena et al, 1996, Proc. Natl. Acad. Sci. USA, 93:10614). Useful hybridization conditions are also provided in, e.g., Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier Science Publishers B.V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic Press San Diego, CA.

5.4.1.6 SIGNAL DETECTION AND DATA ANALYSIS

When fluorescently labeled probes are used, the fluorescence emissions at each site of a transcript aπay may be, preferably, detected by scanning confocal laser microscopy. In one embodiment, a separate scan, using the appropriate excitation line, is carried out for each of the two fluorophores used. Alternatively, a laser may be used that allows simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously {see Shalon et al, 1996, A DNA microaπay system for analyzing complex DNA samples using two-color fluorescent probe hybridization, Genome Research 6:639-645, which is incorporated by reference in its entirety for all purposes). In a prefeπed embodiment, the aπays are scanned with a laser fluorescent scanner with a computer controlled X-Y stage and a microscope objective. Sequential excitation of the two fluorophores is achieved with a multi-line, mixed gas laser and the emitted light is split by wavelength and detected with two photomultiplier tubes. Fluorescence laser scanning devices are described in Schena et al, 1996, Genome Res. 6:639-645 and in other references cited herein. Alternatively, the fiber-optic bundle described by Ferguson et al, 1996, Nature Biotech. 14:1681-1684, may be used to monitor mRNA abundance levels at a large number of sites simultaneously.

Signals are recorded and, in a prefeπed embodiment, analyzed by computer, e.g., using a 12 bit analog to digital board. In one embodiment the scanned image is despeckled using a graphics program {e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet of the average hybridization at each wavelength at each site. If necessary, an experimentally determined coπection for "cross talk" (or overlap) between the channels for the two fluors may be made. For any particular hybridization site on the transcript aπay, a ratio of the emission of the two fluorophores can be calculated. The ratio is independent of the absolute expression level of the cognate gene, but is useful for genes whose expression is significantly modulated by drug administration, gene deletion, or any other tested event.

Accordingly, the relative abundance of an mRNA in two biological samples is scored as a perturbation and its magnitude determined {i.e., the abundance is different in the two sources of mRNA tested), or as not perturbed {i.e., the relative abundance is the same). In various embodiments, a difference between the two sources of RNA of at least a factor of about 25% (RNA from one source is 25% more abundant in one source than the other source), more usually about 50%, even more often by a factor of about 2 (twice as abundant), 3 (three times as abundant) or 5 (five times as abundant) is scored as a perturbation.

Preferably, in addition to identifying a perturbation as positive or negative, it is advantageous to determine the magnitude of the perturbation. This can be carried out, as noted above, by calculating the ratio of the emission of the two fluorophores used for differential labeling, or by analogous methods that will be readily apparent to those of skill in the art. 5.4.2 PATHWAY RESPONSE AND GENESETS Genesets can be determined by observing the gene expression response of perturbation to a particular pathway. For instance, transcript aπays reflecting the transcriptional state of a biological sample of interest are made by hybridizing a mixture of two differently labeled probes each coπesponding {i.e., complementary) to the mRNA of a different sample of interest, to the microaπay. The two samples may be of the same type, i.e., of the same species and strain, but differ genetically at a limited number of loci. Alternatively, they are isogeneic and differ in their environmental history {e.g., exposed to a drug versus not exposed). The genes whose expression are highly coπelated may belong to a geneset.

In one aspect of the invention, gene expression change in response to a large number of perturbations is used to construct a clustering tree for the purpose of defining genesets. Preferably, the perturbations should target different pathways. In order to measure expression responses to the pathway perturbation, biological samples are subjected to graded perturbations to pathways of interest. The samples exposed to the perturbation and samples not exposed to the perturbation are used to construct transcript aπays, which are measured to find the mRNAs with modified expression and the degree of modification due to exposure to the perturbation. In this way, the perturbation-response relationship is obtained.

The density of levels of the graded drug exposure and graded perturbation control parameter is governed by the sharpness and structure in the individual gene responses - the steeper the steepest part of the response, the denser the levels needed to properly resolve the response.

Further, it is preferable in order to reduce experimental eπor to reverse the fluorescent labels in two-color differential hybridization experiments to reduce biases peculiar to individual genes or aπay spot locations. In other words, it is preferable to first measure gene expression with one labeling {e.g., labeling perturbed cells with a first fluorochrome and unperturbed cells with a second fluorochrome) of the mRNA from the two cells being measured, and then to measure gene expression from the two cells with reversed labeling {e.g., labeling perturbed cells with the second fluorochrome and unperturbed cells with the first fluorochrome). Multiple measurements over exposure levels and perturbation control parameter levels provide additional experimental eπor control. With adequate sampling a trade-off may be made when choosing the width of the spline function S used to interpolate response data between averaging of eπors and loss of structure in the response functions.

5.4.3 MEASUREMENT OF GRADED PERTURBATION RESPONSE DATA

To measure graded response data, the cells are exposed to graded levels of the drug, drug candidate of interest or grade strength of other perturbation. When the cells are grown in vitro, the compound is usually added to their nutrient medium. In the case of yeast, it is preferable to harvest the yeast in early log phase, since expression patterns are relatively insensitive to time of harvest at that time. Several levels of the drug or other compounds may be added. The particular level employed depends on the particular characteristics of the drug, but usually will be between about 1 ng/ml and 100 mg/ml. In some cases a drug will be solubilized in a solvent such as DMSO.

The cells exposed to the drug and cells not exposed to the drug are used to construct transcript aπays, which are measured to find the mRNAs with altered expression due to exposure to the drug. Thereby, the drug response is obtained.

Similarly for measurements of pathway responses, it is preferable also for drug responses, in the case of two-color differential hybridization, to measure also with reversed labeling. Also, it is preferable that the levels of drug exposure used proved sufficient resolution (e.g., by using approximately 10 levels of drug exposure) of rapidly changing regions of the drug response.

5.4.4. OTHER METHODS OF TRANSCRIPTIONAL STATE

MEASUREMENT

The transcriptional state of a cell may be measured by other gene expression technologies known in the art. Several such technologies produce pools of restriction fragments of limited complexity for electrophoretic analysis, such as methods combining double restriction enzyme digestion with phasing primers {see, e.g., European Patent O 534858 Al, filed September 24, 1992, by Zabeau et al.), or methods selecting restriction fragments with sites closest to a defined mRNA end {see, e.g., Prashar et al., 1996, Proc. Natl. Acad. Sci. USA 93:659-663). Other methods statistically sample cDNA pools, such as by sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify each cDNA, or by sequencing short tags {e.g., 9-10 bases) which are generated at known positions relative to a defined mRNA end {see, e.g., Velculescu, 1995, Science 270:484- 487).

5.4.5 MEASUREMENT OF OTHER ASPECTS OF BIOLOGICAL

STATE

To form projection and response profiles aspects of the biological state other than the transcriptional state, such as the translational state, the activity state, or mixed aspects can be measured in order to obtain drug and pathway responses. Details of these embodiments are described in this section. 5.4.5.1. EMBODIMENTS BASED ON TRANSLATIONAL

STATE MEASUREMENTS

Measurement of the translational state may be performed according to several methods. For example, whole genome monitoring of protein {i.e., the "proteome," Goffeau et al, supra) can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome. Preferably, antibodies are present for a substantial fraction of the encoded proteins, or at least for those proteins relevant to the action of a drug of interest. Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, New York, which is incorporated in its entirety for all purposes). In a prefeπed embodiment, monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence of the cell. With such an antibody aπay, proteins from the cell are contacted to the aπay and their binding is assayed with assays known in the art.

In another embodiment, proteins can be separated by two-dimensional gel electrophoresis systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc. Nafl Acad. Sci. USA 93:1440-1445; Sagliocco et al., 1996, Yeast 12:1519- 1533; Lander, 1996, Science 274:536-539. The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N-terminal micro-sequencing. Using these techniques, it is possible to identify a substantial fraction of all the proteins produced under given physiological conditions, including in cells {e.g., in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.

5.4.5.2 EMBODIMENTS BASED ON OTHER ASPECTS OF THE

BIOLOGICAL STATE

The methods of the invention are applicable to any cellular constituent that can be monitored. For instance, where activities of proteins relevant to the characterization of a perturbation, such as drug action, can be measured, profiles can be based on such measurements. Activity measurements can be performed by any functional, biochemical, or physical means appropriate to the particular activity being characterized. Where the activity involves a chemical transformation, the cellular protein can be contacted with the natural substrate(s), and the rate of transformation measured. Where the activity involves association in multimeric units, for example association of an activated DNA binding complex with DNA, the amount of associated protein or secondary consequences of the association, such as amounts of mRNA transcribed, can be measured. Also, where only a functional activity is known, for example, as in cell cycle control, performance of the function can be observed. However known and measured, the changes in protein activities form the response data.

In alternative and non-limiting embodiments, response data may be formed of mixed aspects of the biological state of a cell. Response data can be constructed from, e.g., changes in certain mRNA abundances, changes in certain protein abundances, and changes in certain protein activities.

5.5 METHOD FOR PROBING CELLULAR STATES

Methods for targeted perturbation of cellular states at various levels of a cell are increasingly known and applied in the art. Any such methods that are capable of specifically targeting and controllably modifying {e.g., either by a graded increase or activation or by a graded decrease or inhibition) specific cellular constituents {e.g., gene expression, RNA concentrations, protein abundances, protein activities, or so forth) can be employed in performing cellular state perturbations. Controllable modifications of cellular constituents consequentially controllably perturb cellular states originating at the modified cellular constituents. Preferable modification methods are capable of individually targeting each of a plurality of cellular constituents and most preferably a substantial fraction of such cellular constituents.

The following methods are exemplary of those that can be used to modify cellular constituents and produce cellular state perturbations that generate the cellular state responses. Cellular state perturbations may be made in cell types derived from any organism for which genomic or expressed sequence information is available and for which methods are available that permit controllable modification of the expression of specific genes. Genomic sequencing information is available for several eukaryotic organisms, including humans, nematodes, Arabidopsis, and Saccharomyces cerevisiae..

The exemplary methods described in the following include use of titratable expression systems, use of transfection or viral transduction systems, direct modifications to

RNA abundances or activities, direct modifications of protein abundances, and direct modification of protein activities including use of drugs (or chemical moieties in general) with specific known action.

5.5.1 TITRATABLE EXPRESSION SYSTEMS Any of the several known titratable, or equivalently controllable, expression systems available for use in the budding yeast Saccharomyces cerevisiae are useful (Mumberg et al, 1994, Regulatable promoter of Saccharomyces cerevisiae: comparison of transcriptional activity and their use for heterologous expression, Nucl. Acids Res. 22:5767-5768). Usually, gene expression is controlled by transcriptional controls, with the promoter of the gene to be controlled replaced on its chromosome by a controllable, exogenous promoter. The most commonly used controllable promoter in yeast is the GALl promoter (Johnston et al, 1984, Sequences that regulate the divergent GALI-GALIO promoter in Saccharomyces cerevisiae, Mol Cell. Biol. 8: 1440-1448). The GALl promoter is strongly repressed by the presence of glucose in the growth medium, and is gradually switched on in a graded manner to high levels of expression by the decreasing abundance of glucose and the presence of galactose. The GALl promoter usually allows a 5-100 fold range of expression control on a gene of interest.

Other frequently used promoter systems include the MET25 promoter (Kerjan et al, 1986, Nucleotide sequence of the Saccharomyces cerevisiae MET25 gene, Nucl. Acids. Res. 14:7861-7871), which is induced by the absence of methionine in the growth medium, and the CUP1 promoter, which is induced by copper (Mascoπo-Gallardo et al, 1996, Construction of a CUPl promoter-based vector to modulate gene expression in Saccharomyces cerevisiae, Gene 172:169-170). All of these promoter systems are controllable in that gene expression can be incrementally controlled by incremental changes in the abundances of a controlling moiety in the growth medium.

One disadvantage of the above listed expression systems is that control of promoter activity (effected by, e.g., changes in carbon source, removal of certain amino acids), often causes other changes in cellular physiology which independently alter the expression levels of other genes. A recently developed system for yeast, the Tet system, alleviates this problem to a large extent (Gari et al, 1997, A set of vectors with a tetracycline-regulatable promoter system for modulated gene expression in Saccharomyces cerevisiae, Yeast 13:837- 848). The Tet promoter, adopted from mammalian expression systems (Gossen et al, 1995, Transcriptional activation by tetracyclines in mammalian cells, Proc. Nat. Acad. Sci. USA 89:5547-5551) is modulated by the concentration of the antibiotic tetracycline or the structurally related compound doxycycline. Thus, in the absence of doxycycline, the promoter induces a high level of expression, and the addition of increasing levels of doxycycline causes increased repression of promoter activity. Intermediate levels of gene expression can be achieved in the steady state by addition of intermediate levels of drug. Furthermore, levels of doxycycline that give maximal repression of promoter activity (10 micrograms/ml) have no significant effect on the growth rate on wild type yeast cells (Gari et al, 1997, A set of vectors with a tetracycline-regulatable promoter system for modulated gene expression in Saccharomyces cerevisiae, Yeast 13:837-848).

In mammalian cells, several means of titrating expression of genes are available (Spencer, 1996, Creating conditional mutations in mammals, Trends Genet. 12:181-187). As mentioned above, the Tet system is widely used, both in its original form, the "forward" system, in which addition of doxycycline represses transcription, and in the newer "reverse" system, in which doxycycline addition stimulates transcription (Gossen et al, 1995, Proc.

Natl. Acad. Sci. USA 89:5547-5551; Hoffmann et al, 1997, Nucl. Acids. Res. 25:1078-

1079; Hofinann et al, 1996, Proc. Natl. Acad. Sci. USA 83:5185-5190; Paulus et al, 1996,

Journal of Virology 70:62-67). Another commonly used controllable promoter system in mammalian cells is the ecdysone-inducible system developed by Evans and colleagues (No et al, 1996, Ecdysone-inducible gene expression in mammalian cells and transgenic mice,

Proc. Nat. Acad. Sci. USA 93:3346-3351), where expression is controlled by the level of muristerone added to the cultured cells. In addition, expression can be modulated using the

"chemical-induced dimerization" (CID) system developed by Schreiber, Crabtree, and colleagues (Belshaw et al, 1996, Controlling protein association and subcellular localization with a synthetic ligand that induces heterodimerization of proteins, Proc. Nat. Acad. Sci.

USA 93:4604-4607; Spencer, 1996, Creating conditional mutations in mammals, Trends

Genet. 12:181-187) and similar systems in yeast. In this system, the gene of interest is put under the control of the CID-responsive promoter, and transfected into cells expressing two different hybrid proteins, one comprised of a DNA-binding domain fused to FKBP12, which binds FK506. The other hybrid protein contains a transcriptional activation domain also fused to FKBP12. The CID inducing molecule is FK1012, a homodimeric version of

FK506 that is able to bind simultaneously both the DNA binding and transcriptional activating hybrid proteins. In the graded presence of FK1012, graded transcription of the controlled gene is activated. ^δ

For each of the mammalian expression systems described above, as is widely known to those of skill in the art, the gene of interest is put under the control of the controllable promoter, and a plasmid harboring this construct along with an antibiotic resistance gene is transfected into cultured mammalian cells. In general, the plasmid DNA integrates into the genome, and drug resistant colonies are selected and screened for appropriate expression of the regulated gene. Alternatively, the regulated gene can be inserted into an episomal plasmid such as pCEP4 (Invitrogen, Inc.), which contains components of the Epstein-Ban virus necessary for plasmid replication.

In a prefeπed embodiment, titratable expression systems, such as the ones described above, are introduced for use into cells or organisms lacking the coπesponding endogenous gene and/or gene activity, e.g., organisms in which the endogenous gene has been disrupted or deleted. Methods for producing such "knock outs" are well known to those of skill in the art, see e.g., Pettitt et al, 1996, Development 122:4149-4157; Spradling et al, 1995, Proc.

Natl. Acad. Sci. USA, 92:10824-10830; Ramirez-Solis et al, 1993, Methods Enzymol.

225:855-878; and Thomas et al, 1987, Cell 51:503-512. 5.5.2 TRANSFECTION SYSTEMS FOR MAMMALIAN CELLS Transfection or viral transduction of target genes can introduce controllable perturbations in biological cellular states in mammalian cells. Preferably, transfection or transduction of a target gene can be used with cells that do not naturally express the target gene of interest. Such non-expressing cells can be derived from a tissue not normally expressing the target gene or the target gene can be specifically mutated in the cell. The target gene of interest can be cloned into one of many mammalian expression plasmids, for example, the pcDNA3.1 +/- system (Invitrogen, Inc.) or retroviral vectors, and introduced into the non-expressing host cells. Transfected or transduced cells expressing the target gene may be isolated by selection for a drug resistance marker encoded by the expression vector. The level of gene transcription is monotonically related to the transfection dosage. In this way, the effects of varying levels of the target gene may be investigated.

A particular example of the use of this method is the search for drugs that target the src-family protein tyrosine kinase, lck, a key component of the T cell receptor activation cellular state (Anderson et al, 1994, Involvement of the protein tyrosine kinase p56 (lck) in T cell signaling and thymocyte development, Adv. Immunol. 56:171-178). Inhibitors of this enzyme are of interest as potential immunosuppressive drugs (Hanke, 1996, Discovery of a Novel, Potent, and src family-selective tyrosine kinase inhibitor, J. Biol Chem 271:695- 701). A specific mutant of the Jurkat T cell line (JcaMl) is available that does not express lck kinase (Straus et al, 1992, Genetic evidence for the involvement of the lck tyrosine kinase in signal transduction through the T cell antigen receptor, Cell 70:585-593). Therefore, introduction of the lck gene into JCaMl by transfection or transduction permits specific perturbation of cellular states of T cell activation regulated by the lck kinase. The efficiency of transfection or transduction, and thus the level of perturbation, is dose related. The method is generally useful for providing perturbations of gene expression or protein abundances in cells not normally expressing the genes to be perturbed.

5.5.3 METHODS OF MODIFYING RNA ABUNDANCES OR

ACTIVITIES

Methods of modifying RNA abundances and activities cuπently fall within three classes, ribozymes, antisense species, and RNA aptamers (Good et al, 1997, Gene Therapy 4: 45-54). Controllable application or exposure of a cell to these entities permits controllable perturbation of RNA abundances.

Ribozymes are RNAs which are capable of catalyzing RNA cleavage reactions. (Cech, 1987, Science 236:1532-1539; PCT International Publication WO 90/11364, published October 4, 1990; Sarver et al, 1990, Science 247: 1222-1225). "Hairpin" and "hammerhead" RNA ribozymes can be designed to specifically cleave a particular target mRNA. Rules have been established for the design of short RNA molecules with ribozyme activity, which are capable of cleaving other RNA molecules in a highly sequence specific way and can be targeted to virtually all kinds of RNA. (Haseloff et al, 1988, Nature 334:585-591; Koizumi et al, 1988, FEBS Lett., 228:228-230; Koizumi et al, 1988, FEBS Lett., 239:285-288). Ribozyme methods involve exposing a cell to, inducing expression in a cell, etc. of such small RNA ribozyme molecules. (Grassi and Marini, 1996, Annals of Medicine 28: 499-510; Gibson, 1996, Cancer and Metastasis Reviews 15: 287-299).

Ribozymes can be routinely expressed in vivo in sufficient number to be catalytically effective in cleaving mRNA, and thereby modifying mRNA abundances in a cell. (Cotten et al, 1989, Ribozyme mediated destruction of RNA in vivo, The EMBO J. 8:3861-3866). In particular, a ribozyme coding DNA sequence, designed according to the previous rules and synthesized, for example, by standard phosphoramidite chemistry, can be ligated into a restriction enzyme site in the anticodon stem and loop of a gene encoding a tRNA, which can then be transformed into and expressed in a cell of interest by methods routine in the art. Preferably, an inducible promoter {e.g., a glucocorticoid or a tetracycline response element) is also introduced into this construct so that ribozyme expression can be selectively controlled. tDNA genes {i.e., genes encoding tRNAs) are useful in this application because of their small size, high rate of transcription, and ubiquitous expression in different kinds of tissues. Therefore, ribozymes can be routinely designed to cleave virtually any mRNA sequence, and a cell can be routinely transformed with DNA coding for such ribozyme sequences such that a controllable and catalytically effective amount of the ribozyme is expressed. Accordingly the abundance of virtually any RNA species in a cell can be perturbed.

The activity of a target RNA (preferable mRNA) species, specifically its rate of translation, may be controllably inhibited by antisense nucleic acids. An "antisense" nucleic acid refers to a nucleic acid capable of hybridizing to a sequence-specific {e.g., non-poly A) portion of the target RNA, for example its translation initiation region, by virtue of some sequence complementarity to a coding and/or non-coding region. The antisense nucleic acids of the invention can be oligonucleotides that are double-stranded or single-stranded, RNA or DNA or a modification or derivative thereof, which can be directly administered in a controllable manner to a cell or which can be produced intracellularly by transcription of exogenous, introduced sequences in controllable quantities sufficient to perturb translation of the target RNA.

Antisense nucleic acids are typically at least six nucleotides and are preferably oligonucleotides (ranging from 6 to about 200 nucleotides). The oligonucleotides can be DNA or RNA or chimeric mixtures or derivatives or modified versions thereof, single- stranded or double-stranded. The oligonucleotide can be modified at the base moiety, sugar moiety, or phosphate backbone. The oligonucleotide may include other appending groups such as peptides, or agents facilitating transport across the cell membrane {see, e.g., Letsinger et al, 1989, Proc. Natl. Acad. Sci. U.S.A. 86: 6553-6556; Lemaitre et al, 1987, Proc. Natl. Acad. Sci. 84: 648-652; PCT Publication No. WO 88/09810, published December 15, 1988), hybridization-triggered cleavage agents {see, e.g., Krol et al, 1988, BioTechniques 6: 958-976) or intercalating agents {see, e.g., Zon, 1988, Pharm. Res. 5: 539- 549).

Antisense oligonucleotide are typically in the form of single-stranded DNA. The oligonucleotide may be modified at any position on its structure with constituents generally known in the art. Antisense oligonucleotides may comprise at least one modified base moiety such as 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xanthine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl) uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta- D-mannosylqueosine, 5'-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-N6- isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil- 5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiouracil, 3-(3-amino- 3-N-2-carboxypropyl) uracil, (acp3)w, and 2,6-diaminopurine.

Antisenseoligonucleotides may contain modified sugar moities such as arabinose, 2-fluoroarabinose, xylulose, and hexose.. Antisense oligonucleotide may contain modified phosphate backbones such as a phosphorothioate, a phosphorodithioate, a phosphoramidothioate, a phosphoramidate, a phosphordiamidate, a methylphosphonate, an alkyl phosphotriester, and a formacetal or analog thereof. Antisense oligonucleotides may be a 2-α-anomeric oligonucleotide. An α-anomeric oligonucleotide forms specific double- stranded hybrids with complementary RNA in which, contrary to the usual β-units, the strands run parallel to each other (Gautier et al, 1987, Nucl. Acids Res. 15: 6625-6641). The oligonucleotide may be conjugated to another molecule, e.g., a peptide, hybridization triggered cross-linking agent, transport agent, hybridization-triggered cleavage agent, etc.

Antisense nucleic acids comprise a sequence complementary to at least a portion of a target RNA species. However, absolute complementarity, is not required. A sequence "complementary to at least a portion of an RNA," as refeπed to herein, means a sequence having sufficient complementarity to be able to hybridize with the RNA, forming a stable duplex; in the case of double-stranded antisense nucleic acids, a single strand of the duplex DNA may thus be tested, or triplex formation may be assayed. The ability to hybridize will depend on both the degree of complementarity and the length of the antisense nucleic acid. Generally, the longer the hybridizing nucleic acid, the more base mismatches with a target RNA it may contain and still form a stable duplex (or triplex, as the case may be). One skilled in the art can ascertain a tolerable degree of mismatch by use of standard procedures to determine the melting point of the hybridized complex. The amount of antisense nucleic acid that will be effective in the inhibiting translation of the target RNA can be determined by standard assay techniques.

Antisense oligonucleotides may be synthesized by standard methods known in the art, e.g. by use of an automated DNA synthesizer (such as are commercially available from Biosearch, Applied Biosystems, etc.). As examples, phosphorothioate oligonucleotides may be synthesized by the method of Stein et al. (1988, Nucl. Acids Res. 16: 3209), methylphosphonate oligonucleotides can be prepared by use of controlled pore glass polymer supports (Sarin et al, 1988, Proc. Natl. Acad. Sci. U.S.A. 85: 7448-7451), etc. In another embodiment, the oligonucleotide is a 2'-0-methylribonucleotide (Inoue et al, 1987, Nucl. Acids Res. 15: 6131-6148), or a chimeric RNA-DNA analog (Inoue et al, 1987, FEBS Lett. 215: 327-330).

The synthesized antisense oligonucleotides can then be administered to a cell in a controlled manner. For example, the antisense oligonucleotides can be placed in the growth environment of the cell at controlled levels where they may be taken up by the cell. The uptake of the antisense oligonucleotides can be assisted by use of methods well known in the art.

Alternatively, antisense nucleic acids are controllably expressed intracellularly by transcription from an exogenous sequence. For example, a vector can be introduced in vivo such that it is taken up by a cell, within which cell the vector or a portion thereof is transcribed, producing an antisense nucleic acid (RNA) of the invention. Such a vector would contain a sequence encoding the antisense nucleic acid. Such a vector can remain episomal or become chromosomally integrated, as long as it can be transcribed to produce the desired antisense RNA. Such vectors can be constructed by recombinant DNA * technology methods standard in the art. Vectors can be plasmid, viral, or others known in the art, used for replication and expression in mammalian cells. Expression of the sequences encoding the antisense RNAs can be by any promoter known in the art to act in a cell of interest. Such promoters can be inducible or constitutive. Most preferably, promoters are controllable or inducible by the administration of an exogenous moiety in order to achieve controlled expression of the antisense oligonucleotide. Such controllable promoters include the Tet promoter. Less preferably usable promoters for mammalian cells include, but are not limited to: the SV40 early promoter region (Bernoist and Chambon, 1981, Nature 290:

304-310), the promoter contained in the 3' long terminal repeat of Rous sarcoma virus

(Yamamoto et al, 1980, Cell 22: 787-797), the herpes thymidine kinase promoter (Wagner et al., 1981, Proc. Natl. Acad. Sci. U.S.A. 78: 1441-1445), the regulatory sequences of the metallothionein gene (Brinster et al, 1982, Nature 296: 39-42), etc. Therefore, antisense nucleic acids can be routinely designed to target virtually any mRNA sequence, and a cell can be routinely transformed with or exposed to nucleic acids coding for such antisense sequences such that an effective and controllable amount of the antisense nucleic acid is expressed. Accordingly the translation of virtually any RNA species in a cell can be controllably perturbed.

Finally, in a further embodiment, RNA aptamers can be introduced into or expressed in a cell. RNA aptamers are specific RNA ligands for proteins, such as for Tat and Rev RNA (Good et al, 1997, Gene Therapy 4: 45-54) that can specifically inhibit their translation.

5.5.4 METHODS OF MODIFYING PROTEIN ABUNDANCES — ;

Methods of modifying protein abundances include, inter alia, those altering protein degradation rates and those using antibodies (which bind to proteins affecting abundances of activities of native target protein species). Increasing (or decreasing) the degradation rates of a protein species decreases (or increases) the abundance of that species. Methods for controllably increasing the degradation rate of a target protein in response to elevated temperature and/or exposure to a particular drug, which are known in the art, can be employed. For example, one such method employs a heat-inducible or drug-inducible N- terminal degron, which is an N-terminal protein fragment that exposes a degradation signal promoting rapid protein degradation at a higher temperature {e.g., 37° C) and which is hidden to prevent rapid degradation at a lower temperature {e.g., 23° C) (Dohmen et. al, 1994, Science 263:1273-1276). Such an exemplary degron is Arg-DHFR , a variant of murine dihydrofolate reductase in which the N-terminal Val is replaced by Arg and the Pro at position 66 is replaced with Leu. According to this method, a gene for a target protein, P, is replaced by standard gene targeting methods known in the art (Lodish et al, 1995, Molecular Biology of the Cell. W.H. Freeman and Co., New York, especially chapter 8) with a gene coding for the fusion protein Ub-Arg-DHFR^-P ("Ub" stands for ubiquitin). The N-terminal ubiquitin is rapidly cleaved after translation exposing the N-terminal degron. At lower temperatures, lysines internal to Arg-DHFR¹⁵ are not exposed, ubiquitination of the fusion protein does not occur, degradation is slow, and active target protein levels are high. At higher temperatures (in the absence of methotrexate), lysines internal to Arg-DHFR¹⁵ are exposed, ubiquitination of the fusion protein occurs, degradation is rapid, and active target protein levels are low. Heat activation of degradation is controllably blocked by exposure methotrexate. This method is adaptable to other N-terminal degrees which are responsive to other inducing factors, such as drugs and temperature changes.

Target protein abundances and directly or indirectly, their activities can be decreased by (neutralizing) antibodies. By providing for controlled exposure to such antibodies, protein abundances/activities can be controllably modified. For example, antibodies to suitable epitopes on protein surfaces may decrease the abundance, and thereby indirectly decrease the activity, of the wild-type active form of a target protein by aggregating active forms into complexes with less or minimal activity as compared to the wild-type unaggregated wild-type form. Alternately, antibodies may directly decrease protein activity by, e.g., interacting directly with active sites or by blocking access of substrates to active sites. Conversely, in certain cases, (activating) antibodies may also interact with proteins and their active sites to increase resulting activity. In either case, antibodies (of the various types to be described) can be raised against specific protein species (by the methods to be described) and their effects screened. The effects of the antibodies can be assayed and suitable antibodies selected that raise or lower the target protein species concentration and/or activity. Such assays involve introducing antibodies into a cell (see below), and assaying the concentration of the wild-type amount or activities of the target protein by standard means (such as immunoassays) known in the art. The net activity of the wild-type form can be assayed by assay means appropriate to the known activity of the target protein.

Antibodies can be introduced into cells in numerous fashions, including, for example, microinjection of antibodies into a cell (Morgan et al, 1988, Immunology Today 9:84-86) or transforming hybridoma mRNA encoding a desired antibody into a cell (Burke et al, 1984, Cell 36:847-858). In a further technique, recombinant antibodies can be engineering and ectopically expressed in a wide variety of non-lymphoid cell types to bind to target proteins as well as to block target protein activities (Biocca et al, 1995, Trends in Cell Biology 5:248-252). Preferably, expression of the antibody is under control of a controllable promoter, such as the Tet promoter. A first step is the selection of a particular monoclonal antibody with appropriate specificity to the target protein (see below). Then sequences encoding the variable regions of the selected antibody can be cloned into various engineered antibody formats, including, for example, whole antibody, Fab fragments, Fv fragments, single chain Fv fragments (V_H and V_L regions united by a peptide linker) ("ScFv" fragments), diabodies (two associated ScFv fragments with different specificities), and so forth (Hayden et al, 1997, Cuπent Opinion in Immunology 9:210-212). Intracellularly expressed antibodies of the various formats can be targeted into cellular compartments {e.g., the cytoplasm, the nucleus, the mitochondria, etc.) by expressing them as fusions with the various known intracellular leader sequences (Bradbury et al, 1995, Antibody Engineering (vol. 2) (Boπebaeck ed.), pp 295-361, IRL Press). In particular, the ScFv format appears to be particularly suitable for cytoplasmic targeting.

Antibody types include, but are not limited to, polyclonal, monoclonal, chimeric, single chain, Fab fragments, and an Fab expression library. Various procedures known in the art may be used for the production of polyclonal antibodies to a target protein. For production of the antibody, various host animals can be immunized by injection with the target protein, such host animals include, but are not limited to, rabbits, mice, rats, etc. Various adjuvants can be used to increase the immunological response, depending on the host species, and include, but are not limited to, Freund's (complete and incomplete), mineral gels such as aluminum hydroxide, surface active substances such as lysolecithin, pluronic polyols, polyanions, peptides, oil emulsions, dinitrophenol, and potentially useful human adjuvants such as bacillus Calmette-Guerin (BCG) and corynebacterium parvum.

For preparation of monoclonal antibodies directed towards a target protein, any technique that provides for the production of antibody molecules by continuous cell lines in culture may be used. Such techniques include, but are not restricted to, the hybridoma technique originally developed by Kohler and Milstein (1975, Nature 256: 495-497), the trioma technique, the human B-cell hybridoma technique (Kozbor et al, 1983, Immunology Today 4: 72), and the EBV hybridoma technique to produce human monoclonal antibodies (Cole et al, 1985, in Monoclonal Antibodies and Cancer Therapy, Alan R. Liss, Inc., pp. 77-96). In an additional embodiment of the invention, monoclonal antibodies can be produced in germ-free animals utilizing recent technology (PCT/US90/02545). According to the invention, human antibodies may be used and can be obtained by using human hybridomas (Cote et al, 1983, Proc. Natl. Acad. Sci. USA 80: 2026-2030), or by transforming human B cells with EBV virus in vitro (Cole et al, 1985, in Monoclonal Antibodies and Cancer Therapy, Alan R. Liss, Inc., pp. 77-96). In fact, according to the invention, techniques developed for the production of "chimeric antibodies" (Morrison et al, 1984, Proc. Natl. Acad. Sci. USA 81: 6851-6855; Neuberger et al, 1984, Nature 312:604-608; Takeda et al, 1985, Nature 314: 452-454) by splicing the genes from a mouse antibody molecule specific for the target protein together with genes from a human antibody molecule of appropriate biological activity can be used; such antibodies are within the scope of this invention.

Additionally, where monoclonal antibodies are advantageous, they can be alternatively selected from large antibody libraries using the techniques of phage display (Marks et al, 1992, J. Biol. Chem. 267:16007-16010). Using this technique, libraries of up to 10¹² different antibodies have been expressed on the surface of fd filamentous phage, creating a "single pot" in vitro immune system of antibodies available for the selection of monoclonal antibodies (Griffiths et al, 1994, EMBO J. 13:3245-3260). Selection of antibodies from such libraries can be done by techniques known in the art, including contacting the phage to immobilized target protein, selecting and cloning phage bound to the target, and subcloning the sequences encoding the antibody variable regions into an appropriate vector expressing a desired antibody format.

5.5.5 METHODS OF MODIFYING PROTEIN ACTIVITIES

Methods of directly modifying protein activities include, inter alia, dominant negative mutations, specific drugs (used in the sense of this application) or chemical moieties generally, and also the use of antibodies.

Dominant negative mutations are mutations to endogenous genes or mutant exogenous genes that, when expressed in a cell, disrupt the activity of a targeted protein species. Depending on the structure and activity of the targeted protein, general rules exist that guide the selection of an appropriate strategy for constructing dominant negative mutations that disrupt activity of that target (Hershkowitz, 1987, Nature 329:219-222). In the case of active monomeric forms, over expression of an inactive form can cause competition for natural substrates or ligands sufficient to significantly reduce net activity of the target protein. Such over expression can be achieved by, for example, associating a promoter, preferably a controllable or inducible promoter, of increased activity with the mutant gene. Alternatively, changes to active site residues can be made so that a virtually iπeversible association occurs with the target ligand. Such can be achieved with certain tyrosine kinases by careful replacement of active site serine residues (Perlmutter et al, 1996, Cuπent Opinion in Immunology 8:285-290).

In the case of active multimeric forms, several strategies can guide selection of a dominant negative mutant. Multimeric activity can be controllably decreased by expression of genes coding exogenous protein fragments that bind to multimeric association domains and prevent multimer formation. Alternatively, controllable over expression of an inactive protein unit of a particular type can tie up wild-type active units in inactive multimers, and thereby decrease multimeric activity (Nocka et al, 1990, The EMBO J. 9:1805-1813). For example, in the case of dimeric DNA binding proteins, the DNA binding domain can be deleted from the DNA binding unit, or the activation domain deleted from the activation unit. Also, in this case, the DNA binding domain unit can be expressed without the domain causing association with the activation unit. Thereby, DNA binding sites are tied up without any possible activation of expression. In the case where a particular type of unit normally undergoes a conformational change during activity, expression of a rigid unit can inactivate resultant complexes. For a further example, proteins involved in cellular mechanisms, such as cellular motility, the mitotic process, cellular architecture, and so forth, are typically composed of associations of many subunits of a few types. These structures are often highly sensitive to disruption by inclusion of a few monomeric units with structural defects. Such mutant monomers disrupt the relevant protein activities and can be controllably expressed in a cell.

In addition to dominant negative mutations, mutant target proteins that are sensitive to temperature (or other exogenous factors) can be found by mutagenesis and screening procedures that are well-known in the art.

Also, one of skill in the art will appreciate that expression of antibodies binding and inhibiting a target protein can be employed as another dominant negative strategy.

Finally, activities of certain target proteins can be controllably altered by exposure to exogenous drugs or ligands. In a preferable case, a drug is known that interacts with only one target protein in the cell and alters the activity of only that one target protein. Graded exposure of a cell to varying amounts of that drug thereby causes graded perturbations of cellular states originating at that protein. The alteration can be either a decrease or an increase of activity. Less preferably, a drug is known and used that alters the activity of only a few {e.g., 2-5) target proteins with separate, distinguishable, and non-overlapping effects. Graded exposure to such a drug causes graded perturbations to the several cellular states originating at the target proteins.

6 ROBUST DISCRIMINATION EXAMPLE

6.1 RESULTS

As an example, two profiles generated using the immunosuppressant drugs, Cyclosporin A and FK506 provide an illustration of one aspect of the present invention. The profiles were obtained with mRNA transcript aπays in the yeast S. Cerevisiae as described in M. Marton, et al, supra and in the Experimental section infra. The transcriptional signatures of these drugs are illustrated in Figure 3. The horizontal axis in these plots is the intensity of the individual hybridized spots on the microarray, representing individual mRNA species abundances in the two. The vertical axis is the log 10 of the ratio of the intensity measured for one fluorescent label (Culture 1) to that measured for the other label (Culture 2). Eπor bars and names are displayed only for those genes which had up or down regulations due to the drugs that were significant at the 95% confidence level or better. Figure 4 shows the high coπelation (similarity) between the effects of the two drugs at these concentrations, 1 μg/ml for FK506 and 30 μg/ml for Cyclosporin, where both drugs affect primarily the calcineurin-mediated pathway which is the yeast analogue of the T-cell activation pathway in humans. The coπelation coefficient of the log₁₀(expression ratio) is 0.98, where this is computed based on those genes which were significantly up or down regulated at the 95% confidence level in either experiment.

Including all genes in the coπelation calculation decreases the coπelation coefficient to 0.73. This is because most genes do not change, or change very little, and their contribution to the coπelation coefficient is dominated by measurement eπors. These random eπors tend to decrease the observed coπelation. For most purposes more meaningful conclusions are obtained by excluding the genes with the smaller, noise-dominated changes. This results in a coπelation of 0.98, which makes the two response profiles essentially indistinguishable. Real differences between the two drugs become obvious only when they are added to cells in additional, different starting states. The two additional states here are supplied by gene deletion strains lacking the FK506 binding protein, FPR, and the cyclophilin protein which binds Cyclosporin, CPH1. These strains are chosen with prior knowledge of the intermediate binding partners of the drugs under study, to illustrate the method.

The response profiles of the drugs in these additional states are illustrated in Figure 5. In these additional states, the CPH1 mutant fails to respond to Cyclosporin, and the FPR mutant fails to respond to FK506 (at the same concentrations used for the single-state coπelation above). The augmented profiles, of length 3x6000 = 18000 are formed as illustrated in Figure 2, and the new coπelation coefficient computed, r = 0.18 . This lower coπelation is illustrated in the scatter plot of Figure 6.

The lower degree of similarity between the augmented profiles than was seen between the profiles in the baseline cell state reflects the biological fact that although FK506 and Cyclosporin A both have calcineurin as their ultimate target, they reach it via completely different protein binding partners. In the original 'wild type' yeast strain, the drugs appear to have identical function. The two drugs have been discriminated more robustly by including experiments in the augmented profiles where the drugs are administered to cell states that cause the different pathways of action to be revealed.

6.2 EXPERIMENTAL Construction, growth and drug-treatment of yeast strains The strains used in this study were constructed by standard techniques. See e.g. Schiestl et al, 1993, Introducing DNA into yeast by transformation, Methods: A companion to Methods in Enzvmology 5:79-85. For experiments involving FK506, cells were grown for three generations to a density of 1 x 10⁷ cells/ml in YAPD medium (YPD plus 0.004% adenine) supplemented with lOmM calcium chloride as previously described by Gaπett- Engele et al, 1995, Calcineurin, the Ca²7calmodulin-dependent protein phosphatase, is essential in yeast mutants with cell integrity defects and in mutants that lack functional vacuolar H(+)-ATPase, Mol. Cell. Biol. 15:4103-4114. Where indicated, FK506 was added to a final concentration of 1 μg/ml .5 hr after inoculation of the culture. Cyclosporin A (CsA) was added to a concentration of 30 μg/ml. Cells were broken by standard procedures ( See e.g. Ausubel et al, Cuπent Protocols in Molecular Biology, John Wiley & Sons, Inc. (New York), 12.12.1 - 13.12.5) with the following modifications. Cell pellets were resuspended in breaking buffer (0.2M Tris HCI pH 7.6, 0.5M NaCl, 10 mM EDTA, 1% SDS), vortexed for 2 minutes on a VWR multitube vortexer at setting 8 in the presence of 60% glass beads (425-600 μm mesh; Sigma) and phenolxhloroform (50:50, v/v). Following separation, the aqueous phase was reextracted and ethanol precipitated. Poly A⁺ RNA was isolated by two sequential chromatographic purifications over oligo dT cellulose (NEB) using established protocols. See e.g. Ausubel et al, supra). Preparation and hybridization of the labeled sample

Fluorescently-labeled cDNA was prepared, purified and hybridized essentially as described by DeRisi et al. DeRisi et al, 1997, Exploring the metabolic and genetic control of gene expression on a genomic scale, Science 278:680-686. Briefly, Cy3- or Cy5-dUTP (Amersham) was incorporated into cDNA during reverse transcription (Superscript II, LTI, Inc.) And purified by concentrating to less than 10 μl using Microcon-30 microconcentrators (A icon). Paired cDNAs were resuspended in 20-26μl hybridization solution (3x SSC, 0.75 μg/ml poly A DNA, 0.2% SDS) and applied to the microarray under a 22x30 mm coverslip for 6 hr at 63 °C, all according to DeRisi et al, (1997), supra.

Fabrication and scanning of microarrays

PCR products containing common 5' and 3' sequences (Research Genetics) were used as templates with amino-modified forward primer and unmodified reverse primers to PCR amplify 6065 ORFs from the S. cervisiae genome. First pass success rate was 94%. Amplification reactions that gave products of unexpected sizes were excluded from subsequent analysis. ORFs that could not be amplified from purchased templates were amplified from genomic DNA. DNA samples from 100 μl reactions were isopropanol precipitated, resuspended in water, brought to 3x SSC in a total volume of 15 μl, and transfeπed to 384-well microtiter plates (Genetix). PCR products were spotted into 1x3 inch polylysine-treated glass slides by a robot built according to specifications provided in Schena et al, supra; DeRisi et al, 1996, Discovery and analysis of inflammatory disease- related genes using cDNA microaπays, PNAS USA. 94:2150-2155; and DeResi et al, (1997). After printing, slides were processed following published protocols. See DeResi et al, (1997).

Microaπays were images on a prototype multi-frame CCD camera in development at Applied Precision, Inc. (Seattle, WA). Each CCD image frame was approximately 2mm square. Exposure time of 2 sec in the Cy5 channel (white light through Chroma 618-648 nm excitation filter, Chroma 657-727 nm emission filter) and 1 sec in the Cy3 channel (Chroma 535-560 nm excitation filter, Chroma 570-620 nm emission filter) were done consecutively in each fram before moving to the next, spatially contiguous frame. Color isolation between the Cy3 and Cy5 channels was -100:1 or better. Frames were knitted together in software to make the complete images. The intensity of spots (~ lOOμm) were quantified from the 10 μm pixels by frame background subtraction and intensity averaging in each channel. Dynamic range of the resulting spot intensities was typically a ration of 1000 between the brightest spots and the background-subracted additive eπor level. Normalization between the channels was accomplished by normalizing each channel to the mean intensities of all genes. This procedure is nearly equivalent to normalization between channels using the intensity ration of genomic DNA spots (See DeRisi et al, 1997) , but is possibly more robust since it is based on the intensities of several thousand spots distributed over the aπay.

Determination of signature correlation coefficients and their confidence limits Coπelation coefficients between the signature ORFs of various experiments were calculated using p = ∑ x_ky_k / (∑x_k ² ∑ y_k ²)'^/* k k k where x_k is the log₁₀ of the expression ratio for the k'th gene in the x signature, and y_k is the log₁₀ of the expression ratio for the k'th gene in the y signature. The summation is over those genes that were either up- or down-regulated in either experiment at the 95% confidence level. These genes each had a less than 5% chance of being actually unregulated (having expression ratios departing from unity due to measurement eπors alone). This confidence level was assigned based on an eπor model which assigns a lognormal probability distribution to each gene's expression ratio with characteristic width based on the observed scatter in its repeated measurements (repeated aπays at the same nominal experimental conditions) and on the individual aπay hybridization quality. This latter dependence was derived from control experiments in which both Cy3 and Cy5 samples were derived from the same RNA sample. For large numbers of repeated measurements the eπor reduces to the observed scatter. For a single measurement the eπor is based on the aπay quality and the spot intensity.

Random measurement eπors in the x and y signatures tend to bias the coπelation toward zero. In most experiments the great majority of genes is not significantly affected but do exhibit small random measurement eπors. Selecting only the 95% confidence genes for the coπelation calculation, rather than the entire genome, reduces this bias and makes the actual biological coπelations more apparent.

Coπelations between a profile and itself are unity by definition. Eπor limits on the coπelation are 95% confidence limits based on the individual measurement eπor bars, and assuming uncoπelated eπors. They do not include the bias mentioned above; thus, a departure of p from unity does not necessarily mean that the underlying biological coπelation is imperfect. However, a coπelation of 0.7 ± 0.1, for example, is very significantly different from zero. Small (magnitude of p < 0.2) but formally significant coπelation in the tables and text probably are due to small systematic biases in the Cy5/Cy3 ratios which violate the assumption of independent measurement eπors used to generate the

95% confidence limits. Therefore, these small coπelation values should be treated as not significant. A likely source of uncoπected systematic bias is the partially coπected scanner detector nonlinearity that differently affects the Cy3 and Cy5 detection channels.

The 1 μg/ml FK506 treatment signature was compared to over 40 unrelated deletion mutant or drug signatures. These control profiles had coπelation coefficients with the FK506 profile which were distributed around zero (mean p= -0.03) with a standard deviation of 0.16 (data not shown) and none had coπelations greater than p=0.38. Similarly, the calcineurin mutant signature coπelated well with the CsA-treatment signature (p=0.71±0.04) but not with the signatures from the negative control signatures (mean p= - 0.02 with a standard deviation of 0.18).

5

Quality controls

End-to-end checks on expression ratio measurement accuracy were provided by analyzing the variance in repeated hybridizations using the same mRNA labeled with both

Cy3 and Cy5, and also using Cy3 and Cy5 mRNA samples isolated from independent _j cultures of the same nominal strain and conditions. Biases undetected with this procedure, such as gene-specific biases presumably due to differential incorporation of Cy3- and Cy5- dUTP into cDNA, were minimized by performing hybridizations in fluor-reversed pairs, in which the Cy3/Cy5 labeling of the biological conditions was reversed in one experiment with respect to the other. The expression ratio for each gene is then the ratio of ratios _j between the two experiments in the pair. Other biases are removed by algorithmic numerical detrending. The magnitude of these biases in the absence of detrending and fluor reversal is typically on the order of 30% in the ratio, but may be as high as twofold for some ORFs.

Expression ratios are based on mean intensities over each spot. The occasional 0 smaller spots have fewer image pixels in the average. This does not degrade accuracy noticeably until the number of pixels falls below ten, in which case the spot is rejected from the data set. Wander of spot positions with respect to the nominal grid is adaptively tracked in aπay subregions by the image processing software. Unequal spot wander within a subregion greater than half a spot spacing is problematic for the automated quantitating 5 algorithms; in this case the spot is rejected from analysis based on human inspection of the wander. Any spots partially overlapping are excluded from the data set. Less than 1% of spots typically are rejected for these reasons.

7 REFERENCES CITED 0 All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.

Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.

8 COMPUTER IMPLEMENTATIONS The analytic methods described in the previous sections can be implemented by use of the following computer systems and according to the following programs and methods. FIG. 7 illustrates an exemplary computer system suitable for implementation of the analytic methods of this invention. Computer system 501 is illustrated as comprising internal components and being linked to external components. The internal components of this computer system include processor element 502 interconnected with main memory 503. For example, computer system 501 can be an Intel 8086-, 80386-, 80486-, Pentiumy, or Pentiumy-based processor with preferably 32 MB or more of main memory.

The external components include mass storage 504. This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory). Such hard disks are preferably of 1 GB or greater storage capacity. Other external components include user interface device 505, which can be a monitor, together with inputing device 506, which can be a "mouse", or other graphic input devices (not illustrated), and/or a keyboard. A printing device 508 can also be attached to the computer 501.

Typically, computer system 501 is also linked to network link 507, which can be part of an Ethernet link to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows computer system 501 to share data and processing tasks with other computer systems.

Loaded into memory during operation of this system are several software components, which are both standard in the art and special to the instant invention. These software components collectively cause the computer system to function according to the methods of this invention. These software components are typically stored on mass storage 504. Software component 510 represents the operating system, which is responsible for managing computer system 501 and its network interconnections. This operating system can be, for example, of the Microsoft Windows® family, such as Windows 3.1, Windows 95, Windows 98, or Windows NT. Software component 511 represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention. Many high or low level computer languages can be used to program the analytic methods of this invention. Instructions can be interpreted during run-time or compiled. Prefeπed languages include C/ C++, FORTRAN and JAVAy. Most preferably, the methods of this invention are programmed in mathematical software packages that allow symbolic entry of equations and high-level specification of processing, including algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms. Such packages include Matlab from Mathworks (Natick, MA), Mathematica® from Wolfram Research (Champaign, IL), or S-Plus® from Math Soft (Cambridge, MA). Accordingly, software component 512 and/or 513 represents the analytic methods of this invention as programmed in a procedural language or symbolic package.

In an exemplary implementation, to practice the methods of the present invention, a user first loads experimental data into the computer system 501. These data can be directly entered by the user from monitor 505, keyboard 506, or from other computer systems linked by network connection 507, or on removable storage media such as a CD-ROM, floppy disk (not illustrated), tape drive (not illustrated), ZIP® drive (not illustrated) or through the network (507). Next the user causes execution of expression profile analysis software 512 which performs the methods of the present invention.

In another exemplary implementation, a user first loads experimental data and/or databases into the computer system. This data is loaded into the memory from the storage media (504) or from a remote computer, preferably from a dynamic geneset database system, through the network (507). Next the user causes execution of software that performs the steps of the present invention.

Alternative computer systems and software for implementing the analytic methods of this invention will be apparent to one of skill in the art and are intended to be comprehended within the accompanying claims. In particular, the accompanying claims are intended to include the alternative program structures for implementing the methods of this invention that will be readily apparent to one of skill in the art.

Claims

WHAT IS CLAIMED IS:

1. A method for determining a degree of similarity between an effect of a first perturbation and an effect of a second perturbation on a biological sample, the method comprising:

(a) determining a first set of constituent profiles, each constituent profile of said first set is determined using a different one of a plurality of initial states of said biological sample by measuring a response of said biological sample to said first perturbation when said biological sample is in said different one of said initial states;

(b) determining a second set of constituent profiles, each constituent profile of said second set determined using a different one of said plurality of initial states by measuring a response of said biological sample to said second perturbation when said biological sample is in said different one of said initial states;

(c) combining said first set of constituent profiles into a first augmented profile;

(d) combining said second set of constituent profiles into a second augmented profile; and

(e) comparing said first augmented profile with said second augmented profile to determine said degree of similarity.

2. A method for determining a degree of similarity between an effect of a first perturbation and an effect of a second perturbation on a biological sample, the method comprising:

(a) combining a first set of constituent profiles into a first augmented profile; each constituent profile in said first set determined by: a different one of a plurality of initial states of said biological sample wherein a response of said biological sample to said first perturbation when said biological sample is in said different one of said initial states is measured;

(b) combining a second set of constituent profiles into a second augmented profile; each constituent profile of said second set determined by: a different one of said plurality of initial states of said biological sample; wherein a response of said biological sample to said second perturbation when said biological sample is in said different one of said initial states is measured; and

(c) comparing said first augmented profile with said second augmented profile to determine said degree of similarity.

3. A method for determining a degree of similarity between an effect of a first perturbation and an effect of a second perturbation on a biological sample by comparing a first augmented profile with a second augmented profile to determine said degree of similarity; wherein:

(i) said first augmented profile is determined by combining a first set of constituent profiles; each constituent profile of said first set determined with a different one of a plurality of initial states of said biological sample by measuring a response of said biological sample to said first perturbation when said biological sample is in said different one of said initial states; and (ii) said second augmented profile is determined by combining a second set of constituent profiles; each constituent profile of said second set is determined with said different one of said plurality of initial states of said biological sample by measuring a response of said biological sample to said second perturbation when said biological sample is in said different one of said initial states.

4. The method of claim 1, 2 or 3 wherein each said initial state is different.

5. The method of claims 1, 2 or 3 wherein two or more of said initial states are the same.

6. The method of claims 1, 2 or 3 wherein at least one constituent profile in said first set of constituent profiles is a first response profile and at least one constituent profile in said second set of constituent profiles is a second response profile, wherein said first response profile is determined by at least one measurement of at least one cellular constituent in said biological sample when said biological sample is in an initial state selected from said plurality of initial states, and said second response profile is determined by at least one measurement of at least one said cellular constituent in said biological sample when said biological sample is in said initial state.

7. The method of claim 6 wherein said first response profile and said second response profile is determined by said initial state of said biological sample at a time when said measurements are made.

8. The method of claim 1 , 2 or 3 wherein at least one constituent profile in said first set of constituent profiles is a first projected profile and at least one constituent profile in said second set of constituent profiles is a second projected profile, wherein said first and said second projected profile each contain a plurality of cellular constituent set values derived according to a definition of co-varying cellular constituent sets.

The method of claim 8 wherein said first projected profile and said second projected profile is determined by an initial state selected from said plurality of initial states.

10. The method of claims 8 wherein said definition is based upon co-variation of said cellular constituents under a plurality of different perturbations.

11. The method of claim 8 wherein said definition of co-varying cellular constituent sets is defined by a similarity tree derived by a cluster analysis of said cellular constituents under said plurality of perturbations.

12. The method of claim 11 wherein said co-varying cellular constituent sets are defined as branches of said similarity tree.

13. The method of claim 1, 2, or 3 wherein said biological sample is an organism having a cell wall and at least one initial state selected from said plurality of initial states is determined by altering said biological sample in a manner that alters said cell wall permeability.

14. The method of claim 1, 2, or 3 wherein said biological sample is a cell line.

15. The method of claim 14 wherein said biological sample is substantially isogenic to Saccharomyces cerevisiae.

16. The method of claim 14 wherein said cell line expresses a macromolecule that has an ability to act as a drug efflux pump and an initial state that is selected from said plurality of initial states is determined by a mutant activity of said macromolecule in said cell line.

17. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by a first set of culture growth conditions and a second initial state that is selected from said plurality of initial states is determined by a second set of culture growth conditions, wherein said first culture growth conditions and said second culture growth conditions vary by an amount of a component of said culture growth conditions.

18. The method of claim 17 wherein said component of said culture growth conditions is an amount of a nutrient that is necessary for viability of said cell line.

19. The method of claim 17 wherein said component of said culture growth conditions is an amount of a trace element.

20. The method of claim 17 wherein said component is selected from the group consisting of iron, manganese, zinc, copper, molybdenum, boron, chlorine, calcium, sodium, chromium, potassium, magnesium, and selenium.

21. The method of claim 17 wherein said component of said culture growth conditions is an incubation temperature.

22. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by a culture growth density of said cell line and a second initial state that is selected from said plurality of initial states is determined by a second culture growth density of said cell line, wherein said first culture growth density and said second culture growth density vary by an amount.

23. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by a first amount of a pharmacological agent that is contacted with said biological sample and a second initial state that is selected from said plurality of initial states is determined by a second amount of a pharmacological agent that is contacted with said biological sample.

24. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by incubating said cell line on a surface.

25. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by incubating said cell line in a liquid.

26. The method of claim 14 wherein said biological sample is incubated in a container and a first initial state that is selected from said plurality of initial states is determined by the container that said biological sample is incubated in and the container is selected from the group consisting of shaker flasks, culture plates, incubators, 96-well microtiter plates, and 384- well microtiter plates.

27. The method of claim 1, 2, or 3 wherein a first initial state that is selected from said plurality of initial states is determined by a genetic feature of said biological sample.

28. The method of claim 27 wherein the biological sample is substantially isogenic to Saccharomyces cerevisiae having a genome; and a first initial state, which is selected from said plurality of initial states, is determined by a genetic feature selected from the group consisting of a haploid state of said genome, a diploid state of said genome, a heterozygous state of a gene included in said genome, a homozygous state of a gene included in said genome, a mutation of a gene included in said genome, a deletion of a portion of a gene from said genome, an alteration of a regulatory sequence of a gene in said genome, an exogenous gene integrated into said genome and an exogenous oligonucleotide integrated into said genome.

29. The method of claim 27 wherein said biological sample is a cell line having a genome; wherein a first initial state is selected from said plurality of initial states; wherein said first initial state is determined by a genetic feature selected from the group consisting of a heterozygous state of a gene included in said genome, a homozygous state of a gene included in said genome, a mutation of a gene included in said genome, a deletion of a portion of a gene from said genome, an alteration of a regulatory sequence of a gene in said genome, an exogenous gene integrated into said genome of said cell line, and an exogenous oligonucleotide integrated into said genome.

30. The method of claim 29 wherein a second initial state that is selected from said plurality of initial states is determined by contacting said biological sample with an amount of a composition; wherein said compositioncomprises a pharmacological agent, an endogenous hormone, a growth factor, a peptide, or an oligonucleotide.

31. The method of claim 14 wherein a first initial state that is selected from said plurality of initial states is determined by a state of a biological pathway; wherein said biological pathway is selected from a compendium of biological pathways present in said cell line.

32. The method of claim 31 wherein said biological sample is substantially isogenic to Saccharomyces cerevisiae and said biological pathway is a mating pathway.

33. The method of claim 1 , 2, or 3 wherein said first perturbation is a first amount of a first pharmacological agent that is contacted with said biological sample.

34. The method of claim 33 wherein said second perturbation is a second amount of said first pharmacological agent that is contacted with said biological sample, wherein said first and said second amount of said first pharmacological agent are different.

35. The method of claim 33 wherein said second perturbation is an amount of a second pharmacological agent that is contacted with said biological sample.

36. The method claim 1 , 2, or 3 wherein said biological sample includes a genome and said first perturbation is determined by the introduction of an exogenous gene into said genome.

37. The method of claim 1, 2, or 3 wherein said biological sample includes a genome and said first perturbation includes a deletion of at least a substantial portion of one gene in said genome.

38. The method of claim 1, 2, or 3 wherein said first perturbation is a method, the method comprising: contacting said biological sample with an agent selecting from the group consisting of a hormone, a drug, a peptide, an oligonucleotide, a mineral, a composition of media, a phage, a trace element, a salt, a colony stimulating factor, and a source of iπadiation.

39. The method of claim 1 , 2, or 3 wherein said first perturbation is a method, the method comprising: contacting an amount of an organic compound that has a molecular weight less than 1000 Daltons with said biological sample.

40. The method of claim 1 , 2, or 3 wherein said first set of constituent profiles is combined into said first augmented profile by concatenating said first set of constituent profiles and said second set of constituent profiles is combined into said second augmented profile by concatenating said second set of constituent profiles.

41. The method of claim 1, 2, or 3 wherein said first augmented profile is:

Pⁱ = [/*ι;...;RV] wherein,

P' is said first augmented profile;

P is a first constituent profile in said first set of constituent profiles that is determined by measuring a response of said biological sample to said first perturbation when said biological sample is in said first biological state; P'n is an N^h constituent profile in said first set of constituent profiles that is determined by measuring a response of said biological sample to said first perturbation when said biological sample is in an N*¹ biological state selected from said plurality of initial states; and

said second augmented profile is:

wherein,

P is said second augmented profile;

P' is a first constituent profile in said second set of constituent profiles that is determined by measuring a response of said biological sample to said second perturbation when said biological sample is in said first biological state;

R"^ is an N^h constituent profile in said second set of constituent profiles that is determined by measuring a response of said biological sample to said second perturbation when said biological sample is in an N^,h biological state selected from said plurality of initial states; and Nis the number of states in said plurality of initial states; and said step of comparing said first augmented profile with said second augmented profile to determine said coπelation is performed by comparing P' to P^J using a quantitative measure of similarity.

42. The method of claim 41 wherein said quantitative measure of similarity is a generalized dot product: r_u = P^, * P^J /(|P'|P^J|) wherein * denotes dot product, || denotes vector norm and r_y denotes similarity.

43. The method of claim 41 wherein said quantitative measure of similarity is derived from Shannon mutual information theory.

44. The method of claim 1, 2 or 3 wherein each constituent profile includes a plurality of elements, each element representing an amount of a cellular constituent in said biological sample.

45. The method of claim 44 wherein each said element of at least one constitutive profile in said first set and each said element of at least one constitutive profile in said second set is assigned a

"-1", if said element exceeds a negative threshold,

"1", if said element exceeds a positive threshold, and

"0", if said element does not exceed said positive and said negative threshold; and said positive threshold coπesponds to a first amount of one or more cellular constituents in said biological sample and said second threshold coπesponds to a second amount of one or more cellular constituents in said biological sample.

46. The method of claim 44 wherein each said cellular constituent is independently selected from the group consisting of a gene expression level, an amount of an mRNA encoding a gene, an amount of a protein, an amount of an enzymatic activity, an amount of an epitope presented by a macromolecule, an amount of a divalent cation, an amount of a phosphorylated protein, an amount of a dephosphorylated protein, an amount of a hormone, and an amount of a peptide.

47. The method of claim 1, 2, or 3 wherein each said initial state of said biological sample is provided by selecting said biological sample at a different time.

48. The method of claim 1, 2, or 3 wherein said second set of constituent profiles represents a baseline state of said biological sample.

49. The method of 1 , 2, or 3 wherein said second perturbation is wild-type activity and said second set of constituent profiles represents a wild-type state of said biological sample.

50. A method of determining an effect of a first perturbation on a subject, the method comprising:

(a) determining a plurality of augmented profiles; each augmented profile determined by combining a constituent profile set selected from a plurality of constituent profile sets wherein: each said constituent profile set in said plurality of constituent profile sets is determined by obtaining a biological sample from said subject at a different time; and each constituent profile in said constituent profile set is determined by measuring a biological response of said biological sample to a different second perturbation selected from a plurality of perturbations; and

(b) comparing said plurality of augmented profiles to determine said effect of said first perturbation on said subject.

51. The method of claim 50, wherein said first perturbation is selected from the group consisting of a diseased state, introduction of an exogenous gene into the genome of said subject, and a behavioral health risk.

52. A method of determining an effect of a first perturbation on a subject, the method comprising:

(a) determining a plurality of augmented profiles; each augmented profile determined by combining a constituent profile set selected from a plurality of constituent profile sets wherein: each said constituent profile set in said plurality of constituent profile sets is determined by obtaining a biological sample from said subject at a different stage of an environmental insult; and each constituent profile in said constituent profile set is determined by measuring a biological response of said biological sample to a different second perturbation selected from a plurality of perturbations; and

53. The method of claim 52, wherein the environmental insult is a disease that has afflicted said subject.

54. The method of claim 50 or 52 wherein a first constituent profile set in said plurality of constituent profiles sets represents a baseline state and all other constituent profile sets in said plurality of constituent profile sets are expressed as a ratio of said first constituent profile set.

55. The method of claim 50 or 52 wherein a first constituent profile set in said plurality of constituent profiles sets represents a baseline state and all other constituent profile sets in said plurality of constituent profile sets are expressed as a logarithmic ratio of said first constituent profile set.

56. The method of claim 50 or 52 wherein said first perturbation is a drug that is taken by said subject at regular intervals.

57. A method of determining a biological state of a first subject, the method comprising:

(a) determining a first set of constituent profiles, each constituent profile of said first set being determined by measuring a response of a biological sample derived from said first subject to a perturbation at a different time;

(b) determining a second set of constituent profiles, each constituent profile of said second set being determined by measuring a response of a second biological sample, which is derived from a second subject having a known biological state, to said perturbation at a different time;

(e) comparing said first augmented profile with said second augmented profile to predict the biological state of said first subject.

58. A method of diagnosing a disease state in a subject, the method comprising:

(a) determining a first set of constituent profiles, each constituent profile of said first set being determined by measuring a response of a biological sample obtained from said subject to a different perturbation selected from a plurality of perturbations;

(b) combining said first set of constituent profiles into a first augmented profile; and

(c) comparing said first augmented profile with a library of augmented profiles, wherein each augmented profile in said library of augmented profiles is derived from a different biological sample with a known biological state, to diagnose said disease state. ^e

59. The method of 58 wherein the comparing step includes the step of clustering said library into groups based upon similarities to said first augmented profile.

60. A method of drug discovery, the method comprising the steps of:

(a) determining a plurality of augmented profiles; each augmented profile being determined by combining a constituent profile set selected from a plurality of constituent profile sets wherein: each said constituent profile set in said plurality of constituent profile sets is determined by use of a test compound; and each constituent profile in said constituent profile set is determined by contacting said test compound with a cell line that is in a different biological state selected from a plurality of biological states; and

(b) comparing said plurality of augmented profiles to determine the effect of said test compound on said cell line.

61. A method of determining a biological state of a first subject, the method comprising comparing a first augmented profile with a second augmented profile to predict said biological state of said first subject wherein said first and said second augmented profile are derived by

(a) determining a first set of constituent profiles, each constituent profile of said first set is determined by measuring a response of a biological sample derived from said first subject to a perturbation at a different time;

(b) determining a second set of constituent profiles, each constituent profile of said second set is determined by measuring a response of a second biological sample, which is derived from a second subject having a known biological state, to said perturbation at a different time;

(c) combining said first set of constituent profiles into a first augmented profile; and

(d) combining said second set of constituent profiles into a second augmented profile.

62. A method of diagnosing a disease state in a subject, the method comprising comparing a first augmented profile with a library of augmented profiles, wherein each augmented profile in said library of augmented profiles is derived from a different biological sample with a known biological state and said first augment profile is derived by

(a) determining a first set of constituent profiles, each constituent profile of said first set of constituent profiles is determined by measuring a response of a biological sample obtained from said subject to a different perturbation selected from a plurality of perturbations; and

(b) combining said first set of constituent profiles to derive said first augmented profile.

63. A computer system for determining a degree of similarity between an effect of a first perturbation and an effect of a second perturbation on a biological system, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising:

(a) combining a first set of constituent profiles into a first augmented profile; each constituent profile in said first set determined by: a different one of a plurality of initial states of said biological system wherein a response of said biological system to said first perturbation when said biological system is in said different one of said initial states is measured; (b) combining a second set of constituent profiles into a second augmented profile; each constituent profile of said second set determined by: a different one of said plurality of initial states of said biological sample; wherein a response of said biological sample to said second perturbation when said biological sample is in said different one of said initial states is measured; and

64. A computer system for determining a degree of similarity between an effect of a first perturbation and an effect of a second perturbation on a biological sample, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method that comprises comparing a first augmented profile with a second augmented profile to determine said degree of similarity; wherein: (i) said first augmented profile is determined by combining a first set of constituent profiles; each constituent profile of said first set determined with a different one of a plurality of initial states of said biological sample by measuring a response of said biological sample to said first perturbation when said biological sample is in said different one of said initial states; and (ii) said second augmented profile is determined by combining a second set of constituent profiles; each constituent profile of said second set is determined with said different one of said plurality of initial states of said biological sample by measuring a response of said biological sample to said second perturbation when said biological sample is in said different one of said initial states.

65. A computer system for determining an effect of a first perturbation on a subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising:

66. A computer system for determining an effect of a first perturbation on a subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising:

67. A computer system for determining a biological state of a first subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising:

(c) combining said first set of constituent profiles into a first augmented profile; (d) combining said second set of constituent profiles into a second augmented profile; and

68. A computer system for diagnosing a disease state in a subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising:

(c) comparing said first augmented profile with a library of augmented profiles, wherein each augmented profile in said library of augmented profiles is derived from a different biological sample with a known biological state, to diagnose said disease state.

69. A computer system for advancing drug discovery, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising the steps of:

70. A computer system for determimng a biological state of a first subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising comparing a first augmented profile with a second augmented profile to predict said biological state of said first subject wherein said first and said second augmented profile are derived by:

A computer system for diagnosing a disease state in a subject, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising comparing a first augmented profile with a library of augmented profiles, wherein each augmented profile in said library of augmented profiles is derived from a different biological sample with a known biological state and said first augment profile is derived by: