WO2004022575A2

WO2004022575A2 - Bioinformatics analysis of cellular effects of artificial transcription factors

Info

Publication number: WO2004022575A2
Application number: PCT/KR2003/001827
Authority: WO
Inventors: Jin-Soo Kim; Dong-Ki Lee; Jinwoo Park; Youn-Jae Kim
Original assignee: Toolgen, Inc.
Priority date: 2002-09-05
Filing date: 2003-09-05
Publication date: 2004-03-18
Also published as: AU2003260973A1; WO2004022575A3

Abstract

Disclosed is a method of evaluating a plurality of cellular genes. The method includes: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide expression information; and identifying, from the expression information, a set of two or more genes whose expression is co-regulated among the cells of the plurality.

Description

BIOINFORMATICS ANALYSIS OF CELLULAR EFFECTS OF ARTIFICIAL TRANSCRIPTION FACTORS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to the following U.S. Provisional Patent Applications 60/408,862, filed September 5, 2002 and 60/453,111, filed March 7, 2003, the contents of which are hereby incorporated by reference in their entirety for all purposes.

FIELD OF THE INVENTION

The invention relates to data mining methods that can be performed in computers to identify relationships among genes and proteins.

BACKGROUND

Artificial transcription factors can be produced that bind to DNA and regulate transcription. One type of artificial transcription factor includes chimeras of zinc finger domains. WO 01/60970 (Kim et al.) describes methods for constructing artificial transcription factors and for determining their specificity. One application for artificial transcription factors is to alter the expression of a particular target gene. Target sites are identified in the regulatory region of the target gene, and artificial transcription factors are engineered to recognize one or more of the target sites. When such artificial transcription factors are introduced into cells, they may bind to the corresponding target sites and modulate transcription.

SUMMARY

In one aspect, the invention features a method of evaluating a plurality of cellular genes. The method includes: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide expression information; and identifying, from the expression information, a set of two or more genes whose expression is altered by a first and a second transcription factors such that, in at least 50, 60, 70, 80, 90, 95, or 100% of the evaluated cells in which expression one gene of the set is altered, expression of each of the other genes of the set is altered, wherein the transcription factors are factors respectively encoded by the nucleic acids in a first and second cell of the plurality of cells. In one embodiment, a plurality of different such sets are identified.

In one embodiment, the step of identifying includes grouping genes. The grouping can be performed recursively. In one embodiment, the grouping is a function of a similarity coefficient among genes in a candidate group. For example, genes are assigned to a group if the similarity coefficient is greater than a threshold value. In one embodiment, the grouping is performed recursively and the threshold value is decreased during subsequent iterations. In one embodiment, the step of identifying includes clustering genes as a function of the expression information, e.g., using hierarchical clustering, Bayesian clustering, or k- means clustering.

In one embodiment, the step of identifying includes using self-organizing maps, shortest path analysis, Boolean networking, graphical modeling, and/or individual gene grouping. In one embodiment, the step of identifying includes principal component analysis. In one embodiment, the step of identifying includes translating the expression information into binary values, and identifying similar genes using the binary values. For example, the step of identifying similar genes using the binary values includes use of the Hamming Distance, a chi squared based measure, or a Fisher Exact test.

In one embodiment, the step of identifying includes evaluating a metric of similarity or dissimilarity, e.g., a multivariate distance measure, e.g., a Euclidean distance, Minkowski Distance, Mahalanobis Distance, Taxi-cab Distance, Canberra Metric, or Bray-Curtis Coefficient. For example, the identifying can include a filter so that for each cell of the plurality of cells, the multivariate distance measure for expression of all of the genes in the set is within a predetermined value. Other filters for defining the set of genes can also be used. For example, it is possible to require that all the genes in the set have the same directionality alteration, e.g., if one of the genes has increased expression, then all the genes of the set have increased expression or, conversely, if one of the genes has decreased expression, then all the genes of the set have decreased expression. Another exemplary filter is a range, e.g., all the genes of the set have an alteration of expression that is at least 0.5, 1.0, 1.5, 2.0, 5, 50, or 100 fold. In another exemplary range, all the genes of the set have an alteration that is statistically significant, e.g., at least 1, 1.5, or 2 standard deviations from the mean and so forth. In one embodiment, the user varies filter conditions, e.g., until an acceptable number of genes or sets of genes are identified.

In one embodiment, the step of evaluating includes evaluating proteins encoded by the plurality of cellular genes. Proteins can be evaluated, for example, by mass spectroscopy, an immunoassay, spectroscopy, or 2-D gel electrophoresis. In another embodiment, the step of evaluating includes isolating RNA, and hybridization of the RNA or a derivative nucleic acid thereof to one or more probes.

In one embodiment, the nucleic acids encoding the different artificial, chimeric transcription factors include a randomly selected set of nucleic acids. For example, the method includes: (a) providing a library of nucleic acids, wherein each member of the library includes a sequence encoding an artificial chimeric transcription factor operably linked to a promoter and (b) introducing members of the library into the plurality of cells (e.g., at random), thus providing the plurality of cells.

In one embodiment, the nucleic acids encoding the different artificial, chimeric transcription factors include a preselected set of nucleic acids, wherein each member of the preselected set of nucleic acids encodes an artificial, chimeric transcription factor that can alter a cellular phenotype.

In one embodiment, each artificial, chimeric transcription factor includes a first zinc finger domain. Each artificial, chimeric transcription factor can further include a second, a third, and/or a fourth zinc finger domains, e.g., a total of at least three or four zinc finger domains. At least one of the zinc finger domains can be a naturally occurring zinc finger domain (i.e., a domain whose sequence is found in Nature, e.g., in a sequence database). For example, the first zinc finger domain is a mammalian zinc finger domain, e.g., murine, primate, or human.. In one embodiment, each cell of the plurality of cells is an animal cell, e.g., a mammalian cell, e.g., a human cell. The cells can be culture cells or in an organism (e.g., grafted or transgenic cells. In one embodiment, the plurality of cells includes at least four, five, ten, 50, 100, 200, 300, 500, 1000, 2000, 3000, 5000, 8000, IO⁴, 10⁵, or IO⁶ cells. In one embodiment, each cell of the plurality of cells is derived from a common parental cell. The method can further mclude creating a database record that associates the genes of the identified set. The method can further include determining whether proteins encoded by genes of the identified set physically interact, or evaluating an interaction between a test compound (e.g., a small organic, non-polymeric agent less than 2000 kDA in molecular weight) and a protein encoded by a gene of the set. Accordingly, the method can be used to direct drug screening.

In a related aspect, the invention features a method of evaluating a plurality of cellular genes. The method includes: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide expression information; and identifying, from the expression information, a set of two or more genes whose expression is co-regulated among the cells of the plurality. The method can include other features described herein. In another aspect, the invention features a machine accessible medium having encoded thereon information that represents a plurality of database records. Each record of the plurality includes information that indicates abundance of a transcript in a cell and a reference that identifies the cell (e.g., by name, index, random identifier, physical location, or contents, e.g., zinc finger protein present within). The plurality of records includes records for the same transcripts in at least two different cells. Each cell includes an artificial, chimeric transcription factor that differs among the different cells. For example, each transcription factor includes a zinc finger domain. For at least some of the cells, the transcription factor can be randomly selected for at least some of the cells. In one embodiment, the plurality of database records includes database records for at least 10, 50,

100, 500, 1000, or 5000 different transcription factors.

The invention also features a method that includes providing such a medium; and identifying, from information encoded on the medium, a set of two or more genes whose expression is altered by at least two of the different artificial, chimeric transcription factors, such that, in each evaluated cell (or in at least 50, 60, 70, 80, 90, 95, or 100% of the cells) in which expression one gene of the set is altered, expression of each of the other genes of the set is altered. The method can include other features described herein.

In another aspect, the invention features a machine accessible medium that has encoded thereon information that represents a plurality of database records. Each record of the plurality includes a transcript profile that indicates the abundance of a plurality of cellular transcripts and a reference that identifies the cell. Each cell includes an artificial, chimeric transcription factor, and the transcription factor differs among at least some of the cells. The medium can be used to identify a set of two more genes that are co-regulated, e.g., a set of two or more genes whose expression is altered by at least two of the different transcription factors, such that, in each evaluated cell in which expression one gene of the set is altered, expression of each of the other genes of the set is altered. In one aspect, the invention features a method of evaluating a plurality of cellular genes. The method includes: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the nucleic acid in each cell of the plurality; evaluating transcription of a plurality of cellular genes in each cell of the plurality; and identifying, from results of the evaluating, two or more genes that are co-regulated in at least one of the cells of the plurality.

In one embodiment, each artificial, chimeric transcription factor includes at least one zinc finger domain, e.g., two, three, four, five, six or more zinc finger domains. The domains can be configured in an array with a regular linker length and uninterrupted by other types of domains. At least one zinc finger domain, e.g., two, three, or more, can be a naturally occurring zinc finger domain. Each artificial transcription factor can include domains from at least two different naturally occurring protein. Further, an artificial transcription factor can include at least one artificial zinc finger domain (e.g., an artificial mutant of a naturally- occurring zinc finger domain).

The evaluating can include profiling gene expression, in each cell of the plurality to provide a plurality of profiles. The profiling can include hybridization to a nucleic acid array. In another example, the evaluating can include comparing gene expression by another hybridization-based method, e.g., subtractive hybridization, or by differential display and so forth.

Information from the plurality of profiles can be analyzed to identify a set of genes that are co-regulated in each cell of the plurality. The analyzing can include one or more of: hierarchical clustering, Bayesian clustering, k-means clustering, self-organizing maps, shortest path analysis, Boolean networking, graphical modeling, and individual gene grouping.

The method can further include comparing annotated information for the two or more co-regulated genes, e.g., identifying common key words. In another example, the method can further include evaluating a phenotype of each cell, e.g., a phenotype other than direct evaluation of transcript abundance. The method can further include generating and storing a database record for each profile of the plurality. Thus, the results of the evaluating can be retrieved from the computer database to identify the two or more co-regulated genes.

In one embodiment, the co-regulated genes are co-regulated in at least two, three, four, or ten cells of the plurality, e.g., at least 5, 10, 20, 50, or 90% of the cells of the plurality.

The plurality of cells can include at least 5, 10, 50, 100, 200, or 500 different cells. The cells are typically all derivatives of a common parent cell. Exemplary cells include fungal cells, plant cells, and animal cells (e.g., mammalian cells such as mouse, rat, primate, or human cells). The nucleic acids encoding the different artificial, chimeric transcription factors can constitute a random population or a preselected set. For the latter case, the nucleic acids can be preselected, for example, because they each encode a transcription factor that causes a particular phenotype or because they cause different phenotypes. In the former case, the nucleic acids can be isolated at random from a library of nucleic acids. The library itself can be randomly constructed, e.g., as described herein.

The method can similarly be implemented by profiling protein abundance or protein modification state. For example, the method can include: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the nucleic acid in each cell of the plurality; evaluating protein abundance of a plurality of proteins produced by each cell of the plurality; and identifying, from results of the evaluating, two or more proteins that are co-regulated in at least one of the cells of the plurality.

The method can also be use other types of artificial chimeric proteins, e.g., artificial signaling proteins such as chimeras of different adaptor, kinase, and/or receptor intracellular domains.

In another aspect, the invention provides a method of inferring a genetic network. The method can include: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the nucleic acid in each cell of the plurality; evaluating transcription of a plurality of cellular genes in each cell of the plurality; and identifying, from results of the evaluating, two or more genes that are co- regulated in at least two cells (e.g., at least three, four, five, or any arbitrary number up to all cells) of the plurality. The identifying can include, for example, a shortest path analysis, e.g., to find a set of genes that are more similar in co-regulation than at least another set of genes. In one embodiment, the method can further include evaluating whether one of the identified genes causes expression of at least another set of genes. In another embodiment, the method includes inferring a relationship from automated analysis (e.g., keyword analysis) of an annotated database, e.g., a literature database or an annotated sequence database.

In another aspect, the invention provides a method of inferring a genetic network. The method can mclude: providing a cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the nucleic acid in the cell; evaluating transcription of a plurality of cellular genes in the cell under a plurality of different conditions; and identifying, from results of the evaluating, two or more genes that are co-regulated in at least two conditions (e.g., at least three, four, five, or any arbitrary number up to all conditions) of the plurality.

In another aspect, the invention features a machine readable medium having encoded thereon information including a plurality of database records, each record of the plurality indicating abundance of a transcript in a cell and a reference indicating identity of the cell, wherein the plurality of records includes records for the same transcripts in different cells, and each cell includes an artificial, chimeric transcription factor, and the transcription factor differs among at least some of the cells. The medium can enable a user to access the records, compare transcript abundance for a plurality of different cells, and identify two or more transcripts that are co-regulated in at least two of the cells. The transcription factor, for example, can include other features described herein.

Similarly, the invention features a medium having encoded thereon information including a plurality of database records, each record of the plurality indicating abundance of a protein species produced by a cell and a reference indicating identity of the cell, wherein the plurality of records includes records for the same transcripts in different cells, and each cell includes an artificial, chimeric transcription factor, and the transcription factor differs among at least some of the cells.

In one aspect, the invention features a method for identifying co-regulated genes, the method including: providing a plurality of transcript profiles wherein (1) each profile corresponds to a cell and includes a set of parameters, (2) each parameter of the set corresponds to a particular cellular transcript, (3) a plurality of different cellular transcripts are referenced by each of the profiles, and (4) the plurality of profiles includes profiles for cells that include different artificial chimeric transcription factors; and analyzing the plurality of transcript profiles to identify a set of transcripts that are co-regulated in at least two profiles of the plurality that correspond to cells having different artificial, chimeric transcription factors. The method can be a machine-based method, e.g., implemented by a computer. The method can further include outputting information about the set of transcripts, e.g., on a display for viewing by a user; storing the information about the set of transcripts on a computer readable medium; and transmitting the information to a user. For example, the information includes number of members of the set, identifiers of the transcripts, nucleic acid sequence, annotation to one or more of the transcripts. The method can further include evaluating a physical interaction between at least two proteins that include an amino acid sequence encoded by a transcript of the identified set. The step of evaluating physical interaction can include a two-hybrid assay or a cell-free assay (e.g., a fluorescence assay, an antibody-based assay, and so forth).

In another aspect, the invention features a method of providing information about a cellular network. The method can include: providing replicates of a reference cell and a plurality of nucleic acids, each nucleic acid encoding a different artificial, chimeric transcription factor; introducing at least some of the nucleic acids of the plurality into at least some of the replicate cells to yield a plurality of modified cells; expressing the nucleic acid in each modified cell of the plurality; evaluating transcription of a plurality of cellular genes in each modified cell of the plurality; and identifying, from results of the evaluating, two or more genes that are co-regulated in at least one of the modified cells of the plurality; and

The method can further include evaluating a relationship between two or more of the gene products encoded by the two or more co-regulated genes. The method can include other features described herein. In another aspect, the invention features a method of identifying a target gene that is regulated by artificial transcription factors. The method includes: providing a nucleic acid library that includes a plurality of nucleic acids, each encoding a different artificial transcription factor, each transcription factor including at least two zinc finger domains; introducing each member of the plurality into a replicate of a test cell to provide a plurality of transformed cells; identifying a plurality of phenotypically altered cells from the plurality of transformed cells, wherein each phenotypically altered cell has an altered phenotype relative to the test cell; and identifying one or more transcripts or proteins whose abundance is similarly altered in at least two phenotypically altered cells of the plurality relative to the test cell. Examples of similarly altered abundances are changes in abundance relative to a reference (e.g., expression in a reference cell) that have the same directionality or a ratio of change (e.g., observed/reference) that is within a threshold value (e.g., regardless of directionality).

In one embodiment, the method further includes, in a test cell, altering the activity of a transcript or protein whose abundance is similarly altered in at least two phenotypically altered cells, e.g., by genetic alteration (e.g., mutation or overexpression) or otherwise (e.g., RNA interference, anti-sense, or antibody binding). In some cases, the activity of a plurality of transcripts or proteins is altered.

In one embodiment, the similarly altered transcripts or proteins are identified by profiling transcripts or protein abundance in each phenotypically altered cell of the plurality to provide a profile for each phenotypically altered cell; and comparing the profiles to each other. Profiles can be obtained using, for example, a nucleic acid or protein array, SAGE tags, differential display, or subtractive hybridization.

With respect to all methods described herein, a library of nucleic acids that encode chimeric zinc fmger proteins can be used. The term "library" refers to a physical collection of similar, but non-identical biomolecules. The collection can be, for example, together in one vessel or physically separated (into groups or individually) in separate vessels or on separate locations on a solid support. Duplicates of individual members of the library may be present in the collection. A first exemplary library includes a plurality of nucleic acids, each nucleic acid encoding a polypeptide including at least a first, second, and third zinc finger domains. As used herein, "first, second and third" denotes three separate domains that can occur in any order in the polypeptide: e.g., each domain can occur N-terminal or C-terminal to either or both of the others. The first zinc finger domain varies among nucleic acids of the plurality. The second zinc finger domain varies among nucleic acids of the plurality. At least 10 different first zinc finger domains are represented in the library. In one implementation, at least 0.5, 1, 2, 5%, 10%, or 25% of the members of the library have one or both of the following properties: (1) each represses or activates transcription of at least one plG reporter plasmid at least 1.25 fold in vivo; and (2) each binds at least one target site with a dissociation constant of no more than 7, 5, 3, 2, 1, 0.5, or 0.05 nM. The first and second zinc finger domains can be from different naturally-occurring proteins or are positioned in a configuration that differs from their relative positions in a naturally-occurring protein. For example, the first and second zinc finger domains maybe adjacent in the polypeptide, but may be separated by one or more intervening zinc finger domains in a naturally occurring protein. A second exemplary library includes a plurality of nucleic acids, each nucleic acid encoding a polypeptide that includes at least first and second zinc finger domains. The first and second zinc finger domains of each polypeptide (1) are identical to zinc finger domains of different naturally occurring proteins (and generally do not occur in the same naturally occurring protein or are positioned in a configuration that differs from their relative positions in a naturally-occurring protein), (2) differ by no more than four, three, two, or one amino acid residues from domains of naturally occurring proteins, or (3) are non-adjacent zinc finger domains from a naturally occurring protein. Identical zinc finger domains refer to zinc finger domains that are identical at each amino acid from the first metal coordinating residue (typically cysteine) to the last metal coordinating residue (typically histidine). The first zinc finger domain varies among nucleic acids of the plurality, and the second zinc finger domain varies among nucleic acids of the plurality. The naturally occurring protein can be any eukaryotic zinc finger protein: for example, a fungal (e.g., yeast), plant, or animal protein (e.g., a mammalian protein, such as a human or murine protein). Each polypeptide can further mclude a third, fourth, fifth, and/or sixth zinc finger domain. Each zinc finger domain can be a mammalian, e.g., human, zinc finger domain. Other types of libraries can also be used, e.g., including mutated zinc finger domains.

The plurality of nucleic acids can collectively encode at least 5, 10, 20, 30, or 40 different first zinc finger domains and/or at least 5, 10, 20, 30, or 40 different second zinc finger domains. The plurality of nucleic acids can include at least 10, 50, 200, 500, 1000, 5000, 10 000, 20 000, 25 000, or 40 000 different nucleic acids (i.e., with different sequences). In some cases, the plurality can include no more than 100, 500, 2000, 5000, 15 000, 30 000, or 50 000 nucleic acids. The plurality of nucleic acids can constitute at least 20%, 50%, 70%), 80%, 90%, 95%, or 100% of the library by molar ratio.

In one embodiment, the polypeptides encoded by nucleic acids of the plurality include different numbers of zinc finger domains. For example, polypeptides encoded by a first subset can include four zinc finger domains, and polypeptides encoded by a second subset can include five zinc finger domains. Another combination is one with three, four, and five domains, or four, five and six domains.

In one embodiment, the polypeptides encoded by nucleic acids of the plurality include different types of transcriptional regulatory domains. For example, polypeptides encoded by a first subset can include a transcriptional activation domain, and polypeptides encoded by a second subset can mclude a transcriptional repression domain. Still another subset can be devoid of a transcriptional regulatory domain. This embodiment enables, e.g., evaluating cells without bias for a particular type of transcription factor.

In one embodiment, the library can be randomly constructed from one or more sets of nucleic acids encoding zinc finger domains.

As used herein, the "dissociation constant" refers to the equilibrium dissociation constant of a polypeptide for binding a target site. For example, the dissociation constant can be assessed for a three finger protein by binding to a 28-basepair double-stranded DNA that includes one 9-basepair target site. The dissociation constant is determined by gel shift analysis using a purified protein that is bound in 20 mM Tris pH 7.7, 120 mM NaCl, 5 mM

MgCl₂, 20 μM ZnSO₄, 10% glycerol, 0.1% Nonidet P-40, 5 mM DTT, and 0.10 mg/mL BSA (bovine serum albumin) at room temperature. Additional details are provided in Rebar and Pabo ((1994) Science 263:671-673) and US Published Application 2002-0061512 Al.

As used herein, the term "screen" refers to a process for evaluating members of a library to find one or more particular members that have a given property. In a direct screen, each member of the library is evaluated. For example, each cell is evaluated to determine if it is extending neurites. In another type of screen, termed a "selection," each member is not directly evaluated. Rather the evaluation is made by subjecting the members of the library to conditions in which only members having a particular property are retained. Selections may be mediated by survival (e.g., drug resistance) or binding to a surface (e.g., adhesion to a substrate). Such selective processes are encompassed by the term "screening."

The term "base contacting positions," "DNA contacting positions," or "nucleic acid contacting positions" refers to the four amino acid positions of zinc finger domains that structurally correspond to the positions of amino acids arginine 73, aspartic acid 75, glutamic acid 76, and arginine 79 of ZIF268. Glu Arg Pro Tyr Ala Cys Pro Val Glu Ser Cys Asp Arg Arg Phe Ser

1 5 10 15

Arg Ser Asp Glu Leu Thr Arg His He Arg He His Thr Gly Gin Lys 20 25 30 Pro Phe Gin Cys Arg He Cys Met Arg Asn Phe Ser Arg Ser Asp His

35 40 45

Leu Thr Thr His He Arg Thr His Thr Gly Glu Lys Pro Phe Ala Cys

50 55 60

Asp He Cys Gly Arg Lys Phe Ala Arg Ser Asp Glu Arg Lys Arg His 65 70 75 80

Thr Lys He His Leu Arg Gin Lys Asp (SEQ ID NO : l ) 85

These positions are also referred to as positions -1, 2, 3, and 6, respectively. To identify positions in a query sequence that correspond to the base contacting positions, the query sequence is aligned to the zinc finger domain of interest such that the cysteine and histidine residues of the query sequence are aligned with those of finger 3 of ZIF268. The ClustalW WWW Service at the European Bioinformatics Institute (Thompson et al. (1994) Nucleic Acids Res. 22:4673-4680) provides one convenient method of aligning sequences. Conservative amino acid substitutions refer to the interchangeability of residues having similar side chains. For example, a group of amino acids having aliphatic side chains is glycine, alanine, valine, leucine, and isoleucine; a group of amino acids having aliphatic- hydroxyl side chains is serine and threonine; a group of amino acids having amide-containing side chains is asparagine and glutamine; a group of amino acids having aromatic side chains is phenylalanine, tyrosine, and tryptophan; a group of amino acids having basic side chains is lysine, arginine, and histidine; a group of amino acids having acidic side chains is aspartic acid and glutamic acid; and a group of amino acids having sulfur-containing side chains is cysteine and methionine. Depending on circumstances, amino acids within the same group may be interchangeable. Some additional conservative amino acids substitution groups are: valine-leucine-isoleucine; phenylalanme-tyrosine; lysine-arginine; alanine-valine; aspartic acid-glutamic acid; and asparagine-glutamine.

The term "heterologous polypeptide" refers either to a polypeptide with a non- naturally occurring sequence (e.g., a hybrid polypeptide) or a polypeptide with a sequence identical to a naturally occurring polypeptide but present in a milieu in which it does not naturally occur.

The term "hybrid" refers to a non-naturally occurring polypeptide that includes amino acid sequences derived from either (i) at least two different naturally occurring sequences; (ii) at least one artificial sequence (i.e., a sequence that does not occur naturally) and at least one naturally occurring sequence; or (iii) at least two artificial sequences (same or different). Examples of artificial sequences include mutants of a naturally occurring sequence and de novo designed sequences. As used herein, the term "hybridizes under stringent conditions" refers to conditions for hybridization in 6X sodium chloride/sodium citrate (SSC) at 45°C, followed by two washes in 0.2 X SSC, 0.1% SDS at 65°C.

The term "binding preference" refers to the discriminative property of a polypeptide for selecting one nucleic acid binding site relative to another. For example, when the polypeptide is limiting in quantity relative to two different nucleic acid binding sites, a greater amount of the polypeptide will bind the preferred site relative to the other site in an in vivo or in vitro assay described herein.

The term "abundance" can refer to binary (e.g., absent/present), qualitative (e.g., absent/low/medium/high), or quantitative information (e.g., a value proportional to concentration) indicating the presence of a particular molecular species.

The term "transcript profile" refers to a datastructure that includes information about the abundance of a plurality of transcripts.

The term "co-regulated" refers to a set of two or more genes that are coordinately regulated such that under a statistically significant set of conditions, expression of the genes of the set are all altered or none of the genes of the set are altered. The directionality and degree (e.g., fold difference, ratio, or slope of change) in expression may vary for the set of co-regulated genes, or they may be the same (e.g., the same directionality, e.g., all increased or all decreased) or within a particular range (e.g., same standard deviation for the ratio of change). A statistically significant set of conditions can include all conditions analyzed (e.g., all cells analyzed), or a sufficient number of cells to provide a significant statistical factor (e.g., P < 0.05, or P < 0.002), or at least 70, 75, 80, 85, 90, 92, 95, 96, or 97% of the conditions. Different "conditions" can refer, e.g., to the presence of different artificial transcription factors, different environmental conditions (e.g., temperature), or different host cells. In one embodiment, the environment conditions and host cells are substantially the same or the same for all evaluations, but the artificial transcription factor varies.

The term "syn-expression group" refers to a set of co-regulated genes. Recent advances DNA microarray technology facilitate the high-throughput analysis of tens of thousands of genes. The information on gene expression profile of a specific gene is valuable for a variety of reasons. For example, genes with similar functions may share similar expression profiles, and can be grouped together based on the shared expression profile (see, e.g., Eisen et al. PNAS 1998 95; 14863-8; Iyer et al. Sczerace 1999 283, 83-7).

Genes with similar expression profile are more likely to encode interacting proteins (see, e.g., Ge et al. Nat Genet 2001, 29, 482-6; Kemmeren et al. Mol Cell 2002, 9, 1133-43).

The profiling methods described herein avail themselves of many of the advantages of artificial transcription factors, and frequently the advantages of chimeric zinc finger proteins. For example, unlike some other methods, the effect of an artificial transcription factor on its immediate target is not limited to a particular directionality. Both transcriptional activation and repression can be achieved, e.g., by use of an appropriate transcriptional regulatory domain. Use of these ZFP-based artificial transcription factors can provide an efficient, high-throughput perturbation of human genome. Further, activation or up-regulation can result in expression of genes that are not normally expressed in a specific tissue being studied. Thus, one can gain information on group of genes expressed in different tissues.

The specificity of a ZFP can be precisely controlled by the number of finger domains. Depending upon whether three-, four-, five- or six-fingered ZFPs are used, the number of regulated genes in a cell varies. Three- or four-fingered proteins typically are not as specific as six-fingered proteins. Gene expression profiling has demonstrated that many three- fingered ZFPs regulate several genes. Unlike methods that directly target a single gene, artificial ZFPs (especially ones with fewer fingers) often directly affect multiple genes. This regulatory output provides an opportunity to identify many sets of co-regulated genes. The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic of an exemplary method for inducing expression of an artificial ZFP in a culture cell. FIG. 2 is a schematic of an exemplary method for inducing expression of an artificial ZFP in a culture cell, profiling gene expression in the cell, and clustering results of the profiling.

FIG. 3 depicts scatter plots for different artificial ZFPs(A, B, C, D, E and F) that include a transcriptional activation domain. The x and y axes refer to expression levels in a cell that expresses the artificial ZFP and a reference cell. Thus, dots near the diagonal represent transcripts whose levels are similar, whereas dots removed from the diagonal represent transcripts whose levels are increased or decreased in the presence o the artificial ZFP. FIG. 4 depicts scatter plots for different artificial ZFPs(A, B, C, D, E and F) that include a transcriptional repression domain.

FIG. 5 depicts scatter plots obtained at various time points after expression of a particular artificial ZFP, F104-p65.

FIG. 6A, 6B, 7 and 8 provide information about sets of co-regulated genes. FIG. 9A is a diagram of the multiple cloning site region of pYCT-Lib.

FIG. 9B lists an exemplary DNA sequence of the multiple cloning site region pYCT-Lib (SEQ ID NO:2).

FIG. 10 includes scatter plots for stably transfected cells expressing different ZFPs.

Fig. 11A is a listing of the nucleotide sequence of polylinker region of P3 (SEQ ID NO: 3). The sequence outside of this region is identical to that of the parental vector, pcDNA3 (Invitrogen). Each enzyme site is italicized and HA tag is underlined. Both initiation and stop codons are indicated by bold letters. The nuclear localization signal (NLS) is also indicated.

Fig. 1 IB is a schematic of one exemplary method for zinc finger protein library construction.

FIG. 12 is a flow chart of an exemplary method for target identification and validation.

FIG. 13 is a Venn diagram of sets of genes co-regulated by K5, K6, and K7 ZFPs.

FIG. 14 depicts scatter plots obtained at various time points after expression of a particular artificial ZFP, FI 04-p65.

FIGs. 15, 16, 17, and 18 provide information about sets of co-regulated genes.

FIG. 19 is an exemplary schematic of a method for identifying co-regulated genes. FIG. 20 depicts characterization of a ZFP-TF-activated gene expression profile. A gene expression profile obtained with a ZFP-TF in two different cell lines. A ZFP activator, F2840-p65, was either stably expressed in 293 cells (left panel) ortransiently expressed in HeLa cells (right panel). In both experiments, insulin was highly expressed (marked by arrows).

FIG. 21 depicts characterization of a ZFP-TF-activated gene expression profile. Time course analysis of gene expression driven by a ZFP-TF. A 293 cell line stably expressing a ZFP activator, F104-p65, was treated with Dox for the times stated. Two genes were activated over two fold throughout the course of the experiments (marked by arrows). FIG. 22 depicts activation of MAGE family by the Spi-B transcription factor. 293 cells were transiently transfected with either an empty vector (con) or a vector expressing Spi-B (Spi). 48 h post-transfection, cells were harvested and RNAs prepared for real time PCR analysis. Results are the average of two experiments, each performed in duplicate.

FIG. 23 depicts several ZFP-activator-driven gene expression profiles. FIG. 24 depicts several ZFP-repressor-driven gene expression profiles.

DETAILED DESCRIPTION

The sequences of the human genome and many other genomes are available. The functional characterization of genes in these genomes is of prime importance, and can be a critical step for drug discovery. This disclosure provides, inter alia, a method for evaluating or identifying a genetic network. For example, the method enables identification of a network in which gene A upregulates gene B and gene B downregulates gene C and so on. This type of characterization is extremely useful for selecting drug targets. For example, if gene A is known to be involved with a disease, the discovery that gene C is also part of a biological network that reacts to the disease condition identifies the protein encoded by gene C or gene C itself as a drug target.

In one embodiment, a biological network is identified based on expression data using a cell in which an artificial chimeric zinc finger protein is expressed. The expression data is analyzed to identify a set of genes that are coordinately regulated. The criterion used to identify the set of genes can be depend on the application and the user's preference.

Exemplary criterion include a multivariate distance metric and a metric of binary information. For example, sets can be identified until a statistically significant set is found, e.g., using the Pearson's similarity coefficient. Exemplary distance metrics include Euclidean distance, Minkowski Distance, Mahalanobis Distance, Taxi-cab Distance, Canberra Metric, or Bray- Curtis Coefficient. Exemplary metrics for evaluating binary information include: the Hamming Distance, a chi squared based measure, or a Fisher Exact test. Many methods for evaluating expression data are described in Jagota, Microarray Data Analysis and Visualization, Bioinformatics By the Bay Press, 2001.

It is appreciated that for any criterion that is used to winnow the set of identified genes, the skilled artisan can select a variety of parameters, including an arbitrary parameter, observe the results produced, and modify the criterion (e.g., to make it more or less stringent). This process can be performed repeatedly, e.g., until at least one set of genes whose functions are known to be linked by annotation are identified. In another example, the process is performed until a particular quality parameter is attained for the results.

Additional methods that can be used to identify a biological network from the expression data include: a Bayesian network, Boolean network, graphical modeling, statistical analysis, energy minimization, and so forth. For example, Friedman et al. (2000) J Comput Biol. 7(3-4):601-20 describes the use of Bayesian networks to evaluate gene expression data. In another example, Shortest-Path (SP) analysis (see, e.g., Zhou et al. (2002) Proc. Nat. Acad. Sci. USA 99:12783) can be used to analyze gene expression patterns. Prior to analysis, expression data can be pre-processed. In one embodiment, expression data is normalized relative to a reference cell, e.g., a control cell that does not include a functional artificial transcription factor or any arbitrary cell. The preprocessing can mclude one or more of the following; evaluating a ratio of values, evaluating an arithmetic difference in values, log-transforming ratio or difference values, scaling data, zero-centering data, dividing values by a standard deviation, weighting data, and transforming data into discrete categories (e.g., binary on/off categories).

In one aspect, a method described herein is implemented as a high-throughput approach to characterize a cellular system. For example, a large-scale expression profile database is assembled by evaluating cells that express an artificial zinc finger protein (ZFP). The cells each include a nucleic acid encoding the artificial ZFP. In one embodiment, the nucleic acids are randomly isolated from a library, e.g., a library of nucleic acids that encode a chimera of at least three zinc finger domains that are themselves randomly selected from one or more sets. In another embodiment, the nucleic acids or the cells are non-randomly selected. For example, each cell may have a particular phenotype. (See below for exemplary phenotypes).

For example, we have assembled a pool of several thousands of active ZFP-TFs. The target genes regulated by these ZFP are unknown, at least initially. Nucleic acids encoding these ZFPs are transformed into the target cells, e.g., cultured human cells. The nucleic acids are expressed (e.g., using an inducer) so that the ZFPs are produced in the cells. Then parameters of the transformed cells are evaluated and compared to each other. The evaluated parameters typically include a gene expression profile. Microarray analysis was used to identify gene expression profiles for these randomly chosen ZFP-TFs. We observed that each ZFP up- or down-regulates a distinct set of genes in the human genome. The gene expression profiles obtained were then clustered and analyzed. The analysis identified a number of co-regulated gene groups. Both global clustering and individual gene grouping identified several genes that are co-regulated and whose functional relationships are evident, thus demonstrating the validity of this approach. Notably, in this example, the clustering analysis can be applied to the ensemble of profile data for all the transformed cells. Thus, the complete ensemble of data is tapped without bias for particular phenotypes or cellular properties.

Profiling Regulatory Properties of a Chimeric Zinc Finger Protein

A variety of methods are available to evaluate (e.g., profile) a cell that expresses an artificial chimeric transcription factor. Typically, the cell is analyzed to determine the levels of transcripts or proteins present in the cell, or in the medium surrounding the cell. For example, mRNA can be harvested from the cell and analyzed using a nucleic acid microarray. Nucleic acid microarrays can be fabricated by a variety of methods, e.g., photolithographic methods (see, e.g., U.S. Patent No. 5,510^270;), mechanical methods (e.g., directed-flow methods as described in U.S. Patent No. 5,384,261), and pin based methods (e.g., as described in U.S. Pat. No. 5,288,514). The array is synthesized with a unique capture probe at each address, each capture probe being appropriate to detect a nucleic acid for a particular expressed gene.

The mRNA can be isolated by routine methods, e.g., including DNase treatment to remove genomic DNA and hybridization to an oligo-dT coupled solid substrate (e.g., as described in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y). The substrate is washed, and the mRNA is eluted. The isolated mRNA is then reversed transcribed and optionally amplified, e.g., by rtPCR, e.g., as described in (U.S. Patent No. 4,683,202). The nucleic acid can be labeled during amplification or reverse transcription, e.g., by the incorporation of a labeled nucleotide. Examples of preferred labels include fluorescent labels, e.g., red-fluorescent dye Cy5 (Amersham) or green-fluorescent dye Cy3 (Amersham). Alternatively, the nucleic acid can be labeled with biotin, and detected after hybridization with labeled streptavidin, e.g., streptavidin-phycoerythrin (Molecular Probes). The labeled nucleic acid is then contacted to the array. In addition, a control nucleic acid or a reference nucleic acid can be contacted to the same array. The control nucleic acid or reference nucleic acid can be labeled with a label other than the sample nucleic acid, e.g., one with a different emission maximum. Labeled nucleic acids are contacted to an array under hybridization conditions. The array is washed, and then imaged to detect fluorescence at each address of the array. A general scheme for producing and evaluating profiles includes detecting hybridization at each address of the array. The extent of hybridization at an address is represented by a numerical value and stored, e.g., in a vector, a one-dimensional matrix, or one-dimensional array. The vector x has a value for each address of the array. For example, a numerical value for the extent of hybridization at a particular address is stored in variable x_a. The numerical value can be adjusted, e.g., for local background levels, sample amount, and other variations. Nucleic acid is also prepared from a reference sample and hybridized to the same or a different array. The vector y is construct identically to vector x. The sample expression profile and the reference profile can be compared, e.g., using a mathematical equation that is a function of the two vectors. The comparison can be evaluated as a scalar value, e.g., a score representing similarity of the two profiles. Either or both vectors can be transformed by a matrix in order to add weighting values to different genes detected by the array.

The expression data can be stored in a database, e.g., a relational database such as a SQL database (e.g., Oracle or Sybase database environments). The database can have multiple tables. For example, raw expression data can be stored in one table, wherein each column corresponds to a gene being assayed, e.g., an address or an array, and each row corresponds to a sample. A separate table can store identifiers and sample information, e.g., the batch number of the array used, date, and other quality control information.

Clustering. Genes that are similarly regulated can be identified, for example, by clustering expression data to identify coregulated genes. Such cluster may be indicative of a set of genes coordinately regulated by the chimeric zinc finger protein. Genes can be clustered using hierarchical clustering (see, e.g., Sokal and Michener (1958) Univ. Kans. Sci. Bull. 38: 1409), Bayesian clustering, k-means clustering, and self-organizing maps (see, Tamayo et al. (1999) R7OC. Natl. Acad. Sci. USA 96:2907). In many embodiments described herein, these clustering methods are used to identify a set of co-regulated genes, genes that are similarly regulated by different ZFPs.

The similarity of a sample expression profile to a reference expression profile (e.g., a control cell) can also be determined, e.g., by comparing the log of the expression level of the sample to the log of the predictor or reference expression value and adjusting the comparison by the weighting factor for all genes of predictive value in the profile. Proteins can also be profiled in a cell that has an chimeric zinc finger protein within it.

One exemplary method for profiling proteins includes 2-D gel electrophoresis and mass spectroscopy to characterize individual protein species. Individual "spots" on the 2-D gel are proteolyzed and then analyzed on the mass spectrometer. This method can identify both the protein component and, in many cases, translational modifications.

Target DNA Site Identification

With respect to chimeric DNA binding proteins, a variety of methods can be used to determine the target site of a chimeric DNA binding protein that produces a phenotype of interest. Such methods can be used, alone or in combination, to find such a target site. For example, it maybe useful to determine if a set of co-regulated genes each include a sequence that could be directly bound by the relevant chimeric DNA binding protein.

In one embodiment, information from expression profile is used to identify the target site recognized by a chimeric zinc finger protein. The regulatory regions of genes that are co-regulated by the chimeric zinc finger protein are compared to identify a motif that is common to all or many of the regulatory regions. In another embodiment, biochemical means are used to determine what DNA site is bound by the chimeric zinc finger protein. For example, chromatin immuno-precipitation experiments can be used to isolate nucleic acid to which the chimeric zinc finger protein is bound. The isolated nucleic acid is PCR amplified and sequence. See, e.g., Gogus et al. (1996) Proc. Natl. Acad. Sci. USA. 93:2159-2164. The SELEX method is another exemplary method that can be used. Further, information about the binding specificity of individual zinc finger domains in the chimeric zinc finger protein can be used to predict the target site. The prediction can be validated or can be used to guide interpretation of other results (e.g., from chromatin immunoprecipitation, in silico analysis of co-regulated genes, and SELEX).

In still another embodiment, a potential target site is inferred based on information about the binding specificity of each component zinc finger. For example, a chimera that includes the zinc finger domains, from N- to C- terminus, : CSNR, RSNR, and QSNR is expected to recognize the target site 5'-GAAGAGGACC-3' (SEQ ID NO:4). The domains CSNR, RSNR, and QSNR have the following respective DNA binding specificities GAC, GAG, and GAA. The expected target site is formed by considering the domains in C terminal to N-terminal order and concatenating their recognition specificities to obtain one strand of the target site in 5' to 3' order.

Although in most cases, chimeric zinc finger proteins are likely to function as transcriptional regulators, it is possible that in some cases the chimeric zinc finger proteins mediate their phenotypic effect by binding to an RNA or protein target. Some naturally- occurring ZFPs in fact bind to these macromolecules.

Screening Nucleic acid Libraries Encoding Chimeric Proteins

Library nucleic acids can be introduced into cells by a variety of methods. In one example, the library is stored as a random pool including multiple replicates of each library nucleic acid. An aliquot of the pool is transformed into cells. In another embodiment, individual library members are stored separately (e.g., in separate wells of a microtitre plate) and are individually introduced into cells.

In still another embodiment, the library members are stored in pools that have a reduced complexity relative to the library as a whole. For example, each pool can include 10 different library members from a library of 10 or 10 different members. When a pool is identified as having a member that causes a particular effect, the pool is deconvolved to identify the individual library member that mediates the phenotypic effect. This approach is useful when recovery of the altered cell is difficult, e.g., in a screen for chimeric proteins that cause apoptosis.

Library nucleic acids can be introduced into cells by a variety of methods. Exemplary methods include electroporation (see, e.g., U.S. Pat. No. 5,384,253); microprojectile bombardment techniques (see, e.g., U.S. Pat. Nos. 5,550,318; 5,538,880;

5,610,042; and WO 94/09699); liposome-mediated transfection (e.g., using Lipofectamine™ (Gibco BRL) or Superfect™ (Qiagen); see, e.g., Nicolau et al., "Liposomes as carriers for in vivo gene transfer and expression," Methods Enzymol., 149:157-176, 1987.); calcium Phosphate or DEAE-Dextran mediated transformation (see, e.g., Rippe et al., "DNA- mediated gene transfer into adult rat hepatocytes in primary culture," Mol. Cell Biol., 10:689-695, 1990.); direct microinjection or sonication loading; receptor mediated transfection (see, e.g., EP 273 085); and Agrobacterium-mediated transformation (see, e.g., U.S. Pat. No. 5,563,055 and 5,591,616).

It is also possible to use a viral particle to deliver a library nucleic acid into a cell in vitro or in vivo. Viral packaging can also deliver the library nucleic acids to cells within an organism. In another embodiment, the library nucleic acids are introduced into cells in vitro, after which the cells are transferred into an organism.

After introduction of the library nucleic acids, the library nucleic acids are expressed so that the chimeric proteins encoded by the library are produced by the cells. Constant regions of the library nucleic acid can provide necessary regulatory and supporting sequences to enable expression. Such sequences can mclude transcriptional promoters, transcription terminators, splice site donors and acceptors, untranslated regulatory regions (such as polyA addition sites), bacterial origins of replication, markers for indicating the presence of the library nucleic acid or for selection of the library nucleic acid. After the nucleic acids are expressed, the cells can be profiled as described above.

Screening

In some embodiments, the cells are screened to identify ones that have an altered phenotype. This process can be adapted to the phenotype of interest. As the number of possible phenotypes is vast, so too are the possibilities for screening. Numerous genetic screens and selections have been conducted to identify mutants or overexpressed naturally occurring genes that result in particular phenotypes. Any of these methods can be adapted to identify useful members of a nucleic acid library encoding chimeric proteins.

Some screens involve particular environmental conditions. Cells that are sensitive or resistant to the condition are identified. Some screens require detection of a particular behavior of a cell (e.g., chemotaxis, morphological changes, or apoptosis), or a particular behavior of an organism (e.g., phototaxis by a plant, mating behavior by a Drosophila, and so forth).

Some screens relate to cell proliferation. Cells that proliferate at a different rate relative to a reference cell (e.g., a normal cell) are identified. In addition, cells that have an altered response to a proliferative signal (e.g., a growth factors or other mitogen) can be identified. The cells may be more or less sensitive to the signal.

Screens that relate to cell differentiation can also be used. The screening and use of chimeric ZFPs can be used to modulate the differentiative and proliferative capacity of a variety of cells, including stem cells, such as ES cells and somatic stem cells, both human and otherwise. ZFPs can be found that direct ES cells to differentiate into a restricted lineage, such as neuronal progenitor cells or hematopoietic stem cells. It should be also possible to screen for ZFPs that can direct differentiation of stem cells toward a defined post-mitotic cell subtype, for example, directing differentiation of ES cells and/or neural stem cells to dopaminergic or cholinergic neurons. Among other phenotypes to evaluate differentiation, it is possible to look at expression of marker genes and marker proteins. Examples of such markers include:

FLK1; Endothelial cells (Cho SK et al., Blood 2001 Dec 15;98(13):3635-42, SI Nishikawa et al., Development, Vol 125, Issue 9 1747-1757), ; VSMC-specific myosin Heavy chain(MHC) ; smooth muscle (Drab M et al., FASEB J 1997 Sep;l 1(11):905-15, abstract) ; Bone-specific alkaline phosphatase (BAP) or Osteocalci; Osteoblast, (Demers LM et al., Cancer 2000 Jun 15;88(12 Suppl):2919-26); CD4, CD8, CD45; white blood cell, (Ody C et al., Blood 2000 Dec l;96(12):3988-90, Martin P et al., Blood 2000 Oct l;96(7):2511-9); Flk-2, CD34; hematopoietic stem cells, (Julie L. Christensen and Irving L. Weissman, Proc. Natl. Acad. Sci. USA, 2001Nol. 98, Issue 25, 14541-14546, Woodward J, Jenkinson E. Eur J Immunol 2001 Νov;31(11):3329-38, George AA et al., Blood 2001 Jun 15;97(12):3925-

30); Colony-forming unit; HS MSC progenitor (Frimberger AE et al., Exp Hematol 2001 May;29(5):643-52); Muc-18(CD146); Bone marrow fibroblasts, (Filshie RJ et al, Leukemia 1998 Mar;12(3):414-21); collagen type II and IV, Chondrocyte expressed protein-68; chondrocyte (Carlberg AL et al., Differentiation 2001 Jun;67(4-5): 128-38, Steck E et al., Biochem J 2001 Jan 15;353(Pt 2): 169-74); adipocyte lipid-binding protein(ALBP)or fatty acid transporter; adipocyte, (Amri, E. Z et al., (1995) J. Biol. Chem. 270, 2367-2371, Claire Bastie et al., J Biol Chem 1999 Jul 30;274(31):21920-5, Frohnert, B. I et al., J. Biol. Chem.

274, 3970-3977, Lydia Teboul et al., Biochem. J. (2001) 360, 305-312); CD133; neural stem cell, (Uchida N et al., PNAS 2000 Dec 19;97(26): 14720-5); GFAP; astrocyte (Chengkai Dai et al., Genes Dev 2001 Aug l;15(15):1913-25); and microtubule-associated protein-2; neuron (Roy NS et al, Nat Med 2000 Mar;6(3):271-7) It is also possible to screen mammalian cells for other properties, such as anti- tumorigenesis, altered apoptosis, and anti-viral phenotypes. For example, by selecting for cells that are resistant to viral infection or virus production, it is possible to identify artificial chimeric proteins that can be used as anti-viral agents.

Similarly changes in cell signaling pathways can be detected by the use of probes correlated with activity or inactivity of the pathway or by observable indications correlated with activity or inactivity of the pathway.

Some screens relate to production of a compound of interest. The compound can be, e.g., a metabolite, a secreted protein, e.g., a post-translationally modified protein. For example, cells can be identified that produce increased amounts of the compound. Cells of interest can be identified by a variety of means, including the use of a responder cell, microarrays, chemical detection assays, and immunoassays.

As seen above, numerous phenotypes of a cell can be observed. Such phenotypes include visual behavior, cellular and physiological properties (e.g., adherence, ruffling, microtubule turnover, actin turnover, apoptosis, mitosis, and so forth). In some cases, it is useful to determine if a gene or gene product is directly or indirectly regulated by a ZFP. Two possible methods are as follows: (a) Time-Course Analysis

The targets of a chimeric ZFP can be identified by characterizing changes in gene expression with respect to time after a cell is exposed to the chimeric ZFP. For example, a gene encoding the chimeric ZFP can be attached to an inducible promoter. An exemplary inducible promoter is regulated by a small molecule such as doxycycline. The gene encoding the chimeric ZFP is introduced into cells. mRNA samples are obtained from cells at various times after induction of the inducible promoter. See, e.g., FIG. 5, which depicts genes that are activated and repressed in the course of induction of the ZFP F104-p65.

(b) Identification of primary target genes of ZFP-TFs from mammalian cells using protein transduction and cDNA microarray technologies. It is also possible to introduce a chimeric protein into a cell by transduction. The protein is provided to the extracellular milieu and the cell transduces the protein into itself. Thus, the cell does not have to include a gene encoding the chimeric protein. This approach may obviate concerns about exogenous nucleic acid integration, propagation, and so forth. Levels of the protein can be precisely controlled. In one embodiment, the chimeric ZFP is fused to protein transduction domain of Tat or VP22.

To analyze the effects of a transduced chimeric protein in culture cells, the chimeric protein is added to the culture media, e.g., as a fusion to a protein transduction domain. Detection of regulated target genes can be enhanced by addition of an inhibitor of protein synthesis such as cycloheximide. Thus, translation of the primary target genes is blocked, and genes that would be regulated by proteins encoded by the primary target genes would be detected. The identity of primary target genes can be found by DNA microarray analysis.

Still another method includes a computer-based analysis of the regulatory regions of co-regulated genes to identify

Further, it is possible to determine if a gene or gene product that is coregulated by an artificial ZFP can independently mediate a particular cellular phenotype. The candidate target can be independently over-expressed or inhibited (e.g., by genetic deletion or RNA interference). In addition, it may be possible to apply this analysis to multiple candidate targets since in at least some cases more than one candidate may need to be perturbed to cause the phenotype. An example of this approach is provided in Example 3 (Ketoconazole resistance).

Library Construction: 1. Exemplary Structural Domains

The nucleic acid library is constructed so that it includes nucleic acids that each encodes and can express an artificial protein that is a chimera of one or more structural domains. In some aspects, the structural domains are nucleic acid binding domains that vary in specificity such that the library encodes a population of proteins with different binding specificities. A variety of structural domains are known to bind nucleic acids with high affinity and high specificity. For reviews of structural motifs which recognize double stranded DNA, see, e.g., Pabo and Sauer (1992) Annu. Rev. Biochem. 61:1053-95; Patikoglou and Burley (1997) Annu. Rev. Biophys. Biomol. Struct. 26:289-325; Nelson (1995) Curr Opin Genet Dev. 5:180-9. A few non-limiting examples of nucleic acid binding domains include: zinc finger domains, homeodomains, and helix-loop-helix domains.

Zinc fingers. Zinc fingers are small polypeptide domains of approximately 30 amino acid residues in which there are four amino acids, either cysteine or histidine, appropriately spaced such that they can coordinate a zinc ion (For reviews, see, e.g., Klug and Rhodes, (1987) Trends Biochem. Sc l2:464-469(1987); Evans and Hollenberg, (1988) Cell 52:1-3;

Payre and Vincent, (1988) FEBS Lett. 234:245-250; Miller et al, (1985) EMBO J. 4:1609- 1614; Berg, (1988) Proc. Natl. Acad. Sci. U.S.A. 85:99-102; Rosenfeld and Margalit, (1993) J. Biomol. Struct. Dyn. 11:557-570). Hence, zinc finger domains can be categorized according to the identity of the residues that coordinate the zinc ion, e.g., as the Cys₂-His₂ class, the Cys₂-Cys₂ class, the Cys₂-CysHis class, and so forth. The zinc coordinating residues of Cys₂-His₂ zinc fingers are typically spaced as follows: X_a-X-C-X _-5-C-X₃-X_a-X₅- ψ-X₂-H-X_3-5-H, where ψ (psi) is a hydrophobic residue (Wolfe et al, (1999) Annu. Rev. Biophys. Biomol Struct. 3:183-212), wherein "X" represents any amino acid, wherein X_a is phenylalanine or tyrosine, the subscript indicates the number of amino acids, and a subscript with two hyphenated numbers indicates a typical range of intervening amino acids. Typically, the intervening amino acids fold to form an anti-parallel β-sheet that packs against an α-helix, although the anti-parallel β-sheets can be short, non-ideal, or non-existent. The fold positions the zinc-coordinating side chains so they are in a tetrahedral conformation appropriate for coordinating the zinc ion. The base contacting residues are at the N-terminus of the finger and in the preceding loop region.

For convenience, the primary DNA contacting residues of a zinc finger domain are numbered: -1, 2, 3, and 6 based on the following example:

- 1 1 2 3 4 5 6 X_a-X-C-X₂-₅- C-X₃ -X_a-X-C-X-S-N-X_b-X-R-H-X₃-₅-H, where Xa is typically phenylalanine or tyrosine (but may be another amino acid in some cases), and Xb is typically a hydrophobic residue. As noted in the example above, the DNA contacting residues are Cys (C), Ser (S), Asn (N), and Arg (R). The above motif can be abbreviated CSNR. As used herein, such abbreviation refers to a particular polypeptide sequence. Where two sequences have the same motif, a number may be used to indicate the sequence. In certain contexts where made explicitly apparent, the four letter abbreviation refers to the motif in general.

A zinc finger protein typically consists of an array of three or more zinc finger domains.

The zinc finger domain (or "ZFD") is one of the most common eukaryotic DNA- binding motifs, found in species from yeast to higher plants and to humans. By one estimate, there are at least several thousand zinc finger domains in the human genome alone, possibly at least 4,500. Zinc finger domains can be isolated from zinc finger proteins. Non-limiting examples of zinc finger proteins include CF2-II, Kruppel, WT1, basonuclin, BCL-6/LAZ-3, erythroid Kruppel-like transcription factor, Spl, Sp2, Sp3, Sp4, transcriptional repressor YYl, EGRl/Krox24, EGR2/Krox20, EGR3/Pilot, EGR4/AT133, Evi-1, GLU, GLI2, GLI3, HIV- EP1/ZNF40, HIN-EP2, KR1, ZfX, ZfY, and ZΝF7.

Computational methods described below can be used to identify all zinc finger domains encoded in a sequenced genome or in a nucleic acid database. Any such zinc finger domain can be utilized. In addition, artificial zinc finger domains have been designed, e.g., using computational methods (e.g., Dahiyat and Mayo, (1997) Science 278:82-7). It is also noteworthy that at least some zinc finger domains bind to ligands other than

DNA, e.g., RNA or protein. Thus, a chimera of zinc finger domains or of a zinc finger domain and another type of domain can be used to recognize a variety of targets compounds, not just DNA.

U.S. Patent Application Serial No. 60/374,355, titled "Zinc Finger Domain Libraries," and filed April 22, 2002 and U.S. Published Application 2002-0061512 Al and other patent applications described herein describe exemplary zinc finger domains which can be used to construct an artificial zinc finger protein.

Library Construction: 2. Identification of Structural Domains

A variety of methods can be used to identify structural domains. Nucleic acids encoding identified domains are used to construct the nucleic acid library. Further, nucleic acid encoding these domains can also be varied (e.g., mutated) to provide additional domains that are encoded by the library.

Computational Methods. To identify additional naturally-occurring structural domains, the amino acid sequence of a known structural domain can be compared to a database of known sequences, e.g., an annotated database of protein or nucleic acid sequences. In another implementation, databases of uncharacterized sequences, e.g., unannotated genomic, EST or full-length cDNA sequence; of characterized sequences, e.g., SwissProt or PDB; and of domains, e.g., Pfam, ProDom (Corpet et al. (2000) Nucleic Acids Res. 28:267-269), and SMART (Simple Modular Architecture Research Tool, Letunic et al. (2002) Nucleic Acids Res 30, 242-244) can provide a source of structural domain sequences.

Nucleic acid sequence databases can be translated in all six reading frames for the purpose of comparison to a query amino acid sequence. Nucleic acid sequences that are flagged as encoding candidate nucleic acid binding domains can be amplified from an appropriate nucleic acid source, e.g., genomic DNA or cellular RNA. Such nucleic acid sequences can be cloned into an expression vector. The procedures for computer-based domain identification can be interfaced with an oligonucleotide synthesizer and robotic systems to produce nucleic acids encoding the domains in a high-throughput platform. Cloned nucleic acids encoding the candidate domains can also be stored in a host expression vector and shuttled easily into an expression vector, e.g., into a translational fusion vector with other domains (of a similar or different type), either by restriction enzyme mediated subcloning or by site-specific, recombinase mediated subcloning (see U.S. Patent No. 5,888,732). The high-throughput platform can be used to generate multiple microtitre plates containing nucleic acids encoding different candidate chimeras.

Detailed methods for the identification of domains from a starting sequence or a profile are well known in the art. See, for example, Prosite (Hofmann et al, (1999) Nucleic Acids Res. 27:215-219), FASTA, BLAST (Altschul et al, (1990) J. Mol. Biol. 215:403-10.), etc. A simple string search can be done to find amino acid sequences with identity to a query sequence or a query profile, e.g., using Perl to scan text files. Sequences so identified can be about 30%, 40%, 50%, 60%, 70%, 80%, 90%, or greater identical to an initial input sequence. Domains similar to a query domain can be identified from a public database, e.g., using the XBLAST programs (version 2.0) of Altschul et al, (1990) J. Mol. Biol. 215:403-10. For example, BLAST protein searches can be performed with the XBLAST parameters as follows: score = 50, wordlength = 3. Gaps can be introduced into the query or searched sequence as described in Altschul et al., (1997) Nucleic Acids Res. 25(17):3389-3402. Default parameters for XBLAST and Gapped BLAST programs are available at National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda MD. The Prosite profiles PS00028 and PS50157 can be used to identify zinc finger domains. In a SWISSPROT release of 80,000 protein sequences, these profiles detected 3189 and 2316 zinc finger domains, respectively. Profiles can be constructed from a multiple sequence alignment of related proteins by a variety of different techniques. Gribskov and co- workers (Gribskov et al, (1990) Meth. Enzymol. 183:146-159) utilized a symbol comparison table to convert a multiple sequence alignment supplied with residue frequency distributions into weights for each position. See, for example, the PROSITE database and the work of Luethy et /., (1994) Protein Sci. 3:139-1465.

Hidden Markov Models (HMM's) representing a DNA binding domain of interest can be generated or obtained from a database of such models, e.g., the Pfam database, release 2.1. A database can be searched, e.g., using the default parameters, with the HMM in order to find additional domains (see, e.g., Bateman et al. (2002) Nucleic Acids Research 30:276-280). Alternatively, the user can optimize the parameters. A threshold score can be selected to filter the database of sequences such that sequences that score above the threshold are displayed as candidate domains. A description of the Pfam database can be found in Sonhammer et al, (1997) Proteins 28(3):405-420, and a detailed description of HMMs can be found, for example, in Gribskov et al, (1990) Meth. Enzymol. 183:146-159; Gribskov et al, (1987) Proc. Natl. Acad. Sci. USA 84:4355-4358; Krogh et al, (1994) J. Mol. Biol. 235:1501-1531; and Stultz et al, (1993) Protein Sci. 2:305-314.

The SMART database of HMM's (Simple Modular Architecture Research Tool, Schultz et al, (1998) Proc. Natl. Acad. Sci. USA 95:5857 and Schultz et al, (2000) Nucl Acids Res 28:231) provides a catalog of zinc finger domains (ZnF_C2H2; ZnF_C2C2; ZnF_C2HC; ZnF_C3Hl; ZnF_C4; ZnF_CHCC; ZnF_GATA; and ZnF_NFX) identified by profiling with the hidden Markov models of the HMMer2 search program (Durbin et al, (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press).

Hybridization-based Methods. A collection of nucleic acids encoding various forms of a structural domain can be analyzed to profile sequences encoding conserved amino- and carboxy-terminal boundary sequences. Degenerate oligonucleotides can be designed to hybridize to sequences encoding such conserved boundary sequences. Moreover, the efficacy of such degenerate oligonucleotides can be estimated by comparing their composition to the frequency of possible annealing sites in known genomic sequences. If desired, multiple rounds of design can be used to optimize the degenerate oligonucleotides. Comparison of known Cys₂-His₂ zinc fingers, for example, revealed a common sequence in the linker region between adjacent fingers in natural sequence (Agata et al, (1998) Gene 213:55-64). Degenerate oligonucleotides that anneal to nucleic acid encoding the conserved linker region were used to amplify a plurality of zinc finger domains. The amplified nucleic acid encoding the domains can be used to construct nucleic acids that encode a chimeric array of zinc fingers.

Library Construction: 3. Nucleic Acids Encoding Structural Domains

Nucleic acids that are used to assemble the library can be obtained by a variety of methods. Some component nucleic acids of the library can encode naturally occurring domains. In addition, some component nucleic acids are variants that are obtained by mutation or other randomization methods. The component nucleic acids, typically encoding just a single domain, can be joined to each other to produce nucleic acids encoding a fusion of the different domains.

Isolation of a natural repertoire of domains. A library of domains can be constructed by isolation of nucleic acid sequences encoding domains from genomic DNA or cDNA of eukaryotic organisms such as humans. Multiple methods are available for doing this. For example, a computer search of available amino acid sequences can be used to identify the domains, as described above. A nucleic acid encoding each domain can be isolated and inserted into a vector appropriate for the expression in cells, e.g., a vector containing a promoter, an activation domain, and a selectable marker. In another example, degenerate oligonucleotides that hybridize to a conserved motif are used to amplify, e.g., by PCR, a large number of related domains containing the motif. For example, Kruppel-like Cys₂His₂ zinc fingers can be amplified by the method of Agata et al, (1998) Gene 213:55-64. This method also maintains the naturally occurring zinc finger domain linker peptide sequences, e.g., sequences with the pattern: Thr-Gly-(Glu/Gln)-(Lys/Arg)-Pro-(Tyr/Phe).

Moreover, screening a collection limited to domains of interest, unlike screening a library of unselected genomic or cDNA sequences, significantly decreases library complexity and reduces the likelihood of missing a desirable sequence due to the inherent difficulty of completely screening large libraries.

The human genome contains numerous zinc finger domains, many of which are uncharacterized and unidentified. It is estimated that there are thousands of genes encoding proteins with zinc finger domains (Pellegrino and Berg, (1991) Proc. Natl. Acad. Sci. USA 88:671-675). These human zinc finger domains represent an extensive collection of diverse domains from which novel DNA-binding proteins can be constructed. Many human zinc finger domains are described in U.S. Patent Application Serial No. 60/374,355, titled "Zinc Finger Domain Libraries," filed April 22, 2002 and US Patent Application, titled "ZINC FINGER DOMAIN LIBRARIES," filed August 19, 2002, bearing Attorney Docket No. 12279-005001.

If each zinc finger domain recognizes a unique 3- to 4-bp sequence, the total number of domains required to bind every possible 3- to 4-bp sequence is only 64 to 256 (4 to 4 ). It is possible that the natural repertoire of the human genome contains a sufficient number of unique zinc finger domains to span all possible recognition sites. These zinc finger domains are a valuable resource for constructing artificial chimeric DNA-binding proteins. The library can mclude naturally occurring zinc finger domains, artificial mutants of such domains, and combinations thereof. Mutated Domains. In one instance, the collection includes nucleic acids encoding at least one structural domain that is an artificial variant of a naturally-occurring sequence. In one embodiment, such variant domains are assembled from a degenerate patterned library. In the case of a nucleic acid binding domains, positions in close proximity to the nucleic acid binding interface or adjacent to a position so located can be targeted for mutagenesis. A mutated test zinc finger domain, for example, can be constrained at any mutated position to a subset of possible amino acids by using a patterned degenerate library. Degenerate codon sets can be used to encode the profile at each position. For example, codon sets are available that encode only hydrophobic residues, aliphatic residues, or hydrophilic residues. The library can be selected for full-length clones that encode folded polypeptides. Cho et al. ((2000) J. Mol. Biol. 297(2):309-19) provides a method for producing such degenerate libraries using degenerate oligonucleotides, and also provides a method of selecting library nucleic acids that encode full-length polypeptides. Such nucleic acids can be easily inserted into an expression plasmid, e.g., using convenient restriction enzyme cleavage sites.

Selection of the appropriate codons and the relative proportions of each nucleotide at a given position can be determined by simple examination of a table representing the genetic code, or by computational algorithms. For example, Cho et al, supra, describe a computer program that accepts a desired profile of protein sequence and outputs a preferred oligonucleotide design that encodes the sequence.

Library Construction: 4. A Library of Chimeric Zinc Finger Proteins

A library of nucleic acids encoding diverse chimeric zinc finger proteins can be formed by serial ligation, e.g., as described in Example 1. The library can be constructed such that each nucleic acid encodes a protein that has at least three, four, or five zinc finger domains. In some implementations, particularly for large libraries, each zinc finger coding segment can be designed to randomly encode any one of a set of zinc finger domains. The set of zinc finger domains can be selected to represent domains with a range of specificities, e.g., covering 30, 40, 50 or more of the 64 possible 3-basepair subsites. The set can include at least about 12, 15, 20, 25, 30, 40 or 50 different zinc finger domains. Some or all of these domains can be domains isolated from naturally occurring proteins. Moreover, because there may be little or no need for more than one zinc finger domain for a given 3-basepair subsite, it may be possible to generate a library using a small number of component domains, e.g., less than 500, 200, 100, or even less than 64 total component domains.

One exemplary library includes nucleic acids that encode a chimeric zinc finger protein having three fingers and 30 possible domains at each finger position. In its fully represented form, this library includes 27,000 sequences (i.e., the result of 30³). The library can be constructed by serial ligation in which a nucleic acid from a pool of nucleic acids encoding all 30 possible domains is added at each step.

In one embodiment, the library can be stored as a random collection. In another embodiment, individual members can be isolated, stored at an addressable location (e.g., arrayed), and sequenced. After high throughput sequencing of 40 to 50 thousand constructed library members, missing chimeric combinations can be individually assembled in order to obtain complete coverage. Once arrayed, e.g., in microtitre plates, each individual member can be recovered later for further analysis, e.g., for a phenotypic screen. For example, equal amounts of each arrayed member can be pooled and then transformed into a cell. Cells with a desired phenotype are selected and characterized. In another example, each member is individually transformed into a cell, and the cell is characterized, e.g., using a nucleic acid microarray to determine if the transcription of endogenous genes is altered (see "Profiling Regulatory Properties of a Chimeric Zinc Finger Protein," below).

Small libraries, e.g., having about 6 to 200 or 50 to 2000 members, can be used to optimize the properties of an identified chimeric protein. In one embodiment, a first cycle of phenotypic selection identifies an initial chimeric protein that alters a property of a cell. A sublibrary is constructed that varies the domains encoded at a first position, but that retains the domains encoded at other positions. For example, for a three-fingered zinc finger protein, this procedure optimizes the zinc finger domain at the first position. The sublibrary is screened to identify members that encode proteins that have the desired phenotypic effect, e.g., to a greater extent than the initial chimeric protein. In another embodiment, a sublibrary is constructed that mutates a set of amino acid positions. For example, for a chimeric zinc finger protein, the set of amino acid positions may be positions in the vicinity of the DNA contacting residues, but not the DNA contacting residues themselves. In still another embodiment, the sublibrary varies each encoded domain in a chimeric protein, but to a more limited extent than the initial library. For a chimeric zinc finger protein, the nucleic acids that encode a particular domain can be varied among other zinc finger domains whose recognition specificity is known to be similar to that of the domain present in the initial chimeric protein.

Sublibraries can be synthesized by serial ligation, by pooling specific library members from a prefabricated and arrayed large library, by mutagenesis, by recombination (e.g., sexual PCR and "DNA Shuffling™" (Maxygen, Inc., CA)), or by combinations of these methods.

In one example, it was found that various phenotypes of Saccharomyces cerevisiae are altered by regulating gene expression using zinc finger protein (ZFP) expression libraries. ZFPs in our libraries consist of 3 or 4 zinc finger domains (ZFDs) and recognize 9 to 12-bp DNA sequences. The chimeric zinc finger protein is identified without a priori knowledge of the target genes that it regulates. Three different class of transcription factors are produced from the ZFPs in the libraries: isolated ZFPs themselves function as efficient transcriptional repressors when they bind to a site near the promoter region; ZFPs are also expressed as fusion proteins to a transcriptional activation domain or to a repression domain to yield transcriptional activators or repressors, respectively. We used 40 ZFDs as modular building blocks to construct 3-finger or 4-finger ZFPs. Thus, in its fully represented form, a 3 -finger ZFP library consists of 64,000 (= 40³) sequences and a 4-finger library consists of 2.6 million sequences.

Additional Features for Chimeric Transcription Factors

With respect to a library encoding chimeric nucleic acid binding domains, the encoded polypeptides can also include one or more of the following features. These features may be constant among all members of the library or may also vary. In one example, some nucleic acids encode polypeptides that include an activation domain, whereas others include a repression domain, or no transcriptional regulatory domain.

Activation domains. Transcriptional activation domains that may be used in the present invention include but are not limited to the Gal4 activation domain from yeast and the NP16 domain from herpes simplex virus. The ability of a domain to activate transcription can be validated by fusing the domain to a known DΝA binding domain and then determining if a reporter gene operably linked to sites recognized by the known DΝA- binding domain is activated by the fusion protein.

An exemplary activation domain is the following domain from p65:

YLPDTDD HRIEEKRKRTYETFKSIMKKSPFSGPTDPRPPP RIAVPSRSSASVPKPAPQPY PFTSSLSTIΝYDEFPTMVFPSGQISQASALAPAPPQVLPQAPAPAPAPAMVSALAQAPAPVPVLAPGP PQAVAPPAPKPTQAGΞGTLSEALLQLQFDDEDLGALLGΝSTDPAVFTD ASVDΝSEFQQLLΝQGIPVA PHTTEP L EYPEAITRLVTAQRPPDPAPAPLGAPGLPΝGLLSGDEDFSSIAD DFSALLSQ (SEQ

IDΝO:5)

The sequence ofthe Gal4 activation domain is as follows: NFNQSGNIADSSLSFTFTNSSNGPNLITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPL SPG TDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDTPPNPKKEISMAYPYDVPDYAS (SEQ ID NO: 6)

In bacteria, activation domain function can be emulated by fusing a domain that can recruit a wild-type RNA polymerase alpha subunit C-terminal domain or a mutant alpha subunit C-terminal domain, e.g., a C-terminal domain fused to a protein interaction domain. Repression domains. If desired, a repression domain instead of an activation domain can be fused to the DNA binding domain. Examples of eukaryotic repression domains include ORANGE, groucho, and WRPW (Dawson et al, (1995) Mol. Cell Biol. 15:6923-31). The ability of a domain to repress transcription can be validated by fusing the domain to a known DNA binding domain and then determining if a reporter gene operably linked to sites recognized by the known DNA-binding domain is repressed by the fusion protein. Still other chimeric transcription factors include neither an activation or repression domain. Rather such transcription factors may displace or otherwise compete with a bound endogenous transcription factor (e.g., an activator or repressor). An exemplary repression domain is the following domain from UME6 protein:

NSAS S STKLDDDLGTAAAVLSNMRS S P YRTHDKP I SNVNDMNNTNALGVPASRPHS SS FPS KGVLRP I LLRIHNSEQQPIFESNNSTACI (SEQ ID NO:7)

Another exemplary repression domain is from the Kid protein: VSVTFEDVAVLFTRDE KKLDLSQRSLYREVMLENYSNLASMAGFLFTKPKVISLLQQGΞDPW (SEQ ID NO:8)

Peptide Linkers. DNA binding domains can be connected by a variety of linkers. The utility and design of linkers are well known in the art. A particularly useful linker is a peptide linker that is encoded by nucleic acid. Thus, one can construct a synthetic gene that encodes a first DNA binding domain, the peptide linker, and a second DNA binding domain. This design can be repeated in order to construct large, synthetic, multi-domain DNA binding proteins. PCT WO 99/45132 and Kim and Pabo ((1998) Proc. Natl. Acad. Sci. USA 95:2812-7) describe the design of peptide linkers suitable for joining zinc finger domains.

Additional peptide linkers are available that form random coil, α-helical or β-pleated tertiary structures. Polypeptides that form suitable flexible linkers are well known in the art (see, e.g., Robinson and Sauer (1998) Proc Natl Acad Sci US A. 95 : 5929-34). Flexible linkers typically include glycine, because this amino acid, which lacks a side chain, is unique in its rotational freedom. Serine or threonine. can be interspersed in the linker to increase hydrophilicity. In additional, amino acids capable of interacting with the phosphate backbone of DNA can be utilized in order to increase binding affinity. Judicious use of such amino acids allows for balancing increases in affinity with loss of sequence specificity. If a rigid extension is desirable as a linker, α-helical linkers, such as the helical linker described in Pantoliano et al. (1991) Biochem. 30:10117-10125, can be used. Linkers can also be designed by computer modeling (see, e.g., U.S. Pat. No. 4,946,778). Software for molecular modeling is commercially available (e.g., from Molecular Simulations, Inc., San Diego, CA). The linker is optionally optimized, e.g., to reduce antigenicity and/or to increase stability, using standard mutagenesis techniques and appropriate biophysical tests as practiced in the art of protein engineering, and functional assays as described herein.

For implementations utilizing zinc finger domains, the peptide that occurs naturally between zinc fingers can be used as a linker to join fingers together. A typical such naturally occurring linker is: Thr-Gly-(Glu or Gln)-(Lys or Arg)-Pro-(Tyr or Phe) (Agata et al, supra). Dimerization Domains. An alternative method of linking DNA binding domains is the use of dimerization domains, especially heterodimerization domains (see, e.g., Pomerantz et al (1998) Biochemistry 37:965-970). In this implementation, DNA binding domains are present in separate polypeptide chains. For example, a first polypeptide encodes DNA binding domain A, linker, and domain B, while a second polypeptide encodes domain C, linker, and domain D. An artisan can select a dimerization domain from the many well- characterized dimerization domains. Domains that favor heterodimerization can be used if homodimers are not desired. A particularly adaptable dimerization domain is the coiled-coil motif, e.g., a dimeric parallel or anti-parallel coiled-coil. Coiled-coil sequences that preferentially form heterodimers are also available (Lumb and Kim, (1995) Biochemistry 34:8642-8648). Another species of dimerization domain is one in which dimerization is triggered by a small molecule or by a signaling event. For example, a dimeric form of FK506 can be used to dimerize two FK506 binding protein (FKBP) domains. Such dimerization domains can be utilized to provide additional levels of regulation.

Computer-Based Analysis Computer-based analysis can be used to store profile (e.g., gene expression profiles or protein profiles) information and to analyze the profiles, e.g., for co-regulated genes. An example of a programmable system , suitable for implementing such methods, can include a processor, a random access memory (RAM), a program memory (for example, a writable read-only memory (ROM) such as a flash ROM), a hard drive controller, and an input/output (I/O) controller coupled by a processor (CPU) bus. The system can be preprogrammed, in

ROM, for example, or it can be programmed (and reprogrammed) by loading a program from another source (for example, from a floppy disk, a CD-ROM, or another computer). The hard drive controller can be coupled to a hard disk suitable for storing executable computer programs and/or encoded video data. The I/O controller can be coupled to an I/O interface. The I/O interface receives and transmits data in analog or digital form over a communication link such as a serial link, local area network, wireless link, or parallel link.

Programs may be implemented in a high-level procedural or object oriented programming language to communicate with a machine system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Each such program may be stored on a storage medium or device, e.g., compact disc read only memory (CD-ROM), hard disk, magnetic diskette, or similar medium or device, that is readable by a general or special purpose programmable machine for configuring and operating the machine when the storage medium or device is read by the computer to perform the procedures described in this document. The system may also be implemented as a machine-readable storage medium, configured with a program, where the storage medium so configured causes a machine to operate in a specific and predefined manner. For example, a program can be used to cluster genes that are co- regulated by a chimeric transcription factor. In one embodiment, the program uses a shortest path analysis to identify a group of regulated genes.

The following patent applications, WO 01/60970 (Kim et al.); USSN 60/408,862, filed September 5, 2002; USSN 60/338,441, filed December 7, 2001; USSN 60/313,402, filed August 17, 2001; USSN 60/374,355, filed April 22, 2002; USSN 60/376,053, filed April 26, 2002; U.S. Provisional Patent Application, titled "Phenotypic Screen Of Chimeric Proteins," filed August 2, 2002 and bearing Attorney Docket No. 12279-007P03; U.S. Provisional Patent Application, titled "Phenotypic Screen Of Chimeric Proteins," filed August 5, 2002 and bearing Attorney Docket No. 12279-007P04; and US Patent Application, titled "ZINC FINGER DOMAIN LIBRARIES," filed August 19, 2002, bearing Attorney Docket No. 12279-005001 are hereby expressly incorporated by reference in their entirety for all purposes. Likewise, all cited patents, patent applications (including U.S. Provisional Patent Applications 60/408,862, filed September 5, 2002 and 60/453,111, filed March 7, 2003), and references are hereby expressly incorporated by reference in their entirety. The present invention will be described in more detail through the following examples. However, it should be noted that these examples are not intended to limit the scope of the present invention.

EXAMPLE 1: CONSTRUCTION OF ZFP LIBRARIES (1) Yeast strains and plasmids

The S. cerevisiae strain used for this experiment was YPH499a { MATa, ade2-101, ura3052, lys2-801, trpl- 63, his3- 200, leu2- 1, GAL+). Transformation of yeast cells was carried out by using the lithium acetate transformation method (see, e.g., Gietz et al., 1992).

The vector, plasmid p3, was utilized or for constructing libraries of zinc finger proteins. p3 was constructed by modification of the pcDNA3 vector (Invitrogen, San Diego CA). The pcDNA3 vector was digested with Hindlll and Xhol. A synthetic oligonucleotide duplex having compatible overhangs was ligated into the digested pcDNA3 . The duplex contains nucleic acid that encodes the hemagglutinin (HA) tag and a nuclear localization signal. The duplex also includes: restriction sites for BamHI, EcoRINotl, and Bglll; and a stop codon. Further, the Xmal site in SV40 origin of the resulting vector was destroyed by digestion with Xmal, filling in the overhanging ends of the digested Xmal restriction site, and religation of the ends. To express the chimeric zinc finger proteins of the library conditionally in yeast, the library constructed in the p3 vector was transferred to pYCT-Lib. p-YCT-Lib is a yeast shuttle vector that includes the inducible GAL1 promoter (FIG. 9). p-YCT-Lib was constructed as follows. The yeast expression plasmid pYESTrp2 (Invitrogen, San Diego CA) was digested with NgoM4 and then partially digested with Pstl to remove 2μ ori fragment from the vector. The 5.0-kb DNA fragment from NgoM4-PstI digested vector was purified after gel electrophoresis and ligated with a CEN-ARS fragment that was amplified from pRS313.

The resulting plasmid was further modified. The B42 activation domain was removed by digestion of the resulting plasmid with Ncol and BamHI. Then, a DNA segment encoding the V5 epitope tag and the nuclear localization signal was PCR-amplified from pYESTrp2 and ligated into Ncol and BamHI sites. The resulting plasmid was named as pYTC-Lib (FIG. 1).

To generate pYCT-Lib-Gal4, the Gal4 activation domain was PCR-amplified from yeast genomic DNA and inserted between the Notl and Sphl recognition sites of pYTC-Lib to generate pYTC-Lib-Gal4.

To generate pYCT-Lib-Ume6, the 87 amino acid region (residues 508-594) of S. cerevisiae Ume6 was amplified from yeast genomic DNA and inserted between the Notl and Sphl recognition sites of pYTC-Lib This 87 amino acid region functions as a transcriptional repression domain (see, e.g., Kadosh et al. (1997) Cell. 89:365-371.). Yeast cells transformed with a library in pYCT-Lib or its derivatives (pYCT-Lib-

Gal4 or pYCT-Lib-Ume6), are grown in synthetic minimal media lacking tryptophan as the vector contains yeast TRP1 gene as a marker.

(2) Library construction A three-fingered protein library (the "3-F library"), encoding zinc finger proteins that have an array of three ZFDs, was constructed from nucleic acids encoding 40 different, individual ZFDs or "fingers." A four-fingered protein library (the "4-F library") was constructed from nucleic acids encoding 27 different, individual ZFDs.

FIG. 11 depicts one method of constructing a diverse three finger library. Nucleic acid encoding each ZFD was cloned into the p3 vector to form "single fingered" vectors.

Equal amounts of each "single fingered" vector were combined to form a pool. The pool was separately digested with two sets of enzymes: Agel and Xhol, and Xmal and Xhol. After ^' phosphatase treatment for 30 minutes, the digested vector nucleic acids from the Agel and Xhol digested pool were ligated to the nucleic acid segments released from the vector by the Xmal and Xhol digestion. These segments each encode a single zinc finger domain. The ligation of the digested vector to the nucleic acid segments forms vectors that now encode two zinc finger domains. After transformation into E. coli, approximately 1.4 x IO⁴ independent transformants were obtained, thereby forming a two-fingered library. The size of the insert region of the two-fingered library was verified by PCR analysis of 40 colonies. The correct size insert was present in 95% of the library.

Subsequently, this 2-fmgered library was digested with Agel and Xhol. The digested vector which retains nucleic acid sequences encoding two zinc finger domains was ligated to the pool of nucleic acid segments encoding one finger (prepared as described above by digestion with Xmal and Xhol). The products of this ligation were transformed into E. coli and yielded about 2.4 x 10⁵ independent transformants. Verification of the insert region indicated that library members were predominantly correctly constructed, i.e., they each encoded three zinc finger domains. The two fingered fragments prepared above by digestion with Agel and Xhol were ligated to the pool of two-fingered fragment digested with Xmal and Xhol to produce the four zinc finger library. The products of this ligation were transformed into E. coli and yielded about 7 x IO⁶ independent transformants.

3 -fingered (3-F) and 4-fingered libraries (4-F) were subcloned into the ΕcoRI-NotI sites of pYTC-, pYTC-Gal4 and pYTC-Ume6, respectively. These cloning steps produced six different libraries, encoding, variously, three and four fingered ZFPs with and without transcriptional regulatory domains. After amplification in E. coli, each library was transformed into the yeast strain YPH499a using lithium acetate. The transformation yielded approximately 1.5 x IO⁷ colonies. The size of the insert region of the library was verified by PCR analysis of 50 colonies. The correct size insert was present in 95% of the library. The transformants were resuspended in TΕ buffer and stored as glycerol stock at -80°C for the further analysis or screening.

EXAMPLE 2: ANTI-FUNGAL DRUG RESISTANCE Ketoconazole is an orally absorbed antimycotic imidazole drug. It can be administered for the treatment of certain mucoses. The drug blocks the biosynthesis of ergosterol in yeasts and other fungi (Burden et al. (1989) Phytochemistry 28:1791-1804), and has additional effects on cellular metabolism which are not characterized in detail (Kelly et al., 1992, In Fernandes, P.B. (Ed.) New Approaches for Antifungal Drugs. Birkhauser, Boston, pp. 155-187).

To check the fungistatic response of the tester strain YPH499a to ketoconazole, IO⁷ cells of the YPH499a strain were plated onto synthetic medium containing different concentration of the drug. The ketoconazole concentration of 35 μM was found to inhibit growth of YPH499a cells at the cell density used. This concentration was subsequently used for the screening of ketoconazole-resistant yeast colonies. 1 x 10 yeast cells containing plasmids from the 3 -finger or 4-finger libraries were cultivated on SD synthetic liquid medium with 2% galactose for 3 hrs at 30°C to induce zinc finger protein expression and then were plated onto SD galactose agar plates containing 35 μ M ketoconazole (ICN Biomedicals). After fours days of incubation at 30°C, about 120 clones formed colonies on galactose media containing ketoconazole at 35 μM. These ketoconazole resistant yeast colonies were picked and spread out on fresh SD agar plates containing 35 μ M ketoconazole. 23 clones were randomly selected from the 120 isolated clones. For each of these 23 clones, the resistant phenotype was verified by plasmid rescue. The plasmids were isolated, transformed into E. coli for amplification, and retransformed into yeast strain YPH499a (Ausubel et al., 1995). Equal numbers of retransformants were spotted onto SD galactose agar plates with or without 35 μ M ketoconazole. The retransformants were also spotted onto SD glucose agar plates with or without ketoconazole to verify that the drug resistance was induced by galactose-inducible expression of zinc finger protein. The plasmids isolated from ketoconazole resistant transformants were sequenced and their expected target sequences in yeast genome were predicted. The transformants with the pYTC vector and transformants with a plasmid encoding a randomly- picked, irrelevant 3-fingered protein were used as controls for the procedure.

Plasmids were purified from randomly-chosen 23 resistant clones and retransformed into YPH499a. Retransformed cells showed resistance to ketoconazole as compared to wild- type cells carrying the pYTC-Lib plasmid or a plasmid encoding an irrelevant ZFP that has low affinity to DNA as controls. When plated on galactose media, up to 5% of cells transformed with plasmids isolated from ketoconazole-resistant clones survived to form colonies. In contrast, cells transformed with control plasmids did not grow to form colonies. However, no difference was observed when these cells plated on glucose media. Thus, the drug resistance of transformed cells was observed only when the cells were plated on galactose media but not on glucose media. These results show that the zinc finger proteins encoded in the plasmid and expressed under the control of the GAL1 promoter are responsible for the phenotypic change.

Twenty-three plasmids from drug resistant transformants were sequenced and eleven unique clones were identified (Table 1). The DNA sequencing results showed that some characterized chimeric zinc finger proteins were independently isolated a number of times. Further, some of the characterized chimeric zinc finger proteins had related features. For example, fingers I and II of K2, K3, and K4 are identical to one another.

Table 1: ZFPs encoded in plasmids isolated from ketoconazole-resistant transformants

The amino acid sequences of these proteins are listed below (zinc finger domains are underscored and transcriptional regulatory domains are bolded):

Kl : QSHV-QFNR-RSHR-Umeβ

KP YKCHQCGIC&FJQgF.^^

DLGTAAA XSNI-RSSPYRTHDKPISNV^ EQQPIFESNNSTACI (SEQ ID NO:9)

K2 : RSNR-RSNR-QSSRl-QSHT-Umeδ MGKP I PNPLLGLNSTQAMGAP PKKKRKVGIRI GEKP YICR KCGRGF8RK8NL IRHQRTHTG

EKPYICRKCGRGESRKSNIJIREQR RTGEK YKCPDCGKSFSQSSSLIRHQRTHTGEKP YKCEECGKA FiZQSSHLTTHiCTI-EAAAANSASS^'STKLDDDLGTA^ ASRPHSSSFPSKGVLRPILLRIHNSEQQPIFESNNSTACI (SEQ ID NO: 10)

K3 : RSNR-RSNR-QGTR-QSHR5 -Ume6

MGKP I PNPLLGLNSTQAMGAPPKKKRKVGIRI PGEKP YICRKCGRGFSRKSNL IRHQRTHTGEKP YIC RKCGRGFSRKSNLIRHQRT TGEKPΨQCRICMRNFSQRGTLTRHIRT GEKPYVCRECGRGFRQHSH VRHKRTHAAAANSASSSTO^

SSFPSKGVLRPILLRIHNSEQQPIFESNNSTACI (SEQ ID NO: 11)

K4 : RSNR-RSNR-QGTR-QTHQ-Ume6

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKPYJCj;XCG^GFgig- -? J-;HORrm'GEKP_t-'JC RKCGRGFSRKSNLIRHQRTmGEKPFQCRICMRNFSQRGTLTRHIRTE GEKPYECHDCGKSFRQSTH: fcTOH^RJHAAAANSASSSTKLDDDLGTAAAv SOT^SSPYRT EfSFPSKGVLRPILLRIHNSEQQPIFESNNSΪACI (SEQ ID NO: 12)

K5 : VSSR-DGNV-VSSR-VDYK-Gal4

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKPYTC- QCGraFgy-?g£? I^HgrrHTGEKPFQq RieMRNFSDSGNLRVHIRT GEKPYTCKQCGKAFSVSSSLRRHETTmGEKPFHCGYCEKSFSVKDY

SPGW_DQTAYNA GI_TG»E-^^ (SEQ ID NO : 13)

K6 : MHHE-QSNR1-VSSR-QGDR-Gal4

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKP YACWFgCD£i_^^ ΥGEKPFECiωCGKAFIQKSNLIRHQRT GEKPYTCKQCGKAFSVSSSLRRHETTHTGEKPFQCRICM

NO: 14)

K7 : DGNV-QSHT-QSSR1 -DGHR-Gal4 MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKP^QC ?JCM; F£?PSG-^LJ;VHIRT_--TGEKP YKC EECGKAFRQSSHLTTHKILHTGEKP YKCPDCGKSFSQSSSLIRHQRTHTGEKPFQCRICMRNFSDPGH

PLS- ^W^^OTAYN TOITTG ^^ ID NO: 15)

K8 : DGAR-RDTN-QTHQ-RDTN

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKPFQCRJCMR F-?PPGA V-?HJI?RI-TGEKPFOC RICMRNFSRSDTLSNHJRTM:GEKP\YECHDCGKSFRQSTHLTQHRRIHTGEKPFQCRICMRNFSRSDT

[ SNHXRTHAAAARGMHLEGRIM (SEQ ID NO : 16) K9 : RDHT-QTHQ-QSHT-DGNV

HDCGKSFRQSTHLTQHRRIHTGEKPYKCEECGKAFRQSSHLTTHKIi GEKPFQCRICMRNFSDSGN t j VHJJgTfiAAAARGMHLEGRIM (SEQ ID NO: 17)

K10: RDH -QTHQ-QSHT

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKPFQCK^'rCQE-FgiggDHLJrHr-eri-TGEKP YEC HDCGKSFRQSTHLTQHRRIHΣGEKPYKCEECGKAFRQSSHLTTHKIIHΑΑΑΑRGVlΗ EGRin (SEQ ID NO:18)

Kll: RDHT-QSHV-QSHV

MGKPIPNPLLGLNSTQAMGAPPKKKRKVGIRIPGEKPFQCi?rCQi?-Fgi;gD-fL rHrj;T_--rGEKP EC DHCGKSFSQSSHLlWHKRTH GEKPYECDHCGKSFSQSSHLlWHKRTHΑΑl^ΛRGMHLEGRτM (SEQ

ID NO: 19)

For example, the K10 clone was isolated five times among ketoconazole resistant clones. The ZFP encoded in this clone is very closely related to those encoded in the K9 and Kl 1 clones. Since all the QTHQ, QSHT, and QSHV fingers found in the ZFPs encoded in the K9, K10, and Kl 1 clones recognize the same 3-bp DNA sequence, 5'-HGA-3', it is possible that these ZFPs bind to the same target site and regulate the same target gene(s).

Likewise, the ZFPs encoded by the K2, K3, and K4 clones may also bind to the same target site and regulate common target gene(s) whose expression is suppressed by corresponding ZFP repressors. In these clones, three out of four fingers are identical to those in corresponding positions in the other proteins. It is extremely unlikely that similar ZFPs are repeatedly isolated by chance (p < 1.6 x 10^"5). We also note that within each of the 2 groups of related clones (i.e., K2, K3, and K4; K9, K10, and Kl 1), all the ZFPs belong to the same class of transcription factors. The K2, K3, and K4 clones each include the Ume6 repression domain. The K9, K10, and Kl 1 clones function without a dedicated transcriptional regulatory domain. Synergistic or additive effects may result when two or more ZFPs are cotransformed into cells. When two ZFPs (e.g., K4 and K5) were co-expressed, yeast cells became completely resistant to ketoconazole. This was an approximately 1, 000-fold enhancement in the phenotype.

ZFP mutants were constructed that altered a key amino acid residue that participates in DNA base recognition or that removed or replace a regulatory domain. In one mutant, an asparagine residue of its second zinc finger was mutated to alanine. Gel shift assays demonstrated that this mutated ZFP had at least 10-fold decrease in DNA-binding affinity for its expected DNA site. This mutant protein (VSSR-DGAV-VSSR-VDYK-GAL4AD) does not confer drug resistance to yeast cells. In another mutant, the Gal4 activation domain of the K5 ZFP TF was deleted by inserting a stop codon in front of the DNA sequence encoding the activation domain. This protein does not confer resistance to ketoconazole. Similar results were obtained for the other ketoconazole resistance ZFPs using similar mutations.

When the activation domain fused to the K5 ZFP was replaced with the Ume6 repression domain, expression of the Ume6-form of the protein reversed the ketoconazole resistance phenotype. The cells that express the Ume6-form were more sensitive to ketoconazole relative to control cells. This result indicates that, at least in some cases, it is possible to design transcription factors, by selecting for transcription factors that exacerbate a phenotype, and then to alter the attached regulatory domain (e.g., by switching the directionality of its function) to produce a transcription factor that has the desired phenotypic effect. Screening for the so-called opposite phenotype can be more amenable than screening for the desired phenotypic effect. Examples of replacements that switch functional directionality include replacing one type of regulatory domain with another type or regulatory domain, or, less typically, removing a regulatory domain. In the case of the K5 ZFP, a protein that results in increased sensitivity was obtained by screening for increased resistance and by replacing a transcriptional activation domain with a transcriptional repression domain.

We performed DNA microarray experiments to identify genes associated with the drug resistance phenotype. We reasoned that different ZFPs conferring the identical phenotype may regulate identical gene sets whose differential expression is directly or indirectly associated with the phenotype. Three ZFPs, K5, K6, and K7, were chosen for expression profiling analyses. All of these transcription factors contained the Gal4 activation domain. Out of 6,400 yeast ORFs (open reading frames), ten ORFs were activated over 2 fold by at least two different ZFP transcription factors, and four ORFs were activated by all three tested ZFP transcription factors. PDR5, a gene known to pump out ketoconazole, was activated by two ZFPs, K6 and K7, but not by K5. See, FIG. 13 for a Venn diagram indicating genes that are regulated by more than one chimeric ZFP. This result suggests that K5 confers ketoconazole resistance by a PDR5-independent mechanism. Thus it appears that there may exist at least two different pathways, one involving activation of PDR5 and the other not involving PDR5, that lead to ketoconazole resistance in yeast.

In order to identify new genes associated with the drug resistant phenotype, we over- expressed each of the four genes that are activated by all three tested ZFP transcription factors and tested cells overexpressing each of these genes for the drug resistance phenotype. We found that one of the genes, induced ketoconazole resistance when over-expressed on its own. This gene (GenBank accession number: YLL053C) is highly homologous to plasma membrane and water channel proteins from Candida albicans. The amino acid sequence of YLL053C is as follows:

M FPQIIAGMAAGGAASAMTPGKVLFTNALGLGCSRSRGLFLEMFGTAVLCLTVLMTAVE KRETNFMAALPIGISLFMAHMALTGYTGTGWPARSLGAAVAARYFPHYHWIYWISPLLG

AFLAWSVWQLLQILDYTTYVNAEKAAGQKKED ( SEQ ID NO : 20 )

The amino acid sequence ofan exemplary Candida albicans homolog (so-called AQY1) is as follows: MVAESSSIDNTPNDVEAQRPVYEPKYDDSVNVSPLKNHMIAFLGEFFGTFIFL VAFVIA QIANQDPTIPDKGSDPMQLIMISFGFGFGVMMGVFMFFRVSGGNLNPAVTLTLVLAQAVP PIRGLFMMVAQMIAGMAAAGAASAMTPGPIAFTNGLGGGASKARGVFLEAFGTCILCLTV LMMAVEKSRATFMAPFVIGISLFLGHLICVYYTGAGLNPARSFGPCVAARSFPVYH IYW VGPILGSVIAFAIWKIFKILKYETCNPGQDSDA (SEQ ID NO: 21)

The YLL053C gene product may confer resistance by pumping out ketoconazole as does the PDR5 gene product. These data demonstrate that genes associated with phenotypic changes can be identified by comparing the gene expression profiles of cells that are altered by different chimeric ZFPs that each produce the same phenotype (in this example, ketoconazole resistance). EXAMPLE 3

Stable cell line construction

Plasmids encoding ZFP-TFs were stably introduced into FlpTRex-293 cell lines (Invitrogen) essentially as described in the manufacturer's protocol. Briefly, the Hmdlll- Xhol fragment from pLFD-p65, -VP16 or -Kid vector containing DNA segments encoding ZFP-TFs was subcloned into pCDNA5/FRT/TO (Invitrogen). The resulting plasmids were cotransfected with pOG44 (Invitrogen) into FlpTRex-293 cells, and stable integrants were screened. The resulting cell lines express ZFP-p65 or ZFP-VP16 upon doxycycline induction.

DNA microarray

DNA microarrays containing 7458 human EST clones, including 215 unassigned ESTs, and 20 ESTs of putative genes, were provided by Genomic Tree, Inc. (Daejeon, Korea). FlpTRex-293 cells stably expressing ZFP-TFs were grown with (+Dox) or without (-Dox) 1 μg/ml doxycycline for 48 h. Total RNA was prepared from each sample. RNA from a - Dox sample was used as the reference (Cy3), and RNA from a +Dox sample as the experimental (Cy5). Microarray experiments were performed according to the manufacturer's protocol.

Data Analysis

The Cluster and Treeview programs were used for global hierarchical clustering (Eisen et al. (1999) Methods Enzymol. 303:179-205). Gene expression profile of individual gene was compared with the rest of genes analyzed, and when the similarity score was above a limit we set, they were grouped together. The process was repeated for every gene analyzed.

Results and Discussion We demonstrated that information about human genes could be annotated by comparing transcript profiles of cells that express artificial transcription factors (FIG. 2). We randomly isolated 136 ZFP-TFs from a library of at least a hundred thousand artificial ZFP transcription factors. We established stable ΗEK293 cell line that can express each individual ZFP-TF in response to doxycycline. Thus, 136 different independent cell lines, one for each ZFP-TFs, were generated. The global gene expression signature of each cell line was characterized using a cDNA microarray containing 7458 human known genes.

FIGs. 3 and 4 shows some of the expression patterns by ZFP activators (FIG. 3) or repressors (FIG. 4). We found that, in general, the global expression patterns between different ZFP-TFs were quite different. This observation confirms that numerous ZFP-TFs produce discrete changes in transcription even though these ZFP-TFs were-randomly chosen.

Some ZFP-TFs changed the expression patterns of several hundreds of genes. To demonstrate that this is not due to the.lack of specificity of ZFP-TFs, we performed time course experiments the followed changes in gene expression at various times after ZFP-TF induction (FIG. 5). The example shown in FIG. 5 is one such time course for a ZFP-TF named F104-p65. Although F104-p65 changed the expression of several hundreds of genes 48 hr after induction, at the 3 hr time point it only significantly upregulated four genes. Therefore, we conclude that F104-p65 directly regulates these four genes. The later changes to the gene expression profile may be a consequence of these initial effects. For example, a cascade of proteins in one or more pathways may be perturbing the gene expression profiles. After collecting the array data, we processed the data to identify gene groups sharing similar expression profiles. For that, we first used global, hierarchical clustering method of Eisen et al. (1999). 1493 genes in 136 expression profiles passed our predetermined criteria - they were analyzed in over 70% of the expression profiles and were up- or down-regulated more than 3 fold in at least one experiment. We found that, in these expression profiles, several clusters of co-regulated genes had related cellular functions. For example, identified clusters of co-regulated genes include: major histocompatibility complex class I family (FIG. 6A), melanoma antigen family genes (FIG. 6B), RNA processing genes (FIG. 7, genes marked bold), and ribosomal genes (FIG. 8, marked bold).

For more rigorous analyses of individual gene function, we grouped genes by individual similarity scoring, rather than global hierarchical clustering. This strategy has been demonstrated to predict interacting protein pairs better than hierarchical clustering method (Kemmeren et al. Mol Cell 2002, 9, 1133-43). Rigorous cut-off value of similarity of 85%) or more was applied to categorize genes in groups. We first determined whether collections of ZFP-microarray data could identify sets of genes known to have close functional relationships. A number of such gene sets were easily recognized as shown in the Tables provided in FIGs. 15 to 18. First, some gene groups included two probes for the same gene. Typically, these two probes originated from different ESTs for identical genes. The two probes were located at separate positions on the microarray slide. Other groups in Tables provided in FIGs. 15 to 18 are components of a gene family or protein complex, whose function could be easily identified by their gene names (e.g., metallothionein genes, histone genes, ribosomal genes). The tight clustering of genes with similar functions demonstrates that this method successfully groups genes with related biological functions.

Other gene groups at a glance did not show close functional relationships. We surveyed all syn-expression groups identified by our method using extensive search and study of scientific literature references. Strikingly, we identified numerous gene groups with functional relationships. Some of these groups are shown in Table 2.

Table 2

ID Gene Name Function

N62761 fragile X mental retardation gene involved in nuclear export of target mRNA.

1 R41973 GLE1 RNA export mediator

AI363200 proenkephalin binds to G protein-coupled opioid receptors. AI363445 G alpha interacting protein (GAIP) Regulator of G-protein signaling (RGS)

AA464856 ID4 (inhibitor of DNA binding 4) induces apoptosis. W60703 caspase 5 apoptosis-related cysteine protease

AA458878 agrin synaptic basal lamina component H92234 KIF1A (kinesin-like protein) axonal transport of synaptic vesicles

AA057313 MORF transcription repressor

AI678222 zinc finger protein 47 transcription repressor

AA130717 zinc finger protein 264 transcription repressor

N69204 Importin-alpha re-exporter Importin-alpha re-exporter binds with high affinity to R11189 RAN binding protein 8 importin-alpha only in the presence of RanGTP. The complex is dissociated by the combined action of RanBPI and RanGAP

AI126424 E2F-like protein EGF increases E2F-1 expression. AI192302 Eps15R EGFR substrate

AI055825 low affinity IgE Fc receptor Fish odors or fumes cause IgE-mediated hypersensitivity

AI251747 odorant-binding protein 2B (allergic reaction),

AA216528 lathosterol oxidase cholesterol biosynthesis

AH 39090 syntaxin 8 syntaxins are concentrated in 200 nm large, cholesterol- dependent clusters at which secretory vesicles preferentially dock and fuse.

AA054073 CEACAM6 CEACAM6 markedly inhibit the apoptosis of cells when

AA143331 Matrix metalloproteinase 1 deprived of their anchorage to the extracellular matrix, a process known as anoikis,

AA452872 GCN5L2 Transcriptional coactivator; histone acetylation

H92201 nucleosome assembly protein 1- Interact with p300/CBP, another transcription like 4 coactivator/histone acetyltransferase

AI949576 Annexin A3 inhibitor of PLA2 AI969825 cutaneous T-Cell lymphoma- nucleolar TGF-β1 target protein associated tumor antigen SE20-4

AI971049 myocilin trabecular meshwork inducible glucocorticoid response protein TGF-β1 and glucocorticoid attenuate IL-1β- induced PLA2 elevation.

AA422058 methyltransferase-like 1 DNA methylation inactivates metastasis supressor genes. AA496628 PUF found in reduced amount in tumor cells of high metastasic potential.

H51419 potassium voltage-gated channel, CD4-CD8-T cells from mice with collagen arthritis H99676 collagen, type VI, alpha 1 display aberrant expression of type 1 K+ channels.

AI564336 colony stimulating factor 3 Methotrexate (MTX) acts by inducing cellular depletion of R50337 folate transporter member 1 reduced folates, which ultimately leads to an inhibition of DNA synthesis. MTX inhibits colony formation of the hematopoietic progenitor cells (CFU-C) in vitro.

AA398883 squamous cell carcinoma antigen squamous cell carcinoma marker

1 AA431080 Keratin, type II cytoskeletal 6A cytokeratin 19 is a squamous cell carcinoma marker.

In one example, the gene encoding fragile mental X protein (FMR1) and the GLE1 gene are grouped together. A review of scientific literature provided a functional link between these two genes. Both FMR1 and GLE1 proteins participate in nuclear export of mRNA and may interact directly, or exist as a complex. Genes encoding ID4 and caspase 5 constitute still another syn-expression group. Both these proteins have pro-apoptotic functions. One possibility is that caspase 5 may be a downstream effector of ID4.

Another syn-expression group contains GCN5-like protein 2 (GCNL2) and Nucleosome assembly protein 1 (NAP-l)-like 4. GCN5 is a well-known transcriptional coactivator that activates transcription by acetylating histones. Recent studies showed that p300/CBP, another member of the transcription coactivator complex and a histone acetyltransferase, functionally interacts with NAP-1. Therefore, it is possible that NAP-1 may also interact with GCN5, assisting its coactivator function. Overall, this analysis of the scientific literature validates the confidence level for syn-expression groups identified by our large-scale ZFP-microarray methods.

It is known that interacting protein pairs share similar expression pattern. In some cases, proteins encoded by genes in the same syn-expression group identified by our method may physically interact with each other. Physical interaction can be tested by a variety of methods, including the two-hybrid technique (e.g., the mammalian or yeast two-hybrid technique) and biochemical assays (e.g., co-immunoprecipitations, fluorescence assays, and surface plasmon resonance).

Example 4

We have developed a high-throughput approach to collecting randomly perturbed gene expression profiles from the human genome. A human 293 cell library that stably expresses randomly chosen zinc fmger transcription factors was constructed, and the expression profile of each cell line was obtained using cDNA microarray technology. Gene expression profiles from a total of 132 cell lines were collected and analyzed by (i) a simple clustering method based on expression profile similarity, and (ii) the shortest-path analysis method. These analyses identified a number of gene groups, and further investigation revealed that genes grouped together frequently have close biological relationships. The artificial transcription factor-based random genome perturbation method thus provides a novel functional genomic tool for annotation and classification of genes in the human genome and those of any other organisms. Disclosed is a high-throughput method that can be used to acquire functional genomic data. Our approach involves the building of a large-scale gene expression profile database for the human genome. To achieve this, we have used a number of pre-assembled ZFP- transcription factors (ZFP-TFs). By performing large-scale microarray experiments with cell lines that express these ZFP-TFs, we demonstrate that each ZFP-TF regulates a distinct set of genes in the human genome, thus verifying that our method perturbs the gene expression program in an unbiased manner. The gene expression profiles obtained were then subjected to bioinformatic analyses to build a number of co-regulated gene groups. Inspection of these groups identified a number of genes whose functional relationships are evident, thus demonstrating the validity of this approach.

Methods

Construction of stable cell lines that express ZFP-TFs

We selected ZFP-TFs from our pre-made collection without any bias or preference. Human embryonic kidney (HEK) cell lines stably expressing ZFP-TFs were generated as follows. Plasmids encoding ZFP-TFs were stably introduced into FlpTRex-293 cell lines (Invitrogen) essentially as described in the manufacturer's protocol. Briefly, the Hmdlll- Xhol fragment from the pLFD-p65, -VP16 or -Kid vectors (Bae, K. Η. et al. (2003) Nature Biotech, in press), which contain DNA segments that encode ZFP-TFs, were subcloned individually into pCDNA5/FRT/TO (Invitrogen). The resulting plasmids were co-transfected along with pOG44 (Invitrogen) into FlpTRex-293 cells to induce site-specific integration event, and stable integrants were screened. The resulting cell lines express ZFP-TFs upon the addition of doxycycline (Dox). A total of 132 cell lines were subjected to gene expression microarray experiments.

DNA microarrays

DNA microarrays containing 7,458 human EST clones, including 215 unassigned ESTs and 20 ESTs of putative genes, were provided by Genomic Tree, Inc. (Daejeon, South Korea). FlpTRex-293 cells that stably expressed ZFP-TFs were cultured with (+Dox) or without (-Dox) 1 μg/ml Dox for 48 h. Total RNA was prepared from each sample. RNA from a -Dox sample was used as the reference (Cy3), and RNA from a +Dox sample constituted the experimental (Cy5) sample. Microarray experiments were performed according to the manufacturer's protocol. For ΗeLa cell microarray experiments, a pLFD- p65/F2840 plasmid, which is an expression vector encoding the F2840-p65 ZFP-TF, was transiently transfected into ΗeLa cells using the Lipofectamine Plus (Invitrogen) reagent, and for the control, pLFD-p65 alone without the ZFP-TF was transfected.

Data Analysis

CLUSTER and TREEVIEW programs (Eisen et al, (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868) were used for global hierarchical, average linkage clustering. Only genes that are up- or down-regulated greater than 3 fold in one or more experiments, and present in more than 70% of the experiments were subjected to further analysis.

To isolate gene groups with strong similarities in their expression profiles, we processed the data with the following algorithm.

(a) A gene, that is not included in any group, is selected to form a temporary group

T. (b) The Pearson similarity coefficient is calculated for all genes in the T group as well as for the rest of genes not included in any group. If the similarity is greater than the cutoff (initially 100%), we include the compared gene in the T group. (c) If the T group has more than two genes, we consider it to be a new group.

(d) Repeat (a) and (c) until there are no genes left to be included in the group.

(e) Next, at step (b), decrease the cutoff by 5% and repeat steps (a) through (d) until no genes are left (that is, not included in any group).

SP (Shortest Path) analysis was performed essentially as described in Zhou et al (Proc.

Nat. Acad. Sci. USA 99:12783, 2002).

Experimental validation

The Spi-B transcription factor was cloned using polymerase chain reaction (PCR) and subcloned into the pcDNA3 vector (Sigma) to generate pcDNA3-SpiB. pcDNA3-SpiB was then transiently transfected into 293 cells, and 48 h after transfection, cells were harvested and total RNA was prepared. Real-time PCR was performed according to the manufacturer's protocol (Corvette Research).

Random genome perturbation using ZFP-TFs

From a number of pre-assembled ZFP-TF collections in our laboratory, we randomly picked a group of ZFP-TFs and then established stable HEK293 cell lines that express each individual ZFP-TF in a Dox-dependent manner (FIG. 19). Therefore, upon the addition of Dox, a unique ZFP-TF is expressed, and it will in turn regulate a unique set of genes. To obtain the global gene expression signature affected by each ZFP-TF, genome-scale gene analysis was performed for a total of 132 cell lines, using a cDNA microarray that contained 7,458 known human genes (FIG. 19).

Some of the expression patterns achieved by the ZFP activators (FIG. 23) and repressors (FIG. 24) are shown. Overall, different ZFP-TFs showed unique global gene expression signatures, in agreement with our hypothesis that random perturbation of the genome can be attained with randomly chosen ZFP-TFs. It should be noted that the gene expression profiles obtained by ZFP-TFs disappeared when we deleted functional domains or introduced mutations into the DNA binding domains which destroyed DNA binding ability.

We then characterized some of the basic properties of gene regulation by several ZFP-TFs. First, we tested whether a particular transcription profile obtained with a given ZFP-TF is cell type-specific. To this end, we compared gene expression profiles generated by one ZFP-TF, F2840-p65, in the following cell types: (i) 293 cells, a non-cancerous human embryonic kidney cell line that stably expressed the F2840-p65 ZFP-TF, and (ii) HeLa cells, a human cervical carcinoma cell line, in which the F2840-p65 ZFP-TF was transiently transfected. Comparison of the microarray data revealed that the insulin gene was highly up- regulated in both cell lines (FIG. 20, marked by arrows) and that similar sets of genes were regulated in both cell types (Table 3). Thus, ZFP-TF appears to regulate similar sets of genes when introduced into cell lines made from different cell types.

Table 3: Genes upregulated by a ZFP activator in two different human cell lines

Name 293 HeLa

Insulin 64.37 78.22 protein tyrosine phosphatase, receptor type, N 29.38 21 .65 platelet-derived growth factor alpha polypeptide 26.30 9.53 putative gene product 24.23 19.55 fibroblast growth factor receptor 3 (achondroplasia, thanatophoπc 1 9.87 7.82

L1 cell adhesion molecule 1 9.03 1 .74 insulin— like growth factor 2 (somatomedin A) 1 8.1 8 3.36 cyclin-dependent kinase inhibitor 1 C (p57, Kιp2) 1 7.67 7.06

FK506-bιndιng protein 8 (38kD) 1 5.95 13.42 single strand of homotnmeric collagen-like tail subunit of asymmetric 15 43 αrolv/lphnlinDotcrαoα 23 81 major histocompatibility complex, class II, DR alpha 1 4.05 protein tyrosine phosphatase, receptor type, N polypeptide 2 1 3.72 2 97 activity-regulated cytoskeleton-associated protein 1 3.18 6 1 4 brain-specific protein p25 alpha 13 14 5.22 cadhenn 13, H-cadheπn (heart) 13.01 6.90 protein phosphatase 2, regulatory subunit B (B56), alpha isoform 1 2.91 7.05 solute carrier family 12 (potassium/chloride transporters), member 7 1 2.57 1 39 keratin, hair, basic, 5 1 1 .29 cellular retinoic acid-binding protein 2 10 76 12.38 immunoglobulin heavy constant gamma 3 (G3m marker) 1 0 64 24 47

Human Ig active epsilonl 5' UT, V-D-J region subgroup VH-I, gene 1 0.57 2.80 a disintegπn-iike and metalloprotease (reprolysin type) with 1 0.53 14 72 keratin 14 (epidermolysis bullosa simplex, Dow ng-Meara, Koebner) 9.24 1 83 phosphofructokinase, platelet 8.64 2.33 olfactory receptor, family 7, subfamily E, member 12 pseudogene 8.22 2.65 colony stimulating factor 2 receptor, alpha, low-affinity 7.95 9.04 zinc finger protein-like 1 7.60 2.64 tubulin, beta polypeptide 7.57 2.18 coagulation factor VII (serum prothrombin conversion accelerator) 7.29 1 .13 glypican 1 7.29 3 32 myosm, light polypeptide 1 , alkali; skeletal, fast 6.71 3.45

Each value represents fold induction from DNA microarray experiments. Second, we performed a time course experiment using one of our stable cell lines that expresses a ZFP transcriptional activator, F104-p65. Expression of this ZFP-TF for 48 h resulted in the regulation of several hundreds of genes (FIG. 21, 48 h). However, at early time point such as 3 h after Dox addition, it activated only two genes more than two-fold (FIG. 21, marked by arrows). These two genes were induced more than two fold throughout the course of the experiment. This result suggests that these two genes might be the primary targets of the F104-p65 ZFP-TF. Regulation of many genes at the 48 h time point is likely to be the downstream effect of expression of the two primary genes (that is, secondary or tertiary targets of the ZFP-TF). This result also suggests that a pathway analysis can be performed if time-course microarray experiments are performed for a number of ZFP-TF- expressing stable cell lines. Because ZFP-TF expression is tightly controlled by the addition of Dox, a rigorous time course analysis is possible using our cell library system. This will eventually help in the building of a hierarchical map or transcriptional network of gene expression.

Analysis of ZFP-TF expression profiling data set reveals a number of gene groups with functional relationships

The expression profile dataset obtained from microarray experiments with 132 cell lines were analyzed. First, we clustered genes with similar expression profiles. Several clusters containing genes with similar functions, such as ribosomal genes, histone genes, or genes involved in RNA processing, were easily recognized by inspecting the Treeview images (Eisen et al, (1998) Proc. Natl. Acad. Sci. USA 95, 14863-148684). To isolate novel groups of genes that showed a strong correlation in their expression profiles, we set a stringent criterion of a Pearson similarity coefficient of 0.85 or more between genes, and identified only gene groups that met this criterion (see Methods). By this approach, we were able to process 205 genes into 86 groups. The result of this analysis is shown in table 4. As expected, and consistent with the global clustering method (Eisen et al, (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868), several groups consisted of components of a gene family or protein complex, whose function could be easily identified. For example, cystatin C and S

(Table 4, group 9), metallothionein IE and 1L (group 75), and sulfotransferease family 1A member 2 and 3 (group 25) constituted a gene group. Gene groups that consisted of the melanoma antigen family 1 A (groups 47 and 74), histone genes (groups 24 and 79), and ribosomal genes (groups 38, 40, 52, 58, and 76) were also observed.

% similarity represents the Pearson correlation coefficient among genes in the same group. A table of gene groups identified by shortest-path analysis can be reconstructed using comma-delimited data in the Appendix of USSN 60/453,111, filed March 7, 2003, the contents of which are incorporated by reference in their entirety. Asterisks represents the shortest-path.

For the other gene groups, whose components did not, at first glance, show close functional relationships, we identified functional relationships using an extensive search and study of the PUBMED literature database. We identified several gene groups to which we could assign putative functional relationships based on their descriptions in the literature. Some of these groups are shown in Table 5.

Table 5

* ( ) represents reference number

For example, ID4 and caspase 5 constitute a co-regulated group with a Pearson coefficient of 0.85. By literature search, we found that ID4 can induce apoptosis (Berman et al, (1996) Cell 86, 445-452). Because caspase 5 is a well-known pro-apoptotic gene, it is possible that these two genes are functionally related, playing roles in the pro-apoptotic pathway. Another group consists of annexin A3, se20-4 tumor antigen, and myocilin. Annexin is an inhibitor of phospholipase A2 (PLA-2) (Oh et al, (2000) FEBS tett. 477, 244- 248). se20-4 tumor antigen is the nucleolar TGF-βl target protein (Ozbun et al, (2001) Genomics 73, 179-193). Myocilin is a trabecular meshwork inducible glucocorticoid response protein (Polansky et al, (1997) Ophthalmologica 211 , 126-139). As TGF- βl and glucocorticoid attenuate ILl -beta-induced PLA2 elevation (Muhl et al, (1992) EERS left. 301, 190-194), these three genes might be common downstream targets of TGF-beta or glucocorticoid signaling. The tight clustering of genes with similar function indicates that this method is successful in grouping genes with close biological relationships. It should be noted that because we set a highly stringent cut off value, many of the identified groups contained only a small number of genes.

Verification of functional relationships using shortest-path analysis

Next, we applied a recently developed shortest-path (SP) analysis method (see, e.g., Zhou et al.. (2002) Proc. Nat. Acad. Sci. USA 99:12783) to analyze our gene expression profile data. This method considers transitive expression similarity among genes as an attribute to link genes within the same biological pathway. This method has an advantage over traditional clustering approaches because it can group not only functionally related genes with similar expression profiles but also those with different expression patterns (see, e.g., Zhou et al.. (2002) Proc. Nat. Acad. Sci. USA 99:12783). Results of the SP analysis of our expression profile dataset are shown in Table 4. From this analysis, we were able to find information not available from simple clustering and grouping analysis.

We found that SP analysis of our ZFP-TF driven dataset produced very precise groupings, as demonstrated by the histone clusters shown in Table 6.

Table 6: Histone clusters identified by SP analysis

SP group ID ΕST ID Gene Name

Group 7 (4) AA047260 H2A histone family, member O AI095013 H2A histone family, member N AI209184 H2A histone family, member A AA452933 H2A histone family, member L

Group 26 (3) AI076718 H2B histone family, member R H70775 H2B histone family, member A N71982 Homo sapiens mRNA for for histone H2B, clone pjG4-

5-14

Group 19 (3) AA287316 H4 histone family, member I AA868008 H4 histone family, member G AI653010 H4 histone family, member D

In this study, histone genes grouped using expression data sorted according to their subclasses (Table 6). The precise subgrouping of histone genes by SP analysis of data obtained by the methods described herein demonstrates the usefulness and detail provided by this approach.

Another example of SP analysis is shown in Table 6, a gene cluster that includes insulin-like growth factor-2 (IGF-2).

Table 7: IGF-2 cluster identified by SP analysis

SP Group EST ID Gene Name 30

H09111 putative gene product

N54596 insulin-like growth factor 2 (somatomedin A)

N95418 FK506-binding protein 8 (38kD)

R41787 cadherin 13, H-cadherin (heart)

R45941 protein tyrosine phosphatase, receptor type, N

R59165 protein phosphatase 2, regulatory subunit B (B56), alpha isoform

T50498 She

A literature analysis confirmed that the members of this gene cluster are functionally related. First, it has been shown that protein phosphatase 2A (PP2A) is involved in the insulin IGF- 1 signal transduction pathway (Ugi et al, (2002) Mol. Cell. Biol. 22, 2375-2387). The presence of IGF-2 and PP2A in the same SP group is consistent with the observation that PP2A participates in IGF-2 signaling. It also has been reported that growth factor signaling involves PP2A and an unidentified tyrosine phosphatase for MAP kinase inactivation (Alessi et al, (1995) Curr. Biol. 5, 283-295). The presence of receptor protein tyrosine phosphatase N and PP2A in this group along with IGF-2 is consistent with the possibility that these two phosphatases act together in the IGF-2 signaling pathway. This group also contains FKBP8, a member of FK506-binding protein (FKBP) family. FKBPs not only bind FK506, but also rapamycin (Bierer et al, (1990) Science 250, 556-559). It has been recently reported that rapamycin can block IGF signaling by complexing with FKBP (Dilling et al, (1994) Cancer Res. 54, 903-907). The presence of FKBP8 in the IGF-2 gene group suggests that this gene product has a potential role in mediating the effect of rapamycin in IGF-2 signaling. Physical interactions between members of this group have also been characteerized. She is known to interact physically with PP2A and cadherin (Xu et al, (1997) J. Biol. Chem. 272, 13463- 13466). Cadherins can also interact with a receptor type protein tyrosine phosphatase (Brady-Kalnay et al, (1998) J. Cell. Biol. 141, 287-296), in agreement with the presence of both cadherin and receptor type protein tyrosine phosphatase in the IGF-2 group (Table 7). This group also contains an uncharacterized gene (Table 7, putative gene product) with some similarity to the Drosophila furry gene (Cong et al, (2001) Development 128, 2793-2802). Based on the extensive relationship among other members of this group and their role in IGF signaling, this uncharacterized gene may also play an important role in IGF signaling.

Experimental validation of gene groups: Spi-B - melanoma antigens

Another SP group contained members of the melanoma antigen family A (Table 8), suggesting that these genes are coordinately regulated.

Table 8: MAGE cluster identified by SP analysis

SP Group EST ID Gene Name

_J8

AA279188 a disintegrin and metalloprotease domain 8

AI200443 melanoma antigen, family A, 5

AI691089 melanoma antigen, family A, 11

AI830281 melanoma antigen, family A, 9

N71628 Spi-B transcription factor (Spi-1/PU.1 related)

AA402040 tight junction protein 3 (zona occludens 3)

AA857809 melanoma antigen, family A, 4

AA995045 melanoma antigen, family A, 3 AI032153 melanoma antigen, family A, 8

This coordinated regulation can be explained by the existence of a common transcriptional regulator. Spi-B, a PU.l -related transcription factor, was also included in this group. Therefore, we asked whether Spi-B is the master regulator of melanoma antigen family A expression. For this, we cloned the Spi-B gene in a pCDNA3 mammalian expression vector, then transfected this vector into 293 cells. For the melanoma antigen family genes we analyzed (MAGE-3,5,8, and 9), induction of mRNA level was observed upon transfection of the Spi-B expression vector (FIG. 22). Inspection of the promoter regions of MAGE-3, 5, 8, and 9 revealed the presence of PU.l box. Therefore, SP analysis of a ZFP-TF-derived gene expression dataset revealed a novel transcriptional regulator of the family of melanoma antigen A genes.

We demonstrated that the human genome can be randomly perturbed by ZFP-TFs and that the resulting expression profiles can provide novel information about the function of genes. Using conventional clustering based on expression profile similarity and the recently introduced SP analysis, we were able to identify many groups of genes with close functional relationships. SP analysis of ZFP-TF-derived expression datasets also revealed Spi-B as a novel regulator of the melanoma antigen gene family, and this prediction was verified experimentally. Some important advantages of using ZFP-TFs as genome- wide regulators of gene expression mclude the following. First, ZFP-TFs can be used to both down-regulate and up- regulate target genes. Therefore, ZFP-TFs allows a comprehensive analysis of co-clustered genes. Second, the number of finger domains included in the ZFP-TF can regulate the specificity of ZFP. Depending upon whether three- or six-finger ZFPs are used, the number of regulated genes in a cell varies. Three- or four-finger proteins, as used in this study, are not highly specific; they can modulate several genes when introduced into cells. Thus the proteins can produce a large number of gene perturbations which provide additional correlations for analysis. For example, a gene expression profile obtained from a single ZFP-TF with broad specificity can reveal information about several genetic pathways, and this information can be easily categorized with the use of bioinformatics. Third, the universality of transcription factor action makes this strategy easily applicable to many other eukaryotic and prokaryotic genomes.

More array experiments, along with the use of microarrays that cover a substantial part of the expressed human genome, can provide a comprehensive functional analysis of human genome. In addition, time course expression experiments for each cell line can also be used to build pathway maps of human cellular gene expression. A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.

References

1. Tong, A. H. Y. et al. (2001) Science 294, 2364-2368

2. Giaever, G. et al. (2002) Nature 418, 387-391

3. Spralding, A. C, Stern, D., Beaton, A., Rhem, E. J., Laverty, T., Mozden, N., Misra, S., & Rubin, G. M. (1999) Genetics 153, 135-177

4. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998) Proc. Natl. Acad.

Sci. USA 95, 14863-14868

Iyer, V. R. et al. (1999) Science 283, 83-87

Nierhrs, C. & Pollet, N. (1999) Nature 402, 483-487 7. Hughes, T. R. et al. (2000) Cell 102, 109-126

Ge, H., Liu, Z., Church, G. M., & Vidal, M. (2001) Nature Genet. 29, 482-486

Kemmeren, P., van Berkum, N. L., Vilo, J., Bijma, T., Donders, R., Brazma, A., &

Holstege, F. C. P. (2002) Mol. Cell 9, 1133-1143

10. Zhou, X., Kao, M.-C. J., & Wong, W. H. (2002) Proc. Natl. Acad. Sci. USA 99, 12783-12788

11. Cho, Y. S., Kim, M.-K., Cheadle, C, Neary, C, Becker, K. G., & Cho-Chung, Y. S.

(2001) Proc. Natl. Acad. Sci. USA 98, 9819-9823

12. Tuschl, T. (2002) Nature Biotech. 20, 446-448 13. Kawasaki, H., Onuki, R., Suyama, E., & Taira, K. (2002) Nature Biotech. 20, 376- - 380

14. Segal, S. J. & Barbas III, C. F. (2001) Curr. Opin. Biotech. 12, 632-637 15. Lee, D.-k., Seol, W., & Kim, J.-S. (2003) Curr. Top. Med. Chem. 3, 645-657 16. Bae, K. H. et al. (2003) Nature Biotech, in press

17. Mansour, A., Hoversten, M. T., Taylor, L. P., Watson, S. J., & Akil, H. (1995) Brain Res. 700, 89-98

18. Berman, D. M., Wilkie, T. M., & Gilman, A. G. (1996) Cell 86, 445-452 19. Andres-Barquin, P. J., Hernandez, M.C., & Israel, M. A. (1999) Exp. Cell Res. 247, 347-55.

20. Krippner-Heidenreich, A., Talanian, R. V., Sekul, R., Kraft, R., Thole, H., Ottleben, H., & Luscher, B. (2001) Biochem. J. 358, 705-715 21. Li, S., & Baserga, R. (1996) Exp. Gerontol 31, 195-206

22. Klapisz, E., Sorokina, I., Lemeer, S., Pijnenburg, M., Verkleij, A. J., & van Bergen en Henegouwen, P. M. (2002) J.Biol. Chem. 277, 30746-30753

23. Crespo, J. F., Pascual, C, Dominguez, C, Ojeda, I., Munoz, F. M., & Esteban, M. M. (1995) Allergy 50, 257-261 24. Ordonez, C, Screaton, R. A., Ilantzis, C, & Stanners, C. P. (2000) Cancer Res. 60, 3419-3424

25. Koul, D. et al. (2001) Oncogene 20, 6669-6678

26. Oh, J., Rhee, H. J., Kim, S., Kim, S. B., You, H., Kim, J. H., & Na, D. S. (2000) FEBS lett. 477, 244-248 27. Ozbun, L. L., You, L., Kiang, S., Angdisen, J., Martinez, A., & Jakowlew, S. B.

(2001) Genomics 73, 179-193

28. Polansky, J. R., Fauss, D. J., Chen, P., Chen, H., Lutjen-Drecoll, E., Johnson, D.,

Kurtz, R. M., Ma, Z. D., Bloom, E., & Nguyen, T. D. (1997) Ophthalmologica 111, 126-139 29. Muhl, H., Geiger, T., Pignat, W., Marki, F., van den Bosch, H., Cerletti, N., Cox, D., McMaster, G., Vosbeck, K., & Pfeilschifter, J. (1992) FEBS lett. 301, 190-194

30. Lou, W., Krill, D., Dhir, R, Becich, M. J., Dong, J. T., Frierson, H. F. Jr., Isaacs, W. B., & Gao, A. C. (1999) Cancer Res. 59, 2329-2331

31. Steeg, P. S., Bevilacqua, G., Pozzatti, R., Liotta, L. A., & Sobel, M. E. (1988) Cancer Res. 48, 6550-6554

32. Schneider, J., Bitterlich, N., Velcovsky, H. G., Morr, H., Katz, N., & Eigenbrodt, E.

(2002) Int. J. Clin. Oncol. 1, 145-151

33. Ugi, S., Imamura, T., Ricketts, W., & Olefsky, J. M. (2002) Mol. Cell. Biol. 22, 2375- 2387 34. Alessi, D. R, Gomez, N., Moorhead, G., Lewis, T., Keyse, S. M., & Cohen, P. (1995)

Curr. Biol 5, 283-295 35. Bierer, B. E., Somers, P. K., Wandless, T. J., Burakoff, S. J., & Schreiber, S. L. (1990) Science 250, 556-559

36. Dilling, M. B., Dias, P., Shapiro, D. N., Germain, G. S., Johnson, R. K., & Houghton, P. J. (1994) Cancer Res. 54, 903-907 37. Xu, Y., Guo, D.-F., Davidson, M., Inagami, T., & Carpenter, G. (1997) J. Biol. Chem.

272, 13463-13466 38. Brady-Kalnay, S. M., Mourton, T., Nixon, J. P., Pietz, G. E., Kinch, M., Chen, H., Brackenbury, R., Rimm, D. L., Del Vecchio, R. L., & Tonks, N. K. (1998) J. Cell. Biol. 141, 287-296 39. Cong, J., Geng, W., He, B., Liu, J., Charlton, J., & Adler, P. N. (2001) Development

128, 2793-2802

Claims

WHAT IS CLAIMED IS:

1. A method of evaluating a plurality of cellular genes, the method comprising: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide expression information; and identifying, from the expression information, a set of two or more genes whose expression is altered by a first and a second transcription factors such that, in at least 50% of the evaluated cells in which expression one gene of the set is altered, expression of each of the other genes of the set is altered, wherein the transcription factors are factors respectively encoded by the nucleic acids in a first and second cell of the plurality of cells.

2. The method of claim 1, wherein the step of identifying comprises grouping genes.

3. The method of claim 2, wherein the grouping is performed recursively.

4. The method of claim 2, wherein the grouping is a function of a similarity coefficient among genes in a candidate group.

5. The method of claim 4, wherein genes are assigned to a group if the similarity coefficient is greater than a threshold value.

6. The method of claim 5, wherein the grouping is performed recursively and the threshold value is decreased during subsequent iterations.

7. The method of claim 1, wherein the step of identifying comprises clustering genes as a function of the expression information.

8. The method of claim 7, wherein the clustering comprises use of hierarchical clustering, Bayesian clustering, or k-means clustering.

9. The method of claim 1, wherein the step of identifying comprises using self- organizing maps, shortest path analysis, Boolean networking, graphical modeling, and/or individual gene grouping.

10. The method of claim 1, wherein the step of identifying comprises principal component analysis.

11. The method of claim 1, wherein the step of identifying comprises translating the expression information into binary values, and identifying similar genes using the binary values.

12. The method of claim 11, wherein the step of identifying similar genes using the binary values comprises use of the Hamming Distance, a chi squared based measure, or a Fisher Exact test.

13. The method of claim 1, wherein the step of identifying comprises evaluating a metric of similarity or dissimilarity.

14. The method of claim 13, wherein the metric of similarity or dissimilarity is a multivariate distance measure.

15. The method of claim 13, wherein the multivariate distance measure is a Euclidean distance, Minkowski Distance, Mahalanois Distance, Taxi-cab Distance, Canberra Metric, or Bray-Curtis Coefficient.

16. The method of claim 13, wherein, for each cell of the plurality of cells, the multivariate distance measure for expression of all of the genes in the set is within a predetermined value.

17. The method of claim 1, wherein the step of evaluating expression comprises evaluating proteins encoded by the plurality of cellular genes.

18. The method of claim 1, wherein the step of evaluating expression comprises isolating RNA, and hybridization of the RNA or a derivative nucleic acid thereof to one or more probes.

19. The method of claim 19, wherein the step of providing a plurality of cells comprises: (a) providing a library of nucleic acids, wherein each member of the library comprises a sequence encoding an artificial chimeric transcription factor operably linked to a promoter and (b) introducing members of the library into the plurality of cells.

20. The method of claim 1, wherein the nucleic acids encoding the different artificial, chimeric transcription factors comprise a randomly selected set of nucleic acids.

21. The method of claim 1, wherein the nucleic acids encoding the different artificial, chimeric transcription factors comprise a preselected set of nucleic acids, wherein each member of the preselected set of nucleic acids encodes an artificial, chimeric transcription factor that can alter a cellular phenotype.

22. The method of claim 1, wherein each artificial, chimeric transcription factor comprises a first zinc finger domain.

23. The method of claim 22, wherein each artificial, chimeric transcription factor further comprises a second and a third zinc finger domains.

24. The method of claim 22, wherein the first zinc finger domain is a naturally occurring zinc finger domain.

25. The method of claim 22, wherein the first zinc finger domain is a mammalian zinc finger domain.

26. The method of claim 1, wherein each cell of the plurality of cells is an animal cell.

27. The method of claim 26, wherein each cell of the plurality of cells is a human cell.

28. The method of claim 1, wherein each cell of the plurality of cells is maintained in culture, prior to the evaluating.

29. The method of claim 1, further comprising creating a database record that associates the genes of the identified set.

30. The method of claim 1 , wherein the plurality of cells includes at least five cells.

31. The method of claim 1, wherein each cell of the plurality of cells is derived from a common parental cell.

32. The method of claim 1 , further comprising determining whether proteins encoded by genes of the identified set physically interact.

33. A method of evaluating a plurality of cellular genes, the method comprising: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide expression information; and identifying, from the expression information, a set of two or more genes whose expression is altered by a first and a second transcription factors such that, in each of the evaluated cells in which expression one gene of the set is altered, expression of each of the other genes of the set is altered, wherein the transcription factors are factors respectively encoded by the nucleic acids in a first and second cell of the plurality of cells.

34. The method of claim 33, wherein the step of identifying comprises grouping genes.

35. A method of evaluating a plurality of cellular genes, the method comprising: providing a plurality of cells, each cell containing a nucleic acid encoding a different artificial, chimeric transcription factor; expressing the respective nucleic acids in the cells of the plurality; evaluating expression of a plurality of genes in each cell of the plurality to provide 5 expression information; and identifying, from the expression information, a set of two or more genes whose expression is co-regulated among the cells of the plurality.

36. A machine accessible medium having encoded thereon information that o represents a plurality of database records, each record of the plurality comprising information that indicates abundance of a transcript in a cell and a reference that identifies the cell, wherein the plurality of records comprises records for the same transcripts in at least two different cells, each cell comprising an artificial, chimeric transcription factor that differs among the different cells. 5

37. The medium of claim 36, wherein each franscription factor comprises a zinc finger domain.

38. The medium of claim 36, wherein the transcription factor is randomly selected 0 for at least some of the cells.

39. The medium of claim 36, wherein the plurality of database records includes database records for at least 10 different transcription factors.

5 40. A method of evaluating a plurality of cellular genes, the method comprising: providing the medium of claim 36; and identifying, from information encoded on the medium, a set of two or more genes whose expression is altered by at least two of the different artificial, chimeric transcription factors, such that, in each evaluated cell in which expression one gene of the set is altered, 0 expression of each of the other genes of the set is altered.

41. The method of claim 40, wherein the step of identifying comprises grouping genes.

42. The method of claim 41, wherein the grouping is performed recursively.

5

43. The method of claim 41, wherein the grouping is a function of a similarity coefficient among genes in a candidate group.

44. The method of claim 43, wherein genes are assigned to a group if the similarity o coefficient is greater than a threshold value.

45. The method of claim 44, wherein the grouping is performed recursively and the threshold value is decrease during subsequent iterations.

5 46. A machine accessible medium having encoded thereon information that represents a plurality of database records, each record of the plurality comprising a transcript profile that indicates the abundance of a plurality of cellular transcripts and a reference that identifies the cell, wherein each cell includes an artificial, chimeric franscription factor, and the transcription factor differs among at least some of the cells. 0

47. A method of evaluating a plurality of cellular genes, the method comprising: providing the medium of claim 46; and identifying, from information encoded on the medium, a set of two or more genes whose expression is altered by at least two of the different artificial, chimeric transcription 5 factors, such that, in each evaluated cell in which expression one gene of the set is altered, expression of each of the other genes of the set is altered.