AU2002343465A1

AU2002343465A1 - Biological discovery using gene regulatory networks generated from multiple-disruption expression libaries

Info

Publication number: AU2002343465A1
Application number: AU2002343465A
Authority: AU
Inventors: Michiel De Hoon; Takao Goto; Seiya Imoto; Satoru Kuhara; Satoru Miyano; Christopher J. Savoie; Kousuke Tashiro
Original assignee: GNI Ltd
Current assignee: GNI Ltd
Priority date: 2001-09-26
Filing date: 2002-09-26
Publication date: 2003-04-07

Description

BIOLOGICAL DISCOVERY USING GENE REGULATORY

NETWORKS GENERATED FROM MULTIPLE-DISRUPTION

EXPRESSION LIBRARIES

Related Applications

This application claims priority under 35 U.S.C. Section 119(e) to U.S. provisional applications serial numbers 60/325,016, filed September 26, 2001; 60/334,372, filed November 29, 2001; 60/334,255, filed November 29, 2001; 60/334,230, filed November 29, 2001; 60/370,824, filed April 8, 2002; and 60/397,458 filed July 19, 2002. Each of the above provisional patent applications is herein incorporated fully by reference.

Field of the Invention

Embodiments of this invention relate to methods for elucidating network relationships between genes. Methods include Boolean logic, Bayesian estimation, maximum likelihood analysis, spline functions and other curve- fitting methods, and methods to determine edge relationships between groups of genes that are functionally related to each other.

BACKGROUND Gene regulatory networks elucidated from strategic, genome-wide experimental data can aid in the discovery of novel gene function information and expression regulation events from observation of transcriptional regulation events among genes of known and unknown biological function.

Methods designed to elucidate gene regulation pathways have been reported previously⁽¹'²'³⁾ . The inferred networks reported in these studies were derived from gene expression data sets derived from time course, cell cycle and environmental perturbation ^(4j5). However, control relationships inferred from such data sets are suspect since they are not based on comprehensive experimental data designed to elucidate transcription-related regulatory control functions. To rigorously and precisely identify novel and complex gene regulatory networks from de novo expression data sets, a systematic and integrated strategy of expression experiments on genomic deletion mutants combined with suitable computational methods is necessary ^{( "} Of particular importance in the creation of inferred regulatory networks is the biological relevance of the expression experiments from which the putative control relationships are derived. Genome-wide expression data generated from competitive hybridization disruption experiments offer the advantages of internal control and quantitative measurement ofthe direct effects of a gene's presence or absence on the expression of other genes. Selection of disruptant experiments to maximize the elucidation of control relationships is valuable for generating useful gene regulatory information.

SUMMARY The present invention provides methods useful for establishing interrelationships among genes or groups of genes in a system, e.g., establishing gene pathways or gene networks in a system. In one embodiment, gene networks or gene pathways can be constructed based on analyzing gene disruption expression profiles using improved Boolean methods, improved Bayesian methods, or a combination of Boolean and Bayesian methods, as provided by the present invention. In another embodiment, gene networks or gene pathways can be constructed based on analyzing expression profiles of genes affected by an agent, e.g., a drug. In yet another embodiment, gene networks or gene pathways affected by an agent can be constructed by analyzing the gene networks or pathways obtained from the gene disruption expression profiles and the agent affected expression profiles.

In general, gene disruption expression profiles can be obtained based on expression profiles of a library of genes, each of which disrupted individually or in combination with others, e.g., other genes in a related function to provide an expression profile. For example, a library of genes can be selected (e.g., an entire genome or a series of genes selected based on other criteria, including polymerase chain reaction (PCR) primers). Regardless of how the library of genes is selected, once selected, each gene of the library can be disrupted individually and/or in combination with others resulting in a collection of libraries, each of which containing at least one disrupted gene along with other non-disrupted genes. Thus, if one selects 100 genes, the resulting library will consist of at least 101 different sub-libraries e.g., one sub- library of non-disrupted genes or wild-type and at least one disruptant sub- library for each of the 100 genes. Thus, libraries of disrupted genes can be created and expression profiles for each sub-library and the entire library can be obtained.

Agent affected expression profiles can be obtained by administering one or more desired agents to a system containing selected genes and collecting expression profiles of the genes at different dosage or time point of agent administration. In some cases the agent has no effect on gene expression, in other cases the agent is inhibitory (e.g., decreases gene expression) and in other cases the agent increases gene expression (e.g., is an inducer).

Gene expression profiles can be obtained in a quantified form using any suitable means, e.g., using microarray. In one embodiment, a gene expression profile can be organized into gene expression matrix, e.g., a binary matrix categorizing the genes into affected and not affected. In another embodiment, one can introduce an "equivalence set" in which data is normalized, thereby revealing the quantitative relationships in gene expression. From an equivalence set, network information relating the genes with each other can be created. Then, one can use networks to determine functional relationships between genes. One can then use the deduced functional relationships between genes to predict drug and or biological effects. Then, those predictions can be experimentally tested using, for example, mircoarray experiments. This process can result in what we call a "final common pathway" of gene expression, that is, alteration in the function of one gene results in effects on genes "downstream" from the altered gene. According to one embodiment of the present invention, one can analyzing "downstream" effect from a directly affected gene by using the improved Boolean methods of the present invention. According to another embodiment of the present mvention, one can go "upstream" from an affected gene by applying an improved Bayesian method of the present invention. The present invention provides a new approach to Bayesian gene network analysis, in which non-linear, and/or non-parametric regression models are used. Using this approach means that one need not make any apriori assumptions about causal relationships, but rather can infer causal relationships, thereby providing "upstream" or "initial pathways" by which a gene observed to be affected by a drug or treatment is induced to alter its expression by other genes in an upstream relationship to the affected, observed gene. The present invention provides an improved criterion, herein termed "BNRC" for estimating a Bayesian graph of gene networks. Thus, the gene relationships are selected that minimize the BNRC criterion.

In other embodiments, other new methods are provided that not dependent upon Gaussian or other assumed variance in the data. Rather, in certain embodiments, we can measure the variance of the data and use the observed variance to affect the BNRC criterion. Using such non-parametric regression with heterogeneous error variances and interactions, one can optimize the curve fitting of data obtained, and predict how many experiments are needed to obtain data having a desired variance. Using methods such as these, one can obtain gene network information that has a desired degree of accuracy and reliability, and provides both upstream and downstream effects. In some embodiments, the new Bayesian model and penalized maximum likelihood estimation (PMLE) yield similar results.

In still other embodiments, relationships between two genes can be elucidated by analysis of a graph of expression of one gene compared to the expression of another gene. Such graphs may be linear or non-linear. To characterize the relationships, in some embodiments, linear splines, B-splines, Fourier transforms, wavelet transforms, or other basis functions can be used. In some cases, B-splines are conveniently employed.

Determining whether a particular gene or series of genes is important in regulating other genes can be very useful. However, in certain cases, outliers in the data can complicate interpretation of results. This problem can be especially troubling where outliers are near boundaries of gene groups in the overall network. Thus, in certain embodiments, boundary effects can be elucidated, in which the edge intensity and the degree of confidence of the direction of Bayes causality can be determined. In other embodiments, time-course of gene expression can be determined using linear splines. Using splines, time-ordered data can be more reliably analyzed that prior art, "fold-change" methods, which can be relatively insensitive.

Using one or more ofthe above-described general methods, the present invention provides gene network information that is useful and validated by certain results already known for the yeast genome. However, the methods are widely applicable to any genetic network (e.g., "transcriptome"), and to interactions at the level of proteins ("proteome"). Additionally, using one or more of the new methods, we have determined novel relationships between functional genes relating to antifungal therapies. Thus, using methods of this invention, one can predict putative therapeutic targets based on an understanding ofthe initial and final common pathways of gene expression.

BRIEF DESCRIPTION OF THE DRAWINGS This invention is described with respect to specific embodiments thereof, in which:

Figure 1 depicts a schematic diagram of model-based interactive biological discovery of this invention. Figure 2 depicts construction of a gene regulatory sub-network model, using a Boolean method of this invention. Figure 2 A depicts a Gene Expression Matrix, in which numerous profiles are integrated. Each element of the matrix represents a ratio of gene expression between a gene in a column and a gene in a row. Figure 2 B depicts binary relationships between genes to produce a Binary

Matrix. If gene 'Gl ' is deleted and the intensity of gene 'G2' is significantly altered as a result, then gene 'Gl ' affects gene 'G2' . Figure 2 C depicts an adjacency matrix of identified looped regulation genes. If gene 'G3' and gene 'G4' affect each other mutually, they form a loop (strongly connected component) regulation. Figure 2D depicts a step in which genes are partitioned into an Equivalence Set, which treats a loop logically as a "virtual gene". An equivalence set is a group of genes that affect each other or as a group affect one discrete gene. Figure 2E depicts a skeleton relation between virtual genes. The shortest path relationships should be selected to build hierarchical connections. Figure 2F depicts a regulation pathway formed from the skeleton matrix.

Figure 3 depicts results of a disruptant-based study and analysis of this invention in yeast in which 552 nodes representing the included genes and 2953 putative regulatory links among these genes.

Figure 4 depicts a transcription factor regulatory network model of this invention classified by cellular functional roles (CFR) in yeast. In this model, there were 98 transcription factors grouped by cellular functional roles according to information provided in the Yeast Proteome Database. Genes inside the circles are grouped into a given cellular functional category. Regulatory control relationships are depicted with colored lines. Colors indicate the category of genes from which the control relationships emanate. Blue: Carbohydrate metabolism, Bluish purple: Chromatin/Chromosome structure, Brown: Energy generation, Dark Green: Other metabolism, Gray: DNA repair, Green: Lipid, fatty acid metabolism, Light Green: Amino acid metabolism, Orange: Cell stress, Red: Meiosis/Mating response, Pink: Differentiation, Purple: Cell cycle. The bold red lines indicate regulatory control of transcription factors related to cell cycle originating from genes in the meiosis/mating response group. The bold blue lines indicate control relationships emanating from the carbohydrate metabolism group exerting influence over genes in the lipid fatty-acid metabolism group. Several control lines emanating from "carbohydrate metabolism " are depicted. In sharp contrast, only 2 individual genes, SKN7 and HMS2 independently influence cell cycle related gene expression.

Figure 5 depicts a detailed view of a gene regulatory sub-network of transcription factors reconstructed from Boolean analysis of gene expression experiments on disruptant mutant yeast strains. Black lines indicate a regulatory relationship, with the arrows showing the direction of expression influence. The colors and shapes of the nodes denote the general categories of cellular function ofthe gene product according to their descriptions in the YPD.

Genes related to cell division mechanisms are indicated with triangular nodes and genes related to DNA repair and chromosome structure are depicted with squares. The elucidated network shows novel topological control relationships among genes related to meiosis, the mating response and DNA structure and repair mechanisms via UME6 and MET28. Genes related to meiosis and mating response genes are downstream of a cascade regulated by IN02 and a further sub-grouping of genes related to DNA repair and structure appears hierarchically downstream OΪMET28.

Figure 6 depicts a flow chart showing methods of this invention for the estimation of network relationships between genes using Bayesian inference and minimization of a BNRC criterion.

Figures 7a and 7b depict conventional methods for analyzing drug response expression data. Figure 7a includes list of Yeast genes whose expression levels are significantly affected by Griseofulvin exposure at lOOmg at 1 minute (one of 20 experiments). Figure 7b depicts hierarchical clustering of expression data from drug response and gene disruption experiments.

DETAILED DESCRIPTION The descriptions that follow include specific examples drawn from studies in the yeast, Sarchomyces. The methods for analyzing relationships between yeast genes are equally applicable to analysis of relationships among genes of different species, including eukaryotes, prokaryotes, and viruses. Thus, the descriptions and examples that follow are illustrative and are not intended to limit the scope ofthe invention.

I. Biological Discovery with Gene Regulatory Networks in Yeast Generated from Multiple-Disruption Full Genome Expression Libraries

Gene regulatory networks elucidated from strategic, genome-wide experimental data can aid in the discovery of novel gene function information and expression regulation events form observation of transcriptional regulation events among genes of known and unknown biological function. Of particular importance in the creation of inferred regulatory networks is the biological relevance of the expression experiments from which the putative control relationships are derived. Genome-wide expression data generated from competitive hybridization disruption experiments offer the advantages of internal control and quantitative measurement of the direct effects of a gene's presence or absence on the expression of other genes. Selection of disruptant experiments to maximize the elucidation of control relationships is important for generating useful gene regulatory information. To create a reliable and comprehensive data set for the elucidation of transcription regulation on 120 genes with known involvement in transcription regulation. We report several novel regulatory relationships between known transcription factors and other genes with previously unknown biological function discovered with this expression library.

We have implemented, for the yeast genome, a systematic, iterative approach that combines full-genome biological expression experiments with gene regulatory inference, as shown in Figure 1. In certain embodiments, we start with the creation of a library of hundreds of full genome expression experiments, with one gene's expression disrupted per experiment. From this data we used computational techniques to infer approximations of gene expression regulatory relationships. We then examined the biological relevance of regulatory relationships with computerized visualization and simulation software and validated our findings on novel or biologically interesting subnetworks through other databases as well as through further experimentation, including combinatorial disruption experiments.

We constructed a gene expression data library using full-genome yeast c-DNA microarrays. The library was comprised of expression experiments on

120 yeast strains, each with one gene disrupted by homologous recombination. Each ofthe genes in this library was selected for experimentation because it was reported in the Yeast Proteome Database (YPD) to be a factor involved in transcription regulation. Previously reported gene regulatory networks show that genes can interact with themselves and with other regulatory genes⁽¹⁰⁾. To reconstruct hierarchical regulatory relationships from the expression library, in certain embodiments, we developed a novel Boolean algorithm that accommodates common looped regulatory relationships. As shown in Figure 2, gene regulatory relationships modeled by this method can be represented as a directed graph of up-regulation or down-regulation of gene expression between

2 given genes of the 5871 genes measured in the each expression experiments. We constructed a 552 genes member model of regulatory control relationships and then further constrained a sub-network model composed of 98 well known transcription factors. The resultant model, shown in Figure 3, contains a total of 552 nodes representing the included genes and 2953 putative regulatory links among these genes. certain embodiments, we classified transcription factors in the network model accordmg to cellular functional roles (CFR) as defined in YPD. Figure 4 shows the control relationships among classified transcription factors in the network. Only SKN7and HMS2 directly influenced expression of cell cycle related genes. We identified several control lines emanating from "carbohydrate metabolism" genes to all other functional gene groups. This finding is consistent with the energy dependent nature of many cellular processes and metabolic pathways.

As shown in Figure 4, a distinct feature is that expression levels of lipid fatty-acid metabolism transcription factors were exclusively under control of carbohydrate metabolism transcription factors. This relation was given an account of interactions among proteins involved in phospholipid synthesis pathway with the glucose response pathway, the lipid signaling pathway and other lipid synthesis pathways have been reported ⁽¹¹-⁾.

Our new methods were employed to further explore detailed relationships between expression regulatory genes, such as transcription factors with regulatory and non-regulatory genes, from all of gene expression experimental data. One can characterize regulatory roles of genes with unreported biological function by virtue of their expression control by and/or over genes with known function. Figure 5 shows an example of newly elucidated control relationships among transcription factors involved in cell division regulation and DNA replication/repair regulation. We found two discrete functional branches in the sub-network that correspond to cell division regulation and DNA replication/repair are linked by UME6 and MET28, indicating the important role of these two transcription factors in coordinating expression regulation of these interdependent regulatory pathways. MET28, as its name suggests, was previously characterized as a transcription factor related to methionine metabolism⁽¹² The novel putative role for Met28p in regulation of chromosome segregation is supported by its reported interaction with known chromosomal segregation component Smclp, as part of a larger nexus of chromosomal segregation proteins, in mating-type two-hybrid assays ⁽ .

Through sequence analysis of coding sequences and upstream regions of genes in the above-mentioned sub network, we validated the sequence level control mechanisms between transcription factors and their target genes and DNA binding sequences. In the case of UME6, which is known as a global transcriptional regulation of many meiotic genes ⁽¹⁴'¹⁵⁾ and its control system, we performed Multiple Expectation-Maximization for motif Elicitation (MEME) analysis of regions upstream of the 34 genes controlled by UME6 in our model ⁽¹⁶⁾. We found two consensus sequences, TAGCCGCCGA (SEQ ID NO: 1) and TGGGCGGCTA (SEQ ID NO: 2) that were present in 14.7% and 32.4% of the 34 genes respectively, and that had significant P values according to the MEME search. According to the TRANSFAC database, TAGCCGCCGA is defined as the binding site of Ume6p ⁽¹⁴⁾ and

TCGGCGGCA is reported to be the binding site of a represser of CAR1 ⁽¹⁷⁾ which were repressed by a three components complex containing Umeδp (Ume6p, Sin3p and Rpd3p) ⁽¹⁸⁾. Aside from the Ume6p related binding motifs, no other MEME consensus sequence was present upstream of the 11 meiosis related genes, a finding which suggests that these 11 genes are regulated exclusively by UME6 and that Ume6p directly influences their expression. Only two other genes possessed the putative binding sequence but did not show expression influence on the count of UME6 in our experiments.

Experimentally driven discovery of network models of expression control allows for specific biological insights relevant to gene regulatory pathways that are not readily reconstructed from the available biological literature or present in pre-compiled pathway databases. We have shown here that such a system is useful in discovering novel gene function information as well as novel regulatory mechanisms. Use of this and similar strategies to elucidate hierarchical regulatory pathways from full genome expression libraries will allow for rapid insight into transcription regulation that can be applied to rational drug discovery and agrochemical targeting.

Methods Boolean Network Inference Algorithms

A gene expression matrix E is created from a set of gene disruption experiments. The value of matrix element E(a,b) indicates the expression ratio of gene 'b' to the normal condition in the disruptant in which gene 'a' which is caused by the deletion of gene 'α'. (1) Using the gene expression matrix E, if the intensity of gene 'b' is changed higher than a given threshold value θ, or is changed lower than a given threshold II θ resulting from the disruption of gene 'α', it is defined that gene ' ' affects gene 'b' in directly or indirectly and the value of element (a,b) in the binary matrix R is set to 1; R(a,b)=l. Thus the binary matrix R is created by cutting the value of each element in the gene expression matrix E at the threshold (θ or 1/θ).

(2) In the binary matrix R, if there is the relation that gene and 'b' affect each other, that is R(a,b)=R(b,ά)=l, we cannot decide which gene is located at the upper stream. This is the limitation or disadvantage of this method, however, we introduce an equivalence set, which makes a set of the group consisting of genes affecting each other and the group is assumed to be one gene.

(3) Ordering genes (topological sort) Equivalence sets have the semi-order relation and we can be drawn up equivalence set in semi-order (topological sort) to infer the network

(4) Skeleton matrix; semi-ordered accessibility matrix between equivalence sets includes indirect affections. In order to remove them and to make a skeleton matrix, we set the rank to each equivalence set defined as follows: Equivalence set belonging to rank 1 gives no indirect affection to another equivalence sets. Equivalence set belonging to rank 3 gives direct affection to the sets with rank 2 and does indirect affection to ones with rank 1. After setting the rank to each equivalence set, remove all indirect affection from the semi-ordered accessibility matrix.

(5) Draw multi-level digraph; draw the lines between nodes based on the value of each element in the skeleton matrix.

Microarray experiments

Information on gene expression can be obtained using any conventional methods known in the art. For example, microarrays can be used in which complementary DNAs (cDNA) unique to each gene whose expression is to be measured are placed on a substrate. RNA obtained from samples can be analyzed by hybridization to corresponding cDNA, and can be detected using a variety of methods. In some embodiments, it can be desirable to use a microarray detectors and/or methods as described in United States Provisional Patent Application, Serial No: 60/382,669, filed May 23, 2002, herein incorporated fully by reference.

We collected gene expression data using full-genome yeast c-DNA microarrays ⁽¹³⁾. BY4741 (MATa, HIS3O1, LEU2O0, METI5O0, URA3O0) served as the wild type strain. Gene disruptions for strain BY4741 were purchased from Research Genetics, Inc. Cells were inoculated and grown in

YPD (1% yeast extract, 1% bacto-peptone, 2% glucose) at 30°C until OD₆₀₀ reached 1.0 in the logarithmic growth phase and harvested to isolate mRNA for assay of gene expression. The parental strain was the control used for each disruptant strain.

Data Normalization We measured the quantities of 5871 mRNA species from 155 disruptants by cDNA microarray assay. A difference in fluorescent strength between Cy3, Cy5 causes bias ofthe expression quantity ratio. We normalized the expression quantity ratios of each expression profile. The ratio bias had a fixed trend in each spotted block, thus we calculated a linear regression to normalize the mean value ratio of each block to 1.0. The logarithm value of the ratio was used to indicate the standard expression level, therefore we found the logarithm value of ratio and calculated the average and standard deviation of these log values (see tablel). The Standard Deviation (SD) of expression levels of all spotted genes from the c7 E<5 (YDR207Q disruptant expression array for which UMΕ6 is defined as a "Global Regulator" in YPD⁽³³⁾ disruptant was 0.4931, therefore we recognize that there may be an unacceptable number of errors in array data whose overall SD was larger than 0.5. Thus, in certain embodiments, one can eliminate such experiments from analysis. It can be appreciated that one can select the SD of the data in different ways. By selecting a SD of less than 0.5, one can take a relatively conservative approach, in that errors of the above type are relatively unlikely to affect the overall gene network inference. However, one can appreciate that SD of less than 0.4, less than 0.3, less than 0.2, less than 0.1, less than about 0.05, less than about 0.01, less than about 0.005, or less than about 0.001 can be selected.

Selection of Genes

In YPD, 314 genes were defined as "Transcription Factors", and 98 of these have previously been studied for control mechanism. The expression profile data of 552 genes including the genes that controlled by this 98 "Transcription factors" were selected from 5871 profiles. Thus we constructed the gene regulatory network from the expression profile data set based on the values of these 552 genes in 120 gene disruption experiments.

References

1. Pilpel, Y., Sudarsanam, P. & Church, G. M. Identifying regulatory networks by combinatorial analysis of promoter elements. Nature Genetics. 29, 153- 159 (2001). 2. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian Networks to Analyze Expression Data. J Comput Biol. 7, 601-620 (2000).

3. Somogyi, R. & Sniegoski, C. A. Modeling the complexity of genetic networks: Understanding multigene and pleiotropic regulation. Complexity. 1, 45-63 (1996).

4. Spellman, P. T., et al. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol Cell Biol. 9, 3273-3297 (1998).

5. Gasch, A. P., et al. Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Mol Cell Biol. 17, 2099-2106 (1997).

6. Akutsu, T., Miyano, S. & Kuhara, S. Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function. J. Comp. Biol. 7, 331(2000)

7. Akutsu, T., Miyano, S. & Kuhara, S. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727(2000)

8. Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97, 262-267 (2000).

9. Ideker, T., et al. Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science. 292, 929-934 (2001).

10. Shoudan, L. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Sym. Bio. 3, 18-29 (1998).

11. Hiesinger, M., Roth, S., Meissner, E. & Schuller, H. J. Contribution of Cat8 and Sip4 to the transcriptional activation of yeast gluconeogenic genes by carbon source-responsive elements. Cwr Genet. 39, 68-76 (2001).

12. Kuras, L., Cherest, H., Surdin-Kerjan, Y., & Thomas, D. A heteromeric complex containing the centromere cinding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediate the transcription activation of yeast sulfur metabolism. Embo Journal 15, 2519-2529 (1996). 13. Newman, J. R. S., Wolf, E. & Kim, P. S. From the Cover: A computationally directed screen identifying interacting coiled coils from Saccharomyces cerevisiae Proc. Natl. Acad. Sci. 97, 13203(2000).

14. Bowdish, K. S., Yuan, H. E. & Mitchell, A. P. Mol.Cell. Biol. 15, 2955(1995)

15. Rubin-Bejerano, I., Mandel, S., Robzyk, K. & Kassir, Y. Induction of meiosis in Saccharomyces cerevisiae depends on conversion of the transcriptional represssor Ume6 to a positive regulator by its regulated association with the transcriptional activator frnel. Mol. Cell. Biol. 16, 2518(1996)

16. Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings ofthe Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. 17. Luche, R. M., R. Sumrada, M. & Cooper, T. G. A cis-acting element present in multiple genes serves as a repressor protein binding site for the yeast

CAR1 gene. Mol. Cell. Biol. 10, 3884(1990) 18. Messenguy, F., Nierendeels, F., Scherens, B. & Dubois, E. In

Saccharomyces cerevisiae, expression of arginine catabolic genes CAR1 and CAR2 in response to exogenous nitrogen availability is mediated by the

Ume6 (CargRI)-Sin3 (CargRII)-Rpd3 (CargRIII) complex. J. Bacteriol.

182, 3158(2000).

II. Estimation of Genetic Networks and Functional Structures Between

Genes Using Bayesian Networks and Nonparametric Regression

1. Introduction

Improvement of the genetic engineering provides us enormous amount of valuable data such as microarray gene expression data. The analysis of the relationship among genes has drawn remarkable attention in the field of molecular biology and Bioinformatics. However, due to the cause of dimensionality and complexity of the data, it will be no easy task to find structures, which are buried in noise. To extract the effective information from biological data, thus, theory and methodology are expected to be developed from a statistical point of view. Our purpose is to establish a new method for understanding the relationships among genes clearer. Example 1 below provides the mathematical basis for these methods.

In certain embodiments of this invention, Bayesian networks are adopted for constructing genetic networks using microarray gene expression data from a graph-theoretic approach. Friedman and Goldszmidt (1998) proposed an interesting method for constructing genetic links by using Bayesian networks.

They discretized the expression value of genes and considered to fit the models based on multinomial distributions. However, a problem still remains in choosing the threshold value for discretizing (by not only the experiments). The threshold value assuredly gives essential changes of affects of the results and unsuitable threshold value leads to incorrect results. On the other hand, recently, Friedman et al (2000) pointed out that discretizing could result in losses of information. To use the expression data as continuous values, they used Gaussian models based on linear regression. However this model can only detect linear dependencies and cannot produce a complete picture of relationships. We developed a new method for constructing genetic networks using Bayesian networks. To capture not only linear dependencies but also nonlinear structures between genes we use nonparametric regression models with Gaussian noise. Nonparametric regression has been developed in order to explore the complex nonlinear form of the expected responses without the knowledge about the functional relationship in advance. Due to the new structure of the Bayesian networks, a suitable criterion was needed for evaluating models. Thus, this invention includes a new criterion from Bayesian statistics. By using these methods we overcame defects of previous methods and attained more information. In addition, our methods include previous methods as special cases. We validated our methods through the analysis ofthe S. cerevisiae cell cycle data.

We found that using Bayesian analysis according to these methods, we could identify upstream causative relationships between genes, thereby identifying potential therapeutic targets.

2. Bayesian Networks and Nonparametric Regression

Using a Bayesian network framework, we consider a gene as a random variable and decompose the joint probability into the product of conditional probabilities. For example, if we have a series of observations of the random vector, we can denote the probability of obtaining a given observation can depend upon the conditional probability densities. In certain embodiments, one can use nonparametric regression models for capturing the relationships between the variables. A variety of graphic tools can be used to elucidate the relationships. For example, polynomials, Fourier series, regression spline gases,

B-spline bases, wavelet bases and the like can be used for defining a graph of gene relationships. One difficulty in selecting a proper graph is to properly evaluate variance and noise in the system.

3. Criterion for Choosing a Proper Graph

In certain embodiments, we can let π(θa Iλ) be the prior distribution on the unknown parameter ΘQ with hyper parameter vector λ and let log (ΘQ Iλ) = 0( ). The marginal probability of the data X„ is obtained by integrating over the parameter space, and we choose a graph G with the largest posterior probability. Friedman and Goldszmidt (1998) considered the multinomial distribution as the Bayesian network model and also supposed the Dirichlet prior on the parameter ΘQ . In this case, the Dirichlet prior is the conjugate prior and the posterior distribution belongs to the same class of distribution. Then a closed form solution ofthe integration in (4) is obtained, and they called it BDe score for choosing graph. A BDe score is confined to the multinomial model, and we propose a criterion for choosing graph in more general and various situations.

A problem of constructing criteria based on the above model is how to compute the integration. Some methods that can be used include Markov chain, and Monte Carlo methods. In certain embodiments of this invention, we use the

Laplace approximation for integrals. Thus, the optimal graph is chosen such that the criterion BNRC in Example 1 equation (5) is minimized.

This criterion is derived under log (Θ_G Iλ) = 0(n). If log (Θ_G Iλ) = 0(1), the mode ΘQ is equivalent to the maximum likelihood estimate, MLE, and the criterion resulted in a Bayesian information criterion, known as BIC by removing the higher order terms

0 (n^"J)(j >0). Konishi (2000) provided a general framework for constructing model selection criteria based on the Kullback-Leibler information and Bayes approach. As can be seen from Example 1 below, the final graph can be selected as a minimizer ofthe BNRC and does not have to minimize each local score, BNRC_/, because the graph is constructed as acyclic.

4. Estimating Graphs and Related Structures Using a BNRC

Criterion For example, one can express our method in an illustration. Essential points of our method are the use of the nonparametric regression and the new criterion for choosing graph from Bayesian statistics. As for nonparametric regression in Section 2 of the this Example 1, we use 2?-splines as the basis functions. Figure 1 of Example 1 is an example of 5-splines of degree 3 with equidistance knots t_\ . . . , 1 ₁₀. By using a backfitting algorithm, the modes can be obtained when the values of β^ are given. The backfitting algorithm is shown in Example 1 below.

The modes depend on the hyper parameters β^ and we have to choose the optimal value of β^. In the context of our method the optimal values of β^ are chosen as the minimizers of BNRC_j. The 5-splines coefficient vectors can be estimated by maximizing equation (6) of Example 1. The modes of equation (6) are the same as the penalized likelihood estimates and we can look upon the hyper parameters λ^ or ?j_k as the smoothing parameters in penalized likelihood. Hence, the hyper parameters play an important role for fitting the curve to the data.

5. Computation Experiments

We used Monte Carlo simulations to examine the properties of our method. Data were generated from an artificial graph and structures between variables (Figure 2 of Example 1) and then the results summarized as follows:

The criterion BNRC can detect linear and nonlinear structure of the data. A BNRC score may have a tendency toward over growth ofthe graph. Therefore, we use Akaike's information criterion, known as AIC and use both methods. AIC was originally introduced as a criterion for evaluating models estimated by maximum likelihood methods. But estimates by this method are the same as the maximum penalized likelihood estimate and is not MLE. The trace of Sj_k shows the degree of freedom of the fitted curve and is a great help. That is to say, if trSj_k is nearly 2, the dependency can be considered linear. We use both BNRC and AIC for deciding whether to add a parent variable. To validate these methods, we analyzed the S. cerevisiae cell cycle data discussed by Spellman et al. and Friedman et al. The data were collected from 800 genes and 77 experiments. We set the prior probability πo as constant, because we have no information about the size of the true graph. The nonparametric regressors are constructed of 20 5-splines. The number of B- splines can be varied as desired. The hyperparameters control the smoothness of fitted curve and it can be desirable to select hyper parameters and the number of 5-splines to provide a good fit to the data with a minimum of 5-splines.

The results of the analysis can be summarized as follows: Figure 3 of Example 1 shows BNRC scores when we predicted CLN2, CDC5 and SNS1 by one gene. The genes which give smaller BRΝC scores produce a more reliable effect on the target gene. We can observe that which gene is associated with the target gene and we find the gene set which strongly depend on the expression of the target gene. In face, we can construct a brief network by using these information. We can look upon the optimal graph as a revised version of the brief network by taking account of the effect of interactions. If there is a linear dependency between genes, the score BNRC is good when the parent-child relationship is reversed. Therefore, the directions of the causal associations in the graph are not strict, especially when the dependencies are almost linear. There are some genes that mediate Friedman et al. 's result, such as MCD1, CSI2, YOX1 and so on. Most of these genes have been reported to play an important role. A large number of the relationships between genes are nearly linear. But we could find some nonlinear dependencies which linear models can rarely find.

Figure 5 of Example 1 shows the estimated graph associated with genes which were classified by their processes into cell cycle and their neighborhood.

We omitted some branches in Figure 5, but important information is shown. As for the networks given by us and Friedman et al, we confirmed parent-children relationships and observed that both two networks are similar to each other. Especially, our network includes typical relationships which were reported by Friedman et al. As for the differences between both networks, we attention the parents of SVS1. Friedman et al. employed CLN2 and CDC5 as the parent genes of SVS1. On one hand, our result gives CSI2 and YKR090W for SNS1. Our candidate parent genes were found to be more appropriate than those of Friedman et al. Our model suitably fits to both cases in Figure 4 of Example 1. Especially, we conclude that CDC5 has weak effects on SNS1 compared with other genes in Spellman's data (see also Figure 3 of Example 1) In fact, as the parent gene of SNS1, the order of BΝRC score of CDC5 is 247^th. Considering the circumstances mentioned above, our method can provide us valuable information in understandable and useful form. Figure 6 schematically depicts a method of this invention for determining gene network relationships.

6. Discussion The new methods for estimating genetic networks from microarray gene expression data by using Bayesian network and nonparametric regression provide improved accuracy. We derived a new criterion for choosing graph theoretically, and represented its effectiveness through the analysis of the cell cycle data. The advantages of our methods include (1) we can use the expression data as continuous values; (2) we can also detect nonlinear structures and can visualize their functional structures being easily understandable; and (3) automatic search can accomplish the creation of an optimal graph.

Friedman et al's method retains known parameters such as threshold value for discretizing and hyper parameters in the Dirichlet priors selected by trial and error and were not optimized in a narrow sense. On the other hand, our method can automatically and appropriately estimate any parameter based on criterion which has sound theoretical basis.

In other embodiments we can derive criterion BNRC in more general situations. By way of example, we can construct graph selection criterion based on other statistical models.

III. Nonlinear Modeling of Genetic Network by Bayesian Network and Nonparametric Regression with Heterogeneous Error Variances and Interactions In other embodiments, one can use different statistical methods for constructing genetic networks from microarray gene expression data based on Bayesian networks. In these embodiments, we estimate the conditional distribution of each random variable. We consider fitting the nonparametric regression, models with heterogeneous error variances and interactions for capturing the nonlinear structures between genes. Once we set a graph using Bayesian network and nonparametric regression, a problem still remain to be solved in selecting an optimal graph, which gives a best representation of the system among genes. A new criterion for choosing graph from Bayes approach contains the previous methods for estimating networks using Bayesian methods. We demonstrated the effectiveness of proposed method through the analysis of

Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

1. Introduction Use of Bayesian networks can be effective methods in modeling the phenomena through the joint distribution of a large number of variables. Friedman and Goldszmidt discretized the expression values and assumed the multinomial distributions as candidate statistical models. Peter et al. investigated the threshold value for discretizing. Friedman et al. pointed out that discretizing can lose some information considered to fit the linear regression models, which analyze the data as continuous. However, the assumption that the parent genes depend linearly on the objective gene is not always warranted. Imoto et al described the use of nonparametric additive regression models for capturing not only linear dependencies but also nonlinear relationships between genes.

In certain embodiments, we use Bayesian networks and the nonparametric heteroscedastic regression which can be more resistant to effect of outliers and can also capture effects ofthe interactions of parent genes.

After setting the graph, we evaluate its goodness or closeness to the true graph, which is completely unknown. The construction of a suitable criterion becomes important. Friedman and Goldszmidt derived the criterion, BDe, for choosing a graph based on multinomial models and Dirichlet priors. However, unknown hyper parameters in Dirichlet priors remain and one can only set the values experientially. We derived a new criterion for choosing a graph from Bayes approach. The criterion automatically optimizes the all parameters model and gives the optimal graph. In addition, our method includes the previous methods for constructing genetic network by Bayesian network. To show the effectiveness of the new method we analyze gene expression data of Saccharomyces cerevisiae by disrupting 100 genes.

2. Bayesian Network and Nonparametric Heteroscedastic

Regression Model With Interactions Suppose that we have n sets of data {x_ls ...,x_n}of the ^-dimensional random variable vector X = and x? denotes the transpose of x. In microarray gene expression data, n and p correspond to the numbers of arrays and genes. Under Bayesian network framework, we considered a directed acyclic graph G and Markov assumption between nodes. The joint density function is then decomposed into the conditional density. Assuming that the conditional densities are parameterized by the parameter vectors θ₃ and the effective information is extracted from these probabilistic models.

We then used nonparametric regression strategy for capturing the nonlinear relationships between x, and py. In many cases, this method can capture the objective relationships very well. When the data, however, contain outliers especially near the boundary on the domain {p,₇}, nonparametric regression models sometimes can lead to unsuitable smoothed estimates, i.e., the estimated curve exhibits some spurious waviness due to the effects of outliers. To avoid this problem, the nonparametric regression model with heterogeneous error variances as described below in Example2.

Criterion For Choosing a Graph Once we set a graph, a statistical model (equation 8 of Example 2) based on the Bayesian network and nonparametric regression can be constructed and can be estimated by a suitable method. However, to choose the optimal graph, which gives a best approximation of the system underlying the data, it is undesirable to use the likelihood function as a model selection criterion, because the value of likelihood becomes larger in a more complicated model. Hence, a statistical approach based on the generalized or predictive error, Kullback- Leibler information, Bayes approach and so on (20) can be desirable. We therefore constructed a criterion for evaluating a graph based on our model

(equation 8 of Example 2) from Bayes approach.

A criterion for evaluating the goodness of the graph can be constructed from Bayesian theoretic approach as follows:

(1) Obtain the posterior probability of the graph by the product of the prior probability of the graph Q and the marginal probability ofthe data;

(2) Remove the standardized constant, the posterior probability of the graph is proportional to equation 9 of Example 2. (3) Using Bayes approach, choose a graph such that π(G I X_n) is maximum.

(4) Determine if the computation involves a high dimensional integral (equation 9 of Example 2).

(5) If Yes to (4), use equation 11 of Example 2, [see references 17 and 26 of Example 2 for integrals.

Thus, a graph is chosen such that the criterion BNRC is minimal. An advantage of using the Laplace method is that it is not necessary to consider the use of the conjugate prior distribution. Hence modeling in larger classes of distributions ofthe model and prior is attained.

3. Estimating Genetic Network

Nonparametric Regression

We present methods for constructing a genetic network based on the method described in Section 2 above. The nonparametric regressor (equation 5 of Example 2) has the two components: the main effects component represented by the additive model of each parent gene, and the interactions component. In the additive model, we construct each smooth function /«_/*(^') by 5-splines (equation 9 of Example 2) (see referencelδ). In the interaction terms, we use a Gaussian radial basis function. In the context ofthe regression modeling based on the radial basis functions, we used two methods for estimating the centers Z_β and the widths. This method can be termed "fully supervised learning." An alternative method determines the values by using only the parent observation data in advance. The latter method can be employed, and a k-means clustering algorithm for constructing the basis functions can be used. Details ofthe radial basis functions are described further in references 7, 21 and 23 of Example 2. Hyper parameters control the amount of overlapping among basis functions.

In error variances, we consider a heterocedastic regression model and assume the structure shown in equation 6of Example 2. The design of the constants can affect the accuracy of capturing the heteroscedasticty of the data. We set the weights according to Equation 12 of Example 2. If the hyper parameter * is set to 0, then the variances are homogeneous. If we use a large value ofp, then the error variances ofthe data, which exist near the boundary on the domain of the parent variables, are large. Hence, if there are outliers near the boundary, this method can reduce their effects and thereby gain suitably smoothed estimates using appropriate values ofp. 4. Real Data Analysis We demonstrated the effectiveness of the above methods through analysis of S. cerevisiae gene expression data. We observed the changes of expression level of genes on a microarray by gene disruption. By using this method we revealed gene regulatory networks between genes of Saccharomyces cerevisiae. We collected a large number of expression profiles from gene disruption experiments to evaluate genetic regulatory networks. Over 400 mutants are stocked and gene expression profiles are accumulated. We monitored the transcriptional level of 5871 genes spotted on a microarray by a scanner. The expression profiles of over 400 disruptants were piled up in our database. The standard deviation (SD) of all genes on a microarray was evaluated and the value of SD represented roughly the error of a experiment. We needed the value of 0.5 as the critical point of the standard deviation of the expression ratio of all genes. 107 disruptants including 68 mutants where the transcription factors were disrupted were selected from 400 profiles.

We used 100 microarrays and constructed the genetic network of 521 genes from the above data. 94 transcription factors whose regulating genes are identified were found, and the profiles of 421 genes controlled by 94 factors were selected from 5871 profiles. We constructed the nonparametric regression model using 20 2?-splines and 20 radial basis functions. We confirmed that the differences of the smoothed estimates against the various number of the basis functions can not visually be found, because when we use the somewhat large number of the basis functions, the hyper parameters control the smoothness of the fitted curves. We applied the two-genes effect to the interaction components. Hence, the effects of the interactions were obtained as the fitted surfaces and can be visually understandable. We showed the roles of the hyper parameters in the prior distributions and weight constants. Figure 1(a) of Example 2 shows the scatter plot of YGL237C and YEL071W with smoothed estimates by 3 difference values of the hyper parameters. The smoothed estimate depends on the values of the hyper parameters. Figures 1(b) of Example 2 depicts the behavior ofthe BNRC criterion of the two genes in Figure 1 (a). We can choose the optimal value of the hyper parameter as the minimizers ofthe BNRC, and the optimal smoothed estimates (solid curve) can capture structure between these genes well. The dashed and dotted curves are near the maximum likelihood estimate and the parametric linear fit, respectively. The effect ofthe weight constants wlj, . . . , wnj is shown in Figure 2(a) of Example 2. If we use a homoscedastic regression model, we obtain the dashed curve, which exhibits some spurious waviness due to the effect of the data in the upper-left corner. By adjusting the hyper parameter p in equation 12 of Example 2, the estimated curve resulted in the solid curve. The optimal value of p chosen by minimizing the BNRC criterion (see Figure 2(b) of Example 2). When the smoothed estimate is properly obtained, the optimal value of p tends to zero.

We employed a two-step strategy for fitting the non-parametric regression model with interactions. First, we estimated the main effects represented by the additive _δ-spline regressors. Next, we fit the interactions components to the residuals. Figure 3 of Example 2 depicts an example of the fitted surface to the relationship between YIL094C and its parent genes, YKL152C and YER055C. The interaction ofthe two parent genes leads to over expression when both parent genes increased.

In Saccharomyces cerevisiae, the GCN4 gene encodes the transcriptional activator of the "general control" system of amino acid biosynthesis, a network of at least 12 different biosynthetic pathways [reference 6 of Example 2]. Experiments showed that the consequences of the general control response upon the signal "amino acid starvation" induced by the histidine analogue 3-aminotriazole with respect to GCN4p levels. GCN4 activates transcription of more than 30 genes involved in the biosynthesis of 11 amino acids in response to amino acid starvation or impaired activity of tRNA synthetases (see reference 24 of Example 2). Purine biosynthetic genes ADE1, ADE4, ADE5,7 and ADE8 display GCN4-dependent expression in response to amino acid starvation [24]. GCN4 activates transcription of biosynthetic genes for amino acids and purines in response to either amino acid or purine starvation

[24]. Those results of experiments show that there are strong relationships between purine metabolism and amino acid metabolism through GCN4. Our map of relationships fit well known relationships between purine and amino acid metabolism. IV. Bootstrap Nonlinear Modeling of Genetic Network by Using Bayesian Network and Nonparametric Heteroscedastic Regression

In other embodiments, we modified the above approaches. A relevant feature of Bayesian network construction is in the estimation of conditional distribution of each random variable. We fit non-parametric regression models with heterogeneous error variances to the microarray gene expression data to capture nonlinear structures between genes. A problem still remained to be solved in selecting an optimal graph, which gives the best representation ofthe system among genes. We derived a new graph selection criterion from Bayes approach in general situation. The proposed method includes previous methods based on Bayesian networks. We also used a method for measuring the edge intensity and the degree of confidence of the direction of Bayes causality in an estimated genetic network. We demonstrated the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

Once we set the graph, it is desirable to evaluate its goodness or closeness to the true graph, which is unknown. For our method we needed to establish the method for measuring the edge intensity. We used a "bootstrap" method (Efron, 1979; Efron and Tibshirani, 1993) to solve this problem. By using this method, one can measure not only the intensity of edge, but also the degree of confidence of Bayes causality. To show the effectiveness of the proposed method, we analyzed gene expression data of Saccharomyces cerevisiae newly obtained by disrupting 100 genes.

1. Bayesian Network and Nonparametric Heteroscedastic

Regression for Non-Linear Modeling of a Genetic Network 1.1 Nonlinear Baysian Network Model

As described elsewhere in this application, in the Bayesian network framework, one can consider a directed acyclic graph G and Markov assumption between nodes. The joint density function is then decomposed into the conditional density of each variable. Through formula 1 of Example 3 below, the focus of interest in statistical modeling by Bayesian networks is how one can construct the conditional densities, f. One can assume that the conditional densities, Jj , are parameterized by the parameter vectors, and information is extracted from these probabilistic models as described further in

Example 3.

In certain embodiments, one can use nonparametric regression strategies for capturing nonlinear relationships between xij and Pij and suggested that there are many nonlinear relationships between genes and the linear model hardly achieves a sufficient result. In many cases, these methods can capture the objective relationships well. When the data, however, contain outliers, especially near the boundary of the domain, standard nonparametric regression models sometimes lead to unsuitable smoothed estimates, i.e., the estimated curve exhibits some spurious waviness due to the effects of the outliers. To avoid this problem, we fit a nonparametric regression model with heterogeneous error variances. If the number of the parameters in the model is much larger than the number of observations, it has a tendency toward unstable parameter estimates.

In certain cases it can be desirable to use nonparametric regression instead of linear regression, because linear regression cannot evaluate the direction of the Bayes causality or can lead to the wrong direction in many cases. We show an advantage of the new model compared with linear regression through a simple example. Suppose that we have data of genel and gene 2 in Figure 1(a) of Example 3. We consider the two models genel>gene2 and gene2>genel, and obtain the smoothed estimates shown in Figures (b) and

(c), respectively of Example 3. We then can decide that the model (b: genel — gene2) is better than (c: gene2 - genel) by the our criterion, which is derived in a later section (the scores of the models are (b) 120.6 (c) 134.8). Since we generated this data from the true graph gene→gene2, our method yields the correct result. However, if we fit the linear regression model to this data, the mode (c) is chosen (the scores of the linear models are (b) 156.0 (c) 135.8). The method based on linear regression yields an incorrect result in this case, whereas non-linear regression analysis yields a correct result.

Even in the case where the relationship between genes is almost linear. Our method and linear regression can fit the data appropriately. However, using the linear model it is difficult to decide the direction of Bayes causality.

We have previously described the criterion BNRC. We derived the criterion, NRC_hetero, for choosing the graph in a general framework. By using the equation (8), the BNRC_het_ero score of the graph can be obtained by the sum of the local scores, BNRChetero. The optimal graph is chosen such that the criterion BNRC_/!e,_ero (equation 7 of Example 3) is minimal. An advantage ofthe Laplace method is that it is not necessary to consider the use of the conjugate prior distribution. Hence the modeling in the larger classes of distributions of the model and prior is attained. Genetic networks are estimated as described further in Example 3. However, regarding error variances, we consider a heteroscedastic regression model and assume the structure shown in equation 3 of Example 3.

Learning Network In Bayesian network literature, determining the optimal network is an

NP-hard problem. To resolve the networks, one can use a "greedy hill- climbing" algorithm as follows:

(1) Step 1 : Make the score matrix whose (i, j) -th elements is the BNR _e/ero score ofthe graph genet -» gene,-; (2) Step 2: For each gene, implement one of three procedures for an edge: add, remove, reverse, which gives the smallest BNRChetero,'

(3) Step 3: Repeat Step 2 until the BNRC eter_o does not reduce further; and

(4) Permute the computational order of genes and make many candidate learning orders in Step 3. Step 4 can be desirable in situations in which a greedy hill-climbing algorithm produces many local minima and the result depends on the computational order of variables. Another problem of the learning network is that the search space of the parent genes is wide when the number of genes is large. Is such circumstances, one can restrict the set of candidate parent genes based on the score matrix, which is given by Stepl .

One also can use this learning strategy for learning genetic network and showed the effectiveness of their method by the Monte Carlo simulation method. We also checked the efficiencies of our new model through the same Monte Carlo simulations and found improvements due to the nonparametric heteroscedastic regression model, We illustrate the effectiveness of the heteroscedastic regression model in the next subsection.

Hyper parameters Consider the nonparametric regression model defined in equation 4 of

Examρle3. The estimate θ_} is a mode of log π (θj K ) and can depend on the hyper parameters used. We constructed the nonparametric regression model using 20 .B-splines. We confirmed that the differences of the smoothed estimates against the various number of the basis functions cannot be visually detected. When we used a somewhat large number of basis functions, the hyper parameters control the smoothness ofthe fitted curves. (Figure 3(a) of Example 3 shows the scatter plot of YGL237C and YEL071W with smoothed estimates for 3 different values ofthe hyper parameters. The details ofthe data are shown in later section. The smoothed estimate strongly depended on the values ofthe parameters. Figures 3(b) of Example 3 depicts the behavior of the BNRCtø_ero criterion of the two genes in Figure 3(a). One can choose the optimal value of the hyper parameter as the minimizer of the BNRC_/,_etero and the optimal smoothed estimate (solid curve in Figure 3(a) can capture the structure between these genes well. The dashed and dotted curves are near the maximum likelihood estimate and the parametric linear fit, respectively. Effects of the weight constants wij, . . . wnj is shown in Figure 4(a) of Example 3. If one uses the momoscedastic regression model, we obtain the dashed curve, which exhibits some spurious waviness due to the effect of data in the upper-left corner. By adjusting the hyper parameter pj in equation 9 of Example 3, the estimated curve resulted in the solid curve. The optimal value of pj can be also chosen by minimizing the BN C/je/_ero criterion see Figure 4(b) of Example 3. When the smoothed estimate is properly obtained, the optimal value of pj tends to zero.

Finally, in Section 3 of Example 3, we provide an algorithm for estimating the smoothed curve and optimizing the hyper parameters.

Step 1: Fix the hyper parameter p/,

Step 2: Initialize: γ_jk ____= ₀, k - 1, ..., q_j,

Step 3: Find the optimal _j by repeating Steps 3-1 and 3-2;

Step 3-1: Compute: jk = (B^τ _jk W_jkB_jk + nβ_jkK_jh) ^"7 B^T _JkW_jk x (x_(i) - Σ B_β-r_jk)' , k'≠k for fixed _jk.

Step 3-2: Evaluate: Repeat Step 3-1 against the candidate value of β * and choose the optimal value of β _/t, which minimizes the BNRC;_!e(_ero . Step 4: Convergence: repeat Step 3 for k = 1, . . . , qj, 1, . . . , qj, 1, . . . until a suitable convergence criterion is satisfied.

Step 5: Repeat Step 1 to Step 4 against the candidate value of pj, and choose the optimal value of pj, which minimizes the

Bootstrap Edge Intensity and Degree of Confidence of Bayes

Causality

We measured the intensity of the edge and the degree of confidence of the direction of the Bayes causality in the estimated genetic network by the bootstrap method The algorithm can be expressed as follows: (1) Make a bootstrap gene expression matrix X*_n = {x*j, . . . x*„}^τ. by randomly sampling n times, with replacement, from the original gene expression data

\X , . . . , X ) } (2): Estimate the genetic network from X*„ based on the proposed method; and

(3): Repeat Stepl and Step 2 times.

From this algorithm, we obtain T genetic networks. We define the bootstrap intensity of edge and direction of Bayes causality as follows:

Edge intensity

If the edges genei —> gene_j and genβ_j —> genei exist t_/ and t₂ times in the T networks, respectively, we define the bootstrap edge intensity between gene,- and gene_j as (t; + tj)IT.

Degree of Confidence of the Bayes Causality

If ti > t , we adopt the direction gene_t - gene_j and define that the degree of confidence of causality is t/ ty + ti). However, we can use a certain threshold. For example, we set a real number δ and decide genei —> gene₂ if

At least two ways can be used to show the resulting network. First, it can be desirable to determine the edge intensity and the degree of confidence of direction of Bayes causality in the original genetic network. One can add the intensity to each edge in the original network. From this network, one can find the reliability ofthe original network. Second, one can superpose the bootstrap networks and original network. However, the superposed network contains edges that have small intensities. Therefore, we can set a certain threshold value and remove the edges whose intensities are under the threshold. While setting the threshold remains a problem, it is just a visualization problem. We note that the superposed network may not hold the acyclic assumption, but much effective information are in this network.

Real Data Analysis

We illustrate the effectiveness of our methods through the analysis of

Saccharomyces cerevisiae gene expression data. We monitored the transcriptional level of 5871 genes spotted on a microarray by a scanner. The expression profiles of over 400 dirsuptants were stored in our database. The standard deviation (SD) of the levels of all genes on a microarray was evaluated. The value of SD represents roughly the experimental error. In our data, we considered the value of 0.5 as the critical point of the accuracy of experiments. We evaluated the accuracy of those profiles based on the standard deviation of the expression ratio of all genes. 107 disruptants including 68 mutants where the transcription factors were disrupted could be selected from 400 profiles.

It can be appreciated that other values of SD can be considered critical for accuracy. For example, SD values can be 0.4, 0.3, 0.2, 0.1, about 0.05, about 0.01, about 0.005, about 0.001, or any other value considered to produce information having a desired reliability.

We used 100 microarrays and constructed a genetic network of 521 genes from the above data. The 94 transcription factors whose regulating genes have been clearly identified were found. The profiles of the 521 genes controlled by those 94 factors were selected from 5871 profiles.

Table 1 of Example 3 shows the gene pairs with high bootstrap edge intensities. "Inte. " and "Dire. " mean the bootstrap edge intensity and the degree of confidence of direction of Bayes causality, respectively. "F. " is the function of parent gene, i.e. "+" and "-" are respectively, induce and repress. If we cannot decide whether the function is to induce or repress, we enter "0" in "F. ". Figure 5 of Example 3 shows the resulting partial network by superposing 100 bootstrap networks. We denote the edge intensity by the line width, and the number next to the line is the degree of confidence of the direction of Bayes causality.

From Table 1, we conclude the following: Over 60% gene pairs in the Table 1 agree with the biological knowledge. Most of the genetic sets, whose value of edge intensity equals 1, have the relation of "Related protein" in the YPD data base. Some other genetic sets, like ARO genes and PDR genes, have the same regulatory systems. These results indicated that there were some relations between genes which were previously unknown and revealed for the first time using the methods of this invention. We also notice that the other 9 genetic sets have the relation "Functional genomics". The relation of "Functional genomics" indicated that some relations were revealed by previous information derived from large-scale, high-throughput experiments such as microarray analysis, designed to uncover properties of large groups of proteins.

Studies on the regulation of the purine biosynthesis pathway in Saccharomyces cerevisiae revealed that all the genes encoding enzyme required for AMP de novo biosynthesis are repressed at the transcriptional level by the presence of extracellular purines. Two transcriptional factors, named Baslp and

Bas2p, are required for regulated activation ofthe ADE genes (Daignan-Fornier and Fink, 1992) as well as some histidine biosynthesis genes (Denis et al., 1998). Those purine biosynthetic genes, like ADE1, ADE4, ADE5,7 and ADE8, display GCN4-dependent expression in response to amino acid starvation (Rolfes and Hinnebusch, 1993). The starvation for purines stimulates

GCN4 translation by the same mechanism that operates in amino acid-starved cells leads to a substantial increase in HIS4 expression, one of the targets of GCN4 transcriptional activation. Figure 5 indicates that those ADE genes and histidine biosynthesis genes are related with both BAS1 and GCN4. A desirable feature of the methods of this invention involve the use of non-parametric heteroscedastic regression for capturing nonlinear relationships between genes and heteroscedasticity of the expression data. Therefore, our methods can be useful in the analysis of an unknown system, such as human genome, genomes from other eukaryotic organisms, prokaryotic organisms and viruses.

V. Statistical Analysis of a Small Set of Time-Ordered Gene Expression Data Using Linear Splines

Recently, the temporal responses of genes to changes in their environment has been investigated using cDNA microarray technology by measuring the gene expression levels at a small number of time points. Conventional techniques for time series analysis are not suitable for such a short series of time-ordered data. The analysis of gene expression data has therefore usually been limited to a fold-change analysis, instead of a systematic statistical approach.

We use the maximum likelihood method together with Akaike's Information Criterion to fit linear splines to a small set of time-ordered gene expression data in order to infer statistically meaningful information from the measurements. The significance of measured gene expression data is assessed using Student's t-test.

Previous gene expression measurements of the cyanobacterium Synechocystis sp. PCC6803 were reanalyzed using linear splines. The temporal response was identified of many genes that had been missed by a fold-change analysis. Based on our statistical analysis, we found that about four gene expression measurements or more are needed at each time point. These conclusions and further description can be found in Example 4 herein below. 1. Introduction

In recent years, many cDNA microarray experiments have been performed measuring gene expression levels under different conditions. Measured gene expression data have become widely available in publicly accessible databases, such as the KEGG database (Nakao et al., 1999).

In some of these experiments, the steady-state gene express levels are measured under several environmental conditions. For instance, the expression levels of the cyanobacterium Synechocystis sp. PCC6803 and a mutant have been measured at different temperatures, leading to the identification of the gene Hik33 as a potential cold sensor in this cyanobacterium (Suzuki et al.,

2001).

In other experiments, the temporal pattern of gene expression is considered by measuring gene expression levels at a number of points in time. Gene expression levels that vary periodically have for instance been measured during the cell cycle of the yeast Saccharomyces cerevisiae (Spellman et al.

1998). The gene expression levels of the same yeast species were measured during the metabolic shift from fermentation to respiration (DeRisi et al 1997). In those experiments, environmental conditions were changing slowly over time. Conversely, gene responses to an abruptly changing environment can be measured. As an example, the gene expression levels of the cyanobacterium

Synechocystis sp. PCC 6803 were measured at several points in time after a sudden shift from low light to high light (Hihara, 2001).

In cDNA microarray experiments, gene expression levels are typically measured at a small number of time points. Conventional techniques of time series analysis, such as Fourier analysis or autoregressive or moving-average modeling, are not suitable for such a small number of data points. Instead, the gene expression data are often analyzed by clustering techniques or by considering the relative change in the gene expression level only. Such a "fold- change" analysis may miss significant changes in gene expression levels, while it may inadvertently attribute significance to measurements dominated by noise. In addition, a fold-change analysis may not be able to identify important features in the temporal gene expression response.

Several techniques to analyze gene expression data, such as deriving Boolean or Bayesian networks, have been used in the past (Liang et al., 1998 Akutsu et al, 2000; Friedman et al; 2000). Describing gene interactions in terms of a regulatory network is can be desirable, and developing a network model may benefit from gene expression data obtained for a large number of time points. However, data for large numbers of time points for a large number of genes is currently not available. Although the number of genes in a given organism may be on the order of several thousands, gene expression levels are often measured at only five or ten points in time.

So far, a systematic method has been lacking to statistically analyze gene expression measurements from a small number of time-ordered data. We developed a strategy based on fitting linear spline functions to time-ordered data using the maximum likelihood method and Akaike's Information Criterion

(Akaike, 1971). The significance of the gene expression measurements is assessed by applying Student's t-test. This allows us to infer information from gene expression measurements while taking the statistical significance of the data into consideration. This kind of analysis can be viewed as an additional step toward building gene regulatory networks. As an example, we reanalyzed the gene expression measurements of the cyanobacterium Synechocystis sp. PCC 6803 (Hihara, 2001). Using linear splines, information can be deduced and measured data that is missed by methods using the fold-change only. By repeating our analysis with a subset of the available data, we determined how many measurements are needed at each time point in order to reliably estimate the linear spline function.

2. Methods

Student's -test We assessed whether the measured gene expression rations are significantly different from unity. If for a specific gene, we can conclude that for all time points the measured expression rations are not significantly different from unity, we can eliminate that gene from further analysis. The significance level can be established by applying Student's t-test for each time point separately. Since multiple comparisons are being made for each gene, the value ofthe significance level a should be chosen carefully.

We define H₀ ^ω as the hypothesis that for a given gene the expression ratio is equal to unity at a given time point t,- and H₀ as the hypothesis that for a given gene the expression rations at all time points are equal to unity. If we denote α as the significance level for rejection of hypothesis H₀, and ' as the significance level for rejection of hypothesis Ho^(l), then α' and are related via 1-α = (1 - α')^α, in which a is the number of time points at which the gene expression ratio was measured. Note that by expanding the right hand side in a first order Taylor series, this equation reduces to Bonferoni's method for adjusting the significance levels (see also Anderson and Finn, 1996).

By performing Student's t-test at every time point for each gene using α' as the significance level, we will find whether H₀ ^(l) and therefore H₀ should be rejected. If Ho is not rejected, we can conclude that that gene is not significantly affected by the experimental manipulation, and should therefore not be included in further analyses. If for a given gene the null hypothesis H₀ is rejected, we conclude that that gene was significantly affected by the experimental manipulation.

Analyzing time-ordered data using linear splines

Next, we analyzed temporal gene expression response for genes that were found to be significantly affected. The measured gene expression ratios formed a small set of time-ordered data, to which we fit a linear spline function.

A linear spline function is a continuous function consisting of piecewise linear functions, which are connected to each other at knots (Friedman and Silverman, 1989; Higuchi, 1999). Whereas cubic splines are used more commonly, for the small number of data pints we can use linear spline functions more suitably. A conceptual example of a linear spline function with knots t _\, t ₂, t^* ₃, t^* ₄, is shown in Figure 1 of Example 4. We wish to fit a nonparametric regression model of the form xj - g(tj) + ε_j to these data, in which g is a linear spline function and εj is an independent random variable with a normal distribution with zero mean and variance σ².

We estimated the linear spline function g using the maximum likelihood method. The probability distribution of one data point xj given t j, is shown in equation 3 of Example 4. The log-likelihood function for the n data points is then given by equation 4 of Example 4. The maximum likelihood estimate of the variance σ² can be found by maximizing the log-likelihood function with respect to σ². This yields equation 5 of Example 4.

The fitted model depends on the number of knots q. The number of knots can be chosen using Akaike's Information Criterion, known as the AIC

(Akaike, 1971; see also Priestly, 1994). For each value of q, we calculated the value of the AIC after fitting the linear spline function as described in Example 4, and selected the value of q that yields the minimum value of the AIC. The case q=2 corresponds to linear regression. For the special case q=l, we effectively fit a flat line to the data. If we find that for a particular gene, the minimum AIC is achieved for the constant function (q = 1), then we can conclude that the expression level of that gene was unaffected by the experimental manipulations. Gene expression data are typically given in terms of expression rations. At time zero, the expression ration is equal to unity by definition. This fixed point can be incoφorated easily in our methodology by modifying equation 7 of Example 4. The minimum value of D² can be achieved by choosing the linear spline function shown in equation 17 of Example 4.

3. Results Student's t-test We herein illustrate using Student's t-test and fitting linear spline functions, by reanalyzing measured gene expression profile of the cyanobacterium sp. PCC 6803 after a sudden exposure to high light (HL) (Hihara et al, 2001). The expression levels of 3079 ORFs were measured at zero, fifteen minutes, hone hour, six hours, and fifteen hours both for cyanobacteria exposed to HL and cyanobacteria that remained in the low light (LL) condition. Table 1 of Example 4 shows the number of measurements at each time point. Data from the cDNA expression measurements were obtained from the KEGG database (Nakao et al., 1999). The data used for the original analysis (Hihara, 2001) may not be identical to the raw data submitted to KEGG (Hihara, personal communication). In addition, two measurement sets out of six at the t - 15 minutes time point are missing in the KEGG database. Recalculating the gene expression rations from the raw data gives numbers close to the previously published results. After subtracting the background signal intensities from the HL and LL raw data, global normalization was applied and the ration of the HL to the LL signal intensities was calculated to find the relative change in gene expression level with respect to the control (LL) condition. In the fold-change analysis, a gene was regarded as being affected by HL if its expression level changed by a factor of two or more. The statistical significance of such changes was assessed heuristically by considering the size of the standard deviation of the measurements (Hihara, 2001).

The results ofthe Student's t-test on the gene expression rations of each gene separately are shown in Table 2. At a significance level of α = 0.001, 167 genes were found to be significantly affected by HL condition. Note that we would expect about three type-1 errors among these 167 genes. In comparison, 164 ORFs were found to be affected by the HL condition in the original analysis (Hihara, 2001).

By considering the fold-change for the psbD2 gene (slr0927), it was concluded that it was not significantly induced by HL (Hihara, 2001). This was remarkable, since this gene had been reported to be inducible by HL in the cyanobacterium Synechococcus sp. PCC 7942 (Bustos and Golden, 1992; Anandan and Golden, 1997). However, performing Student's t-test on the gene expression data for the psbD2 gene at t = 6 hours yields p - 3.3 x 10^"5, suggesting that this gene was indeed affected by HL.

Analysis Using Linear Spline Functions

Next we fit linear spline functions to the measured gene expression ratios. The number of knots q were between one and five, with a fixed knot equal to unity at time zero. For q = 3 and q = 4, three possibilities exist for the placement of the knots between the linear segments of the linear spline. These are indicated in Figure 2 of Example 4, together with the cases q = l, q ~ 2, and q = 5. Notice that the number of possible knot placements increases exponentially with the maximum number of knots q_max. In the fold-change analysis, temporal gene expression patterns were classified into six categories (Hihara, 2001), listed in Table 3 of Example 4. Fitting a linear spline function to the measured gene expression data provided a more flexible way to describe the data than categorizing. In addition, a numerical description of the gene expression response pattern is an important step in deriving a gene regulatory network.

Analyzing Expression Data Using Linear Splines

We next illustrate the usage of the AIC with an example. In the fold- change analysis, the threonine synthase gene thrC (slll688) was found to be repressed at the approximately one hour. The calculated values of the AIC for the different sets of knots are listed in Table 4. The minimum AIC was achieved for knots at 0, 15 minutes, 1 hour, and 15 hours. Figure 3 of Example 4 shows the measured gene expression levels, together with the linear spline that was fitted to the data. Performing this procedure for all the gene expression measures let us classify different genes based on their time-dependent response to HL. Several choices can be made as to which genes to include in this analysis. In the original analysis, genes were removed from the calculation if their expression levels were among the 2000 lowest out of 3079 ORFs (Hihara, 2001).

Alternatively, we can eliminate genes if Student's t-test showed that they were not significantly affected by HL. Table 5 shows the number of genes whose measured expression levels correspond to each pattern for these different cases. For Student's t-test, a significance level q = 0.001 was used. We compared the 167 genes which were identified by Student's t-test as significantly affected by HL with the results from the fold-change analysis (Hihara, 2001), in which 164 ORFs were identified. First, we removed those genes from our analysis for which an outlier was present in the data. We define an outlier as a data point that deviates more than two standard deviations from the mean ofthe data at a given time point. Only one gene was found for which the measured expression data contained an outlier; the linear spline function fitted to its expression data was a flat line. None of the other gene expression levels was described by a flat line, which is consistent with the results from Student's t-test. Next, we removed those genes whose expression level was among the lowest 2000 in order to avoid using data that are dominated by noise. The same procedure had been used for the fold-change analysis (Hihara, 2001). After removal of these genes, 107 genes remained that were sigmficantly affected by HL. Ofthe 107 genes, 42 had not been identified in the fold-change analysis

(Hihara, 2001). These genes are listed in Table 6 of Example 4, together with the location of the knots that we found for each gene. For each linear spline function the percentage variance explained was calculated as a measure of the goodness of fit. As an example, Figure 4 shows the measured gene expression ration as well as the fitted linear spline function of the gene sylR (slr0329), having four knots at zero, fifteen minutes, one hour, and fifteen hours. Of the 42 genes, the gene xylR (slr0329) had the largest percentage variance explained (98.7%). Of the 164 ORFs that were identified in the fold-change analysis, 39 were not significantly affected by HL according to Student's t-test at a significance level q ~ 0.001. These ORFs are listed in Table 7 of Example 4.

Finally, we established whether the number of measurements at each time point was sufficient to reliably determine the placements of knots for the linear spline function. To do so, we repeated the estimation of the linear spline function. To do so, we repeated the estimation of the linear spline function using subsets of the measured data. We then counted for how many genes the estimated knot positions changed if a subset of the data was used instead of the complete set of data. The average and standard deviation of this number for four, three, and two data points at each time point is shown in Table 8.

Even if only two data points (at the one hour time point) are removed, and four data points are used at each time point in 15% of the cases the estimated knot positions changed. Thus, in certain embodiments, four or more data points are needed for each time point to reliably deduce information from gene expression measurements.

Discussion

We have described a strategy based on maximum likelihood methods to analyze a set of time-ordered measurements. By applying Student's t-test to the measured gene expression data, we first established which of the measured genes are significantly affected by the experimental manipulation. The expression responses of those genes were then described by fitting a linear spline function. The number of knots to be used for the linear spline function was determined using Kike's Information Criterion (AIC).

Using linear spline functions permits more flexibility in describing the measured gene expression than using a nominal classification. Also, to set up a gene regulatory network, it can be desirable that the gene response as determined from gene expression measurements is available in a numerical form. Finally, positions of the knots specify those time points at which the expression of a gene changes markedly, which can be desirable in identifying its biological function. Classification of gene expression responses based on the position of the knots can be refined by creating subcategories that take the magnitude of the linear spline function at the knots into account. For instance, for linear spline functions with three knots, we may consider creating six subcategories in which changes in the gene expression level are described by (flat, increasing), (flat decreasing), (increasing, flat), (decreasing, flat),

(increasing, decreasing), or (decreasing, increasing).

Applying the technique of linear spline functions to measured gene expression data, we can identify the temporal expression response pattern of genes that were significantly affected by the experimental manipulations. The response of 42 of those genes was not noticed in earlier fold-change analyses of expression data. Furthermore, it was shown that the expression response levels found in a fold-change analysis were not found to be significant by Student's t- test for 33 out of 164 genes some genes. In some embodiments, gene expression data may be nosy and are plagued by outliers. Whereas Student's t- test and maximum likelihood methods described here take the statistical significance of noisy data into account, the issue of outliers needs to be addressed separately. As a simple procedure to remove outliers, we calculated the mean and standard deviation ofthe data for each point in time, and removed those data that deviate more than two standard deviations or so from the mean. Finally, the number of expression measurements needed at each time point to reliably fit a linear spline function was determined by removing some data points and fitting a linear spline function anew. It was found that if four data points per time point were used, in about 15% of the cases the knot positions will not be estimated reliably. It is therefore advisable to make more than four measurements per time point.

VI. Use of Gene Networks for Identifying and Validating Drug Targets We describe new methods for identifying and validating drug targets by using gene networks, which are estimated from cDNA microarray gene expression data. We created novel gene disruption and drug response microarray gene expression data libraries for the purpose of drug target elucidation. We used two types of microarray gene expression data for estimating gene networks and then identifying drug targets. Estimated gene networks can be useful in understanding drag response data and this information is unattainable from clustering analysis methods, which are the standard for gene expression analysis. In the construction of gene networks in certain embodiments, we used both Boolean and Bayesian network models for different steps in analysis and capitalize on their relative strengths. We used an actual example from analysis of the Saccharomyces cerevisiae gene expression and drug response data to validate our strategy for the application of gene network information to drug discovery.

1. Introduction

Cluster methods have become a standard tool for analyzing microarray gene expression data. However, they cannot offer information sufficient for identifying drug targets in either a theoretical or practical sense. We provide methods to determine how estimated gene networks can be used for identifying and validating target genes for the understanding and development of new therapeutics. Gene regulation pathway information is desirable for our purpose, and we used both Boolean and Bayesian network modeling methods for inferring gene networks from gene expression profiles. The procedure for identifying drug targets can be separated into two parts. At first, identify the drug-affected genes. Second, we search for the "target" genes, which are usually upstream of the drag-affected genes in the gene network. A Boolean network model is useful for identifying the drug affected genes by using "virtual gene" gene technique, described herein and a Bayesian network model can be used for exploring the draggable gene targets related to the elucidated affected genes. We applied our methods to novel Saccharomyces cerevisiae gene expression data, comprised of expression experiments from 120 gene disruptions and several dose and time responses to a drag.

2. Gene Networks for Identifying Drug Targets

A. Clustering Methods

Clustering methods such as the hierarchical clustering and the self- organizing maps are commonly used as a standard tool for gene expression data analysis in the field of Bioinformatics. Eisen focuses on the hierarchical clustering and provides software, Cluster/TreeNew, for clustering analysis of gene expression data. De Hoon et al improved this software especially in the k- means clustering algorithm.

Clustering methods only provide the information on gene groups via the similarity of the expression patterns. However, it can be desirable to have additional hierarchical pathway information, not only cluster information to detect drug targets that are affected by an agent. We show the limitations of clustering techniques for drag targeting purposes through real data analysis.

We use two new methods for estimating gene networks from gene expression data. In this subsection, we give the brief introductions of both methods. For detailed discussions of the algorithms, please refer the papers in the references section.

B. Boolean Network

To estimate a boolean network model, we can discretize the gene expression values into two levels, 0 (not-expressed) and 1 (expressed). Suppose that uj, . . . u_k are input nodes of a node v. The state of v is determined by a Boolean functions . . . , ψ(u_k)), where ψ(uι) is the state or the expression pattern of u . If we have time series gene expression data, the state depends on the time t and the state ofthe node at time t depends on the states of inputs at time t - 1. On the other hand, suppose that we have gene expression data obtained by gene disruptions. Akutsu et al. proposed the theory and methodology for estimating a Boolean network model without time delay. Maki et al. provides a system, named AIGNET, for estimating a gene network based on the Boolean network and an S-system. We use the AIGNET system for estimating Boolean network models.

Advantages ofthe use ofthe Boolean network model includes: a) This model is simple and can be easily understood. A Boolean network model can detect the parent-child relations correctly, when the data has sufficient accuracy and information and b) the estimated Boolean network model can be directly applied to the Genome Object Net, a software tool for biopathway simulation. A disadvantage of Boolean methods is that data must be discretized into two levels, and the quantization loses information. Moreover, the threshold for discretizing is a parameter and should be chosen by a suitable criterion.

C. Bayesian Networks A Bayesian network is a graphic representation of complex relationships of a large number of random variables. We consider the directed acyclic graph with Markov relations of nodes in the context of Bayesian network. We can thus describe complex phenomena through conditional probabilities instead of the joint probability ofthe random variables. Further description can be found in Example 5 herein.

Friedman et al. proposed an approach for estimating a gene network from gene expression profiles. They discretized the expression values into three values and used the multinomial distributions as the conditional distributions of the Bayesian network. However, this did not solve the problem of choosing threshold values for discretizing.

We developed a nonparametric regression model that offers a solution that does not require quantization together with new criterion, named BNRC, for selecting an optimal graph. The BNRC is defined as the approximation of the posterior probability of the graph by using the Laplace approximation for integrals. We applied the proposed method to Saccharomyces cerevisiae gene expression data and estimated a gene network. The advantages of this method are as follows: a) one can analyze the microarry data as a continuous data set, b) this model can detect not only linear structures but also nonlinear dependencies between genes and c) the proposed criterion can optimize the parameters in the model and the structure ofthe network automatically.

A Bayesian network has some advantageous based on the mathematics of inference, and we use the method for constructing Bayesian networks. We can construct cyclic regulations and multilevel directional models of regulatory effects from data created from logical joins of expression data from disruptants and drug response experiments by combining the Bayesian and Boolean networks in analysis. Hence, the Boolean and Bayesian networks used together can cover the shortcomings one another and we can obtain more reliable information.

3. Applications to Microarray Data

We created two libraries of microarray data from Saccharomyces cerevisiae gene expression profiles. One was obtained by disrupting 120 genes, and the other was comprised of the response to an oral antifungal agent, (four concentrations and five time points). We selected 735 genes from the yeast genome for identifying drag targets. In YPD, 314 genes were defined as "Transcription factors", and 98 of these have already been studied for their control mechanism. The expression profile data for 735 genes chosen for analysis included the genes controlled by these 98 "Transcription factors" from 5871 gene species measured in addition to nuclear receptor genes which have a role in gene expression regulation and are popular drug targets. We constructed network models from the data set of 735 genes over 120 gene disruption conditions. The details ofthe disruption data are also described in Imoto et al. As for drug response microarray gene expression data, we incubated yeast cultures in dosages of 10, 25, 50, 100 mg of an antifungal medication in culture and took aliquots of the culture at 5 time points (0, 15, 30, 45 and 60 minutes) after addition of the agent. Here time 0 means the start point of this observation and just after exposure to the drug. We then extracted the total RNA from these experiments, labeled the RNA with cy5, hybridized them with cy3 labeled RNA from non-treated cells and applied them to full genome cDNA microarrays thus creating a data set of 20 microarrays for drag response data. In this paper, we use these 140 microarrays to elucidate drug targets using gene networks.

4. Results

A. Clustering Analysis

In the identification of the drag targets, a popular but problematic prior art strategy has been to use clustering analysis often even with a library of base perturbation control data to compare to ( ). We have two types of microarray data, gene disruption and drug response, allowing us to compare drug response patterns to gene expression patterns caused by disruption, hi the clustering analysis, if there is significant and strong similarity between the expression patterns of a single disruptant or group of disruptants and a given drug response microarray, we may conclude that an agent probably plays the same role as the disrupted gene(s). Moreover, if this disrupted gene has know functional role, we may obtain more information about the response to a drug.

Unfortunately, as if often the case with such experiments we cannot gain such a straightforward result from clustering our data. Figure 1 shows the image representation of the correlation matrix of our microarray data. We combine the two types of data and make the matrix Z — (X: Y), where X and Y are the drag response and the gene disruption microarray data matrix, respectively. Here, each column denotes an expression pattern obtained by one microarray, and each column is standardized with a mean of 0 and variance of 1. Therefore, Figure 1 of Example 5 depicts the information of the correlation matrix R = Z^τ z 7p, where p is the number of genes. To illustrate our methods, we focus on 735 genes for estimating gene networks and identifying drug targets.

In Figure 1 of Example 5, the light and dark colors denote the positive and negative high correlations, respectively. Drug response microarrays have high correlation amongst each other and a low correlation with any of the gene disruption microarrays. Under such a situation, it may be difficult to identify interactions between gene disruptions and drag responses from the clustering analysis that would be meaningful in the context of drag response. We further implemented hierarchical clustering of the drug response microarrays, but this produced one cluster and we could not extract any more information on the actual drug targets from this analysis. This result was essentially unchanged when we use other distance measurements or clustering techniques. Hence, it is difficult to obtain information for identifying meaningful drug targets from the clustering methods.

B. Boolean Network Analysis

To overcome shortcomings of prior art clustering methods, we estimated gene network by using the microarry data, Z, which was created by combining gene disruption and drug response microarrays. We consider the conditions of the drag responses data as "virtual genes", e.g., the condition lOOmg/ml and

30min is given an assignment as the gene YEXP100mg30min. By using a Boolean network model, we found child genes of these virtual genes, with the drag affecting these child genes in progeny generational order. We focus on genes which five or more virtual genes as the parent genes, as the putative drug- affected genes, that is, genes which are under direct influence of the virtual genes (drag affect). However, a gene which has only one virtual gene as its parent gene may be the primary drug-affected gene, depending on the mode of action for a given agent. The virtual gene technique highlights the use of Boolean network models compared with the Bayesian network model in the initial screening for genes under drug-induced expression influence.

In addition, fold-change analysis can provide similar information to the proposed virtual gene technique. We identified the affected genes under certain experimental conditions by fold-change analysis. However, our virtual gene technique can improve the result of the fold-change analysis. Suppose that we find gene A and B are affected by the drug from fold-change analysis. The fold- change analysis cannot take into account the baseline interactions between geneA and geneB. That is, if there is a regulation pathway between geneA and geneB that geneA — > geneB, the geneB may not be affected by the drug directly. Rather, an effect of the drag on geneA may result in an indirect effect on geneB.

The virtual gene technique can take into account such interaction by using the information of the gene disruption data and thus reduce the search set to more probable target genes given available interaction data.

There is not guarantee that genes that are most affected by an agent are the genes that were "drugged" by the agent, nor is there any guarantee that the drugged target represents the most biologically available and advantageous molecular target for intervention with new drugs. Thus, even after identifying probable molecular modes of action, it is desirable to find the most draggable target genes upstream of the drag-affected genes in a regulatory network and to then screen low molecular weight compounds for drag activity on those targets.

In the estimated Boolean network, virtual genes can be placed on the top of the network. Therefore, it can be difficult or sometimes impossible to find upstream information for drug-affected genes in this estimated Boolean network. In such circumstances, we use a Bayesian network model for exploring the upstream region ofthe drug- affected genes in an effective manner.

C. Bayesian Network Analysis

We found that a gene network can be estimated by a Bayesian network and nonparametric regression method together with BNRC optimization strategy. We use the Saccharomyces cerevisiae microarray gene expression data as described herein obtained by disrupting 120 genes. From Boolean network analysis, we found the candidate set of the drag-affected genes effectively. Draggable genes are drug targets related to these drug-affected genes, which we want to identify for the development of novel leads. We explored the draggable genes on the upstream region ofthe drag-affected genes in the estimated gene network by a Bayesian network method. Using a Bayesian model of network regulatory data available to us from our knockout expression library, we searched upstream of drag affected target genes, which have a known regulatory control relationship over drag affected target expression. For example, we focus on nuclear receptor genes as the draggable genes because a) nuclear receptor proteins are known to be useful drag targets and together represent over 20% of the targets for medications presently in the market, b) nuclear receptors are involved in the transcription regulatory affects that are directly measured in cDNA microarry experiments.

Figure 3 of Example 5 shows a partial network, which includes the drug- affected genes (Bottom), the draggable genes (Top) and the intermediary genes (Middle). Of course, we can find more pathways from the draggable genes to the drug-affected genes if we admit more intermediary genes. Due to the use of the Bayesian network model, we can find the intensities of the edges and can select more reliable pathway. This is an advantage of Bayesian network models in searching for suitable draggable targets. In Figure 3 of Example 5, draggable genes in the circle connect directly to the drug-affected genes and the other draggable genes have one intermediary gene per one draggable gene. From Figure 3, we identified draggable genes for each drag-affected gene, e.g., we found the draggable genes for MAL33 and CDC6 shown in Table 1 of Example 5.

5. Discussion

We describe new strategies for identifying and validating drug targets using computational models of gene networks. Boolean and Bayesian networks can be useful for estimating gene networks from microarray gene expression data. Using both methods we can obtain more reliable information than by using either method above. A Boolean network is suitable for identifying drug- affected genes by using the virtual gene technique set forth herein. Bayesian network models can provide information for upstream regions of the drug- affected genes and we can thus attain a set of candidate draggable genes. Our novel strategies are established based on the sophisticated use of a combination of two network methods. The strength of each network method can be clearly seen in this strategy and the integrated method can provide a methodological foundation for the practical application of bioinformatics techniques for gene network inference in the identification and validation of drag targets.

VII. Drug Target Discovery and Validation with Gene Regulatory

Networks

Gene regulatory networks developed with full genome expression libraries from gene perturbation variant cell lines can be used to quickly and efficiently to identify the molecular mechanism of action of drugs or lead compound molecules. We developed an extensive yeast gene expression library consisting of full-genome cDNA array data for over 500 yeast strains each with a single gene disruption. Using this data, combined with dose and time course expression experiments with the oral antifungal agent Griseofulvin, we used Boolean and Bayesian network discovery techniques to determine the genes whose expression was most profoundly affected by this drug. Griseofulvin, whose exact molecular target was previously unknown, interferes with mitotic spindle formation in yeast. Our system was used to directly discover CIK1 as the primarily affected target gene in the presence of Griseofulvin. Deletion of

OKI, the primary affected target determined by gene network discovery produces similar morphological effects on mitotic spindle formation to those of the drug. Using the base hierarchical data from the expression library and nonparametric Bayesian network modeling we were able to identify alternative ligand dependant transcription factors and other proteins upstream of CIK1 that can serve as potential alternative molecular targets to induce the same molecular response as Griseofulvin. This process for network based drag discovery can significantly decrease the time and resources necessary to make rational drug targeting decisions.

1. Introduction

Rational drug design methodologies have previously been concentrated on optimizing small molecules against a predetermined molecular target. The randomized lead to target to phenotype screening for target selection that is currently the prevailing paradigm in drag discovery has failed to offer a more efficient and accurate target selection process even with the advent of wide scale availability of genomic information and high throughput screening processes ⁽¹'²'³⁾. The availability of genomic sequences, full genome microarrays and recent advances in gene network inference computational techniques allows for a new rational paradigm for drag target selection that takes into account global networked regulatory interactions among molecules in the genome. Accurate models of gene regulatory network data can be produced from disruptant based expression data⁽⁴'⁵⁾ by using various computational inference techniques⁽⁶'⁷⁾. Here we show how this gene regulatory information can be used to quickly determine the molecular networks and gene targets affected by a given compound. The same information allows for the identification and selection of alternative draggable molecular targets upstream or downstream of a drag targeted molecule in the gene expression regulatory cascade. . We have developed a gene regulatory network-driven iterative drug target discovery process. In this methodology, first large numbers of gene expression experiments are performed on single gene disruptant cell lines. This information is used to create computationally inferred maps of hierarchical gene expression control. The hierarchical regulatory information is used as a basis for evaluation of drug response experiments and for generation of hypotheses of molecular action mechanisms. Information from the literature and further biological experimentation on the elucidated regulatory subnetworks is used to understand and validate results before selecting a candidate molecule for drag targeting. Most effective drugs in clinical use including aspirin and other popular medications have not been rationally designed to interact with a specific molecular target. Thus, even when the desired clinical effect or phenotype is achieved with these drags, the underlying molecular mechanisms of action and thus the mechanisms of the drag's side effects remain unknown. Full genome gene expression experiments have been shown to be useful in determining alternative genes and pathways affected by a drag ^(8,9,10) ₅ but determination of the primary molecular target for many drags which affect hundreds of genes with standard gene expression analysis methods such as clustering is impractical without apriori information on potential targets for the drag. Here we demonstrate the use of hierarchical gene expression regulation networks from full genome expression libraries and gene network modeling techniques together with drug response expression experiments to determine the previously underlying molecular targets for the popular generic antifungal agent, Griseofulvin. Griseofulvin is a widely prescribed oral antifungal agent that is indicated primarily for severe fungal infections of the hair and nails. While Griseofulvin' s molecular action is unknown, the drag disrupts mitotic spindle structure in fungi to lead to metaphase arrest.

2. Methods

Microarray experiments The yeast strain used in this study is BY4741. To monitor the gene expression profile, cells were pre-grown at 30 °C in YPD (2 % polypeptone, 1 % yeast extract and 2 % glucose) to mid-exponential phase and were exposed with Griseofulvin by adding it to the medium to the concentrations at 0, 10, 25, 50, and 100 mg/ml. The exposed cells were harvested at 0, 15, 30, 45 and 60 min after addition by Griseofulvin and used for RNA extraction. Total RNAs were extracted by the hot-phenol method.

Boolean Network Inference Algorithms In addition to the Boolean methods for gene disruption experiments reported elsewhere*-¹ , we created a gene expression matrix E from a set of the drug treatment experiments combined with disruptant expression data. In the case of drag generated perturbation, we created a "virtual" gene to represent the drug affect analogous to a gene disruption. We then were able to use standard disruption matrix algorithms designed for disraption based data.

Bayesian Network Inference Algorithms We performed non-parametric regulation and Bayesian networks to define regulatory subnetworks upstream of OKI using algorithms and methods reported elsewhere herein.

Data Normalization We measured the quantities of 5871 mRNA species from 20 drug treated strains by cDNA microarray assay. A difference in fluorescent strength between Cy3, Cy5 causes bias ofthe expression quantity ratio. We normalized the expression quantity ratios of each expression profile. The ratio bias had a fixed trend in each spotted block, thus we calculated a linear regression to normalize the mean value ratio of each block to 1.0. The logarithm value ofthe ratio was used to indicate the standard expression level, therefore we found the logarithm value of ratio and calculated the average and standard deviation of these log values (see tablel). The Standard Deviation (SD) of expression levels of all spotted genes from the UME6 (YDR207C) disruptant expression array for which UME6 is defined as a "Global Regulator" in YPD⁽³³⁾ disruptant was 0.4931, therefore we recognized that there are an unacceptable number of errors in array data whose overall SD was larger than 0.5 and eliminated such experiments from analysis.

Selection of Genes for Modeling In YPD, 314 genes were defined as "transcription factors", and 98 of these have previously been studied for control mechanism. The expression profile data of 552 genes including the genes that are controlled by these 98 "transcription factors" were selected from 5871 profiles. Thus we constructed the gene regulatory network from the expression profile data set based on the values of these 552 genes in 120 gene disraption experiments and 20 drag treatment experiments. 3. Results We incubated yeast cultures in dosages of 10, 50 and 100 mg in 10ml of DMF and took aliquots of the culture at 5 time points (0, 15, 30 , 45 and 60 minutes) after addition of Griseofulvin. We then extracted the total RNA from these experiments, labeled the RNA with cy5, hybridized them with cy3 labeled RNA from non-treated cells and applied them to full genome cDNA microarrays. 183 genes were downregulated by over a 2 sigma threshold among 552 genes which differed in expression between drug treated and normal yeast (Figure 7a). Standard hierarchical clustering methodologies (Figure 7b) applied to the combined expression libraries from drug response and gene disruption experiments, clustered genes into two major groups; first, genes affected by Griseofulvin and second, genes affected by disruption. Within the

Griseofulvin clusters, genes were further grouped by dosage or time course. However, clustering did not discover any gene expression patterns that significantly indicated correlated regulation by a given gene and the drug Griseofulvin. This result would be expected except in cases where the antifungal agent affects the expression of only one discrete gene and minimal

(Figure 7b).

However, the use of gene regulatory network models, combined with the drug perturbation data allows for a hierarchical gene regulatory view of the drug's interaction with genes in the transcriptome. To generate this gene network drug perturbation data, we first created a full genome expression library of 542 single-gene disraption mutants. The 120 array data selected from the library was logically joined with the array matrices generated from the Griseofulvin experiments. A Boolean methodology designed for gene network elucidation⁽¹¹'¹²⁾ was then applied to the joint expression matrices for each time course and hierarchical regulatory maps for each experiment were generated for each dose and time point. We produced joint Boolean regulatory subnetworks for each dosage and time point experiment. The Boolean algorithm was selected for its suitability for handling joint matrices, its ability to handle looped regulatory processes and the ease creation of hierarchical directed graphs with several orders of regulatory separation.

From this data we were able to identify the first order drug affects as opposed to secondary cascades initiated by those initial perturbation regulatory events.

By evaluating the Boolean data from each time and dose differential experiments we were able to identify 8 genes that were consistently and significantly suppressed as first effects at each time and drug concentration. Of these genes, CIKl exhibited the strongest suppression effects across the experiments. CIKl codes for a protein described in the yeast proteome database (YPD) as a coiled-coil protein of the spindle pole body involved in spindle formation and the congression (nuclear migration) step of karyogamy⁽¹³⁾. Since the action of Griseofulvin is known to affect mitotic spindle formation, which is in agreement with the function of CZK7 , we performed a pathological examination of a yeast strain with a disrupted CIKl gene and yeast affected by Griseofulvin. While neither the treatment with Griseofulvin at normal physiological dosage nor the disruption of CIKl are lethal, both cultures show similar morphological differences and growth characteristics Furthermore, microscopic examination ofthe mitotic spindle structure in Griseofulvin treated yeast and CIKl deleted yeast show very similar changes ofthe spindle body and surrounding organizational structures ⁽¹⁵⁾.

The methodologies described here clearly demonstrate the utility of a combined expression array and computational approach using gene network techniques to rapidly ascertain and validate the molecular mechanisms of action of a given compound on a cell. Use of such techniques will help to rationalize the target selection process of pharmaceutical development in the post-genomic era and could contribute to efficiency of discovery and a reduction in development risk for the pharmaceutical industry. The same techniques can further be applied to other biological discovery and agrochemical targeting. Our laboratories are currently replicating this discovery model in human and other biological systems. Additional descriptions can be found in United States Provisional Patent Application serial no. 60/395,756, filed July 12, 2002 incoφorated herein fully by reference.

References 1. Smith, A. Screening for drag discovery: The leading question. Nature. 418, 453-459 (2002). 2. Aherne, G. W., McDonald, E. & Workman, P. Finding the needle in the haystack: why high-throughput screening is good for your health. Breast Cancer Res. 4, 148-154 (2002).

3. Willins, D.A., Kessler, M., Walker, S.S., Reyes, G.R. & Cottarel, G. Genomics strategies for antifungal drag discovery - from gene discovery to compound screening. Curr Pharm Des. 8, 1137-1154 (2002)

4. Hughes, T. R., et al. Functional Discovery via a Compendium of Expression Profiles. Cell. 102, 109-126 (2000)

5. Glaever, G., et al. Functional profileing of the Saccharomyces cerevisiae genome. Nature. 418, 387-391 (2002)

6. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian Networks ' to Analyze Expression Data. J Comput Biol. 7, 601-620 (2000).

7. Somogyi, R. & Sniegoski, C. A. Modeling the complexity of genetic networks: Understanding multigene and pleiotropic regulation. Complexity. 1, 45-63 (1996).

8. Marton, M. L, et al. Drag target validation and identification of secondary drag target effects using DNA microarrays. Nature Medicine. 4, 1293-1301 (1998)

9. Heller, M. J. DNA MICROARRAY TECHNOLOGY: Devices, Systems, and Applications. Annu Rev Biomed Eng. 4, 129-153 (2002)

10. Reynolds MA. Microarray technology GEM microarrays and drug discovery ; J Ind Microbiol Biotechnol. 28, 180-185 (2002)

11. Akutsu, T., Miyano, S. & Kuhara, S. Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fmgeφrint function. J. Comp. Biol. 7, 331-343(2000)

12. Akutsu, T., Miyano, S. & Kuhara, S. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727- 734(2000)

13. Rose, M. D. Nuclear fusion in the yeast Saccharomyces cerevisiae. Ann. Rev. Cell. Dev. Biol. 12, 663-695 (1996) 14. Cottingham, F. R., Gheber, L., Miller, D. L., & Hoyt, A. M. Novel Roles for Saccharomyces cerevisiae Mitotic Spindle Motors. The Rockefeller University Press. 147, 335-349 (1999)

15. Manning, D. B., Barrett, J. G., Wallace, J. A., Granok, H. & Snyder, M. Differential Regulation of the Kar3p Kinesin-ralated Protein by Two

Associated Proteins, Ciklp and Viklp. The Rockefeller University Press. 144, 1219-1233 (1999)

16. Imoto, S., Goto, T., & Miyano, S. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac Symp Biocomput 175-186(2002).

VII. Systems for Discovering Network Relationships and Uses Thereof

In other embodiments, methods for elucidating genetic networks are provided that include Example 6 described herein. Accordingly, a desirable system can include those in which:

(1) one can collect experimental data for enabling the network to be elucidated when the genome structure is elucidated;

(2) data of all the related genes can be measured; (3) many tools for enabling many experiments to be used, such as gene chips;

(4) an output is measured after applying a disturbance to obtain many standardized data; and

(5) analysis ofthe genetic relationships can be determined.

An example of such a system is presented in Figure 1 of Example 6. Figure

2 of Example 6 depicts schematically, a method for obtaining microarray data on expressed genes from an organism. Figure 3 of Example 6 depicts schematically a method for assessing and quantifying the expression of a gene by comparing the amounts of gene products (e.,g., RNA) from mutated cells (disruptants) in which a particular gene is disrupted with the amounts of gene products from normal (wild-type) cells.

Analysis of a large scale network can be accomplished, in some embodiments, using the guideline provided as Figure 4 of Example 6. Time course studies can be carried out and the expression of each gene under study can be evaluated. A Boolean network model can be provided and a dynamic model of the network can be made. Positive and negative interactions can be mapped, thereby producing a gene network. A multilevel diagraph approach can be taken to relate effects of disrupting (or altering expression) each gene with respect to other genes under study.

Incorporation by Reference

All patents, patent applications and references cited in this application are incoφorated herein fully by reference.

The foregoing description of embodiments of the present invention has been provided for the puφoses of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention and the various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence.

EXAMPLES The following Examples are provided to illustrate embodiments of this invention. Other specific applications ofthe teachings in this patent application can be used without departing from the spirit of this invention. Other modifications of the methods can be used, and are considered to be within the scope of this invention. Example 1

ESTIMATION OF GENETIC NETWORKS AND FUNCTIONAL

STRUCTURES BETWEEN GENES BY USING BAYESIAN

NETWORKS AND NONPARAMETRIC REGRESSION

We propose a new method for constructing genetic network from gene expression data by using Baysian net-works. We use nonparametric regression for capturing nonlinear relationships between genes and derive a new criterion for choosing the network in general situations. In a theoretical sense, our proposed theory and methodology include previous methods based on Bayes approach. We applied proposed method to the S. cerevisiae cell cycle data^13,20 and showed the effectiveness of our method compared with previous methods.

1 Introduction

The improvement of the genetic engineering provides us enormous amount of valuable data such as microarray gene expression data. The analysis of the relationship among genes has drawn remarkable attention in the field of molecular biology and Bioinformatics. However, due to the cause of dimensionality and complexity of the data, it will be no easy task to find structures, which are buried in noise. To extract the effective information from biological data, thus, theory and methodology are expected to be developed from a statistical point of view. Our purpose is to establish a new method for understanding the relationships among genes clearer.

In the analysis of the microarray gene expression data, Bayesian networks are adopted for constructing genetic networks ³> >⁵>¹² from a graph-theoretic approach. Friedman and Goldszmidt (1998) ¹² proposed an interesting method for constructing genetic links by using Bayesian networks. They discretized the expression value and considered to fit the models based on multinomial distributions: However, a problem still remains to be done in choosing the threshold value for disctritizing by not only the experiments. The threshold value assuredly gives essential changes of the results and unsuitable threshold value leads to wrong results. On the other hand, recently, Friedman et al. (2000) ¹³ pointed out that discretizing was probably loosing the information- To use the expression data as continuous values, thus, they considered the use of Gaussian models based on linear regression. However this model can only detect linear dependencies and we cannot obtain sufficient results.

In this paper we propose a new method for constructing genetic networks by using Bayesian networks. To capture not only linear dependencies but also nonlinear structures between genes, we use nonparametric regression models with Gaussian noise¹¹'^15,21,22. Nonparametric regression has been developed in order to explore the complex nonlinear form ofthe expected responses without the knowledge about the functional relationship in advance. Due to the new structure of the Bayesian networks, a suitable criterion is needed for evaluating models. We derive a new criterion from Bayesian statistics. By using proposed method, we will overcome the defects of previous methods and attain more effective information. In addition, our method includes the previous methods as spacial cases- We demonstrate our proposed method through the analysis of the S. cerevisiae cell cycle data.

2 Bayesian Network and Nonparametric Regression

Let X — (Xι ,X₂, —,X_p)^τ be a p-dϊmensional random vector and let G be a directed acyclic graph. Under the Bayesian network framework, we look upon a gene as a random variable and decompose the joint probability into the product of conditional probabilities, that is

P(X_l ,X₂, ...,X_p) = P(X₁\P₁)P(X₂\P₂) x - ^• - x P(X_P\P_P), (1)

f(^Xil , Xi2, —, Xip) = fl (Zil ^• " X fp(Xip]Pip)-

Then all we need to do is to consider how construct the conditional densities nonparametric regression models for capturing the relationship between x_tJ- and p_fj- = (Pa , —_>Pi_q )^{T m} the f^oπn x_i} = πi_! (pW) + + ■ - - + m_?J (pfy^{ + e_ij} i = I, .-, «; j = 1, ~,p, where mj_t (k = 1, ..., q_j are smooth functions from Rto R and e_tj (i = 1, ...,n) depend independently and normally on mean 0 and variance <n2. For the function m_t, it is assumed that

l, .., _gi, . (2)

the number of basis functions.

Then a nonparametric regression model can be written as a probability density function in the form

where - _} = ( ^ , -, ^ )^T is a parameter vector with y_jh = (7^ , ... , 7M^J_)^T- If a variable X_j has not parent variables, we consider the model based on the normal distributions with mean μ and variance σ .

Finally we have a Bayesian network model based on the nonparametric regression model with Gaussian noise

/(a:.-; Θ_G) ^θj > ^{l ~} *> -' ⁿ> where θ_G = (θ₁ , —, θ_p )^τ is a parameter vector included in the graph G and θ_j is a parameter vector in the model ,-, i.e., θ_j = (lj,σ_j)^τ or θ_j = (μj, σ_j)^τ.

3 Proposed criterion for choosing graph

Let ιr(θ \λ) be the prior distribution on the unknown parameter vector ΘQ with hyper parameter vector A and let logτr(0σ|λ) = 0(n). The marginal probability of the data X„ is obtained by integrating over the parameter space, and we choose a graph G with the largest posterior probability

where π is a prior probability of G. Friedman and Goldszmidt (1998)¹² considered the multinomial distribution as the Bayesian network model f(xi', Θ_G), and also supposed the Dirichlet prior on the parameter Θ . In this case, the Dirichlet prior is the conjugate prior and the posterior distribution belongs to the same class of distribution. Then a closed form solution of the integration in (4) is obtained, and they called it BDe score for choosing graph ^6,lβ. Recall that the BDe score is confined to the multinomial model, and we propose a criterion for choosing graph in more general and various situations.

The essential problem of constructing criteria based on (4) is how compute the integration. Some methods can be considered for computing the integration such as Markov chain Mote Carlo, we use the Laplace approximation for integrals^7,17,23. The Laplace approximation to the marginal probability of X_n is

where r is the dimension of Θ_G,

(θ_G\X_n) = - >g/(*_t; 0_G) + i log,r(0_σ|λ),

and Θ_G is the mode of Then we have a criterion, BNRC, for selecting graph

BN

(5)

The optimal graph is chosen such that the criterion BNRC (5) is minimal.

where

(6)

Thus the BNRC (5) can be obtained by the local scores of graph as follows: We define the local BNRC for the j-th variable X₃ by

BNRC, - -2 log j π_L, J f[ (7) where τtι_} is a prior probability of the local structure associated with X₃. We also apply Laplace's method to the BNRC, and the BNRC is obtaind by p BNRC = -21ogτr_σ + J {BNRC, + 21og7rχ,,}. Notice that the final graph is selected as a minimizer of the BNRC and not have to be minimize each local score BNRC, , because the graph is constructed as acyclϊc

4 Estimating graph and related structures by using BNRC

In this Section we express our method in more concrete terms. Essential points of our proposed method are the use of the nonparametric regression and the

.B-splines of e ree 3 s^opac-ed ^■".^«• *"*^{■ Me} ? f d We assume that the prior distribution on the parameter vector θ is

Ii = Jl ^τfr -ϊjk\ *)-

Each prior distribution πj_k("f_jkW_k) s a singular Mjk variate normal distribution given by

where λ_,-jt is a hyper parameter, K_3k is an M k x Mjk matrix, ~f _kKjk _j =

∑_fcs ( ^ - ²7r-?ι^_t + 7H.₂,*)² and |K),t|₊ is the product of Mjk - 2 nonzero eigenvalues of K_3k- The score BNRC, (7) can be obtained as

BNRC,- = -21o_g7T_£3 - log _-_fc(7_'jfclλ_ifc)

+ log M_ifc + l)log(2τrn-¹), (9) where 0,- = (lj,σ])^τ is a mode of l^(θ₃\X_n) defined in (6) for fixed λ_jfc. For computational aspect, we approximate the logarithm of the determinant of the Hessian matrix in (9) by

- M_jk log(nσf)} - log(2σj), ^" where B₃k is an n x Mjk matrix defined by Bjk = (& fc( i_fc)> -->^ifc(P_nfc)) with bjkfr ) = (b[Mi⁾),~-Λ)M?))^T- ^Hence ∞πⁱbϊⁱώⁱg (3), (8) and (9), BNRC,- is resulted in

BNRC,- = Gj + (n - 2q₃ - 2)logσ?

where βjt = (T_jλjk is a hyper parameter,

C_j - -2 log τr_Xi + («-+ - 2g_i)log(2π) + n - log2 ϊi

Ii kjk = Bj_kB₃k + nβjkKj_k, Mj- ^ Mjk- k—l

By using the backfitting algorithm ¹⁵, the modes _jJt (k = l, ..., q₃) can be obtained when the values of β_jk are given. The backfitting algorithm can be expressed as follow:

Step 1 Initialize: ~f_jk = 0, k = l, ...,q_j.

Step 2 Cycle: k = l_> ..., g_i, l, ..., g_i, l₎ ...

_TiJt == (B&B_ifc + Bivljv).

Step 3 Continue Step 2 until a suitable convergence criterion is satisfied. The mode σf is given by σ = ||x_(i) - ∑'l, 5 _{)t iJt}||² .

In attention, the modes 7_jfc and σ² are depend on the hyper parameters β_jk and we have to choose the optimal values of βjk- In the context of our method, it is a natural way that the optimal values of β₃k are chosen as the minimizers of BNRC,-.

Recall that the B-splines coefficients vectors _jk are estimated by maximizing (6). The modes of (6) are the same as the penalized likelihood es- timates ^{21 ,24} and we can look upon the hyper parameters λjk or βjk as the smoothing parameters in penalized likelihood. Hence, the hyper parameters play an important role for fitting the curve to the data.

5

5 Computational experiments

Before analyzing the real data, we used Monte Carlo simulations to examine the properties of our method. The data were generated from artificial graph and structures between variables (Figure 2) and then the results can be summarized as follows: Proposed criterion BNRC can detect linear and nonlinear structure of the data very well. But BNRC score have a tendency toward, overgrow of graph. In the analysis of the real data, then, we consider the use of Akaike's information criterion known as AIC ¹'² and use both methods. AIC was originally introduced as a criterion for evaluating models estimated by maximum likelihood method. But the estimates by our method is the same as the maximum penalized likelihood estimates and is not MLE. In this case, the modified version of AIC ¹⁰ is given by

«

AIC = ^ Iog /føby. -,-, *!) + 2(∑ trS_it + 1), i=l k=\ where Sj_k = Bjk(Bj_kBjk + nβj_kKj_k)^~1 Bj_k. The trace of Sjk shows the degree of freedom of the fitted curve and is a great help. That is to say, if trS,-* is nearly 2, the dependency is looked upon linear. We use both BNRC and AIC for decision whether we add up to a parent variable. By using this method, the estimated graph and structures are closed to ture model.

We analyze the S. cerevisiae cell cycle data discussed by Spellman et al. (1998) ²⁰ and Friedman et al (2000) ¹³. The data were collected from 800 genes with 77 experiments.

Figure 3: BNRC scores for CNL2, CDC5 and SVS1.

We set the prior probability π_G is constant, because we have no reason why the large graph is unacceptable and no information about the size of the true graph. The nonparametric regressiors are constructed 20 J5-splines. In fact, the number of B-splines is also parameter. However, we use somewhat large number of B-splines, the hyper parameters control the smoothness of fitted curve and we cannot visually find differences among fitted curves corresponding with various number of B-splines. The results of the analysis can be summarized as follows: Figure 3 shows BNRC scores when we predict CLN2, CDC5 and SVS1 by one gene. The genes which give smaller BNRC scores give a better expression the target gene. We can observe that which gene is associated with target gene and we find the gene set which strongly depend on. In fact, we can construct a brief network by using these information. We can look upon the optimal graph as a revised version of the brief network by taking account of the effect of interactions. We note that if there is a linear dependency between genes, the score BNRC is also good when the parent-child relationship is reversed. Therefore, the directions of the causal associations in the graph are not strict especially when the dependencies are almost linear. Our result basically advocates the result of Friedman et al. (2000) ¹³, but, of cause, there are different points in parts. There are some genes that mediate FViedman et aVs result, such as MCD1, CSI2, YOX1 and so on. Most of these genes had been reported to play an important role. A large number of the relationships between genes are nearly linear. But we could find some nonlinear dependencies which linear models hardly find. Figure 5 shows the estimated graph associated with genes which were classfied their processes into cell cycle and their neighborhood. Here, we omit some branches in Figure 5, but important information are almost shown. As for the networks given by us and Friedman et al.¹³, we confirmed parent-children relationships and observed that both two networks are similar to each other. Especially, our network includes typical relationships which were reported by Friedman et al. ¹³. As for the differences between both networks, we attention the parents of SVS1. Friedman et al. ¹³ employed CLN2 and CDC5 as the parent genes of SVSl. On one hand, our result gives CSI2 and YKR090W for SVSl. We check up on the difference of these two results. After all, in the sense of BNRC and AIC, our candidate parent genes are more appropriate than Friedman et al. ¹³'s. The reason might be the effect of discretizing, because our model suitably fits to both cases in Figure 4. Especially we conclude that CDC5 gives just weak effects to SVSl compared with other genes only in the case in which Spellman ⁰'s data (see also Figure 3). In fact, as the parent gene of SVSl, the order of BNRC score of CDC5 is 247th. Considering the circumstances mentioned above, our method can provide us valuable information in understandable and useful form.

Figure 4: Cell cycle data and smoothed estimates.

(a) and (b) Friedman ct ol (2000) ¹³, BNRC = 57.71, A1C=167 96;

(c) and (d) Proposed method, BNRC = 32 53, AIC=140 16.

6 Discussion

We proposed the new method for estimating genetic networks from microarray gene expression data by using Bayesian network and nonparametric regression. We derived a new criterion for choosing graph theoretically, and represented its effectiveness though the analysis of the cell cycle data. The advantages of our method are mainly as follows: We can use the expression data as continuous values. Not only linear dependencies, we can also detect nonlinear structures and can visualize their functional structures being easily understandable. Fully automatic search can accomplish the creation of optimal graph.

We also pointed out that Friedman et al. ¹³'s method remained the unknown parameters such as threshold value for discretizing and hyper parameters in the Dirichlet priors which selected by trial and error and were not optimized in a narrow sense. On the other hand, our proposed method can automatically and appropriately estimate any parameters based on proposed criterion which has a sounder theoretical basis. Besides, our method includes Friedman et al.¹³'s as a special case.

We consider the following problems for our future works: (1) We used the statistical models based on Gaussian distribution. However, we derive the criterion BNRC in more general situations. In fact, we can construct the graph selection criterion based on other statistical models. (2) It is a possible case that the outliers cause strange results. Thus, the development of the robust methods and the technique for detecting the outliers are important problems. (3) The intensities of the unions are probably measured by using bootstrap method ⁹, We would like to investigate these problems in a future paper.

Figure 5 Cellcycle data result. References

1. H. Aka ke, in Petrov,B.N. and Csaki,F. eds., 2nd Inter. Symp. on Information Theory, Akademiai Kiado, Budapest, 267 (1973).

2. H. Akaike, IEEE Trans. Autom. Contr., AC-19, 716 (1974).

3. T. Akutsu, S. Miyano and S. Kuhara, Pacific Symposium on Biocomput- ing, 17, (1999).

4. T. Akutsu, S. Miyano and S. Kuhara, Bioinformatics, 16, 727 (2000).

5. T. Akutsu, S. Miyano and S. Kuhara, J. Comp. Biol, 7, 331 (2000). 6- G. F. Cooper and E. Herskovits, Machine Learning, 9, 309 (1992).

7. A. C. Davison, Biometrika, 73, 323 (1986)

8. C. de Boor, A Practical Guide to Splines. Springer, Berlin. (1978).

9. B. Efron, Ann. Stat., 7, 1 (1979).

10. P. H. C. Eilers and B. Marx, Statistical Science, 11, 89 (1996).

11. R. L. Eubank, Spline Smoothing and Nonparametric Regression, Marcel Dekker, New York. (1988).

12. N. Friedman and M. Goldszmidt, in M. I. Jordan ed., Learning in Graphical Models, Kluwer Academic Publisher. 421 (1998).

13. N. Friedman, M. Linial, I. Nachman and D. Pe'er, /. Comp. Biol, 7, 601 (2000).

14. P. J. Green and B. W. Silverman, Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London. (1994).

15. T. Hastie and R Tibshirani, Generalized Additive Models. Chapman & Hall, London. (1990).

16. D. Heckerman, D. Geiger and D. M. Chickering, Machine Learning, 20, 274 (1995).

17. D. Heckerman, in M. I. Jordan ed., Lernϊng in Graphical Models 301, Kluwer Academic Publisher. (1998).

18. S. Konishi, (in Japanese), Sugakυ, 52, 128 (2000).

19. G. Schwarz, Ann. Stat., 6, 461 (1978).

20. P. Spelhnan, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botsteϊn and B. Futcher, Mol. Biol Cell, 9, 3273 (1998).

21. B. W. Silverman, J. R. Stat. Soc. Series B, 47, 1 (1985).

22. J. S. Simonoff, Smoothing Methods in Statistics. Springer, New York. (1996).

23. L. Tinerey and J. B. Kadane, J. Amer. Statist. Assoc., 81, 82 (1986).

24. G. Wahba, J. R. Stat. Soc. Series B, 40, 364 (1978).

Example 2

Nonlinear Modeling of Genetic Network by Bayesian

Network and Nonparametric Regression with

Heterogeneous Error Variances and Interactions

ABSTRACT 1. INTRODUCTION

We propose a new statistical method for constructUnder the development of the microarray technoling genetic networks from microarray gene expresogy, we can observe the gene expression data of sion data based on Bayesian networks. In the conthousands- now simultaneously. In the analysis text of Bayesian network, an essential point of netof gene expression data, constructing genetic network construction is in the estimation of the conwork receives a large amount of attention in the ditional distribution of each random variable. We field of molecular biology and bioinformatics (see consider fitting the nonpaxametric regression models 1³M J_>[S]_>[12],[13],[18] and [22]). However, the causes with heterogeneous error variances and interactions of dimensionality and complexity of the data disfor capturing the nonlinear Structures between genes. turb the progress of the microarray gene expression Once we set a graph by using Bayesian network arid' ^' data analysis. That is to say, the informa^'tioii tliat' nonparametrϊe regression, a problem still remains to we want is buried with a huge amount of th.e data be solved in selecting an optimal graph, which gives with noise. In this paper, we propose a new statisa. best representation of the system among genes. tical method for constructing genetic network that We theoretically derive a new criterion for chooscan capture even the nonlinear relationships between ing graph from Bayes approach in general situations. genes clearer- The proposed method contains the previous methods Bayesian network ([19]) is an effective method in for estimating genetic networks by using Bayesian modeling the phenomena through the joint distrinetworks. Wc demonstrate the effectiveness of probution of a large number of random variables. In posed method through the analysis of Saccharomyces recent years, some interesting works have been escerevisiae gene expression data newly obtained by tablished in constructing genetic networks from the disrupting 100 genes. microarray gene expression data by using Bayesian networks. Sriedman and Goldszmidt [12] discretized

Key words: microarray gene expression data; the expression values and assumed the multinomial Bayesian network; nonparametric regression; hetdistributions as candidate statistical models. Pc'cr eroscedasticity; interaction; posterior probability et al. [22] investigated the threshold value for discretizing. On the other hand, Riedxnan et ol [13] pointed out that the discretizing probably looses the information of the data and considered to fit the linear regression models, which analyze the data in the continuous. However, the assumption that the parent genes depend linearly on the objective gene is not always guaranteed, ϊmoto Gt al. [18] proposed the use of nonparametric additive regression models (see also [16]) for capturing not only linear dependencies but also nonlinear structures between genes. In this paβc "wc^"prόp∞c f Ee~meth <rfoτ constructing eling bf :; ye ϊaιϊ construct the genetic network by using Bayesian networks and the conditional densities, fj- We assume that the the nonparametric heteroscedastic regression model, conditional densities, f_j, arc parameterized by the which is more resistant to the effect of outliers and parameter vectors θj, and the effective uTtfoππation can also capture the effects ofthe interactions of paris extracted from these probabilistic models. ent genes. Imoto et al. [18] proposed the use of nonparamet¬

Once wc set the graph, we have to evaluate its ric regression strategy for capturing the nonlinear goodness or closeness to the true graph, which is relationships between XΪJ and p In many cases, completely unknown. Hence, the construction of a this method can capture the objective relationships suitable criterion becomes the center of attention very welL When the data, however, contain outliers of statistical genetic network modeling. Rriedman especially near the boundary on the domain p«}, and Goldszmidt [12] derived the criterion, BDe, for nonparametric regression models sometimes lead to choosing graph based on the multinomial models and unsuitable smoothed estimates, Le., the estimated Dirichlet priors. However, they remained she uncurve exhibits some spurious waviness due to the known hyper parameters in Dirichlet priors and we effects of the outliers- Since what is estimated is only set up the value? expericntially. We investithe system of a being, the too much complicated gate the graph selection problem as the statistical relationship is unsuitable. In fact, this inapproprimodel selection or evaluation problem and theoretiate case sometimes occurs in the real data analycally derive a new criterion for choosing graph from sis. To avoid this problem, we consider fitting the Bayes approach ([8]). The proposed criterion autononparametric regression model with heterogeneous matically optimizes the all parameters in the model error variances and gives the optimal graph. In addition, our proposed method includes the previous methods for con- x_{j = m_jføj) + Ei_j, ey ~ N(0, <τ ), (2) βtructing genetic network by Bayesian network- To show the effectiveness of the proposed method, wc where m (-) is a smooth function from W' to R and analyse gene expression d»ta of Saccharomyces cereN(μ, σ²) denotes Gaussian distribution with mean visiae by disrupting 100 genes. μ and variance σ². Here R denotes a. set of real numbers. This model includes Imoto et al. [18]'s

2. BAYESIAN NETWORK AND NONmodel and, clearly, the linear regression model as PARAMETRIC HETEROSCEDASTIC special cases. REGRESSION MODEL WITH INTEROne possible approach to the construction of the ACTIONS systematic component, mj(pi_j), is the nonparametric additive regressor [18]

2.1 Nonlinear model in Bayesian network

Suppose that we have n sets of data x\, . . - , a;_n} m (p_{j) = ,ι (p „ωg + ^{• • •} 4- m_Jv(p &\

«ϊi i)t* 00 of the p-dϊmensional random variable vector X — (Xi,-.., Xp)^T _t where xi = (xa, ..., Xip)^T and which is a straightforward extension of the mulor denotes the transpose of s. In microarray gene tiple linear regression. In. general, each smooth expression data, n and p correspond to the numbers function m_j^) is characterized by the n values of arrays and genes.- Under the Bayesian network ^mJ*(Pιi )> — 7^mjk( _nl a^d tie system (3) contains framework, we consider a directed acyclic graph n x q parameters. Then the number of the parameG and Markov assumption between nodes. The ters in the model is much larger than the number of joint density function is then decomposed into the observations and it has a tendency toward unstable conditional density (see also [10]) of each variable, parameter estimates. In this paper, wc construct the that is, smooth function m_jk(' b basis functions approach

J(xn , .. . , x_ip) = fi x_n \p_a) x ... x f_p(xϊ_p]p_ip), (1) where p^ — p_γ , .~.,p .)^τ are ^-dimensional parm=l ent observation vectors of x j in the graph G. When JF*ι = {X , Xz)^τ is the parent variable vector of Xγ where -_/ ), — ,7^.. * are unknown coefficient pawe see p_a — (xi2,X{₃)^r, (i = 1, ...,n). Through the rameters and b _k , (■), ..., b j. _k( ) are basis funcformula (l), the focus of interest in statistical modtions. Prom this representation, the n parameters = the graph, we

mM,) = 2.2 Criterion for choosing graph

Once wc set a graph, the statistical model (8) based on the Bayesian network and nonparametric regres⁽⁵⁾ sion can be constructed and be estimated by a suit able method. However, the problem that still re¬

In the error variances, σ _2, , we assume the following mains to be solved is how wc can choose the optimal structures: graph, which gives a beet approximation of the system underlying the data. Notice that we can not

' *7 _V ²> i = l, ...,τι; j = l, -,p, (6) use the likelihood function as a model selection criterion, because the value of lϋelihood becomes large where w ₃ w_nj- are constants and σ₃ is an irα- in a more complicated modeL Hence, it needs to known parameter By setting up the constants consider the statistical approach based on the generW_\3, ..., w_n} in reflecting the feature of the error varialized or predictive error, Kullback-Leibler informaances, we can represent the heteroscedasticity of the tion, Bayes approach and so on ([20]). In this secdata. Combining (2), (5) and (6), we obtain a nontion, wc construct a criterion for evaluating graph parametric regression model with heterogeneous erbased on our model (8) fro Bayes approach. ror variances and interactions When we set a graph, the criterion for evaluating the goodness of the graph can be constructed from fΛ^x i)> j>ξj,< ) Bayesian theoretic approach as follow: The posterior

W_f probability of the graph is obtained by the product i*n of the prior probability of the graph, 7r_σ, and the marginal probability of the data. By removing the τ.ξjΛi p )r] , standardized constant, the posterior probability of

(7) the graph is proportional to

(9) where X_n = (xι, ..., x_n)^τ is an n x p gene profile matrix, τr(#c_|λ) is the prior distribution on the parameter Θ_G satisfying lo τr(#_G|λ) = 0(n) and λ is

19 the hyper parameter vector. Under Bayes approach, where we can choose the optimal graph such that is maximum. A crucial problem for const a I_Λ,(0j|X„) = ~ log f_s(xij]Pij θj) + ~ criterion based on the posterior probability of the graph is the computation of the high dimensional Here X_j is the hyper parameter vector. Hence by integral (9). Iriedman and Goldszmidt [12] used the defining conjugate priors fox solving the integral and gave a closed-form solution. To compute this high dimensional integral, we use Laplace's approximation ([ll], [17] and [26]) for integrals ilities satisfying ^βcore is given by

(11)

The smoothed estimates based on nonparametric regression are obtained by replacing the parameters ~f_}- and ξ_j by ^ήf_j and ζ_j, respectively. Noticed that we

Jχ(θ_G) = derive the criterion, BNRC, under the assumption and Θ_G is the mode of Then we have a lo 7r(0c|λ) = 0[n). If wc use the prior density satcriterion, BNRC, for se h isfying logτr(0<;!λ) = 0(1), the BNRC criterion is resulted in Schwarz [25]'s criterion known as BIC oi

BNRC(G) SIC. In such case, the mode Θ is equivalent to the jEaximnm lilcelihood estimate.

3. ESTIMATING GENETIC NETWORK

3.1 Nonparametric regression In this section wc practically present the mcthoc

The optimal graph is chosen Buch that the criterion for constructing genetic network based on proposcc BNRC (10) is minimal. The merit of the use of the method described in Section 2. First wc would lik< Laplace method is that it is not necessary to consider to mention the nonparametric regression modcL Th< the use of the coηjugate prior distribution. Hence nonparametric regrcssor (5) has the two components the modeling in the larger classes of distributions of the main effects represented by the additive mode the model and prior is attained. of each parent gene and the interactions. In tlu

Suppose that the parameter vectors θj are indeadditive model, wc construct each smooth functioi pendent one another, the prior distribution then can nyj O by R-splines ([9] and [18]). be decomposed into In the interaction terms, we use Gaxissian radia basis function

homogeneous error variances. If we use the large exist near the boundary on the domain of the parent The result is summarize n t e ollow ng theorem. variables, are large. Hence, if there are the outliers near the boundary, we can reduce their effects and gain the suitable smoothed estimates by using the appropriate value of p.

3.2 Priors

Suppose that the prior distribution is fac- tor sed into BNRC_j = 2(q_j + 2s ₅ 4- 1)

*ϊ(0i|λ,-) L_jt + 1) logpT/n)

)

where Vji is the dimension of ξ j- λ_ik = Bf_kW_jB_jk + nβ_jkK_jk (M_jk κ M_jk), 27e_πe -we approximate the Hessian matrix by (a)

jVjl n) ^CX 4. REAL DATA ANALYSIS In this section we show the effectiveness of the pro¬

^■∑ ^^^ -∑^II^IΓ- posed method through the analysis of & cerevisiae fc=l 1=1 gene expression data. Our research group has installed a systematic experimental method, which ob¬

The first term in the right hand side is the log- serves the changes of expression level of genes on likehhood function and the second and third terms a microarray by gene disruption. By using this are called roughness penalties. The hyper parammethod, we have launched a project whose purpose eters Xjk and V_ji are called smoothing parameters is to reveal the gene regulatory networks between the which control the smoothness of the fitted curves. 5871 genes of Saccharomyces cerevisiae while many

Figure 2: The effect of the weight constants. the differences of he smoothed estimates against the various number of the basis functions can not visually found, because when we use the somewhat large number of the basis functions, the hyper paramelaboratories have also reported similar projects. We ters control the βmoothness of the fitted curves. Wc have already collected a large number of expression applied the two genes effect to the interaction comprofiles from gene disruption epcpcriments to evalu- ponents. Hence, the effects of the interactions are ate genetic regulatory networks. Over 400 mutants obtained as the fitted surfaces and can be visually are stocked and gene expression profiles are accumuunderstandable. lating. Before we mention the result of this analysis, we

We monitored the transcriptional revel of 5871 show the roles of the hyper parameters in the prior genes spotted on a microarray by a scanner. The exdistributions and weight constants. Figure 1 (a) pression profiles of over 400 disruptants were piled shows the scatter plot of YGL237C and YEL071W up in our database. The standard deviation (SD) with smoothed estimates by 3 different values ofthe of all genes on a microarray were eval ated^'and the hyper parameters. Clearly, the smoothed estimate value of SD represented roughly the error of a exstrongly depends on the values of the hyper parameperiment. In our data, we estimated the value of 0.5 ters. Figure 1 (b) is the behavior of he BNRC criteis the critical point of the accuracy of experiments. rion ofthe two genes in Figure 1 (a). We can choose We have evaluated the accuracy of those profiles on the optimal value ofthe hyper parameter as the min- the base of the standard deviation of the expression ϊmizcrs of the BNRC and the optimal smoothed estiratio of all genes. 107 disruptants including 68 mumate (solid curve) can capture the structure between tants where the transcription, factors were disrupted these genes welL The dashed and dotted curves could be selected from 400 profiles. arc near the maximum likelihood estimate and the

We used 100 microarrays and constructed the ge- parametric linear fit, respectively. The effect o the.

Figure 4: The resulted partial network of the analysis of 521 Saccharomyces cerevisiae genes. r ous waviness due to t e e ect o t e ata n t e eva uat on pro em an er ve t e new cr terion for ιpper-left corner- By adjusting the hyper parameselecting graph from Bayes approach. Our method ter p in (12), the estimated curve is resulted in the covers the previous methods for constructing genetic solid curve. The optimal value of p is also chosen by networks by using Bayesian networks and improves minimizing the BNRC criterion (see Figure 2 (b)). them in the theoretical and methodological sense. Of course, when the smoothed estimate is properly We showed the effectiveness of our method through obtained, the optimal value of p tends to zero. the analysis of Saccharomyces cerevisiae gene ex¬

Wc employ two step strategy for fitting the nonpression data and evaluated the resulted network parametric regression model with interactions. First, by comparing with the knowledge of biology. We wc estimate the main effects represented by the adconstruct the genetic network without the biological ditive jB-spliαe regreεsors. Next, wc fit the interinformation. Nevertheless, the resulted network inaction components to the residuals. Figure 3 is an cludes the many important connections, which agree example of the fitted surface to the relationship bewith the biological knowledge. Hence, wc expect tween YIL094C and its parent genes, YKL152C and that our method can demonstrate its power against YER055C. It is clearly, shown that the interaction of the analysis of completely unknown system like the the two parent genes leads more Overexpress when human genome. both parent genes increase.

The results of the analysis and their evaluations arc as follows: In Saccharomyces cerevisiae the References GCN4 gene encodes the transcriptional activator of the "general control" system of amino acid bioyn- [1] H. Aiaϋe, Information theory and an extension of the miu miim likelihood principle, in B.N. Petrov, _<___ F. thesis, a network of at least 12 different biosynCsa i, e 3., Akade iai Kiado, Budapest, 267 (1973). thetic pathways [6]- Experiments showed that the

[2] H, Ataike, Likelihood and the Bayea procedure, in J.M. consequences of the general control- response- upon Bernardo, M.H. DeGroot, D.V. Lindley & A.P. . Smith, . the signal "amino acid starvation" induced by the eds,, Univ. Press, Valencia (1980). histi-dinc analogue 3-aminotriazole with respect to [3] T, Akutsu, S. Miyano & S. Kuhara, Identification of Ge- Gcn4p levels. GCN4 activates transcription of more nςtfc Networks from a Small Number of Gene Expression than 30 genes involved in the biosynthesis of 11 Patterns Under the Boolean Network ModcL Proc. Paamino acids in response to amino acid starvation cific Symposium on Biocomputing 1999,. IT (1999). or impaired activity of tRNA synthctascs (sec [24]). [4] T. Akutsu, S. Miyano &: S. Knhara, Inferring qualitative relations in genetic networks and metabolic pathwaya. Purine biosynthetic genes ADEl, ADE4, ADE5,7, Bioinformatics, 16, 727 (2000). and ADE8 display GCN4-dependent expression in [5] T. Akutsu, S, Miyano & S. Kuhara, Algorithms for Idenresponse to amino acid starvation [24]. GCN4 actifying Boolean Networks and Related Biological Nettivates transcription of biosynthetic genes for amino works Based on Matrix Multiplication and Fingerprint acids and purines in response to either amino acid or Function. Λ Comp. BioL, 7, 331 (2000). purine starvation [24]- Those results of experiments [6] G. Albπκht, H.-U. ^'Mosch, B. Hoffmann, U. Rcusser & shows that there are the strong relations between G. H. Brans, Monitoring the Gcn4 Protein-mediated Response in the Yeast Sacdi rom cvi cerevisiae. J. BioL purine metabolism and amino acid metabolism by a CfitτΛ., 273, 12696 (19B8) GCN4. Our map of relation realized well the many

[7] T. Andou, S. Imoto & S. Konishi, Nonlinear regression relations among purine and amlnoacid metabolism. modeling using radial basis fonctϊoa ne work. Submitted.

5. CONCLUSION [8] J. O. Bexger, Statistical D ciaion Thtortf and Bayesian Analysis. Springer-Verlag New York (1985).

In this paper we proposed a new statistical method for estimating genetic network from microarray gene [9] C. de Boor, A Practical Guide to Splines. Springer-Verlag Berlin. (1978). expression data by using Bayesian network and non¬

[10] R. Cαwell, A. Dawid, S. aurifczen & D. Spiegelbal- parametric regression. The key idea of our method is tεr, Probabilistic Networks and. Expert Systems. Springer- the use of the nonparametric heteroscedastic regresVerlag New York (1999). sion models for capturing nonlinear relationships be[llj A. C. Davison, Ap roαάmate redi tive h&efihood- tween gene3 and heteroscedasticity of the expression Biometrika, 73, 323 (1986). . [12] N. Friedman & M. Goldszmidt, Learning Bayesian Networks with Local Structure, in . I. Jordan ed., Kluwer Academic Publisher. 421 (1998).

[13] N. Friedman, M. Linial, I. Nachman & D. Pe'er, Using Bayesian Network to Analyze Expression Data- J. Comp. Biol. 7, 601 (2000).

[14] P. J. Green fc B. W. Silverman, Nonparametric Regression and Generalized- Linear Models. Chapman & Hall (1994).

[15] I. J. Good & IL A. GasMns, Nonparametric roughness penalties for probability densities. Biomciτika, 58, 255 (1971).

[16] T. Hastie &: R. Tibshirani, Gzner lizti Additive Models. Chapman & Hall (1990).

[17] D. Heckerman, A tutorial on learning with Bayesian networks, in M.I. Jordan cd., Kluwer Academic Publisher. (1998).

[18] S- hnoto, T- Goto k. S- Miyano, Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Proc Pacific Symposium on Biocomputing £002, (2002) in Press.

[19] P. V. Jensen, An introduction to Bayesian Networks. University College London Press (1996).

[20] S. Konishi, Statistical model evaluation and information criteria, in S. Ghosh ed., Marcel Dekker (1999).

[21] J. Moody & C. J. Darken, Fast learning in. Networks of Locally-Tuned Processing Units. Ncur. Comp. 1, 281(1989).

[22] P. Pe'er, A- Regev,^' G. Elidan & N. PHedma^'n, Inferring SubncfcwOrks from Perturbed Expression Profiles. Bioinformatics, 17, Suppl.1, 215 (ISMB 2001).

[23] B. D. Riplcy, Pattern. Recognition and Neural Networks. Cambridge University Press (1996).

[24] R. J. Rolfes and A. G. Hinnebusch. Translation of the yeast transcriptional activator GCN4 is stimulated by purine limitation: implications for activation of the protein kinase GCN2. Mol. Cell Bio , 13, 5099 (1993).

[25] G- Schwarz, Estimating the dimension of a model. Ann. Statist., 6, 461 (1978).

[26] L. Tinerey & J. B. Kadane, Accurate approximations for posterior moments and marginal densities. J. Am.tr. Statist Assoc, 81, 82 (1986).

Example 3

Bayesian Network and Nonparametric Heteroscedastic Regression for Nonlmear

Modeling of Genetic Network

Abstract even the nonlinear relationships between genes clearer.

A Bayesian network [7, 23] is an effective method in

We propose a new statistical method for constructing a modeling phenomena through the joint distribution of a genetic network from microarray gene expression data by large number of random variables. In recent years, some using a Bayesian network. An essential point of Bayesian interesting works have been established in constructing genetwork construction is in the estimation ofthe conditional netic networks from microarray gene expression data by usdistribution of each random variable. We consider fitting ing Bayesian networks. Friedman and Goldszmidt [12, 13, nonparametric regression models with heterogeneous error 14] discretized the expression values and assumed multinovariances to the microarray gene expression data to capmial distributions as the candidate statistical models. Pe'er ture the nonlinear structures between genes. A problem still et al. [28] investigated the threshold value for discretizing. remains to be solved in selecting an optimal graph, which On the other hand, Friedman et al. [15] pointed out that the gives the best representation of the system among genes. discretizing probably loses information ofthe data. In fact, We theoretically derive a new graph selection criterion from the number of discretizing values and the thresholds are unBayes approach in general situations. The proposed method known parameters, which have to be estimated from the includes previous methods based on Bayesian networks. data. The resulted network strongly depends on their values. We demonstrate the effectiveness of the proposed method Then Friedman et al. [15] considered fitting linear regresthrough the analysis of Saccharomyces cerevisiae gene exsion models, which analyze the data in the continuous (see pression data newly obtained by disrupting 100 genes. also [20]). However, the assumption that the parent genes depend linearly on the objective gene is not always guaranteed. Imoto et al. [22] proposed the use of nonparametric additive regression models (see also [16, 18]) for capturing

1. Introduction not only linear dependencies but also nonlinear structures between genes. In this paper, we propose a method for con¬

Due to the development of the microarray technology, structing the genetic network by using Bayesian networks constructing genetic network receives a large amount of atand the nonparametric heteroscedastic regression, which is tention in the fields of molecular biology and bioinformatics more resistant to the effect of outliers. [3, 4, 5, 14, 15, 17, 22, 28]. However, the dimensionality Once we set the graph, we have to evaluate its goodand complexity of the data disturb the progress of the miness or closeness to the true graph, which is completely uncroarray gene expression data analysis. That is to say, the known. Hence, the construction of a suitable criterion beinformation that we want is buried in a huge amount ofthe comes the center of attention of statistical genetic network data with noise. In this paper, we propose a new statistical modeling. Friedman and Goldszmidt [14] used the BDe method for constructing a genetic network that can capture criterion, which was originally derived by [21] for choos- ing a graph. The BDe criterion only evaluates the Bayesian is the system of a living nature, a too complicated relationnetwork model based on the multinomial distributions and ship is unsuitable. In fact, this inappropriate case unforDirichlet priors. However, Friedman and Goldszmidt [14] tunately sometimes occurs in the analysis of real data. To kept the unknown hyper parameters in Dirichlet priors and avoid this problem, we consider fitting a nonparametric rewe only set up the values experimentally. We investigate gression model with heterogeneous error variances the graph selection problem as a statistical model selection or evaluation problem and theoretically derive a new cri_Xij = mji (P „0W1) + ^{■ ■ ■} + m_iq5 ( ,«V ϊ) + _£ij, (2) terion for choosing a graph using the Bayes approach (see [6]). The proposed criterion automatically optimizes all pawhere Ei_j depends independently and normally on mean rameters in the model and gives the optimal graph. In addi0 and variance σ?- and mjt(-) is a smooth function from tion, our proposed method includes the previous methods R to R. Here R denotes a set of real numbers. This for constructing genetic network based on Bayesian network. To show the effectiveness of the proposed method, we use the Monte Carlo simulation method. We also analyze gene expression data of Saccharomyces cerevisiae newly obtained by disrupting 100 genes. in the model is much larger than the number

2. Bayesian Network and Nonparametric Hetand it has a tendency toward unstable parameter estimates. eroscedastic Regression Model In this paper, we construct the smooth function rri_jk(-) by the basis functions approach

2.1. Nonlinear Bayesian network model

Suppose that we have n sets of array data {a_ι , ..., x_n} ofp genes, where Xi = (ar.-j , ..., Xi_p)^τ and x^τ denotes the transpose of x. In the Bayesian network framework, we where 7^ ', ... ,7^._fc(. are unknown coefficient parameters consider a directed acyclic graph G and Markov assumption and &ιι; (-)ι — J &M t(') ^are basis functions. From this between nodes. The joint density function is then decomposed into the conditional density of each variable, that is, representation, the n parameters τrijk Pι ), —, n.ju(j>Jb) are reparameterized by the M_jk coefficient parameters f{x_n , - ■ - , Xip) = [ strongly recommend the use of nonparametric regression instead of linear regression, because linear regression cannot decide the direction of the Bayes causality or leads to the wrong direction in many cases. We show the advantage of the proposed model compared with linear regression through a simple example. Suppose that we have data of genei and genβ₂ in Figure 1 (a). We consider the two models genei ->^■ genβ2 and gene∑ -> genei, and obtain the smoothed estimates shown in Figure 1 (b) and (c), respectively. We decide that the model (b: genei -4 genea) is better that (c: genej → genei) by the proposed criterion, which is derived in a later section (the scores ofthe models gress on strategy for capturing the nonlinear are (b) 120.6 (c) 134.8). Since we generated this data from between ary and ,- and suggested that there are many the true graph genei - genβ₂, our method yields the cornonlinear relationships between genes and the linear model rect result However, if we fit the linear regression model to hardly achieves a sufficient result. In many cases, this this data, the model (c) is chosen (the scores are (b).156.0 method can capture the objective relationships very well. (c) 135.8). The method, which is based on linear regression, When the data, however, contain outliers especially near the yields an incorrect result in this case. boundary of the domain {Pi }, nonparametric regression Consider the case that the relationship is almost linear. models sometimes lead to unsuitable smoothed estimates, Our method and linear regression can fit the data approprii.e., the estimated curve exhibits some spurious waviness ately. However, it is clearly difficult to decide the direction due to the effects of the outliers. Since what is estimated of Bayes causality. In such a case, the direction is not strict

Figure 1. Simulated data: The true causality is genei → gene₂. (a) Scatter plot of the simulated data, (b) Smoothed curve of the graph gene_x -> gene₂. (c) Smoothed curve of the graph gene₂ → genei. These curves are obtained by the proposed method.

2.2. Criterion for choosing graph

In the error variances, σ?-, we assume the structures,

Once we set a graph, the statistical model (5) based on

' «* i = l, ...,n; j = l, ...,p, (3) the Bayesian network and nonparametric regression can be constructed and be estimated by a suitable procedure. Howwhere u>χ_j, ...,w_nj are constants and σ? is an unknown paever, the problem that still remains to be solved is how we rameter. By setting up the constants wi_j, ..., w_nj in recan choose the optimal graph, which gives a best approxflecting the feature ofthe error variances, we can represent imation of the system underlying the data. Notice that we the heteroscedasticity of the data. Combining (2) and (3), cannot use the likelihood function as a model selection criwe obtain a nonparametric regression model with heterogeterion, because the value of likelihood becomes large in a neous error variances more complicated model. Hence, we need to consider the statistical approach based on the generalized or predictive error, Kullback-Leibler information, Bayes approach and so on (see e.g., [1 , 24, 25] for the statistical model selection problem). In this section, we construct a criterion for evalu¬

(4) ating a graph based on our model (5) from Bayes approach. The posterior probability of the graph is obtained by the

olution. To compute this high dimensional integration, we

ated data: The thin curves are B-splines that are

BNRChetero(G) weighted by coefficients and the thick curve is the smoothed estimate that is otained by

= -2 log G J J[ /(**; θ_G)ιr(θ_G\λ)dθ_c the linear combination of weighted B-splines.

-21ogτr_σ - rlog(2τr/n) + log \J_λ(θ_G)\ -2n (θ_G\X_n). (7) The smoothed estimates based on nonparametric het¬

The optimal graph is chosen such that the criterion eroscedastic regression are obtained by replacing the paB RC_het _ro (7) '^s minimal. The merit of the use of the rameters - by -γ .. Noticed that we derive the criterion, Laplace method is that it is not necessary to consider the BNRCh_etero, under the assumption logπ(#G|λ) = 0(τi). use ofthe conjugate prior distribution. Hence the modeling If we use the prior density satisfying log r(#_G|λ) = O(l), in the larger classes of distributions of the model and prior the BNRC_hetero score results in Schwarz's criterion known is attained. as BIC or SIC [30]. In such case, the mode Θ_G is equivalent

Suppose that the parameter vectors θ_j are to the maximum likelihood estimate.

3. Estimating Genetic Network

BNRChetero = are Wi_j = - • . -= w„ = 1 and the model has homogeneous error variances. If we use a large value of pj, the BNRC« _er = 2(» + 1) - (∑ Mj_k + 1) lo (2τr/n) error variances of the data, which exist near the boundary n on the domain of the parent variables, are large. Hence, if - ∑ log Wij + n log(2πσ?) + n there are outliers near the boundary, we can reduce their efi=l fect and gain the suitable smoothed estimates by using the appropriate value of pj. - Mj_k log(nσ?)} - log(2σ?)

3.2. Priors ?_. + ∑{(^ - 2) log (2πσ /nβjk) - log \K_jk\+

+nβ_jk ^xi_j ^T _kK_jk ^xf_jkl&}}, distribution as the prior distribution on ηf_jk, w t

,

πσ = exp{— (No. of hyper parameters)} 3.4. Learning network

= J exp{-tø + l)} = f[ π_Lj. In the Bayesian network literature, it is shown that de¬

3=1 3=1 termining the optimal network is an NP-hard problem. In

The justification of this prior is based on Akaike's Bayesian this paper, we use the greedy hill-climbing algorithm for information criterion, known as ABIC [2], and Akaike's inlearnign network as follows: formation criterion, AIC [1]. Stepl: Make the score matrix whose (i, )-th element is the BNRC^.'_tero score ofthe graph genβi - gene,-.

33. Criterion Step2: For each gene, implement one of three procedures for an edge: "add", "remove", "reverse", which gives the smallest BNRChetero- Step3: Repeat Step2 until the BNRC_het_ero does not reduce.

Generally, the greedy hill-climbing algorithm has many local minima and the result depends on the computational order of variables. To avoid this problem, we permute the computational order of genes and make many candidate learning orders in Step3. Another problem of the learning network is that the search space ofthe parent genes is enormously wide, when the number of genes is large. Then we restrict the set of the candidate parent genes based on the score matrix, which is given by Stepl.

Figure 3. The smoothed estimates by the various values of the hyper parameters. (a1): The effect of hyperparameter βj . in the prior distribution of the coefficients of B-splines. This parameter can control the smoothness of the fited curve. (b1) and (d): The effect of hyperparamter pj in the parameter of the error variances. This parameter can capture the heteroscedastisity of the data and can reduce the effects of outliers.

3.5. Hyper parameters optimal value of the hyper parameter as the minimizer of the B RC _ete _o and the optimal smoothed estimate (solid

Consider the nonparametric regression model defined in curve in Figure 3 (al)) can capture the structure between (4). The estimate θ₃ is a mode of l , (Θ₃\X„) and depends these genes well. The dashed and dotted curves are near the on the hyper parameters. In fact, the hyper parameter plays maximum likelihood estimate and the parametric linear fit, an essential role for estimating the smoothed curve. respectively.

In our model, we construct the nonparametric regresThe effect ofthe weight constants ...,w_nj is shown sion model by 20 B-splines. We confirmed that the differin Figure 3 (bl) and (cl). If we use the nonparametric ho- ences of the smoothed estimates against the various nummoscedastic regression model [22], we obtain the dashed ber of the basis functions cannot be found visually. Becurve, which exhibits some spurious waviness due to the cause when we use a somewhat large number of the basis effect ofthe data in the upper-left corner (bl . By adjusting functions, the hyper parameters control the smoothness of the hyper parameter pj in (9), the estimated curve results the fitted curves. Figure 3 (al) shows the scatter plot of in the solid curve. The optimal value of p_j is also chosen YGL237C and YEL071W with smoothed estimates for 3 by minimizing the BNRC_het_ero criterion (see Figure 3 (b2) different values ofthe hyper parameters. The details ofthe and (c2)). Of course, when the smoothed estimate is propdata are shown in later section. Clearly, the smoothed estierly obtained, the optimal value of ρ_} tends to zero. mate strongly depends on the values of the hyper parameFinally, we show the algorithm for estimating the ters. Figure 3 (a2) is the behavior ofthe BNRC_{het r} critesmoothed curve and optimizing the hyper parameters. rion ofthe two genes in Figure 3 (al). We can choose the Stepl : Fix the hyper parameter p_j. Step2: Initialize: _jfc = 0, k = 1, ..., q_j. Carlo simulation was repeated 1000 times and we focused

Step3: Find the optimal _k by repeating Step3-1 and on the number of correct estimations. Figure 4 (b) and (c)

Step3-2 are the results of the Monte Carlo simulations for s = 0.2

Step3-1: Compute: and s = 0.1, respectively.

effectiveness of our method. The data ware generated from the artificial network of Figure 4 (a) with the functional Table 1. The false positives ofthe Monte Carlo structures between nodes as follows: simulations. The number attached after node name is the number of estimated connec¬

Xt = Xi + 2 sin( ₅) - 2X₇ + εi, ε, ~ N(0, (4s)²), tion without direction information and the perx₂ = {l + e_Xp(- X₃)r¹ +e2, ε₂ ~N(Q,s²), centage is the direction information. For example, in s=0.2, the proposed method esti¬

Xz = es„ e_s ~ -V(0, l), mated the relationship "X1 → X4" or "X1 <-

Xf/3 + ε₄, ε₄ ~ N(0, (4s)²), X4" 15 times from 1000 Monte Carlo simu₆ = X₃ - Xl + εs, ε₅ ~ iV(0, (4s)²), lations and 87% of 15 times represents the χ₆ = ε₆, ε₆ ~ JV(0, l), direction from left to right ("X1 → X4").

where s is a constant. After transfoiming the observations to the real microarray data in the setting s = 0.2, we can of the parent variables to mean 0 and variance 1, then the expect that our network estimation method can work effecobservations ofthe child variable are generated. tively in the real data analysis. From Figure 4 (b) and (c)

We generate 100 observations from this artificial netand Table 1 , the number of true negatives is much less than work and our aim is to rebuild the network in Figure 4 (a) the number of false positives. We believe that this tendency from the simulated data. We use two different settings ofthe is preferred in the exploratory data analysis. In the setting noise variance, one is s = 0.2 and another is s = 0.1. The s — 0.1, our model can rebuild the target network more preobservations from the setting of the noise s — 0.2 are ex- cisely, and the number, of false positives decrease compared perientially similar to the real microarray data. The Monte with the result of s = 0.2.

Figure 4. The results of the Monte Carlo simulations, (a) True network, (b) Result for s=0.2. (c) Result for s=0.1. The number next to edge represents the number of estimated connections from 1000 Monte Carlo experiments. The percentage includes the information of the edge direction. For example, in s=0.2, the connection between X5 and X1 appeared 958 times from 1000 Monte Carlo experiments and those 97% is the correct direction (from X5 to X1).

4.2. Real Data Analysis those 94 factors were selected from 5871 profiles.

In this section we show the effectiveness of our proposed Baslp and Bas2p also activate expression of three genes method through the analysis of Saccharomyces cerevisiae in the histidine biosynthesis pathway. In a gcn4 gack- gene expression data, which is newly obtained by disrupting ground, mutations that abolish the BAS1 or BAS2 function 100 genes. Our research group has installed a systematic exlead to a histidine auxotrophy. Previous investigation indiperimental method, which observes changes in the exprescated that Baslp and Bas2p are DNA binding proteins resion levels of genes on a microarray by gene disruption. By quired for transcription of HIS4 and these ADE genes like using this method, we have launched a project whose purGCN4 [8, 11, 29]. In this paper, we made clear that both gepose is to reveal the gene regulatory networks between the netic relation. Figure 4 indicates that those ADE genes and 5871 genes of Saccharomyces cerevisiae. Many laboratohistidine biosynethesis genes are related with BAS1 more ries have also reported similar projects. We have already directly than GCN4. The ribose component of purine ricollected a large number of expression profiles from gene bonucleotides is derived from ribose 5-P, an inter mediate disruption experiments to evaluate genetic regulatory netofthe pentose phosphate cycle. The atoms of the base moiworks. Over 400 mutants are stocked and gene expression ety are contributed by many compounds. They are added profiles are accumulating. step wise to the preformed ribose. There exist striking in¬

We monitored the transcriptional level of 5871 genes terrelationships with the pathway for histidine synthesis. spotted on a microarray by a scanner. The expression profiles of over 400 disruptants were stored in our database. The standard deviation (SD) of the levels of all genes on Studies on the regulation of the purine biosiynthesis a microarray was evaluated. The value of SD represents pathway in Saccharomyces cerevisiae revealed that all the roughly the experimental error. In our data, we estimated genes encoding enzymes required for AMP de novo biosynthe value of 0.5 as the critical point of the accuracy of exthesis are repressed at transcriptional level by the presence periments. We have evaluated the accuracy of those profiles of extracellular purines. ADE genes are transcriptionally acon the base ofthe standard deviation ofthe expression ratio tivated as well as some histidine biosynthesis genes. Espeof all genes. 107 disruptants including 68 mutants where the cially the fact that expression of HIS4 is related with ADE transcription factors were disrupted could be selected from genes were known. In our regulated network, HIS4 were 400 profiles. related with some ADE genes closely, and some HIS genes

We used 100 microarrays and constructed a genetic netare related with ADE genes like HIS4. The biosynthesis of work of 521 genes from the above data. The 94 transcription the essential amino acid histidine shows in Saccharomyces factors whose regulating genes have been clearly identified cerevisiae shows close connection to purine metabolism, were found. The profiles of the 521 genes in control by and our result satisfied this fact.

Figure 5. The resulting partial network of the analysis of 521 Saccharomyces cerevisiae genes. 5. Conclusion [4] T. Akutsu, S. Miyano and S. Kuhara. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727- 734, 2000.

In this paper we proposed a new statistical method for estimating a genetic network from microarray gene expres[5] T. Akutsu, S. Miyano and S. Kuhara. Algorithms for Identifying Boolean Networks and Related Biological Networks Based on Masion data by using a Bayesian network and nonparametric trix Multiplication and Fingerprint Function. J. Comp. Biol, 7, 331- regression. The key idea of our method is the use of non₃44, 2000. parametric heteroscedastic regression models for capturing [6] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. nonlinear relationships between genes and heteroscedasticSpringer-Verlag New York, 1985. ity ofthe expression data. If we have a network that repre[7] R. Cowell, A. Dawid, S. auritzen and D. Spϊegelhalter. Probabilistic sents the causal relationship among genes, we can simulate Networks and Expert Systems. Springer-Verlag New York, 1999. the genetic system on the computer, e.g., Genomic Object [8] B. Daignan-Fornϊer and G.R. Fink. Coregulation of purine and hisNet [26, 27]. In this stage, it is required that the relationtidine biosynthesis by the transcription activator BAS1 and BAS2. Proc. Natl. Acad. Sci. USA, 89, 6746-6750, 1992. ships between genes are suitably estimated. In this sense, the proposed heteroscedastic model can give an essential [9] A.C. Davison. Approximate predictive likelihood. Biometrika, 73, 32₃-332, 1986. improvement, because the previous models sometimes lead

[10] C. De Boor. A Practical Guide to Splines. Springer-Verlag Berlin, to unsuitable estimates ofthe systems. We consider the sim1978. ulation of biological system as a future work. [11] V. Denis, H. Boucherie, C. Monribot and B. Daignan-Fomϊer. Role of

An essential problem for network construction is the the Myb-like protein Basl p in Saccharomyces cerevisiae: a proteome evaluation of the graph. We investigated this problem as analysis. Mol. Micmbiol., 30, 556-566, 1998. a statistical model selection or evaluation problem and de[12] N. Friedman and M. Goldszmidt. Discretizing Continuous Attributes rived the new criterion for selecting graph from Bayes While Learning Bayesian Networks. Proc. 13 'th International Conapproach. Our method covers the previous methods for ference on Machine Learning, 157-165, 1996. constructing genetic networks by using Bayesian networks [13] fj. Friedman and M. Goldszmidt. Learning Bayesian Networks with and improves them in the theoretical and methodological Local Structure. Proc. Twelfth Conf. on Uncertainty in Artificial Intelligence, 252-262, 1 96. senses. The proposed method successfully extracts the ef¬

[1 ] N. Friedman and M. Goldszmidt. Learning Bayesian Networks with fective information and we can find these information in Local Structure, in M.l. Jordan ed., Kluwer Academic Publisher, the resulting genetic network visually. We use the simple 1998. greedy algorithm for learning network. However, this algo[15] N. Friedman, M. Linial, I. Nachman and D. Pe'er. Using Bayesian rithm needs much time for determining the optimal graph. Network to Analyze Expression Data. J. Comp. BioL, 7, 601-620, Hence, the development of a better algorithm is one of the 2000. important problems and we would like to discuss it in a fu[16] PJ. Green and B.W. Silverman. Nonparametric Regression and Genture paper. eralized Linear Models. Chapman & Hall, 1994.

We showed the effectiveness of our method through the [17] A.J. Hartemink, D.K. Gilford, T.S. Jaakkola and R.A. Young (2002). Monte Carlo simulations and the analysis of Saccharomyces Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models. Proc. Pacific Symposium on cerevisiae gene expression data and evaluated the resulting Biocomputing, 7, 437-449, 2002. network by comparing with biological knowledge. We con[18] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman struct the genetic network without using biological informa& Hall, 1990. tion. Nevertheless, the resulting network includes many im[19] D. Heckerman. A tutorial on learning with Bayesian networks, OT portant connections, which agree with the biological knowlM.I. Jordan ed., Kluwer Academic Publisher, 1998. edge. Hence, we expect that our method can demonstrate its [20] D. Heckerman and D. Geiger. Learning Bayesian networks: a unipower in the analysis of a completely unknown system, like fication for discrete and Gaussian domains. Proc. Eleventh Conf. on the human genome. Uncertainty in Artificial Intelligence, 274-284, 1995.

[21] D, Heckerman, D. Geiger and D.M. Chickering. Learning Bayesian Networks: The combination of knowledge and statistical data. Ma¬

References chine Learning, 20, 197-243, 1995.

[22] S. Imoto, T. Goto and S. Miyano. Estimation of genetic networks and

[11 H. Akaike. Information theory and an extension of the maximum functional structures between genes by using Bayesian networks and likelihood principle, in B.N. Petrov, & F. Csaki, eds., Akademiai Ki- nonparametric regression. Proc Pacific Symposium on Biocomputado, Budapest, ₂67-₂81, 1973- ing, !, 17 -186, 2002.

[2] H. Akaike. Likelihood and the Bayes procedure, in J.M. Bernardo,

[23] F.V. Jensen. An introduction to Bayesian Networks. University Col.H. DeGroot, D.V. Lindley and A.F.M. Smith, eds., Univ. Press, lege London Press, 1996. Valencia, 41-166, 1980.

[3] T. Akutsu, S. Miyano and S. Kuhara, S. Identification of Genetic Net[24] S. Konishi. Statistical model evaluation and information criteria, in works from a Small Number of Gene Expression Patterns Under the S. Ghosh ed., Marcel Dekker, 1999. Boolean Network Model. Proc. Pacific Symposium on Biocomputing, [25] S. Konishi and G. Kitagawa. Generalised information criteria in 4, 17-28, 1999. model selection. Biometrika, 83, 875-890, 1996. [26] H. Matsuno, A- Doi, M. Nagasaki and S. Miyano. Hybrid Petri Net representation of gene regulatory network. Proc Pacific Symposium on Biocomputing, 5, 338-349, 2000.

[27] H. Matsuno, A. Doi, Y. Hirata. and S. Miyano. XML Documentation of Biopathways and Their Simulations in Genomic Object Net Genome Informatics, 1254-62, 2001.

[28] D. Pe'er, A. Regev, G. Elidan and N. Friedman. Inferring Subnetworks from Perturbed Expression Profiles. Bioinformatics, 17, Suppl.1 (ISMB 2001), 215-224, 2001.

[29] RJ. Rolfes and A.G. Hiimebusch. Translation ofthe yeast transcriptional activator GCN4 is stimulated by purine limitation: implications for activation of the protein kinase GCN2. Mol. Cell Biol, 13, 5099-5111, 1993.

[30] G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6, 46M64, 1978.

[31] L. Tinerey and J.B- Kadane. Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc, 81, 82-86, 1986.

Example 4

Statistical analysis of a small set of time-ordered

gene expression data using linear splines

ABSTRACT expression is considered by measuring gene expression

Motivation: Recently, the temporal response of genes to levels at a limited number of points in time. Gene expression changes in their environment has been investigated using levels that vary periodically have for instance been cDNA microarτay technology by measuring the gene measuied during the cell cycle of the yeast Saccharomyces expression levels at a small number of time points. cerevisiae (Spellman et al, 1998). The gene expression Conventional techniques for time series analysis are not levels of the same yeast species were measured during the suitable for such a short series of time-ordered data. The metabolic shift from fermentation to respiration (DeRisi et analysis of gene expression data has therefore usually al, 1 97). In this experiment, the environmental conditions been limited to a fold-change analysis, instead of a were changing slowly over time. Conversely, the gene systematic statistical approach. response to an abrυptly changing environment can be

Methods: We use the maximum likelihood method measured. As an example, the gene expression levels of the together with Akaike's Information Criterion to fit linear cyanobacterium Synechocystis sp. PCC 6803 were measured splines to a small set of time-ordered gene expression data in order to infer statistically meaningful information at several points in time after a sudden shift from low light from the measurements. The significance of measured to high light (Hihara, 2001). gene expression data is assessed using Student's West. In cDNA microarray experiments, gene expression levels Results: Previous gene expression measurements of the are typically measured at a small number of time points. cyanobacterium Synechocystis sp. PCC6803 were Conventional techniques of time series analysis, such as reanalysed using linear splines. The temporal response Fourier analysis or autoregressive or moving-average was identified of many genes that had been missed by a modelling, are not suitable for such a small number of data fold-change analysis. Based on our statistical analysis, we points. Instead, the. gene expression data are often analysed found that about four gene expression measurements or by clustering techniques or by considering the relative more are needed at each time point change in the gene expression level only. Such a fold-change Contact: rndehoon@ims.u-tokyo.ac.jp analysis may miss significant changes in gene expression levels, while it may inadvertently attribute significance to

1 INTRODUCTION measurements dominated by noise. In addition, a fold-change analysis may not be able to identify important

In recent years, many cDNA microarray experiments have features in the temporal gene expression response. been performed measuring gene expression levels under Several techniques to analyse gene expression data, such different conditions. Measured gene expression data have as deriving Boolean or Bayesian networks, have been become widely available in publicly accessible databases, proposed in the past (Liang et al , 1 98; Akutsu et al, 2000; such as the KEGG database Nakao et αl, 1999). Friedman et al; 2000, Imoto et al, 2002). Whereas

In some of these experiments, the steady-state gene describing gene interactions in terms of a regulatory expression levels are measured under several environmental network is very important, deriving a network model conditions. For instance, the expression levels of the requires gene expression data at a large number of time cyaDobacterium Synechocystis sp. PCC6803 and a mutant points, which is currently often not yet available It should^' have been measured at different temperatures, leading to the be noted that the number of genes is on the order of several identification ofthe gene Hϊk33 as a potential cold sensor in thousands,' while the gene expression levels are often this cyanobacterium (Suzuki et αl., 2001). measured at only five or ten points in time.

In other experiments, the temporal pattern of gene So far, a systematic method has been lacking to Analysing expression data using linear splines statistically analyse gene expression measurements from a 2.2 Analysing time-ordered data vising linear splines small number of time-ordered data. In this paper, we will Next, we analyse the temporal gene expression response for outline a strategy based on fitting linear spline functions to genes that were found to be significantly affected. The time-ordered data using the maximum likelihood method measured gene expression ratios foπn a small set of and Akaike's Information Criterion (Akaike, 1971). The time-ordered data, to which we can fit a linear spline significance of the gene expression measurements is function. A linear spline function is a continuous function assessed by applying Student's t-test. This allows us to infer consisting of piecewise linear functions, which are information from gene expression measurements while connected to each other at knots (Friedman and Silverman, taking the statistical significance of the data into 1989; Higuchi, 1999). Whereas cubic splines are used more consideration This kind of analysis should be viewed as a commonly, for the small number of data points we are first step toward building gene regulatory networks. As an dealing with linear spline functions are more suitable. A example, we reanalysed the gene expression measurements conceptual example of a linear spline function with knots of the cyanobacterium Synechocystis sp. PCC 6803 (Hihara, t ^* , t] , t₄' is shown in Figure 1. 2001). It is shown that information can be inferred from the measured data that is missed when considering the Consider a set of data points («,-, JC,), / e {1,..., n}. We fold-change only. By repeating our analysis with a'subset of wish to fit a nonparametric regression model of the form the available data, we were able to determine how many measurements are needed at each time point in order to *; = g( ) + 3 (2) estimate the linear spline function reliably. to these data, in which g is the linear spline function and §is

2 METHODS an independent random variable with a normal distribution with zero mean and variance &.

2.1 Student's t-test We estimate the linear spline function g using the

First, we would like to assess if the measured gene maximum likelihood method. The probability distribution of expression ratios are si nificandy different from unity. If for one data point X_j, given t_p is a specific gene we can conclude that for all time points the measured expression ratios are not significantly different from unity, we can eliminate that gene from further analyses. (3) The significance level can be established by applying Student's f-test for each time point separately. Since multiple comparisons are being made for each gene, the value of the significance level or should be chosen carefully. The log-likelihood function for the n data points is then

We define Ho⁽ⁱ as the hypothesis that for a given gene the given by expression ratio is equal to unity at a given time point fy, and Ho as the hypothesis that for a given gene (he expression -,( ) . (4) ratios at all time points are equal to unity. If we denote a as. the significance level for rejection of hypothesis Ho, and α ' as the significance level for rejection of hypothesis Ho^(,), The maximum likelihood estimate of the variance <? can then α' and αare related via be found by maximising the log-likelihood function with respect to <?. This yields l -«= (l- α')\ (1) in which α is the number of time points at which the gene expression ratio was measured. Note that by expanding the right hand side in a first order Taylor series, this equation reduces to Bonferoni's method for adjusting significance levels (see also Anderson and Finn, 1996).

By performing Student's f-test at every time point for each gene, using ob as the significance level, we will find whether Ho⁰ and therefore Ho should be rejected. If Ho is not rejected, we can conclude that that gene is not significantly affected by the experimental manipulation, and should therefore not be included in further analyses. If for a given gene the null hypothesis Ho is rejected, we conclude '» fe ^fs '« Time r that that gene was significantly affected by the experimental manipulation. Fig. 1. A conceptual diagram of a linear spline Junction fitted to measured data. Analysing expression data using linear splines

Priestley, 1994) AlC= -2L{g, σ² )+ 2(q+Ϋj, (15)

The log-likelihood function can then be written in the form where q + 1 is the number of estimated parameters ( c ² and the q entries in the estimated vector | ). Substituting the estimated log-likelihood function from equation (6), we find

The maximum likelihood estimate J of the linear spline A/C = >ι ln[2πσ²]÷π + 2<j-!-2 _ (16) function g can now be found by minimising c ¹. It can be shown that the' minimum value of c ^z will be achieved if in which c ² is given by equation (5) after substitution of the linear spline function is chosen such that the maximum likelihood estimate g for the linear spline function g.

Ag = b , (7) For each value of q, we will calculate the value of the AIC after fitting the linear spline function as described above, in which is a vector containing the q values of the linear and select the value of q that yields the minimum value of the AIC. The case q = 2 corresponds to linear regression. For spline function at the knots A is a tridiagonal symmetric the special case q — 1, we are effectively fitting a flat line to matrix given by the data. If we find that for a particular gene, the minimum AIC is achieved for the constant function (q = 1), then we can conclude that the expression level of that gene was . = . ∑ (8) unaffected by the experimental manipulations. j: ,[ <t₁<tl Gene expression data are typically given in term of expression ratios. At time zero, the expression ratio is equal to unity by definition. This fixed point can be incorporated

A, easily in our methodology by modifying equation (7). It can be shown that the minimum value of c ²' will now be (9) achieved by choosing the linear spline function such that

and & is a vector given by in which b_z' sb₂ - a . an , = 1. t-t,

*.= Σ T^ (12)

3 RESULTS

3.1 Student's t-test

We will illustrate using Student's f-test and fitting linear

Σ ^ x,+ «m -* spline functions by reanalysing the measured gene r.'l_t trf ^tι ^tι-ι j::i ι_i<ri._t h+l *i .expression profile of the cyanobacterium sp. PCC 6803 after for l < f< (13) a sudden exposure to high light (HL) (Hihara et al, 2001). The expression levels of 3079 ORFs were measured at zero, fifteen minutes, one hour, six hours, and fifteen hours both

(14) for cyanobacteria exposed to HL and cyanobacteria that remained in the low light (LL) condition. Table 1 shows the number of measurements at each time point Data from the

The fitted model depends on the number of knots q. The cDNA expression measurements were obtained from the number of knots can be chosen using Akaike's Information KEGG database (Nakao et al., 1999). Criterion, known as the AIC (Akaike, 1971; see also Analysing expression data using linear splines

Table l.The gene expression measurements as a time-ordered set of data. one and five, with a fixed knot equal to unity at time zero. For q - 3 and q = 4, three possibilities exist for the

Time point Number of measurements placement of the knots between the linear segments of the linear spline. These are indicated in Figure 2, together with

; = Ij in^'Duies 4 (*) r = 1 hour 6 the cases q - 1, q = 2, and q - 5. Notice that the number of ; = 6 hours 4 possible knot placements increases exponentially with the 1= 15 hours maximum number of knots ^_πa, as H 2'"^"2.

(*): Two sees of measurements are missing in the KEGG database at the / = In the fold-change analysis, the temporal gene expression minutes time point. patterns were classified into six categories (Hihara, 2001), listed in Table 3. Fitting a linear spiihe function to the

Table 2. Student's f-test of gene expression measurements (3079 ORFs total). measured gene expression data provides a move flexible way to describe the data than categorising. In addition, a

Significance level Number of ORFs numerical description of the gene expression response pattern is an important initial step in deriving a gene p < 0.0003 98 regulatory network.

0.0003 <p < 0.001 69

0.001 <p < 0.01 320

0.01 <p <0.05 517 p >0.05 2075

It should be mentioned that the data used for the original analysis (Hihara, 2001) may not be identical to the raw data flat line H -I submitted to KEGG (Hihara, personal communication). In addition, two measurement sets out of six at the t = 15 minutes time point are missing in the KEGG database. q = 2 _*- -* Recalculating the gene expression ratios from the raw data gives numbers close to the previously published results though. ^•

After subtracting the background signal intensities from f*- -*- -* thd^" HL and LL raw data, global normalisation was applied and the ratio of the HL to the LL signal intensities was <7 = 3 X- -* calculated to find the relative change in the gene expression level with respect to the control (LL) condition. In the fold-change analysis, a gene was regarded as being affected by HL if its expression level changed by a factor of two or Vx- -*- -X more. The statistical significance of such changes was assessed heuristically by considering the size of the standard deviation ofthe measurements (Hihara, 2001).

The results of Student's /-test on the gene expression ratios of each gene separately are shown in Table 2. At a q = 4 significance level of a — 0.001, 167 genes were found to be significantly affected by ML condition. Note that we would expect about three type-1 errors among these 167 genes. In comparison, 164 ORFs were found to be affected by the HL condition in the original analysis (Hihara, 2001).

By considering the fold-change for the psbD2 gene q = 5 X- -*- -*r- -X (slr0927), it was concluded that it was not significantly induced by HL (Hihara, 2001). This was remarkable, since Fig. 2. Possible placement of knots for a time-ordered set of measurements at this gene had been reported to be inducible by HL in the live time points. cyanobacterium Synechococc s sp. PCC 7942 (Bustos and Golden, 1992; Anandan and Golden, 1997). However, Table 3. Six categories were used in the fold-change analysis to classify the performing Student's i-test on the gene expression data for temporal pattern of gene expression (Hihara, 2001). the psbD2 gene at t = 6 hours yields p = 3.3 X 10^"s, suggesting that this gene was indeed affected by HL. Typel Induced within 15 minutes, then decreased

Type2 Induced continuously at high levels Type 3 Induced at approximately one hour

3.2 Analysis using linear spline functions ^■__ype4 Repressed within 15 πύnutεs, then increased T p=5

Next, we fit linear spline functions to the measured gene Repressed continuously at low levels Typeβ Repressed at approximately one hour expression ratios. The number of knots q will be between Analysing expression data using linear splines

We will illustrate the usage of the AIC with an example. Table 5. The number of genes for each response pattern. In the fold-change analysis, the threorine synthase gene thrC (slU688) was found to be repressed at approximately one Placement of biots Number of genes hour. The calculated values of the AIC for the different sets of knots are listed in Table 4. The minimum AIC was Keeping Removing Uang /-test to all genes lowest 2000 remove genes achieved for knots at 0, 15 minutes, 1 hour, and 15 hours. Figure 3 shows the measured gene expression levels, Flat line 927 249 1 (')

(0, 15 hours) 280 69 3 together with the linear spline that was fitted to the data. (0, 15 miπiues, 15 hours) 663 217 61

Performing this procedure for all the gene expression (0, 1 hours, 15 hours) 386 178 7 measurements lets us classify the different genes based on (0, 6 hours, 15 hours) :6_> 64 3

(0, 15 minutes. I hours. 296 119 27 their time-dependent response to HL. Several choices can be 15 hours) made as to which genes to include in this analysis. In the (ft 15 minutes, 6 hours, 178 83 39 original analysis, genes were removed from the calculation 15 hours)

(0, 1 hour, 6 hours, 15 103 49 12 if their expression levels were among the 2000 lowest out of hours) 3079 ORFs (Hihara, 2001). Alternatively, we can eliminate (0, 15 minutes, 1 hour, 6 86 51 14 genes if Student's r-test showed that they were not hours, 15 hours)

(*): Removed because of the presence of an outlier. significantly affected by HL. Table 5 shows the number of genes whose measured expression levels correspond to each response pattern for these different cases. For Student's f-test, We can compare the 167 genes which were identified by a significance level or = 0.001 was used. Student's t-t st as significantly affected by HL with the results from the fold-change analysis (Hihara, 2001),- in which 164 ORFs were identified. First, we remove those

Table 4. The calculated AIC for the different sets of knots for the threorine synthase gene thrC slllόSS). genes from our analysis for which an outlier was present in the data. We define an outlier as a data point that deviates

Placement of knots AIC more than two standard deviations from the mean of the data at a given time point Only .one gene was found for which

Flat line 62.55

(0, 15 hours) 64.54 the measured expression data contained an outlier; the linear

(0, 15 minutes, 15 hours) 64 54 spline function fitted to its expression data was a flat line.

(0, 1 hours, 15 hours) 66.24 None of the other gene expression levels was described by a

(0, 6 hours, 15 hours) 66.38

(0, 15 minutes, 1 hounr, 15 hours) 25.24 flat line, which is consistent with the results from Student's

(0, 15 minutes, 6 hours, 15 hours) 64.69 Mest.

(0, 1 hour, 6 hours, 15 hours) 68.23 Next, we remove those genes whose expression level was

<0, 15 miputes, 1 hour, 6 hours, 15 hours) 26.45 among the lowest 2000 in order to avoid using data that are predominantly determined by noise. The same procedure had been used for the fold-change analysis (Hibara, 2001). After removal of these genes, 107 genes remained that were significantly affected by HL.

Of the 107 genes, 42 had not been identified in the fold-change analysis (Hihara, 2001). These genes are listed in Table 6, together with the location of the knots that was found for each gene. For each linear spline function the percentage variance explained was calculated as a measure of the goodness of fit. As an example, Figure 4 shows the measured gene expression ratio as well as the fitted linear spline function of the gene xylR ^fslr0329), having four knots at zero, fifteen minutes, one hour, and fifteen hours. Of the 42 genes, the gene xylR (slr0329) had the largest percentage

Fig.3. The measured gene expression levels for the threorine synthase gene ΛrC variance explained (98.7%). (Ml 688}, together with the fitted linear spline. Of the 164 ORFs that were identified in the fold-change analysis, 39 were not significantly affected by HL according to Student's /-test at a significance le el a = 0.001. These ORFs are listed in Table 7. Analysing expression data using linear splines

Table 6 ORFs that were significantly affected by HL, but had not been Tiible 7. ORFs that were identified to he affected by HL m the fold-change identified in the fold-change analysis (Hihara, 2001) analysis (Hihara, 2001), but found not significant by Student's /-test ene Placement of lenots % variance slrlϊl l

ORF G sill 867 SII1732 explained sill 733 SI1I734 slrl280 S1J0170 S110430 S1I1514 slrl641

S111330 {0, 15 nun., 1 hours) 918 slrl350 slrl516 slι 228 0, 15 nun , 15 hours) 614 s!rl604

5111797 ycl21 ( slrI675 (0, IS min , 15 hours) 787 SS13044 si! 1483 S1I1541

S1Λ121 slr0343 petD (0 15 nun., 1 hours) 935 SII20I2 Slr0394 Slr0426 (0 15 rar , ' 5 hows) 81 i slrl351 slrl594

S1Λ442 S110217 S110218 sW7S9 (0 15 -mn. . Sams) 647 S110219 S110528 SII066S S110814 S110846 slrt>906 psbB (0, 15 nun , .; Fours) 946 sill 884 slrf)007 S1Λ476 slrll28 (0, 15 mm., 1 D hours) 637 5lr0740 7 s)r0959 slrl544 slrlI52 (0, 15 nun . 15 hours) 60 εlrl674 slr!686 slrl277 gspD (0, 15 mm, 15 hours) 405 slrl963 slrlβlO (0, 15 min., 15 hours) 702 slrl813 (0, 15 mm, 15 hours) 843 slr2052 (0, 15 nun., 15 hours) 871 Table 8 Reliability of the linear spline fitting procedure as a function of number

S110144 pyrHor (0, 1 hour. 15 hours) 885 of data used Only those genes were considered which were significantly smbA affected by HL according to Student's f-test, and whose data did not contain any si! 1327 atpC (0, 1 hour, 15 boras) 703 outheis.

S11I496 (0, 1 hour, 15 lours) 774 slrl367 ElgPor (0, 1 hour, 15 hours) 882 glgY Using four data points at 15 mm , 6 hours, linear spline functions are slrl793 tαlB (0, 1 hour, 15 hours) 870 and 15 hours, and six data points at 1 hour estimated for 166 genes sill 878 (0 6 hours, 15 hours) 658 slrlδOO (0, 6 hours. 1 hours) 465 Using four data points at each time point 25 ± 8 estimated differently

SU085I psbC (0, 15 min , 1 hour, 15 hours) 91 9

S1I1471 cpcG (0, 15 nun-, 1 hour, 15 hours) 966

S111812 ιps5 (0, 35 rran I hour, 15 hours) 603 Using three data points at each time point 53 ± 10 estimaled differently slr0076 (0, 15 iron , 1 hour, 15 hours) 858 εlr0329 xylR (0, 15 min, 1 hour, 15 hours) 987

Using two data points at each time point 79± 17 estimated differently slrt>927 psWD2 (0, 15 min 1 hour, 15 hours) 489 slrl513 (0, 15 nun, 1 hour, 15 hours) 848

S110547 (0, 15 iron , 6 hours 15 hours) 793

S1I1528 (0, 15 ran , 6 houra, 15 hours) 909 slrt056 G4 (0, 15 nan , 6 hours, 15 hours) 766 Finally, we would like to establish whether the number of slrill6I pilT (0, 15 tmn , 6 hours, 15 oujs) 962 measurements at each time point was sufficient to reliably slrl239 pntA (0, 15 πmn , 6 houis, 15 hours) 967 determine the placements of knots for the linear spline slrl5 5 (0, 15 rran , 6 hours, 15 hours) 967 slrl799 (0, 15 iran , 6 hours, 15 hours) 955 function. To do so, we repeated the estimation of the linear

S110617 iπflO (0, 1 hour, 6 hours, 15 hours) 849 spline function using subsets of the measured data We then

SII0921 (0, 1 hour, 6 hours, 15 hours) 469 counted for how many genes the estimated knot positions

SII0985 (0, 1 hour, 6houπ,15 hours) 892 slrl237 codA (0, 1 hour, 6 hours, 15 hours) 807 changed if a subset of the data was used instead of the

SllOJOS amtl (0, 15 nun, 1 hoar 6 bmrs, 15 hours) 844 complete set of data. The average and standard deviation of s!10185 (0, 15 nun, 1 hour, 6 hours, 15 hours) 949 this number for four, three, and two data points at each time

5111873 (0, 15 nun, I hour, 6 hours, 15 hours) 920 slrυllδ hemD (0, 15 min, 1 hour 6 hours, 15 hours) 892 point is shown in Table 8.

Even if only two data points (at the one hour time point) are removed, and four data points are used at each time point, in 15% of the cases the estimated not positions change. This suggests that four or more data points are needed for each time point to reliably deduce information from gene expression measurements.

4 DISCUSSION

We have described a strategy based on maximum likelihood methods to analyse a set of time-ordered measurements By applying Student's t-test to the measured gene expression data, we first establish which of the measured genes are significantly affected by the experimental manipulation. TTie expression responses of those genes are then described by fitting a linear εphne function. The number of knots to be

F5g. 4. The measured gene expression levels for the gene xylR (slr0329), used for the linear spline function is determined using together with the fitted linear spline This gene was considered to be unaffected by H tn the fold-change analysis (Hihara, 2001) Akaike's Information Criterion (AIC).

Using linear spline functions allows us more flexibility in descπbing the measured gene expression than using a nominal classification Also, in order to set up a gene Analysing expression dala using linear splines regulatory network, it is important that the gene response as Bustos.S.A. and Golden.S.S. (1992) Light-regulated determined iirom gene expression measurements is available expression of the psbD gene family in Synechococcus sp. in a numerical form. Finally, the positions of the knots strain PCC 7942: Evidence for the role of duplicated specify those time points at which the expression of a gene psbD genes in cyanobacteria. Mol Gen. Genet., 232, changes markedly, which is important in identifying its 221-230. biological function. As a next step, the classification of gene DeRisU.L., Iyer.V.R. and Brown, P.O. (1997) Exploring the expression responses based on the position of the knots can Metabolic and Genetic Control of Gene Expression on a be refined by creating subcategories that take the magnitude Genomic Scale. Science. 278. 680-6S6. of the linear spline function at the knots into account. For Friedman JF H and SilvermaaB.W. (1989) F.sxible instance, for linear spline functions with three knots we may parsimonious smoothing and additive modeling (with consider creating six subcategories in which changes in the discussion). Jechnometήcs, 31, 3-39. gene expression level are described by (flat, increasing), Friedman,N., Linial.M., Nachman . and Pe'er.D. (2000) (flat, decreasing), (increasing, flat), (decreasing, flat) Using Bayesian networks to analyze expression data. J. (increasing, decreasing), or (decreasing, increasing). Comp. Biol, 7, 601-620.

Applying the technique of linear spline functions to Kguchi,T_ (1999) Automatic identification of the large scale measured gene expression data, we were able to identify the field aligned current systems as an example of knowledge temporal expression response pattern of genes that were discovery from the large database (in Japanese). significantly affected by the experimental manipulations. Proceedings ofthe Institute of Statistical Mathematics, 47, The response of 42 of those genes was not noticed in earlier 291-306. fold-change analyses of expression data. Furthermore, it was Hihara.Y, Kamei,A-. Kanehisa,M., Kaplan,A. and shown that the expression response levels found in a Ikeuchi,M. (2001) DNA microarray analysis of fold-change analysis were not found to be significant by cyanobacterial gene expression during acclimation to high Student's f-test for 33 out of 164 genes some genes. light. The Plant Cell, 13, 793-806.

Gene expression data tend to be noisy and are often Liang.S., Fuhrman.S. and SomogyiJ .. (1998) REVEAL, a plagued by outliers. Whereas Student's f-test and maximum general reverse engineering algorithm for inference of likelihood methods described here take the statistical genetic network architectures. Proc. Vac. Symp. on significance of noisy data into account, the issue of outliers Biocomputing, 3, 18-29. needs to be addressed separately. As a simple procedure to Imoto.S. Goto.T. and iyano,S. (2002) Estimation of remove outliers, we calculated the mean and standard genetic networks and functional structures between genes deviation of the data for each point in time, and removed by using Bayesian networks and nonparametric regression. those data that deviate more than two standard deviations or Proc. Pac. Symp. on Biocomputing, in press. so from the mean. Nakao. ., Bono,H_, Kawashima,S., Kamiya.T., Sato;K.,

Finally, the number of expression measurements needed Goto.S. and Kanehisa,M. (1999) Genome-scale gene at each time point to reliably fit a linear spline function was expression analysis and pathway reconstruction in KEGG determined by removing some data points and fitting a In Dunker -K., Konagaya,A., Miyano.S., Takagi.T. (eds), linear spline Junction anew. It was found that if four data Proceedings of the Tenth Workshop on Genome points per time point are used, in about 15% of the cases the Informatics. University Academy Press, Tokyo, Japan, pp. knot positions will not be estimated reliably. It is therefore 94-103. advisable to make more than four measurements per time Priestley.M.B. (1994) Spectral Analysis and Time Series. point. Academic Press, London.

Sρellmaπ,'P., Sherlock.G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B,, Brown,P.O., BotsteinJ). and

REFERENCES FutcherJB. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces

Akaike.H. (1971) Information theory and an extension of cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, the maximum likelihood principle. Research 3273-3297. Memorandum No. 46, Institute of Statistical Mathematics, Suzuki,!, Kanesaki.Y., Mikarni.K., Kanehisa,M and Tokyo. In Petrov.B. , Csaki,F. (eds.), 2nd Int. Symp. on Murata,N. (2001) Cold-regulated genes under control of Inf. Theoiy, Akademiai Kiadό, Budapest (1973), pp. the cold sensor Hik33 in Synechocystis. Mol Microbiol, 267-281. 40, 235-244.

Akutsu,T., Miyano,S. and Kυhara,S. (2000) Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727-734.

Anandan.S. and GoIden,S.S. (1997) Cis-acύng sequences Key words: gene expression data, linear splines, AIC, required . for light-responsive expression of the psbDII time-ordered data, maximum likelihood method. gene in Synechococcus sp. strain PCC 7942. /. Bacteriol 179, 6865-6870.

Anderson,T.W. and Rnn,J.D. (1996) The New Statistical Analysis of Dala. Springer-Verlag, New York. Example 5

USE OP GENE NETWORKS FOR IDENTIFYING AND VALIDATING DRUG TARGETS

We propone a new methodology for identifying and validating drug targets by using gene networks, which are estimated from cDNA microarray gene expression data. e created novel gene disruption and drug response microarray gene expression dst* libraries for the purpose of drug target elucidation. Wc use two types of -microarray gcnC expression data for estimating gene networks and then identifying drug targets. The estimated gene networks play an essential role in understanding drug response data and this information 5β unattainable from clustering analysis methods, which are the jstandard foir gene expression analysis. In the construction of gene netwoiis, β UBe both Boolean and Bayesian network models for different steps in analysis and capitalize on their relative strengths- We use an actual example from analysis of the Socc aromyctt erevisiae gene expression «w>d drug response data to suggest a concrete βtntcgy for the application of gCnc network information to drug iscovery.

1 Introduction

In recent years, the development of microarray technology has produced a large volume of gene expression data under the various experimental conditions- Constructing gαoe network from gene expression data has received considerable attention arxd several methodologies for gene network inference have been proposed in the fields of computer science and statistics, e.g., see the _paperf'^2!'?'⁴'⁹>¹¹'¹³'¹ '^lβ'²⁰'^!W in the References section. The development of new methods for constructing a gene network is import and should progress. On the other hand, real world application of estimated gene networks to solve medical or biological problems has recently become the center of interest. In this paper, we introduce a methodology for identifying a_nd validating drug targets based on gene expression data.

Clustering metho ^'⁷'⁴³ have become a standard tool for analyzing π»-

croarray gene expression data. However, they cannot offer information sufficient for identifying drug targets in either a theoretical or practical sense. In. this paper, we show how estimated gene networks can be used as a key for identifying and validating target genes for the understanding and development of new therapeutics. Gene regulation pathway information is essential for our purpose, and we use both Boolean and Bayesian networks as the methods for inferring gene networks from gene expression profiles. The procedure for identifying drug targets can be separated into two partβ. At first, we βhould identify the drug-affected genes. Second, we search for the draggable genes, which are usually upstream of the drug-affected genes in the gene network. The Boolean network model shows its most useful for identifying the drug-affected genes by using "virtual gene" technique, proposed herein, and the Bayesian network model can work effectively for exploring the druggable gene targets related to the elucidated affected genes. We apply the proposed method to novel Saccharomyces cerevisiae gene expression data, comprised of expression experi ents from 120 gene disruptions and several dose and time responses to a drug. We demonstrate a concrete strategy for identifying genes for drug targeting through this application study.

2 Gene Networks for Identifying Drug Targets

S.l Clustering Methods

Clustering methods such as the hierarchical clustering and the self-organizing maps?³ are commonly used as a standard tool for gene expression data analysis in the field of Bioinformatics. Eiseι£ focuses on the hierarchical clustering and provides software, Cluster/Tree Vew, for clustering analysis of gene expression data. De Hoon et al?^,β improves this software especially in the .fe-røeaos clustering algorithm. These clustering algorithms can work effectively in the early stage of analysis. In the clustering analysis of gene expression data, we assume that the genes, which have similar expression patterns, play the same functional role in cell. Therefore, we often use the clustering methods for predicting gene functions.

We agree with the usefulness of the clustering methods, however, they cannot provide sufficient information for our purpose. Clustering methods only provide the information on gene groups via the similarity of the expression patterns. In fact, we need additional hierarchical pathway information, not only cluster information, in order to detect drug targets that are affected by an agent. In Section 3, we show the limitations of clustering techniques for drug targeting purposes through real data analys s. 2.2 Network Methods

We use two methods for estimating gene networks from gene expression data. In this subsection, we give the brief introductions of both methods. For detailed discussions of the algorithms, please refer to the p_ap_ersi.².³.'J.⁹» »¹3,i4,l^β,20_>22 j_n the References. section.

Boolean Network

A number of group._* ^,2'^3,4,1S>²² have proposed Boolean network

Bayesϊan Network

The Bayesian network¹⁵ is a graph representation of the complex relations of a large number of random variables. We consider the directed acyclic graph with Markov relations of nodes in the context of Bayesian network. We can thus describe complex phenomena through conditional probabilities instead of the joint probability of the random variables. That is, suppose that we have a gene expression data x of f-th array and _j'-th gene for i = l,...,n and j = 1, ..., x ^{■ • •} x f_P(xip]p_ip), where

Pi_j = (Pii 7 — ι ^])^T is ^Ά gi-dimensional parent observation vector of ary.

Friedman et al? proposed an interesting approach for estimating a gene network from gene expression profiles. They discretized the expression values into three values and used the multinomial distributions as the condi¬

together with new criterion, named BNRC, for selecting an optimal graph. The BNRC is defined as the approximation of he posterior probability ofthe graph by using the Laplace approximation for integrals. Imoto et a/.^13>14 applied the proposed method to the Saccharomyces cerevisiae gene expression data and estimated a gene network. The advantages of this method are as follows: a) We can analyze the microarray data as a continuous data set, b) this model can detect not only linear structures but also nonlinear dependencies between ^" genes and e) the proposed criterion can optimize the parameters in the model and the structure of the network automatically.

The Bayesian network has theoretical advantageous base on the mathematics of inference and in this paper, we use the method proposed by Imoto et αJ.¹³>¹⁴ for constπicting Bayesian. networks. Unfortunately, the Bayesian network cannot construct cychc regulations and are not useful for creating multilevel directional models of regulatory effects from data created from logical joins of expression data from disruptants and drug response experiments. However, we can get around this problem by combining the Bayesian and Boolean networks in analysis. Hence, the Boolean and Bayesian networks used together can cover the shortcomings one another and we can obtain more reliable information by using both network methods. 3 Application

5.1 Microarray Data

We created two libraries of microarray data from the Saccharomijccs cerevisiae gene expression profiles. One was obtained by disrupting 120 genes, and the another was comprised of the responses to an oral antifungal agent (four concentrations and five time points). We selected 735 genes from the yeast genome for identifying drug targets. In YPD, 314 genes were defined as "Transcription factors", and 98 of these have already been studied for their control mechanisms. The expression profile data for 735 genes chosen for analysis included the genes controlled by these 98 "Transcription factors" from 5871 gene Species measured in addition to nuclear receptor genes which have a pivitol role in gene expression regulation and are popular drug targets. We constructed the gene network models from the data set of 735 genes over 120 gene disruption conditions. The details of the disruption data are also described in Imoto et l} . As for the drug response microarray gene expression data, we incubated yeast culures in dosages of 10, 25, 50 and 100 mg of an antifungal medication in culture and took aliquots of the culture at 5 time points (0, 15, 30, 45 and 60 minutes) after addition of the agent. Here time 0 means the start point of this observation and just after exposure to the drug. We then extracted the total RNA from these experiments, labeled the RNA with cy5, hybridized them with cy3 labeled RNA from non-treated cells and applied them to full genome cDNA microarrays thus creating a data set of 20 microarrays for drug response data. In thiβ paper, we use these 140 microarrays to elucidate drug targets using gene networks.

3.2 Result

Clustering Result .

In the identification ofthe drug targets, the popular but problematic strategy is the use of clustering analyst²'¹⁷ often even with a library of base perturbation control data, to compare to drug response data based on the correlations. We have two 1ypes of microarray data., gene disruption and drug response, allowing us to compare drug response patterns to gene expression patterns caused by disruption. In the clustering analysis, if there is significant and strong similarity between the expression patterns of a single disruptant or group of disruptants and a given drug response jnicroarray, we may conclude that an agent probably plays the same role as the disrupted gcαe(s). Moreover, if this disrupted gene has known functional role, we may obtain more information

Fϊgun; 1: THc 5m_<*ge of the correl-iiion mfttrϊx R of combined gcpc expre_rf.ϊoι-_ dnta.

about the response to a drug.

Unfortunately, as is often the case with such experiments we cannot gain such a straightforward result from clustering our data. Figure 1 shows the image representation of the correlation matrix of our microarray data. We combine the two types of data and make the matrix Z = (X -, Y), where X and Y are the drug response and the gene disruption microarray data matrix, respectively. Here, each column denotes an expression pattern obtained by one microarray, and each column is standardi_zed with a mean of 0 and variance of 1- Therefore, Figure 1 is the information of the correlation matrix R = Z^TZ/p_t where p is the number of genes and ΪB 735.

In Figure 1, the light and dark colors denote the positive and negative Hgh correlations, respectively. It is clear that the drug response microarrays have high correlation amongst each other and a low correlation with any of the gene disruption microarrays. Under such a situation, we cannot find any interactions between gene disruptions and drug responses from the clustering analysis that would be useful to identify the strongly affected genes that ate meaningful in the context of drug response. We further implemented hierarchical clustering of the gene disruption and drug response microarrays, but the drug response microarrays produce one cluster and we cannot extract any more information on what the actual drug targets from this analysis. This result is essentially ^•unchanged when we use other distance measurements or clustering techniques. Hence, we cannot obtain the information for identifying the meaningful drug targets from the clustering methods with our data.

Boolean Network Result

Wc estimated gene network by using the microarray data, _Z, which was created by the combining gene disruption and drug response microarrays. We introduce a method called virtual gene technique. We consider the conditions of the drug responses data as "virtual genes", e.g., the condition lOOmg/ml and 30min is given an assignment as the gene YEXPlOOmg30min. By using the^" Boolean network model, we can find the child genes of these virtual genes with the drug affecting these child genes in progeny generational order. We focus on genes which have five or more virtual genes as the parent genes, as the putative drug-affected genes, that is, genes which are under direct influence of the virtual geneβ (drug affect). However, a gene which has only one virtual gene as its parent gene, may be the primary drug-affected gene, depending on the mode of action for a given agent and this must be analyzed on a case by case basis. The virtual gene technique highlights the advantages of Boolean network models compared with the Bayesian network model in the initial screening for genes under drug-induced expression influence.

■ In addition, fold-change analysis can provide similar information to the proposed virtual gene technique. In fact, we can obtain the affected genes under certain experimental condition by fold-change analysis. However, our virtual gene technique can improve the result of the fold-change analysis. Suppose that we find gene A and B are affected by the drug from fold-change analysis. The fold-change analysis cannot take into account the baseline interactions between geneA and g neB. Than is, if there is a regulation pathway between geneA and geneB that geneA → geneB, the geneB is probably not affected by the drug directly. The virtual gene technique can take into account such interaction by sing the information of the gene disruption data and thus reduce the search set to more probable target genes given available interaction data.

There is not guarantee that genes that are most affected by an agent are the genes that were "drugged" by the agent, nor is there any guarantee that the drugged target represents the most biologically available and advantageous molecular target for intervention with new drugs. Thus, even after identifying probable molecular modes of action, we should find the most draggable genes upstream of the drug-affected genes in a regulatory network and to then screen low molecular weight compounds for drug activity on those targets. In the

Figure 2: The downβtmiim pathway of the virtual gene YE_tP_lO(lmg30j______o.

estimated Boolean network, the virtual genes should be placed on the top of the network. Therefore, it is difficult or sometimes impossible to find upstream information for drug-affected genes in this estimated Boolean network. At this stage, we can use the Bayesian network model for exploring the upstream region of the drug-affected genes in an effective manner.

Bayesian Network Result

• The gene network is estimated by the Bayesian network and nonparametric regression method together with BNRC optimization strategy¹³'¹⁴. We use the Saccharomyces cereviβiae microarray gene expression data obtained by disrupting 120 genes. From the Boolean network analysis, we can find the candidate set ofthe drug-affected genes effectively. The druggable genes arc the drug targets related to these drug-affccted targets, which we want to identify for the development of novel leads. We can explore the druggable genes on the upstream region of the drug-affected genes in the estimated gene network by Bayedan network method. Using the Bayesian models of network regulatory data available to us from our knockout expression library wc are able to search upstream of drug affected target for genes, which have a known regulatory

Figure 3: A f»__-rtϊ.__l network ββiong the druggable (Top), drug-affected (Bottom) and intermediary gonna (Middle).

control relationship over drug affected target expression. Here, wc focus on the nuclear receptor genes as the druggable genes because a) Nuclear receptor proteins are known to be useful drug targets and together represent over 20% of the targets for medications presently in the market- b) Nuclear receptors are involved in the transcription regulatory affects that, are directly measured in cDNA microarray experiments.

Figure 3 shows a partial network, which includes the drug-affected genes (Bottom), the druggable genes (Top) and the intermediary genes (Middle). Of course, we can find more pathways from the druggable geneβ to the drug- affected genes if we admit more intermediary genes. Due to the use of the Bayesian network model, we can find the intensities of the edges and can select more reliable pathway. This is an advantage of the Bayesian network model in searching for ideal druggable targets. Iα Figure 3, the druggable genes in the circle coimect directly to the drug-affected genes and the other druggable genes have one intermediary gene per one druggable gene. Erom Figure 3, we can find the druggable genes for each drug-affected gene, e.g., we can find the druggable genes for MAL33 and CDC6 shown in Table 1- Table 1: The dπiggabls genes of A 33 and CDC8. "Parents" means these genes connected directly to the drog-aflectcd gene. "Grandparents" means there 13 one intermediary gene between these genes and the drag-affcctcd gene.

Drug-aflectcd MAL33 (YBR287W) : βltoββ fermαitϋtloji regulatory protein

Orαjgable . arent*

HSF82 (YP.W40C) : Heat shock protein

SRB4 (YER022W) : DNΛ-dirβcted RNA pD.ymCTaβc II holoenzyme and Komberg'-j mediator (SRB) subcornplex aubvnit

Groπdporcni*

BARl (YI 015W) : Bn- ef ^poin precursor

GPA1 (VMR.O05C) : CTP-binding protcio alpha subunit of the phCTOinonc patlrnrαy KΛR2 (YJL034W) : nuclear fusion prolran

Drug-aiTcctni . CDC6 {Y_J-__.194W) ! Cell division control protein

Druggable P&rcnta

AR_P7 (YPR034W) : Component of SW1-SNF gl al transcription activator complex and RSC chromatϊn remo eling co plex BARl (YH 015W) » Barrlwrpeptnn precursor

GΛ H (YOL051W) : DNA-dϊrectεd RNA po)yπκ>ras« II hobcπzyme and

Kornbcrg'a inediator (SRB) i ibcomplwX uobunJl I'ΛRl (YJ lS7C) : Cyelin-dβpertdent kinase mhϊbiTror (OKI) SLA2 (ΥNb343W) : Cytoπkelcton nsscmbly Control protein

4 Discussion

In this paper, wc propose a new strategy for identifying and validating drug targets using computational models of gene networks. We showed that the clustering methods cannot provide the sufficient information and we that there is a need for the kind of hierarchical interaction data provided by network methods. We focus on the Boolean and Bayesian networks for estimating gene networks from microarray gene expression data. Theses two methods were originally proposed from the different points of view, however, both methods have relative strengths and eakness and we can obtain more reliable information from employing both methods in our analysis. The Boolean network is ideally used for identifying the drug-affected genes by using the virtual gene technique set forth above. We also described the theoretical advantages of the virtual gene technique over results obtained from simple fold-change analysis. This is a positive point for using the simpler Boolean networks for certain purposes. However, we-sihow that the Boolean network cannot give the valuable information when we explore the druggable genes upstream of the drug-affected genes in the estimated network. Wc solve this difficulty by applying the Bayesian network model to this problem. The Bayesian network model can provide the information of the upstream region of the drug-affected genes effectively and we can thus attain a set of candidate druggable genes for each drug-affected gene. The strategy proposed by this paper is established based on the sophisticated use of a combination of two network methods. The strength of each network method can be clearly seen in this Strategy and the proposed integrated method can provide a methodological foundation for the practical application of bioinformatics techniques for gene network inference in the identification and validation of drug targcts.

References

1. T. Akutsu, S. Kuhara, 0. Marayama and S. Miyano, A system for identifying genetic networks from gene expression patterns produce.d by gene

' disruptions and overexpressions. Genome Informatics, 9, 151, (1998).

2. T, Akutsu, S. Miyano and S. Kuhara, Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Proc. Pacific Symposium on Biocomputing, 4, 17, (1999).

3. T. Akutsu, S. Miyano and S. Kuhara, Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727, (2000),

4. - Akutsu, S. Miyano and S. Kuhara, Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function- J. Comp. Biol, 7, 331, (2000).

5. M.J.L. de Hoon, The C clustering library manual, (2002).

6. M.J. . de Hoon, S. Imoto and S. Miyano, Bioinformatics. submittcd-

7- M.B. Eisen, P.T. Spellman, P.O. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863, (1998).

8. M.B. Eisen, http://ranaJbI.gov/EisenSoftwareSource.htin (2000).

9.. N. Riedman, M- Linial, I. Nachman and D. Pe'er, Using Bayesian networks to analyze expression data. J. Comp. Biol, 7, 601, (2000). ^• 10. Genomic Object Net, http://www.gcnomicobject.net/public/index.html (2002).

11, A.J. Hartemink, D K. GiHbrd, T.S. Jaakkola and I A. Young, Combining location and expression data for principled discovery of genetic regulatory network models. Proc, Pacific Symposium on Biocomputing, 7, 437, (2002).

12. T.R. Hughes et al., iunctional discovery via a compendium of expression profiles. Cell, 102, 109, (2000). 13. S- Imoto, T. Goto and S. Miyano, Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression- Proc Pacific Symposium on Biocomputing, 7, 175, (2002).

14. S. Imoto, S- Kim, T. Goto, S. Aburatani, K. Tashiro, S- Kuhara and S. Miyano, Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Proc. IEEE Computer Society Bioinformatics Conference, (2002). in preεs.

15. F.V. Jensen, An introduction to Bayesian Networks. Univeraty College London Press (1996).

16. Y. Maid, D. Tominaga, M. Okamoto, S. Watanabe and Y. Eguchi, Development of a system for the inference of large scale genetic networks. Proc. Pacific Symposium on Biocomputing, 6, 446, (2001).

17. M.J. Marton et al, Drug target validation and identification of secondary drag target effects using DNA microarrays of Expression Profiles. Nat. Med., 4, 1293, (1998).

18. H. Matsuno, A. Doi, M. Nagasaki and S. Miyano, Hybrid Petri net representation of gene regulatory network. Proc. Pacific Symposium on Biocomputing 5, 338-349, (2000).

19. H- Matsuno, A. Doi, Y. Btirata. and S. Miyano, XML documentation of biopathway3 and their simulations in Genomic Object Net. Genome Informatics, 12 54-62, (2001).

20. D. Pe'er, A. Regev, G. Elidan and N. Friedman, Inferring subnetworks from perturbed expression profiles. Bioinformatics, IT Suppll, 215, (1SMB 2001).

21. M-A. Savageau, Biochemical Systems analysis: a study of function and design in molecular biology. Addison-Wcsley, Reading, (1976).

22. I. Shmulevich, E.R. Dougherty, S. Kim and W. Zhang, Probabilistic Boolean networks: a rule-based uncertainty, model for_ gene regulatory networks. Bioinformatics, 18, 261, (2002).

23. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovεky, E.S. Lander and T.R. Gloub, Interpreting patterns of gene eαcpression with self-organizing maps: methods and application to hematopcάetϊc differentiation. Proc. Nat Acad- Set. USA, 96, 2907, (1999).

Example 6

System for Discovering Network Relationships, and Uses Therof

1) Objects

Deteπnining network relationships is highly desirable in numerous disciplines, including drug discovery, genomics, proteomics, computational networks, human networks, and other systems in which elements interact with each other in complex fashions.

One of the most important points in drug development is to explore action sites of lead compounds. Screening of the drug involves the processes of 1) synthesis of the lead compounds and screening thereof, 2) retrieval of the action sites and direct approach to diseases, 3) improvement to more effective drugs and 4) improvement to the drugs having less side effects. The important technology not available today in the process of the drug development is an assay assist system taking advantage of genomes. It is currently very difficult to construct an assist system using human genomes, although the human genome has been already elucidated. Underlying difficulties are based on the fact that construction of models using knock-in/knock-out genes is very difficult, and the models should be established for respective organs since each organ has its own specific expression.

Under these situations, it is considered to be an effective system for most effectively developing drugs having less side effects to determine the target ate of the drug using yeast as an eucaryotic microorganism that may be easily modeled and for which abundant data have been accumulated, and to construct a system for extracting features of the network concerning the target site. Practically, differences of expression among the genes are accumulated using the kπock-in/knock-out effect of each gene, and an expression control network is constructed from the data. Then, a lead compound is added to the culture medium of yeast utilizing this network, and any action sites of the lead compound are estimated by comparing the difference of gene expression with accumulated data. GNI has developed a genome assay assist system to provide an analysis environment and services for assisting pharmaceutical companies with more accurate and efficient development of drugs.

2) Technologies for Constructing Network

As an achievement of the genome project that is considered to be one of the greatest molecular biology projects in the 20^th century, genome structures of many model animals have been elucidated, and analysis of the structure of even human genomes are now under way. Elucidation of genome functions are made possible by this rapid progress of the genome structure analysis that have been impossible by conventional biochemical and genetic methods. The most advanced results comprises transcriptome for investigating expression of total genes, and proteome for investigating expression of total proteins. In addition to these results, it has been practically available to elucidate the overall cell functions through investigation of the expression network of all genes constituting an organism by analyzing these data. Based on these facts, the next project for determining the genome structure is to elucidate genome functions involved in each organism. Studies for controlling overall functions of an organism have started by classifying and systematizing each genome depending on its nature. Consequently, a new era is now beginning when the observed data should be finally combined and connected to elucidation of the genome functions. - Under these situations, it is almost impossible to collect and analyze multiple lines of information using conventional methods by which a single gene is analyzed.

A novel analysis method is required for attaining the object as hitherto described. In contrast to the conventional model constructed by experimentally confiπning respective relations, it should be considered to totally construct a model from these large scale data. For example, a gene expression control network can be constructed by analyzing expression profiles and time course of total genes of an organism under various culture conditions, or by analyzing expression profiles in a mutant whose genes are destroyed. A study as shown in Fig. 1 is possible from a systematic point of view, and studies for assuming internal structure of the system from the data on input and output, and disturbance has became possible. When disturbance is considered to be destruction of a gene (no expression of the gene), it is a problem from the systematic point of view whether the gene expression system can be conjectured assuming that the input means expression of normal total genes and the output means expression of total genes when a part ofthe gene is destroyed.

Identification of System from Systematic Point of View

It is possible to collect experimental data for enabling the network to be assumed when the genome structure is elucidated.

-Data of all the related genes can be measured.

-Many tools for enabling many experiments to be performed - such as chips have been developed.

When an output is measured by applying a disturbance to obtain many standardized data, then:

FIG. 1 Molecular Network

The first step that enabled this expression profile to be analyzed is emergence of micro-arrays^ that enables variation of many Mnds of DNA or RNA (expression profⁱle) to be simultaneously measured, or GeneChip method*. "Nature Genetics, January, 1999" have issued a special edition", on the gene chips. This method is now beginning to be widely used in the biological and medical fields, and many venture enterprises have been established worldwide to start sales of total system involving the chips and frays (see, for example, www.inmcite.com and www.affimetrix.com). DNA Chip laboratories Co., Takara Brewery Co. and Hokkaido System Science Co. have also joined in manufacture of the chip in Japan. Since a vast amount of experimental results are expected to be obtained from these methods, novel hypotheses and analysis methods will be required. Such methods would enable to replace conventional models that have been constructed by experimentally confirming individual relations with novel models that may be constructed from large scale data at one stroke. For example, it is possible to construct a control network of gene expression by analyzing the expression profile of entire genes under a variety of environments. Search with comprehensive methods enables cause genes of diseases to be explored and to construct a disease model.

Analyzing gene expression networks

(1) Basic array technology

The same principle as the conventional Southern blot method is basically used in the array experiments. Technical developments that have brought about a rapid advance in the array technology include development of measuring instruments by which quantitative ratio between the sample and control is precisely measured by binding DNAs in high density on a slide glass having good optical characteristics. Recently, 10,000 or more of cDNAs were successfully spotted on a slide glass using an instrument (an arrayer or a spotter) developed by Dr. P. O. Brown et al¹⁾ of Stanford University in USA. Expression of the gene could be measured by hybridizing sample and reference DNAs labeled with two kinds or more of fluorescent substances with the cDNAs on the slide. In a different method, a GeneChip has been manufactured using a more industrially available photolithographic method by which oligonucleotides are directly synthesized in high density on the surface of the glass. This method is now being practically used³⁾. Otherwise, a method using an ink-jet mechanism is under development, and this method is expected to have a possibility that can form the array in higher speed with a higher density.

(2) Practice of micro-array

An example of practical experiments will be described hereinafter. In our laboratory, direct and indirect influences of genes are comprehensively collected by allowing some genes in budding yeast to be destroyed or to be excessively expressed to investigate expression profiles of the total genes. The experimental procedure of the micro-array will be described herein with reference to Figs. 2 and 3. (I) Total genes of the microorganism to be studied are prepared at first. Usually, a set of PCR (polymerase chain reaction) primers (with a length of 20 bases) against the total genes are prepared, and PCR reactions are applied against the genomes or clones to obtained amplified fragments of the total genes.

(II) DNA of each amplified gene is spotted on a slide glass using a robot called a spotter or arrayer, and the genes are fixed. Each spot has a diameter of 150 μ, and 2,500 genes are spotted per 1 square cm.

(IH) Subsequently, the sample is prepared. In the case of the budding yeast, a selected strain after the genome analysis and a mutant strain in which a specified gene has been destroyed are cultured under the same condition. mRNAs are extracted from both strains. cDNAs are prepared from these mRNAs, and are simultaneously labeled with two kinds of fluorescent substances (having different excitation wavelength and fluorescence wavelength with each other). An equal amount of cDNAs prepared from respective strains are mixed. The mixture is placed on the micro-array, and cDNAs are bound with the probes fixed on the glass surface by complementary bonds. The slide is washed to remove non-specϋϊcaUy bound cDNAs, thereby obtaining the expression profile of about 6,000 total genes in one experiment. (TV) The micro-array binding the sample is quantified with a two-wavelength fluorometer to calculate the sample to reference ratio. The fluorometer is required to have a high sensitivity since the spot size is very small.

Micro-array NA chip

FIG. 2 Outline for preparing a micro-array

Normal cell Mutant cell

suppressed gene

Expression activated gene

FIG. 3 Measurement of expression profile using micro-array

The mutant ;ells in the drawing mean the cells treated with a chemical. The gene expression profiles are prepared for comparison among the profiles, followed by identification ofthe location of the profile in the network. 3. Analysis of gene expression net work by information technology (1) Kinds of data

Data is acquired at one selected time, or the cells are sampled with a given time interval during cultivation to acquire the expression profile data at that time as a time course. Otherwise, the experiment is carried out at one point on the time axis under different experimental conditions in order to collect as many expression profile changes as possible. Table 1 shows a part of Gene Expression Matrix in which the data of the gene destroy experiments are compiled. The results of gene destroy experiments with respect to 6,000 genes are shown in the columns of the table as ratios to wild type strains, in which the data of respective destroyed strains are accumulated.

TABLE 1 Gene Expression MATRIX The columns denote the destroyed gene names, and the rows denote the gene names. The figures are represented as relative amounts of expression by taking the normal level as unity. (2) Analysis model

Determination of network from array data

According to the prediction that the functional expression of microorganisms is defined by gene expression information based on the central dogma of the molecular biology, a line of information for elucidating the functions of organisms whose expression network has been investigated are expected to be obtained by analysis of the large scale gene expression data. Generally speaking, the methods used involve elucidation of gene groups having the same gene expressing pattern from the gene expression profiles issued from the project using clustering method⁵.⁷> ^β). Apart from these methods, studies for presuming the gene expression control network have started worldwide. The basic object thereof is to develop calculation methods for forming a construction diagram of the gene expression control network as shown in Fig. 4 from the gene expression matrix or time course data.

Guideline for Analysis of Large Scale Network

Gene expression matrix

Time course data of gene expression

FIG. 4 Conceptual diagram of identification of gene network Net work analysis using such model is largely divided into two groups. A boolean model is used in one group, wherein the relation among the genes .is represented by a binary state in which one gene is controlled by the other gene or not⁹» ¹⁰). Different models deal with actual number of expression, wherein most of the models are based on differential equations related to dynamics of continuous values¹¹ _' ¹²). However, the model based on the differential equation requires some devices since the reaction constants involved in the model should be optimized in the expression control network analysis carried out against the total genome¹³). An analyses method using a Bayes network have been introduced in addition to these method¹⁴⁾.

2. Our trial

(1) Expression profile construction program

Since genome level studies have became possible today in the environment that allows multi-dimensional data to be readily measured, it is considered to be necessary to collect many expression profiles of gene destroyed or gene excess expression strains for the purpose of elucidation of the gene expression control network. We have also constructed many gene destroyed mutants for genes mainly comprising the genes related to transcription, genes related to signal transfer and genes related to cell cycles as shown in Table 2, and are trying to collect their expression profiles. One hundred and seventy three mutants mainly involving transfer factors have been constructed today, and their expression profiles are under construction with a program for disclosing the expression profile data this year. TABLE 2 Construction ro ram of ene destro ed strains

(2) Trial for identifying the network

The multi-level digraph method we are trying will be introduced below. While the network is determined by the method shown in Fig. 5, it involves the following steps. (1) A gene expression matrix GE is constructed from the gene expression profile data collected; (2) The effect of one gene on any of the other genes, if any, is inquired by an operation to assign 1 for the gene in which its expression ratio has been remarkably changed, and 0 for the others, thereby forming a binary matrix. (3) The parts in which this matrix forms a cycle is searched, and the parts are collected and defines as a new gene group B'. Since handling of the looped control structure is complex, the parts are collected together to define as a new gene group again. (4) The columns and lows of the matrix are replaced to align related genes at an upper right portion of the matrix. This makes most of network topologies of the genes that are controlled with each other to be clear. (5) Finally, the gene expression network may be further defined by removing the indirectly connected parts. The network determined by this method from the expression profile of 70 gene destroyed strains we have constructed is shown in Fig. 6.

Multi-level digraph approach

Presumption of expression control network from expression profile

Gene GI G2 G3 G4 . G4 affects G3" G3 affects G4 GI G2 G3 G4

W3d type 1 1 I 1 G2 affects G3 GI 0 1 1 1

G4 disruptant 2.3 3.5 0.1 - G2 affects G4 G2 0 0 1 1

G3 (δsruptarrt 4.1 2.8 - 0.2 GI affects G2 G3 0 0 0 1 1

02 durup aiK 3.0 - 0.3 0.3 GI affects G3 G4 0 0 1 0 !

GI tSsruptcnt - 0.4 0.3 0.2 GI affects G4

Induction of control relation among genes from the gene expression matrix © Multi-level digraph <g> Skeleton matrix J? _ζTs Gene group having a loop structure is treated as an equivalence gronp

GI ÷ T) Fro__n *\ GI G2 G3' Froni ^ Gl G2 G3^*

G2 GI 0 1 0 Gl 0 1 1 G2 0 0 1 G2 0 0 1 {G3!G4} G3' 0 0 0 G3" 0 o o

G³={G3.G4)

FIG. 5 Outline of multi-level digraph method

FIG. 6 A part of network determined from 70 destroyed strains

4. Problems in the future

The most largest problem is errors involved in the data. Since the currently available experimental technologies may give rise to about errors in the order of from 30% to 40%, developments of infbrmatex technologies is necessary based on such errors. However, the method for dealing with such large errors is not currently sufficient. Therefore, methodologies capable of experimentally reducing the errors, or methodology for estimating the errors to a certain extent are expected to be introduced.

Developments of display methods of the results are important for treating a vast number of genes. Since the network among 6,000 genes should be displayed for transcription factors were selected for analysis. We constructed an initial 552 gene-member model of regulatory control relationships and then constructed a sub-network model composed of the 98 known transcription factors. In creating these models we employed a Boolean computational

algorithm as reported in Shoudan, et. A1A An outline of our method is shown in fig.7.

Multi-level digraph approach

Reconstruct the net twwraorrfck ccllaassssiiffiieedd iinnttoo tthhee ggeenneess aanndd ssoommee eeqquuiivvalence sets using a gene expres sssiioonn mmaattririxx.. gene expression matrix G_ε @ binary matrixes ® adjacency matrixΛ

0 G1 G2 G3 G4 G4 affects G3 G, G₂ G3 G₄

^{WU wβ} 1 1 1 1 G2affecte G3 Gl 0 1 1 1

G tSsmplani 1 1 0 - G2 affects G4 G2 0 0 1 1

Q? ΛsmptHni 1 I - υ Gl affects G2 G3 0 0 0 0 11

GI Ssrtψtsm 1 - 0 υ Gl affects G3 G4 0 0 1 1 00

- 0 0 0 Gl aøects G4

Derive binaiy relations between genes using a gene expression matrix. nmlfi-level digraph © skeleton relation R ©Partition genes into equivalence sets (belonging to closed path) Gl

Gl G2 G3- Gl G2 G3'

G2 <^SS Gl 0 1 0 <^bm oi o i l

+1 G2 0 0 1 G2 0 0 1

(G3,G4) G3* 0 0 0 G3* 0 0 0

G^S={G3,G4)

Find out a partial order between equivalence sets and skeleton network, and define Boolean functions between genes.

Figure 7 Figure 7. Construction of a gene regulatory subnetwork model with the Boolean method. First, a gene expression matrix is created from a set of gene disruption experiments. Each element of the matrix indicates the ratio of gene expression and the binary relation matrix shows that if gene 'a' is deleted and intensity of gene V is largely changed, gene 'a' affects gene V. If gene a and gene ^*b' affect each other mutually, they form a loop (strongly connected component). An equivalence set is introduced which treats a loop logically as a gene. An equivalence set is a group of genes that affect each other or only one gene. Next genes are partitioned into equivalence sets. These semi-ordered accessibility matrices among equivalence sets are

describe influence as an equivalence set group even when individual genes

with the set do not directly influence each other. Connections that are also included in lower rows are -> then removed to build the hierarchical connections in the network and generating the network topology of the gene in control relationships.

The network model constructed by Boolean models of the expression experiments resulted in a well-defined regulatory transcription factor sub-network involved in mitosis and meiosis pathway. Fig. 8 shows this transcription factor sub-network in detail. As can be seen in this figure, the sub-network describes specific regulatory relationships among transcription factors involved in mating, meiosis and DNA structure and repair mechanisms. For example HAP2, which is a gene whose function has yet to be reported shows indirect regulatory influence on Alpha2, a known mating response gene via an intermediary product, THI2 whose specific function is likewise unknown. Furthermore, various elements related to chromosome structure and DNA repair thought to be unconnected to known meiosis pathways are influenced by MET28 and CIN5. Both of these results make sense biologically; the dependence on error-free DNA replication for successful sexual reproduction and the relationship between meiosis and mating are obvious.

To understand the generalized regulatory relationships among various functional groups of transcription factors, we classified transcription factors in the network model with their "cellular roles". A "cellular role" here is defined as the major biological process involving the protein in YPD. We

defined 10 groups for categorization. in our transcription factor network. Fig 8a displays graphically the interrelationships of classified transcription factors in our network, and Fig 8b. shows the numerical relationships among the groups.

As seen in Figure 8a, the cell cycle transcription factors were influenced by only a two of the 10 categories of transcription factors and only three genes in total. The three elements that independently influenced cell cycle transcription factors were related to meiosis, mating response, and cell stress, processes that for functional reasons would necessitate limited interaction with the cell cycle process. It is known that eukaryotic organisms use discreet and highly conserved cell-cycle checkpoints to ensure that nuclear division is restrained while DNA is undergoing replication or repair l³⁵-³⁷⁾.

Likewise, it is apparent that the lipid fatty-acid metabolism transcription factors were affected by the carbohydrate metabolism transcription factors. Our data demonstrates, for example, that expression of IN04, a UDFHGDHG, is influenced by RTG1, a JJHGDGGH. Also, PDP.1, a BIG HORSE is influenced by the^' expression of CAT8, a SMALL DOG. Other such interactions between proteins involved in phospholipid synthesis pathway with genes of the glucose response pathway, the lipid signaling pathway and other lipid synthesis pathways have been well described previously in the literature!²⁶). It is well known that the genes encoding many enzymes of phospholipid biosynthesis all contain variants of UASINO in their- promoters and these are regulated in response to growth phase and nutrient starvation

(G.M. Carman (1991)). Especially, Snflp/Snf4p and Irelp are reported that the key molecules that bind to UASINO sequence, which are associated with the glucose response, are required for derepression of UASi_Ho-containing genes.

To confirm validity of our models on a genome-wide basis, we searched for the existence of previously described regulatory relationships from the literature within our 552 gene-member regulatory model. Of 98 known transcription factors, the regulatory networks upstream and downstream of 26 of these can be elucidated from the interactions reported in the literature. We examined how many of these pathways for well-known genes were reconstructed in our model. As a result, we found that approximately 27% of these relations were apparent from our expression data.

Our gene regulatory network construction using gene disruptant based expression profile data along with Boolean computational models has elucidated a number of gene regulatory relationships that make specific predictions relevant to pathway biology. Further tuning of these models from new biological insights and correlation of data with other expression data such as protein expression experiments will allow for increased resolution of specific pathway interactions. In addition, we are currently investigating other computational methodologies for examination of stochastic relationships contained in expression profile data such as Bayesian Modeling. Use of this and similar strategies for the elucidation of

dynamic regulatory pathways from static expression data will allow for a new era of cost-effective, high throughput research into novel biological pathways and mechanisms.

Figure 8. Gene regulatory networks by cellular function. Figure 8a. A regulatory subnetwork of 98 transcription factors grouped by cellular function according to information provided in the YPD database. ^' Genes inside the circles are those grouped into a given cellular functional category. Lines of hierarchical relationship in regulatory control are depicted with colored lines with the lines the same color as the category of genes from which the control relationship emanate toward the genes on which they exert regulatory ^"influence. The bold red lines indicate regulatory control of transcription factors related to cell cycle originating from genes in the meiosis/mating response group. The dark blue lines indicate control relationships eminating from the carbohydrate metabolism group exerting influence over genes in the lipid fatty-acid metabolism group. This result is supported by the literature and also common sense. This well-conserved gene regulatory relationship insures that you will get fat if you eat too many sweets, b. Frequency of control relationships among functional categories of transcription factors. Row headings indicate gene control responses initiating from the category and column headings represent the categories of genes controlled. Numbers indicate the number of discreet control relationships initiated from a gene in one category and exerting control over a gene in the other category. The bold red numbers highlight the relatively sparse control relationships exerting influence over the cell cycle transcription factors. Bold blue numbers highlight the relatively sparse control relationships exerting control over lipid fatty acid metabolism, both of which emanate from the carbohydrate metabolism category.

Figure 9. A detailed view of a gene regulatory subnetwork of transcription factors reconstructed from Boolean analysis of gene expression experiments on disruptant mutant yeast strains. Black lines indicate a regulatory relationship, with the arrows showing the hierarchical direction of the expression influence. The colors and shapes of the nodes denote the general categories of cellular function of the gene product according to their descriptions in the YPD database. Genes related to cell division mechanisms are indicated with triangular nodes and genes related to DNA repair and chromosome structure are depicted with squares. The elucidated network shows novel topological control relationships among genes related to meiosis, the mating response and DNA structure and repair mechanisms. Genes related to meiosis and mating response genes are downstream of a cascade regulated by IN02 and a further sub-grouping of genes related to DNA repair and structure appears hierarchically downstream of UME6.

Figure 10 depicts a schematic diagram of a system for discovering networks of this invention. Cells (top line: "A") are grown and treated with and without interventions to produce mRNA reflecting the treatments of lack thereof. Messenger RNA is reverse transcribed into cDNA and spotted onto

arrays (second line from top). Information regarding the levels of expression is collected and stored in the Database. Information from the Database is analyzed according to the network modeling of this invention to produce a network of relationships between the different genes (in box). The output from the system is then displayed on a monitor or other visualization tool.

The above approach in determining networks can be applied to a variety of linear and non-linear analysis methods, including, but not limited to Boolean analysis, Bayesian analysis, graphical modeling, clustering analysis, maximum likelihood estimation, maximum penalized likelihood estimation and other types of analytical methods.

References

1) Schena, M. et, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270, 467-470(1995)

2) DeRisi, J.L., Exploring the metabolic and genetic control of gene expression on genomic scale, Science 278:680-686(1997)

3) Lockhart, D. J. et al., Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnol. 14. 1675-1680(1996)

4) The chipping forecast. Supplement to Nature Genetics. 21(1) 1999

5) pft9i , M, m ^*&, % ^χ ,

18. 1050-1056(1999)

6) Tamayo P., et al., Interpreting patterns of geno oxpression with self-organizing maps: methods and application to hematopoietic differentiation, Proc Natl Acad Sci USA. 96, 2907-2912(1999)

7) Toronen, P. et. Al., Analysis of gene expression date using self-organizing maps, FEBS left. 451, 142-146 1999

8) Arkin, A. et. Al., A Test Case of Correlation Metric Construction of a Reaction Pathway from Measurements, Scinece 277. 1257-1279(1997)

9) Liang S., et al., REVEAL, A General Reverse Engineering Algorithm for Inference of GeneticNetwork Architectures, In P&c. Symp. On Biocomputing. 18-29(1998)

10) Akutsu, T. et al., Identification of genetic networks from small number of gene expression patterns under the Boolean network model. In Pac. Symp. on Biocomputing 17-28(1999)

11) R. Somogyi, et al., The gene expression matrix: toward the extraction of genetic network architectures. In The Second World Congress of Nolinear Analysis(WCNA)(1996)

12) D. Weaver, et al., Modeling Regulatory Networks with Weight Matrices, In Pac. Symp. on Biocomputing, 112-123(1999)

13) PΦϊEg, S-system >f S Wk9&&; 165-188(2000)

14) Friedman, N., et. Al., Using Bayesian Networks to Analyze Expression Data, RECOMB 2000, 127-135(2000) Table. 3. -ecrcmsδfϊόϊana&siT^

structure (jC

THE2 IWE4 UME6

ALPHA2 YRR1

ADR1 SPT6 GAL11

■ ■ o c cc < GNI t v^^^T ^ 9—^Jf-^»Λ'^:A&ϊ-*yY9 9-^:f rMΛ

INDUSTRIAL APPLICABILITY

Methods for determining network relationships between genes is useful in the development of lead compounds in the pharmaceutical, health care and other industries in which relationships between genes is desired. Inferential methods of this invention find application in any area of statistical analysis in which complex relationships between groups of data are desired. Such areas include engineering, economics and biology.

Claims

We claim:

1. A method for constructing a gene network, comprising the steps of:

(a) providing a quantitative disruptant data library for a set of genes of an organism, said library including expression results based on disruption of each gene in said set of genes, quantifying an average effect and a measure of variability of each disruption on each other of said genes;

(b) creating a gene expression matrix from said library;

(c) generating network relationships between said genes; and (d) determining if one or more groups of genes is expressed differently from other of said groups of genes.

2. The method of claim 1, further comprising the step of:

(e) providing a Bayesian computational model, wherein said Bayesian model comprises minimizing a BNRC criterion.

3. The method of claim 2, wherein said step of minimizing a BNRC criterion comprises using a non-linear curve fitting method selected from the group consisting of polynomial bases, Fourier series, wavelet bases, regression spline bases and B-splines.

4. The method of claim 1, wherein said data library is created using a drug to alter gene expression.

5. The method of claim 2, wherein said step of minimizing said BNRC criterion further comprises selecting a Bayesian mode using a backfitting algorithm.

6. The method of claim 2, wherein said step of minimizing a BNRC criterion further comprises using Akaike's information criterion.

7. The method of claim 2, wherein said step of minimizing a BNRC criterion further comprising using maximum likelihood estimation.

8. The method of claim 1, wherein said genes are associated with a cell cycle.

9. The method of claim 2, wherein said measure of variability is variance.

10. The method of claim 3, wherein said non-linear curve fitting method is a non-parametric method.

11. The method of claim 10, wherein said non-parametric method for minimizing a BNRC criterion includes using heterogeneous error variances.

12. The method of claim 11, wherein said step of minimizing a BNR.C criterion further comprises the steps of: (1) making a score matrix whose (i, j ¹ element is the BN Chetero score ofthe graph gene, -» gene,-;

(2) implementing one or more of add, remove and reverse which provides the smallest BNRC_/ietero; and

(3) repeating step 2 until the BN C_hetero does not reduce further.

13. The method of claim 11, wherein said step of minimizing a BNRC criterion further comprises the step of applying a hill-climbing algorithm to minimize BNRC® hetero-

14. The method of claim 11 , wherein an intensity of the edge is determined using a bootstrap method.

15. The method of claim 14, wherein said bootstrap method comprises the steps of: (1) providing a bootstrap gene expression matrix by randomly sampling a number of times, with replacement, from the original gene library expression data; (2) estimating the genetic network for gene,- and gene,-; (3) repeating steps (1) and (2) T times, thereby producing T genetic networks; and

(4) calculating the bootstrap edge intensity between gene, and gene,- as (ti + t₂)/T.

16. A method for elucidating a gene network, comprising the steps of:

(a) providing a raw data library of time-course gene expression data for a plurality of genes of an organism;

(b) subtracting background signal intensities from said raw data library;

(c) calculating the relative change in gene expression for each of said plurality of genes;

(d) analyzing the statistical significance of said relative in gene expression using Student's t-test; and (e) fitting said changes in gene expression to a linear spline function.

17. The method of claim 16, further comprising the step of removing from consideration, those genes whose expression levels are sufficiently low so as to be determined predominantly by noise.

18. The method of claim 1, wherein said step of grouping comprises grouping said genes into one or more equivalence sets.

19. A method for estimating a gene network relationship and optimizing hyper parameters of said relationship, comprising the steps of:

(1) fixing hyper parameter p/,

(2) initializing γ_{jk = 0}, k - 1 , ..., q_j

(3) finding the optimal β_yv_t by repeating steps 3-1 and 3-2;

(3-1) computing: j_jk = (B^τ _jk W_jkB_jk + nβ_jkK_jt) ^'; B^T _JkW_Jk x (x_ω - Σ B_Jk-γ_jk), k'≠k for fixed ;

(3-2) repeating step 3-1 against a candidate value of β * and choosing the optimal value of j_k, which minimizes BNRC/,_eZero (4) repeat step for k= \, . . . , qj, 1, . . . , qj, 1, . . . until a suitable convergence criterion is satisfied; and

(5) repeat steps 1 to Step 4 against a candidate value of pj, and choosing the optimal value of pj, which minimizes the BN C_hete_ro-

20. A method for constructing a gene network model of a system containing a network of genes comprising using a Bayesian computational model, wherein said Bayesian computational model comprises minimizing a BNRC criterion.

21. The method of claim 20, wherein minimizing the BNRC criterion comprises using a non-linear curve fitting method selected from the group consisting of polynomial bases, Fourier series, wavelet bases, regression spline bases and B-splines.

22. The method of claim 20, wherein minimizing the BNRC criterion comprises selecting a Bayesian mode using a backfitting algorithm.

23. The method of claim 20, wherein minimizing the BNRC criterion comprises using Akaike' s information criterion.

24. The method of claim 20, wherein minimizing the BNRC criterion comprises using maximum likelihood estimation.

25. The method of claim 20, wherein minimizing the BNRC criterion comprises using a non-linear curve fitting method, wherein the nonlinear curve fitting method is a non-parametric method.

26. The method of claim 25, wherein the non-parametric method includes using heterogeneous error variances.

27. The method of claim 26, wherein minimizing the BNRC criterion further comprises the steps of: (1) making a score matrix whose (i, j)^t element is the BNRCh_ete_ro score ofthe graph gene,- -> gene,-;

(2) implementing one or more of add, remove and reverse which provides the smallest BN C/,_e,_ero; and

(3) repeating step 2 until the BNRChetero does not reduce further.

28. The method of claim 26, wherein minimizing the BNRC criterion further comprises the step of applying a hill-climbing algorithm to minimize BNRChetero-

29. The method of claim 26, wherein an intensity ofthe edge is determined using a bootstrap method.

30. The method of claim 29, wherein said bootstrap method comprises the steps of:

(1) providing a bootstrap gene expression matrix by randomly sampling a number of times, with replacement, from the original gene library expression data;

(2) estimating the genetic network for gene, and gene,-; (3) repeating steps (1) and (2) T times, thereby producing T genetic networks; and

31. The method of claim 20, wherein the Bayesian computational model is used to analyze a gene expression profile ofthe system.

32. The method of claim 31, wherein the gene expression profile comprises the level of gene expression of each gene in the system.

33. The method of claim 32, wherein at least one gene in the system is disrupted.

34. The method of claim 32, wherein the gene expression profile comprises a sub-gene expression profile and wherein the sub-gene expression profile comprises the level of gene expression of each gene in the system when at least one gene is disrupted in the system.

35. The method of claim 34, wherein the gene expression profile comprises at least two different sub-gene expression profiles.

36. The method of claim 32, wherein the system is treated with an agent.

37. A method for constructing a gene network model of a system containing a network of genes comprising using a Bayesian computational model and a Boolean method.

38. The method of claim 37, wherein said Bayesian computational model comprises minimizing a BNRC criterion.

39. The method of claim 37, wherein the Bayesian computational model and the Boolean method are used to analyze a gene expression profile ofthe system.

40. The method of claim 39, wherein the gene expression profile comprises the level of gene expression of each gene in the system.

41. The method of claim 40, wherein at least one gene in the system is disrupted.

42. The method of claim 40, wherein the gene expression profile comprises a sub-gene expression profile and wherein the sub-gene expression profile comprises the level of gene expression of each gene in the system when at least one gene is disrupted in the system.

43. The method of claim 42, wherein the gene expression profile comprises at least two different sub-gene expression profiles.

44. The method of claim 40, wherein the system is treated with an agent.

45. A data file comprising a gene network model constructed by the method of claim 20.

46. The data file of claim 45 in a computer readable form.

47. The data file of claim 45 accessible from a remote location.

48. The data file of claim 45 accessible from an internet web location.

49. A method for identifying a target gene of an agent in a system containing a gene network comprising: (a) constructing a first and second gene network model using a

Bayesian computational model, wherein said Bayesian computational model comprises minimizing a BNRC criterion,

wherein the first gene network model is obtained by analyzing a first gene expression profile and the second gene network model is obtained by analyzing a second gene expression profile,

wherein the first gene expression profile is obtained from . the system without being treated with the agent and the second gene expression profile is obtained from the system being treated with the agent, and

(b) analyzing the first and second gene network model using said

Bayesian computational model, wherein the agent is considered as a gene in the system, and wherein a target gene ofthe agent is identified.

50. The method of claim 49, wherein the target gene is a gene directly affected by the agent.

51. The method of claim 49, wherein the target gene is a gene indirectly affected by the agent.

52. A data file containing the identity of one or more target genes ofthe agent obtained according to the method of claim 49.

53. The data file of claim 52 in a computer readable form.

54. The data file of claim 52 accessible from a remote location.

55. The data file of claim 52 accessible from an internet web location.

56. A method of providing a service comprising receiving an agent from a party, and identifying a target gene ofthe agent for the party according to the method of claim 49.

57. The method of claim 56, wherein receiving an agent comprises receiving the identity ofthe agent.

58. A method of providing a service comprising receiving an agent from a party, and identifying a target gene ofthe agent for the party using the gene network model constructed according to the method of claim 20.