METHOD OF EXPANDING A BIOLOGICAL NETWORK
FIELD AND BACKGROUND OF THE INVENTION
The present invention relates to biological networks and, more particularly, to an efficient and effective method of expanding biological networks.
Many biological functions are accomplished by altering the expression of various genes at the level of RNA (e.g., through control of initiation of transcription, provision of RNA precursors, RNA processing, and nucleolysis of RNA molecules), protein (e.g., through control of translation, post translational modifications and elimination of protein molecules) or protein-activity (e.g., through control of cellular location, interactions with other proteins, further post translational modifications and proteolysis). For example, fundamental biological processes such as cell cycle regulation, cell differentiation and regulated cell death (also known as apoptosis), are characterized by variations in the expression levels of groups of genes.
Gene expression is also associated with pathogenesis. For example, the lack of sufficient expression of functional tumor suppressor genes could lead to tumorgenesis. Similarly, over expression of oncogenes and metastatic genes could lead to tumorgenesis and metastasis. Thus, changes in the level of expression of particular genes serve as signposts for the presence and progression of various diseases.
The study of gene expression has been generally concentrated on the regulatory regions of the genes of interest and on the relationships among different genes. A limited number of transcriptional factors/DNA binding proteins have so far been identified and a limited number of regulatory pathways have been discovered.
However, the expression of a particular gene is frequently regulated by the expression of a number of other genes. The expression of those regulatory genes may also be under the control of additional genes. The regulatory
relationships among genes (or other biological variables, such as proteins) form a biological network, composed of biological objects and relations among them. The biological network is an extremely complex entity, having numerous levels of redundancies on one hand and composed of multi-function factors on the other hand.
It is a common knowledge that the information for the construction and maintenance of all molecular and cellular signaling processes is stored in the genome, whereby the DNA sequence codes for the structure and molecular dynamics of proteins, which in turn determine biochemical recognition. Hence, the information stored in the DNA is responsible to a large extent to the dynamics of a genetic network.
Micro-array technology is one of several developing approaches to comparatively analyze genome-wide patterns of mRNA and protein expression. For mRNA expression studies, the ultimate goal is to develop arrays which contain every gene in a genome against which mRNA expression levels can be quantitatively assessed. Nucleic acid micro-arrays contain a wide sampling, or array, of cDNAs or oligonucleotides representing genes of a single species, such as human, on a slide. Such arrays can be used to determine the levels of mRNA species derived from a cell or tissue of interest tagged with a fluorescent label, and allowed to bind to the array of cDNAs or oligonucleotides corresponding to the different genes. The arrays are put into a scanning microscope that can measure the brightness of each fluorescent dot thus generated, which brightness reveals the amount or level of a respective mRNA. Antibody micro-arrays may include antibodies recognizing prelabeled proteins in a sample in a similar manner. Similarly, antigen micro-arrays can be used to determine the level of different antibodies in a sample.
Advances in genomics and proteomics, specifically in the development of the micro-array high throughput technologies, require new bioinformatics tools for manipulation, analysis and processing of the data. Of particular interest to
those in the bioinformatics area are systems for identifying the biological functions of genes based on their temporal (or perturbed) pattern of expression.
One system, known as clustering analysis, clusters genes according to the shape similarity of their pattern of expression, with clusters related to specific biological functions. This approach has been applied, e.g., to identify genes involved in a metabolic shift from the yeast genome [DeRisi et al, Science 278:680-686 (1997)]. Although clustering is a useful way to identify common data patterns, it is a rather crude method, as it is based on pairwise comparisons, and as such it is only a first step of analyzing the data. Based on understanding of the biological regulatory mechanisms and on theoretical examination of the evolutionary implications of the system as a whole, researchers have constructed different mathematical models to describe the behavior of biological systems [A.Arkin, J.Ross et. al, Genetics, 149:1275- 1279 (1998); T.Chen et. al, Proceedings of the Third Annual International Conference on Computational Molecular Biology, 94-103 (1999); T.Chen et. al, Proceedings of the 1999 Pacific Symposium in Biocomputing, 29-40 (1999); D.Dhaeseleer et. al, Proceedings of the 1999 Pacific Symposium in Biocomputing, 41-52 (1999)].
Also of prior art interest is the approach of reverse engineering which, based on a mathematical model, attempts to mimic a genetic network as a simple Boolean network [S.Liang et. al, Proceedings of the 1998 Pacific Symposium in Biocomputing, 18-20 (1998)]. The reverse engineering approach can potentially systematically decipher the complex circuitry of the genetic network from the temporal gene expression pattern. Yet of another prior art interest is a probabilistic approach to handle biological datasets [N. Friedman, et. al, "Using Bayesian networks to analyze expression data", published in Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, pages 127-135 (2000), the contents of which are hereby incorporated by reference]. In this approach a
Bayesian genetic network learning framework is used to improve the validity of results obtained by network analysis.
However, the discipline of genetic network analysis has not become yet a practical aid to the biologist. The main reason being the complexity of a massively parallel, redundant and "evolution made" system. Compared to the numerous degrees of freedom in the space of possible molecular networks, even datasets with hundreds of well planned distinct cellular profiles cannot directly identify the perfect model out of many other possible alternatives. Another major source of difficulties is the noise inherent in high throughput experiments and the need to either give away much of the data or tolerate significant levels of "false positives". It has been found by theoretical studies that, without additional assumptions, the mathematical problem of inferring all but tiny genetic networks from experiments is impractical, since the number of experiments that would have to be performed in the worst case is out of reach. This difficulty may be phrased "experimental complexity".
Even when the models assumed are simple Boolean networks, although strong assumptions on the data reduce experimental complexity (e.g., random distribution in the attractor space), these assumptions do not hold for data gathered today. One still cannot expect enough data to support current reconstruction approaches in the foreseeable future.
Deeper inference of relations at a higher level of complexity is called for, and is done mainly manually nowadays. The research on genetic networks is presently trying to shape new methodologies that will enable inference of more complex relations from the data presently available. There is thus a widely recognized need for, and it would be highly advantageous to have a framework for analysis of genetic networks and hypothesis generation, devoid of the above limitations.
SUMMARY OF THE INVENTION
According to one aspect of the present invention there is provided a method of determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the method is executed by a data processor and comprising: (a) inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) issuing a report describing the network of biological relationships.
According to further features in preferred embodiments of the invention described below, the at least one additional object has at least one dependency relation to at least one existing object of the core biological network.
According to still further features in the described preferred embodiments the at least one additional object comprises at least one pair having a dependency relation therebetween.
According to still further features in the described preferred embodiments the dataset represents a plurality of experiments, each experiment of the plurality of experiments being characterized by a plurality of states, one state for each object.
According to still further features in the described preferred embodiments at least a portion of the plurality of states has a probabilistic property. According to still further features in the described preferred embodiments the probabilistic property is defined by a plurality of distributions each of the distributions being defined within a probability space, the probability space being represented by a plurality of probabilities, each of the plurality of probabilities corresponding to one state.
According to still further features in the described preferred embodiments each probability of the plurality of probabilities equals a fraction of a class of experiments characterized by a predetermined rule.
According to still further features in the described preferred embodiments the predetermined rule is such that a particular state of a particular object has a predetermined value.
According to still further features in the described preferred embodiments the predetermined rule is such that each state of a portion of the plurality of states which respectively corresponds to a portion of the plurality of objects, has a value which is selected from a predetermined set of values.
According to still further features in the described preferred embodiments the method further comprising repeating step (b) at least once, thereby obtaining a plurality of expanded core biological networks.
According to still further features in the described preferred embodiments the method further comprising redefining the expanded core biological network to be the core biological network, prior to step (b).
According to still further features in the described preferred embodiments the optimally expanding is by defining at least one fitness function.
According to still further features in the described preferred embodiments the fitness-function is selected so as to minimize over-fitting of the network of biological relationships to the dataset.
According to still further features in the described preferred embodiments the at least one additional object of step (b) is selected so as to maximize the fitness-function. According to still further features in the described preferred embodiments the optimally expanding is by scoring the expanded core biological network, thereby obtaining a network-score.
According to still further features in the described preferred embodiments the scoring is by: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of
the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network; thereby obtaining the network-score.
According to still further features in the described preferred embodiments the method further comprising scoring each object of the at least one additional object, thereby obtaining at least one object-score, and sorting the additional objects in accordance with the object-scores.
According to still further features in the described preferred embodiments the method further comprising scoring each of the plurality of expanded core biological networks, thereby obtaining a plurality of network-scores, and sorting the plurality of expanded core biological networks in accordance with the network-scores.
According to still further features in the described preferred embodiments the scoring each of the plurality of expanded core biological networks is by: for each of the plurality of expanded core biological networks: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network, thereby obtaining a network-score; thereby obtaining the plurality of network-scores.
According to still further features in the described preferred embodiments the method further comprising, for each object, A, having a dependency relation to at least one object B: (i) defining, for each experiment of the plurality of experiments, an ordered set of states, the ordered set of states comprises one state of object, A, and one state for each of the at least one object, B, the states corresponding to the experiment, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each experiment; and (ii) for each pair of experiments determining whether respective ordered sets of states are identical, and if so than defining each experiment of the pair of experiments to be a
consistent experiment; thereby obtaining, for each object having a dependency relation to at least one object, a set of consistent experiments.
According to still further features in the described preferred embodiments the method further comprising defining, for each object, an object consistency- function, the object consistency-function equals a maximal number of consistent experiments corresponding to the object, thereby providing a plurality of object consistency-functions .
According to still further features in the described preferred embodiments the fitness-function equals the object consistency-function. According to still further features in the described preferred embodiments the method further comprising: summing all the object consistency-functions of all the objects in the core biological network, thereby providing a core consistency-function; and summing all the object consistency-functions of all the objects in the expanded core biological network, thereby providing an expanded core consistency-function.
According to still further features in the described preferred embodiments the fitness-function equals a regulation-specificity, the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function. According to still further features in the described preferred embodiments the method further comprising calculating the regulation-specificity for each of the at least one additional object, thereby providing at least one regulation- specificity.
According to still further features in the described preferred embodiments the dataset comprises at least one perturbed state, each of the least one perturbed state corresponds to a constraint within the dataset.
According to still further features in the described preferred embodiments the method further comprising excluding each of the at least one perturbed state from the object consistency function.
According to another aspect of the present invention there is provided an apparatus for determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the apparatus comprising: (a) an input unit for inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between i pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) a core expander for optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) an output unit for issuing a report describing the network of biological relationships.
According to further features in preferred embodiments of the invention described below, the core expander is operable to expand the core biological network a plurality of times, thereby to obtain a plurality of expanded core biological networks.
According to still further features in the described preferred embodiments the core expander is operable to redefine the expanded core biological network to be the core biological network.
According to still further features in the described preferred embodiments the core expander comprises a fitness-function unit, operating at least one fitness function.
According to still further features in the described preferred embodiments the fitness-function unit operable to select the fitness-function so as to minimize over-fitting of the network of biological relationships to the dataset. According to still further features in the described preferred embodiments the core expander is operable to select the at least one additional object so that the fitness-function is maximized.
According to still further features in the described preferred embodiments the apparatus further comprising a scorer, for scoring each object, thereby to obtain a plurality of object-scores and for summing at least a portion of the
plurality of object-scores corresponding to objects of the expanded core biological network thereby to obtain a network-score.
According to still further features in the described preferred embodiments the scorer comprising: (i) electronic-calculating functionality for calculating a fitness-function for each object of the core biological network, so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby to obtain at least one fitness-function-value; and (ii) electronic-calculating functionality for averaging the at least one fitness- function-value over the core biological network; thereby obtaining the network- score.
According to still further features in the described preferred embodiments the apparatus further comprising a consistency function unit for obtaining, for each object, an object consistency-function, wherein the consistency function unit comprising: (i) electronic-calculating functionality for defining, for each object, A, having a dependency relation to at least one object B, and for each experiment of the plurality of experiment, an ordered set of states, the ordered set of states comprises states of the object, A, and of each of the at least one other object, B, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each object, A, and for each experiment; and (ii) electronic-calculating functionality for determining, for each pair of experiments, whether respective sets of states are identical, and if so than to define the pair of experiments to be a pair of consistent experiments; and (iii) electronic-calculating functionality for determining for each object, a maximal number of consistent experiments corresponding to the object, and for setting the object consistency-function to be the maximal number of consistent experiments.
According to still further features in the described preferred embodiments the fitness-function unit comprising: (i) electronic-calculating functionality for summing all the object consistency-functions of all the objects in the core biological network, thereby to provide a core consistency-function; and (ii)
electronic-calculating functionality for summing all the object consistency- functions of all the objects in the expanded core biological network, thereby to provide an expanded core consistency-function.
According to still further features in the described preferred embodiments the fitness-function unit operable to set fitness-function to be a regulation- specificity, the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function.
According to still further features in the described preferred embodiments the fitness-function unit operable to calculate the regulation-specificity for each of the at least one additional object, thereby to provide at least one regulation- specificity.
According to still further features in the described preferred embodiments the objects are selected from the group consisting of expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters and epidemiological parameters.
According to still further features in the described preferred embodiments at least one of the dependency relations represents an interaction law between to objects. According to still further features in the described preferred embodiments at least one of the dependency relations represents a control relation selected from the group consisting of control relation at a level of mRNA, protein and protein-activity.
According to still further features in the described preferred embodiments the core biological network represents a transcriptional network.
According to still further features in the described preferred embodiments the core biological network represents protein network having relations among hetrodimers, kinases and ubiqutines.
According to still further features in the described preferred embodiments the core biological network represents genes encoding proteins participating in at least one biochemical pathway.
According to still further features in the described preferred embodiments the core biological network represents biological interactions among any combination of objects selected from the group consisting of genes, proteins, chemicals and external signals.
According to still further features in the described preferred embodiments the fitness-function equals a mutual information function. According to still further features in the described preferred embodiments the fitness-function equals a Bayesian local score function.
According to still further features in the described preferred embodiments the Bayesian local score function is a Dirichlet equivalent score function.
According to still further features in the described preferred embodiments the report is selected from the group consisting of a graph, a table and a list.
According to still further features in the described preferred embodiments the table comprises a plurality of elements each representing a dependency relation between two objects.
According to still further features in the described preferred embodiments the graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
According to still further features in the described preferred embodiments the report includes the network-score.
According to still further features in the described preferred embodiments the report includes the at least one object-score.
According to still further features in the described preferred embodiments the report includes at least one of the plurality of network-scores.
According to still further features in the described preferred embodiments the report comprises at least one pair of objects having a dependency relation therebetween.
According to still further features in the described preferred embodiments the report includes at least one of the plurality of object consistency-functions.
According to still further features in the described preferred embodiments the report includes at least one of the at least one regulation-specificity. The present invention successfully addresses the shortcomings of the presently known configurations by providing a method and apparatus for determining a network of biological relationships far exceeding prior art.
BRIEF DESCRIPTION OF THE DRAWINGS The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
In the drawings:
FIG. la represents a core biological network, according to the present invention;
FIG. lb represents an expanded core biological network, according to the present invention;
FIG. 2a represents a portion of an expanded core biological network, according to the present invention;
FIG. 2b represents a first example of a portion of a dataset, according to the present invention;
FIG. 2c represents a second example of a portion of a dataset, according to the present invention;
FIG. 3 shows an apparatus for determining a network of biological relationships from a dataset, according to the present invention; FIG. 4 shows a dependency structure editor, according to the present invention;
FIG. 5 shows a variable list window, according to the present invention;
FIG. 6 shows a logic viewer, according to the present invention;
FIG. 7 shows an interaction window, according to the present invention; FIG. 8 shows a one-expansion screen result window, according to the present invention;
FIG. 9 represents a basic ergosterol metabolic pathway from farnesyl to ergosterol;
FIG. 10 shows an analysis of the promoter region of ERG 11 according to prior art; and -
FIG. 11 shows a Venn-like diagram of the ERG4 dependent genes;
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is of a method and apparatus for determining a network of biological relationships which can be used to construct a network of interactions between various biological objects. Specifically, the present invention can be used to construct a network of interactions, in the sub-cellular, cellular, organism or population level, among expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters, or among epidemiological parameters and the like. The principles and operation of the present invention may be better understood with reference to the drawings and accompanying descriptions.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the
following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting. The present invention is primarily directed at a network describing a dataset of biological objects, e.g., expressed genes or proteins, each of which is represented by a plurality of states. The states may correspond, for example, to different expression levels of a gene or a protein, or to different condition of a protein: present, absent, phosphorylated etc. According to one aspect of the present invention there is provided a method of determining a network of biological relationships from a dataset. The method according to this aspect of the present invention is executed by a data processor and is effected by implementing the following method steps, in which, in a first step, a core biological network is inputted. As is further detailed hereinunder, the core biological network consists of a portion of the biological objects and dependency relations between the objects.
As used herein, the term network refers to a network of biological relationships and the term core refers to a core network.
In a second step of the method according to this aspect of the present invention, the core is expanded by adding thereto at least one additional object of the dataset and at least one additional dependency relation to at least one existing object of the core.
According to a preferred embodiment of the present invention the second step may be executed more than once, so as to obtain at least one candidate expanded core.
A third step of the method of this aspect of the present invention comprises issuing a report describing the network. As is further described hereinunder, the report may be in any form suitable for sufficiently describing a network and may include a list, a table, a graph or the like. Preferably, the
report includes both the objects of the network (e.g., the nodes of the graph, if such is issued) and the dependency relations (e.g., the edges of the graph).
Cores can be constructed manually by an expert with a focused interest on a specific pathway or directly from a dataset containing collections of known pathways. The present invention is aimed to add objects and dependencies to the core so that the dataset is predicted better by the extended core.
Referring now to the drawings, Figure la illustrates a core 10 having a plurality of objects 12, and a plurality of dependency relations 14 between objects 12. Objects 12 are shown in Figure la as circles and dependency relations 14 as arrows interconnecting two of the circles.
Reference is now made to Figure lb which illustrates an expanded core 20. Expanded core 20 includes additional objects 16 and dependency relations which have been added to core 10, thus providing expanded core 20.
Before providing a further detailed description of the method and apparatus for determining a network of biological relationships, as delineated hereinabove and in accordance with the present invention, attention will be given to the advantages and potential applications offered thereby.
Thus, in sharp distinction to the prior art, an expansion of a given core leads to a major reduction in model space dimensionality and allows the apparatus to both spend more computational power on a target subsystem and to be less prone to over-fitting. Additionally, the core may also serve for integrating diverse sources of information into a well defined computational process.
The dataset from which the network is determined may represent a plurality of biological measurements and/or other information. Thus, for example, the dataset may represent a plurality of genes, proteins or protein activities, which have been detected in a plurality of experiments (e.g., by detecting specific mRNA molecules or by detecting the level and/or catalytic activity of specific proteins). Alternatively, the dataset may represent data
pertaining to levels or presence of antibodies, metabolites, reactants and products of enzymatic reactions, physiological parameters and/or biochemical parameters. Still alternatively, the dataset may represent a plurality of biological components generated and/or disappeared, such as symptoms occurring in various stages of one ore more pathologies.
It would be appreciated that the dataset is not limited to infra-cellular, cellular or intra-organism objects. Hence, the dataset may also include data representing an evolution process among species, which is reflected by transitions, appearances and/or disappearances of biological objects. Still alternatively, the dataset may represent data relating of an epidemiological process, again, reflected by transitions, appearances and/or disappearances of biological objects.
The dependency relations between objects of the dataset may represent control at the RNA, protein or protein-activity levels. At the level of RNA, control can take place at the initiation of transcription, provision of RNA precursors, RNA processing (including RNA splicing and RNA editing), nucleolysis of RNA molecules and the like. At the protein level, control can take place at translation, post translational modifications, elimination of protein molecules, e.g., via ubiquitination and the like. At the protein-activity level, control can take place via cellular localization, interactions with other proteins, phosphorylation/dephosphorylation and proteolysis.
As a specific, non limiting example, the dataset may comprise expression levels of genes. In this case the states of the dataset are the expression levels of the genes. Hence, the dataset may acquire the form of a gene expression matrix, where the rows of the matrix correspond to genes and the columns correspond to different conditions or experiments. Expression levels of different genes may be correlated due to mutual biological regulatory mechanisms between the corresponding genes. For example, suppose that the dataset contains information on gene G and on gene G', and by comparing their expression levels
on different experiments it was found that there is a correlation between the expression levels of G and G'. Then, it may be possible that G regulates the expression of G' and vice versa. Hence, one defines a control-function based on the correlation between expression levels of the genes. Generally, a control-function of a variable, V, has a domain which includes the states of all its independent variables, and a range which includes the sate of V. In the above example, if G regulates G' then the expression levels of G belong to the domain while the expression levels of G' belong to the range. There are cases, however, where a specific gene (say, G') is regulated by more than one gene. Thus, the domain includes the expression levels of all the regulating genes.
It should be understood that the phrase control-function is chosen for illustrative purposes and these functions may be applied to any dataset as detailed above. As further detailed hereinafter, the domain and the range may be defined in different ways. In one embodiment, the domain and the range include a discrete set of values, referred to herein as input-values and output-values, respectively. In another embodiment the domain and the range include a set of distributions over the input and/or output values, representing probabilities rather than discrete values. In this embodiment, the distributions in the domain and in the range are referred to as input-distributions and output-distributions, respectively.
A system which is subjected to a specific experiment is generally characterized by an initial state (the state of the system before the experiment) and a final state (the state of the system after the experiment). Mathematically, an experiment is defined by a set of input-values and a set of output-values (or by a set of input-distributions and a set of output-distributions), which respectively constitute an input-vector and an output-vector. The input and output vectors are referred to herein as INP and OUT, respectively.
In a specific experiment some of the objects may be artificially constrained. For example in cases in which the objects are genes, some of the genes may be "knocked-out" or "over-expressed" prior to the experiment. Hence, according to a preferred embodiment of the present invention, the mathematical description of an experiment may include an additional set, which represents the perturbed genes. This set is referred to herein as PERT. The set PERT is used for excluding from consideration states of those genes which have been perturbed in a specific experiment and for which no conclusion can be drawn. Hence, each experiment is described by a doublet {INP, OUT} or a triplet {INP, OUT, PERT}. Specifically, time-series data, providing expression levels at a series of n time points, yield π-1 experiments, where the vectors at time points i and i+\ form INP and OUT of the z'th experiment, respectively.
An experiment in which INP = OUT, is called a steady state experiment. Datasets are often either time series of samples along some synchronized biological process, or a single sample from a cell culture under some condition. Steady state experiments might contain an averaging of an underlying temporal process and so modeling them correctly entails a less detailed representation of the biological system. Mathematically, for steady state data, one must exclude models with variables regulating themselves in order to avoid the trivial self- regulation solution.
According to a preferred embodiment of the present invention, the method further includes a scoring step. The scoring step is for the purpose of quantifying the quality of the procedure. The scoring step may be executed in more than one level, e.g., in an object level or in network level. Hence, in Figure lb, each of the additional objects 16 which have been added to core 10 may be scored so as to obtain at least one object-score, and each of the candidate expanded cores (such as expanded core 20) may be scored so as to obtain at least one network-score.
According to a presently preferred embodiment of the invention, the object-scores may be used as criteria for adding or not-adding objects 16 to core 10. For example, a specific object may be added to core 10 only if its object- score is above a predetermined threshold. In addition, object-scores may be used to quantify the influence of a specific object, once added to core 10, on other objects already present in the expanded core (for example objects 12, see Figures la-b). This may be useful, for example, to identify a specific relation between two objects in expanded core 20. As further exemplified in the example section below, it has been found by the inventors of the present invention, that in some applications a specific object has a high object-score solely due to a strong relation it has to a single object in the network.
Whereas the object-score has a local nature, the network-scores are global, and as such reflect the overall goodness of the expanded core with respect to the original dataset. Network-scores may be used to classify all the candidate expanded cores (such as expanded core 20) in accordance with their network-scores, say, in decreasing order from the best expanded core to the worse expanded core. Based on the network-scores and on the relations between objects within each expanded core (expressed via objects-scores), the user may choose the optimal expanded core for a specific pathway. According to a preferred embodiment of the present invention, the object-score is calculated by a fitness function. The fitness-fiinction may be any mathematical function which scores "good" expansions high and "bad" expansions low, as is further exemplified hereinbelow.
Hence, in one embodiment, the fitness-function equals a consistency- function, Consist, which is calculated by counting a maximal number of entries in INP which are consistently represented in OUT. The consistency-function, Consist, may be better understood by considering a simple example with reference to Figures 2a-c.
Figure 2a shows a portion of an expanded core (e.g., expanded core 20), having three objects 32, 34 and 36, where object 36 directly depends on objects 32 and 34 via dependency relations 38 and 40. As above, the objects are represented as circles and the dependency relations are represented as arrows. For example, in a case in which objects 32, 34 and 36 are genes, the portion of the core in Figure 2a represent a portion of gene network in which the genes 32 and 34 regulate together gene 36. Hence, Figure 2a exemplifies a candidate core which is to be scored by the consistency function, based on the dataset.
In each experiment in the dataset, the states of objects 32, 34 and 36 define an ordered set of states. Figure 2b shows a first example of a portion of a dataset having 5 experimental results el, ..., e5, of objects 32, 34 and 36 which are represented as integers 0, 1 or 2. The corresponding order sets of the five experiments are el={0,0,0}, e2={l,2,l}, e3={0,l,l}, e4={2,2,2} and e5={ 0,1,0}. Referring to experiments e5 and e3, in both experiments the first two states (the expression levels of objects 32 and 34) are 0 and 1, respectively. However, the state of object 36 in experiment e3 differs from its state in experiment 5.
In other words, a control-function having the values of objects 32 and 34 in its domain and the value of object 36 in its range is inconsistent on experiments e3 and e5.
According to a preferred embodiment of the present invention for such cases the contribution of this pair of experiments to the value of Consist is zero. Figure 2c shows a second example of a portion of a similar dataset having 5 experimental results of the three objects as above with an exception that the value of object 36 in experiment e5, is 1 (as opposed to 0 in the first example). Unlike the first example, in the dataset of Figure 2c, there is perfect consistency between the ordered sets of experiments e3 and e5 (e3=e5={0,l,l }). For such cases, the contribution of this pair of experiments to value of Consist is a unity. As can be understood all other pairs of input values appear only once
and therefore all other ordered sets are trivially consistent. According to a preferred embodiment of the present invention the overall value of Consist is the sum of its values on each pair of input-output values.
Hence, Consist measures the total level of consistency of the output object with its (possibly) regulating objects, as measured in the dataset. In one possible embodiment, it equals the maximum number of consistent experiments, each being defined such that the output-value is determined uniquely by the input- value. When Consist equals exactly the number of experiments it is said that the consistency is perfect. As stated, an experiment may include some constraints which are represented by PERT. According to a preferred embodiment of the present invention, in computing the value of Consist of an object, experiments in which that object is perturbed are excluded
The consistency-function described above provides a sufficient tool for the scoring step. It would be of utmost advantage however, to have a fitness- function which would handle probabilistic input data. While reducing the present invention to practice it has been uncovered that the utility of the invention may be significantly enhanced by providing a fitness-function handling probabilistic datasets. Probabilistic dataset may be provided by defining the domain and the range of each control-function using input- and output-distributions. According to a preferred embodiment of the present invention, each of the input- distributions and each of the output-distributions are defined within a probability space which is related to a direct product of a first probability space Y and a second probability space X. Specifically, X is used as the domain and Y is used as the range of the states, where each of the spaces X and Y is represented by a plurality of probabilities, which are preferably calculated as detailed herein.
For each set of input-values, a probability in Y equals a fraction of experiments in which INP equals the corresponding set of input-values. For
each output-value, a probability in X equals a fraction of experiments in which OUT equals the corresponding output-value. The probability space, (YxX)n, is defined by obtaining n random samples from Y and n random samples X, where each sample is interpreted as a single experiment with a value from Y as the input-value and a value from X as the output-value.
Once the probability space (YxX)n is defined, states from the probability spaces Y and X are used as entries in the vectors INP and OUT, respectively, and the consistency-function is calculated by counting a maximal number of entries in INP which are consistently represented in OUT. When studying complicated networks using limited datasets, a major source of concern is over-fitting. Typically, over-fitting may occur in cases in which random control patterns yield a score which is as high as the real control function score. The present invention successfully addresses this problem by providing a probability function which gives information regarding the specificity of a speculated regulation pattern.
Thus, in another embodiment the fitness-function itself has a probabilistic property. In this embodiment, the fitness-function is called regulation-specificity and denoted herein rSpec.
According to a preferred embodiment of the present invention rSpec equals a probability that the value of the consistency-function of core 20 is larger than- or equal to- the value of the consistency-function of core 10. Hence, if a regulated object and a possible set of its regulators have consistency k based on a set of experiments (which consistency has been calculated using discrete or probabilistic domain and range), then the regulation specificity, rSpec, is the probability of obtaining a consistency of k or higher in a probability space, (YxX)n.
The size of the probability space defined above is exponential in the number of experiments. According to a preferred embodiment of the present invention, in cases where the difference n-k is small (i.e., almost perfect
consistency), rSpec may be calculated by the following approximation which is linear in the number of experiments, hence more efficient in terms of computation time. In the present embodiment, rSpec equals to a probability for obtaining, for each experiment, a consistency which is larger than or equal to k in a dataset with the n inputs from the INP and outputs which are randomly sampled from X. Specifically, for / input configurations with multiplicities n\,...,n then rSpec is the probability of getting «' identical values out of nh z-1,...,/, samples from X where ∑«' > k.
In an additional embodiment, the fitness-function may also be calculated based on a permutation sensitivity analysis of the observed consistency pattern. The probability spaces X and Y which were defined for rSpec, may also be used in the present embodiment. Hence, according to a preferred embodiment of the present invention, the measured Y values are fixed while the measured X values are randomly permutated. For each random permutation of X, a probability of obtaining the consistency is calculated. A skilled artisan would appreciate that this probability is high for low consistency and low for high consistency. In other words, the fitness-function may be calculated by fixing the domain and applying random permutation only on the range of the states.
It would be appreciated by one ordinarily skilled in the art that other functions may be used as fitness-functions of the present invention. Hence, in still another embodiment, the fitness-function may be a mutual information function M(X,Y). The mutual information function is known in the art. To this end see, e.g., a book by T.M. Cover and J.M. Thomas, entitled "Elements of Information Theory", published by John Wiley & Sons (1991), the contents of which are hereby incorporated by reference.
Although the mutual information function is defined for discrete distributions, it may be generalized to a probability space similarly to the above procedure. Hence, given a set E of n experiments, the input (or output) distribution of a variable set S = v1,...,v^ is calculated as a fraction of experiments in E for which the inputs (or outputs) of S were exactly c\,...,cd.
Denoting this distribution by XE S, the mutual information of a regulating set S, a variable v and a set E of experiments equals M(XE S, XE {V})-
In yet another embodiment, the fitness- function may be a Bayesian local score function. Bayesian local score functions are known in the art, and disclosed, e.g., in N. Friedman, et. al., "Using Bayesian networks to analyze expression data", Proceedings of the Fifth Annual International Conference on Computational Molecular Biology 127-135 (2000), the contents of which are hereby incorporated by reference. With respect to Bayesian score functions, the fitness function may be a Dirichlet equivalent score function, which is further detailed in N. Friedman, et. al.
As stated, the object-score which is preferably calculated by one of the fitness-functions detailed above, has a local nature. According to a preferred embodiment of the present invention, a global network-score is calculated by the following steps. In a first step, given the objects of core 20 as possible regulators, the fitness function is calculated for each object of core 10, thereby obtaining at least one fitness-function-value. A second step includes averaging the fitness-function-values over the objects of core 10. A skilled artisan would appreciate that such a procedure provides a tool for evaluating an overall score for core 20. Irrespectively of the scoring procedure as detailed hereinabove, the present invention computationally infers biological pathways by finding an expanded core that fits the experimental data best. As stated, core 10 of Figure lb represents existing biological knowledge therefore it allows an improved specificity of expanded core 20, compared to the quality of some de-novo reconstruction of a network.
Once the scoring step is complete, the object- and network-scores may be used to generate a report describing the network. As stated, the report may be in any form suitable for sufficiently describing a network, e.g., any combination of a list, a table and/or a graph.
The report may include a list of all candidate networks, sorted in accordance to their individual network-scores. In addition, in a specific network one or more object-scores may be presented, preferably in a decreasing order, so as to report on the relative or absolute contribution of the corresponding object to the overall network-score. Based on the object-score, the user may decide whether or not to accept the corresponding object to the network. Still in addition, the report may include the effect of a specific object on other objects. For example, in cases in which the dataset comprises data pertaining to the expression levels of a plurality of genes, the report may include, for a specific gene, the level of regulation of the gene on other genes, so that the user may study a particular regulation mechanism between two genes.
In an embodiment in which the report includes a table, the table preferably comprises a plurality of elements each representing a dependency relation between two objects. In an embodiment in which the report includes a graph, the graph preferably includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects. Thus, the graph provides a visual description of both the objects of core 20 and the dependency relations between different objects in core 20. According to another aspect of the invention, there is provided an apparatus for determining a network of biological relationships from a dataset, the apparatus is generally referred to herein as apparatus 50.
Reference is now made to Figure 3, which illustrate apparatus 50. Hence, apparatus 50 comprises an input unit 52, a core expander 54 and an output unit 56. According to a preferred embodiment of the present invention input unit 52 serves for inputting a core biological network which is similar to core 10, as detailed hereinabove. Core expander 54 serves for expanding the core by adding to the core at least one additional object of the dataset. Thus,
similarly to the above description, core expander 54 provides an expanded core, which may be, for example, core 20.
Output unit 56 serves for issuing a report describing the network of biological relationships. The report, generated by output unit 56, may have any form, preferably one of the forms described above.
Additional objects, advantages, and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination of the following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.
EXAMPLES
Reference is now made to the following examples, which together with the above descriptions, illustrate the invention in a non limiting fashion.
Database and Computer Resources The method has been tested on yeast transcription datasets using the ergosterol pathways as a core. The study was focused on yeast, which has the largest publicly available gene expression datasets. The variable set consisted of 6200 yeast open reading frames (ORFs). Transcription profiles were taken from two large scale yeast cDNA arrays experiments: (1) 260 selected knockout experiments performed -by Hughes et. al. [T. R. Hughes et. al., "Functional discovery via a compendium of expression profiles", Cell, 102:109-26, (2000)]; and (2) 100 experiments testing yeast behavior in stressful conditions performed by Gasch et. al. [A. P. Gasch el. al., "Genomic expression programs in the response of yeast cells to environmental changes", Mol. Biol. Cell, 11 :4241-57 (2000)].
In the following examples, a single node expansion process was used. In the process, for each object, the sum of fitness gains to all core objects from adding that object to the core, is calculated. Thus, the goal was to identify genes
that might regulate or indirectly affect the pathway in those experiments which are left unexplained by the core model.
The computations were conducted on a Pentium III 500 MHz processor, using a Linux operating system. A new software platform GENESYS™ (GEnetic Network Expansion
SYStem) has been developed to implement the method described hereinabove. The platform includes engines for representing networks and computing fitness, a flexible expansion algorithm, viewers for visualization of biological data sets, an application to enable interactive usage of the viewers and engine and an internal database scheme for the storage of datasets and pathways.
The GENESYS environment is built as a modular prototype composed of back-end and front-end implemented in C++ and Perl/Tk. The front-end is a
GUI application featuring a wide selection of viewers and providing means to invoke computational processes using the back-end. The back-end provides a set of libraries and computational engines using them.
The basic features of GENESYS front-end are described herein with reference to the accompanying computer screen snapshots.
Figure 4 shows a dependency structure editor. The editor enables creation, viewing and saving of network cores, manipulation of nodes and arcs in the graph and it also provides connectivity to the logic viewer described below.
Figure 5 shows a variable list window which is a list widget providing access to the all the objects with all the information attached to it.
Figure 6 shows a logic viewer which presents a binary decision diagram (BDD) given a focus variable and its current incoming neighbors. A BDD [R. E. Bryant, "Graph based algorithms for Boolean function manipulation", IEEE Transactions on Computers, C-35(8):677-91 (1986)] represents the effect of inputs values on output value of the focus variable and provides insight into the structure of the discrete logic induced by a given dependency structure. To create a BDD, a layered tree is constructed where each level represents the
addition of a new input variable. The leafs of the tree represent groups of experiments in which the input added up to the leaf level takes the same value. A parent child relation is added whenever the input assignment of level / is a subset of the assignment of the bigger input set at level /+1. In the final tree level, the output value is introduced to reveal the relation between input assignment and output value. Using the BDD, users can explore the logic underlying a putative dependency structure and deduce functional roles each input may have (activator/repressor or more complicated functions).
Figure 7 shows an interaction window which provides putative interactions from the GENESYS database. The window lets the user navigate in the interaction graph by showing each time the neighborhood of a focus variable which may be selected interactively. In the interaction window different edge colors are used for different types of interaction.
Figure 8 shows a one-expansion screen result window which lists the expansion candidates of a given core after the completion of a screen. The user can sort the list according to various parameters and evaluate the affinity of each of the expansion object to different parts of the core.
Ergosterol Metabolism Ergosterol is an essential lipid in yeast which is to the equivalent of cholesterol in mammals. Ergosterol's primary role is in the cell membranes but it is also involved in aerobic metabolism, sterol uptake and sterol transport. Ergosterol metabolism is understood rather well. Ergosterol metabolism is composed of two pathways in series. The first, the mevalonate pathway, transforms acetyl-CoA to farnesyl and provides essential components for few important metabolic pathways (e.g., heme and quinines). The latter pathway transforms farnesyl to ergosterol. Much of the regulation of ergosterol is believed to be transcriptionally mediated, but the actual details are known only in part [G. F. Bammert and J. M. Fostel, "Genome-wide expression patterns in Saccharomyces cerevisiae: comparison of drug treatments and genetic alterations affecting biosynthesis of ergosterol", Antimicrobial Agents and
Chemotherapy, 44:1255-1265 (2000); G. Daum, N. D. Lees, M. Bard and R. Dickson, "Biochemistry, cell biology and molecular biology of lipids of Saccharomyces cerevisiae", Yeast, 14:1471-1510 (1998); T. G. Turi and J. C. Loper, "Multiple regulatory elements and control expression of the gene encoding the Saccharomyces cerevisiae Cytochrome P450, lanosterol 14a- demethylase (ERG11)", JBC, 267:2046-56 (1992)].
Figure 9 shows the basic known ergosterol metabolic pathway from farnesyl to ergosterol, including a series of 11 enzymes and three transcription factors. It is important to stress here the difference between metabolic pathways and regulatory networks: The fact that two enzymes follow each other in a biochemical process does not mean their transcription regulation is directly connected. The ergosterol dependency structure core was modeled as a set of variables, with dependencies marked only between known transcription factors and their targets. In other words, no dependency was prescribed between enzymes.
The reactions of pathway enzymes in the entire data set were analyzed. A number of experiments showed a global reaction of the pathway: in those experiments most of the pathway enzymes underwent significant change. This is presumably the result of some self regulatory mechanism (and indeed ergosterol itself is reported to function as transcription regulator for its pathway enzymes). However, many other experiments (about 40) showed a change in one or more of the pathway genes, which is not explained by the above mechanism. Those experiments may be explained by a more elaborate model.
EXAMPLE 1
Transcription Factors Screening
Out of the ~ 6200 yeast ORFs, 130 putative transcription factors (TFs) were identified, using SGD annotations [C. A. Ball et. al, "Saccharomyces genome database provides tools to survey gene expression and functional analysis data", Nucleic Acids Research, 29:80-1 (2001)], and typical structural
motifs (e.g., zinc fingers). A single node expansion algorithm was then applied, limiting the candidates for node expansions to these putative TFs.
Results are summarized in Table 1 below, showing a fitness gain of each of the putative TFs, ranked against a "naked" core consisting of the eleven ERG enzymes with no dependencies among them. HAPl was ranked second out of 130, in agreement with the known role of HAPl in ERG11 regulation. TUP1 is a general repressor and was thus ranked lower, ROXl was less expressed in the data and was ranked much lower.
Table 1
Prior art analysis [T. Turi and J. Loper, "Multiple regulatory elements and control expression of the gene encoding the Saccharomyces cerevisiae Cytochrome P450, lanosterol 14a-demethylase (ERGl l)", JBC, 267:2046-56 (1992)] of the promoter region of ERG 11 is summarized in Figure 10.
The ERGl l regulation has been tested by applying a single node expansion to a core consisting of the eleven ERG enzymes as well as HAPl and ROXl as regulators of ERGl l. The algorithm measured the improvement in fitness contributed by each of the 130 TFs, and an uncharacterized gene was ranked first. That gene improves the fitness of ERG1 (and others). Remarkably, it also has a good homology to HAPl (33 % identity, 50 % similarity along 100 amino acids and even better in a shorter range).
Moreover, analyzing ERGl l logic as a function of HAPl, ROXl, TUP1 and novel TF shows that the effect of the new putative TF on eRGl l is
inductive (as expected from a UAS2 binding gene). Thus, the hypothesis that the novel TF is indeed an ERGl l regulator that might bind to UAS2 is supported by three different methods: (1) sequence homology; (2) promoter analysis indicating a second inducer should exist; and (3) the screening procedure of the present example, using 360 different expression profiles in distinct cell states.
EXAMPLE 2 Screening All Genes In the present example, all ~ 6200 yeast ORFs were screened against the ergosterol core. This type of analysis may discover more general patterns of regulation that cannot be directly tagged as "A is a factor of B". Still, as shown below, some interesting biology may be learned from it. The results of such a screen are given in Table 2 below. The two top ranking genes, POS5, YBR043C, are both of unknown function. POS5 has homology to iron metabolism enzymes. Both present significant fitness gain for ERG4 regulation. ERG4 is the last of the ergosterol pathway enzymes, is not essential and little is known on its regulation.
Reference is now made to Figure 11 showing a Venn-like diagram of the ERG4 dependent genes. The diagram represents all experiments in which ERG4, POS5 and YBR043C were induced. The size of each set is indicated by a number. As can be seen from Figure 11 induction POS5 and YBR043C strongly correlates with of ERG4 induction (11/12 experiments in both cases). In addition, ERG4 is showing a second, separate regulation pattern (5 experiments) which is unrelated to POS5 and YBR043C. It would be appreciated that using standard clustering or similarity, the behavior of ERG4 in experiments with no POS5, YBR043C involvement would have masked the pattern identified in the present example.
The fourth gene in the screen list is INOl which is involved in inositol biogenesis. Inositol has a regulatory function in the pospholipid pathway
(adjacent to ergosterol). Note that the dependency is localized differently (improving different variables) in that case. The relation of GAS 1 to ergosterol might be rooted in its function in the cell wall. The dependency between our core and MKK2 is very reasonable considering its function in the signaling pathway to the cell wall protein PCKl. The 11th gene in the list is ERG10, which is the first gene in the mevalonate pathway leading to our core.
The dependencies revealed by the general 1 -expansion screening can serve as the basis for deeper biological exploration. The process pinpoints statistically significant patterns which are hard to identify otherwise. In contrast with the TF 1 -expansion screening, the results are less direct and do not identify specific dependencies.
Table 2
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single
embodiment, may also be provided separately or in any suitable subcombination.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.