WO2002103608A2 - Method of expanding a biological network - Google Patents

Method of expanding a biological network Download PDF

Info

Publication number
WO2002103608A2
WO2002103608A2 PCT/IL2002/000460 IL0200460W WO02103608A2 WO 2002103608 A2 WO2002103608 A2 WO 2002103608A2 IL 0200460 W IL0200460 W IL 0200460W WO 02103608 A2 WO02103608 A2 WO 02103608A2
Authority
WO
WIPO (PCT)
Prior art keywords
network
function
objects
core
fitness
Prior art date
Application number
PCT/IL2002/000460
Other languages
French (fr)
Other versions
WO2002103608A3 (en
Inventor
Ron Shamir
Amos Tanay
Original Assignee
Ramot University Authority For Applied Research & Industrial Development Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ramot University Authority For Applied Research & Industrial Development Ltd. filed Critical Ramot University Authority For Applied Research & Industrial Development Ltd.
Publication of WO2002103608A2 publication Critical patent/WO2002103608A2/en
Publication of WO2002103608A3 publication Critical patent/WO2002103608A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/10Boolean models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models

Definitions

  • the present invention relates to biological networks and, more particularly, to an efficient and effective method of expanding biological networks.
  • RNA e.g., through control of initiation of transcription, provision of RNA precursors, RNA processing, and nucleolysis of RNA molecules
  • protein e.g., through control of translation, post translational modifications and elimination of protein molecules
  • protein-activity e.g., through control of cellular location, interactions with other proteins, further post translational modifications and proteolysis.
  • fundamental biological processes such as cell cycle regulation, cell differentiation and regulated cell death (also known as apoptosis), are characterized by variations in the expression levels of groups of genes.
  • Gene expression is also associated with pathogenesis. For example, the lack of sufficient expression of functional tumor suppressor genes could lead to tumorgenesis. Similarly, over expression of oncogenes and metastatic genes could lead to tumorgenesis and metastasis. Thus, changes in the level of expression of particular genes serve as signposts for the presence and progression of various diseases.
  • genes or other biological variables, such as proteins
  • the regulatory relationships among genes form a biological network, composed of biological objects and relations among them.
  • the biological network is an extremely complex entity, having numerous levels of redundancies on one hand and composed of multi-function factors on the other hand.
  • Micro-array technology is one of several developing approaches to comparatively analyze genome-wide patterns of mRNA and protein expression.
  • the ultimate goal is to develop arrays which contain every gene in a genome against which mRNA expression levels can be quantitatively assessed.
  • Nucleic acid micro-arrays contain a wide sampling, or array, of cDNAs or oligonucleotides representing genes of a single species, such as human, on a slide.
  • Such arrays can be used to determine the levels of mRNA species derived from a cell or tissue of interest tagged with a fluorescent label, and allowed to bind to the array of cDNAs or oligonucleotides corresponding to the different genes.
  • Antibody micro-arrays may include antibodies recognizing prelabeled proteins in a sample in a similar manner. Similarly, antigen micro-arrays can be used to determine the level of different antibodies in a sample.
  • clustering analysis clusters genes according to the shape similarity of their pattern of expression, with clusters related to specific biological functions. This approach has been applied, e.g., to identify genes involved in a metabolic shift from the yeast genome [DeRisi et al, Science 278:680-686 (1997)].
  • clustering is a useful way to identify common data patterns, it is a rather crude method, as it is based on pairwise comparisons, and as such it is only a first step of analyzing the data.
  • researchers Based on understanding of the biological regulatory mechanisms and on theoretical examination of the evolutionary implications of the system as a whole, researchers have constructed different mathematical models to describe the behavior of biological systems [A.Arkin, J.Ross et.
  • a method of determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects the method is executed by a data processor and comprising: (a) inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) issuing a report describing the network of biological relationships.
  • the at least one additional object has at least one dependency relation to at least one existing object of the core biological network.
  • the at least one additional object comprises at least one pair having a dependency relation therebetween.
  • the dataset represents a plurality of experiments, each experiment of the plurality of experiments being characterized by a plurality of states, one state for each object.
  • At least a portion of the plurality of states has a probabilistic property.
  • the probabilistic property is defined by a plurality of distributions each of the distributions being defined within a probability space, the probability space being represented by a plurality of probabilities, each of the plurality of probabilities corresponding to one state.
  • each probability of the plurality of probabilities equals a fraction of a class of experiments characterized by a predetermined rule.
  • the predetermined rule is such that a particular state of a particular object has a predetermined value.
  • the predetermined rule is such that each state of a portion of the plurality of states which respectively corresponds to a portion of the plurality of objects, has a value which is selected from a predetermined set of values.
  • the method further comprising repeating step (b) at least once, thereby obtaining a plurality of expanded core biological networks.
  • the method further comprising redefining the expanded core biological network to be the core biological network, prior to step (b).
  • the optimally expanding is by defining at least one fitness function.
  • the fitness-function is selected so as to minimize over-fitting of the network of biological relationships to the dataset.
  • the at least one additional object of step (b) is selected so as to maximize the fitness-function.
  • the optimally expanding is by scoring the expanded core biological network, thereby obtaining a network-score.
  • the scoring is by: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network; thereby obtaining the network-score.
  • the method further comprising scoring each object of the at least one additional object, thereby obtaining at least one object-score, and sorting the additional objects in accordance with the object-scores.
  • the method further comprising scoring each of the plurality of expanded core biological networks, thereby obtaining a plurality of network-scores, and sorting the plurality of expanded core biological networks in accordance with the network-scores.
  • the scoring each of the plurality of expanded core biological networks is by: for each of the plurality of expanded core biological networks: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network, thereby obtaining a network-score; thereby obtaining the plurality of network-scores.
  • the method further comprising, for each object, A, having a dependency relation to at least one object B: (i) defining, for each experiment of the plurality of experiments, an ordered set of states, the ordered set of states comprises one state of object, A, and one state for each of the at least one object, B, the states corresponding to the experiment, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each experiment; and (ii) for each pair of experiments determining whether respective ordered sets of states are identical, and if so than defining each experiment of the pair of experiments to be a consistent experiment; thereby obtaining, for each object having a dependency relation to at least one object, a set of consistent experiments.
  • the method further comprising defining, for each object, an object consistency- function, the object consistency-function equals a maximal number of consistent experiments corresponding to the object, thereby providing a plurality of object consistency-functions .
  • the fitness-function equals the object consistency-function.
  • the method further comprising: summing all the object consistency-functions of all the objects in the core biological network, thereby providing a core consistency-function; and summing all the object consistency-functions of all the objects in the expanded core biological network, thereby providing an expanded core consistency-function.
  • the fitness-function equals a regulation-specificity
  • the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function.
  • the method further comprising calculating the regulation-specificity for each of the at least one additional object, thereby providing at least one regulation- specificity.
  • the dataset comprises at least one perturbed state, each of the least one perturbed state corresponds to a constraint within the dataset.
  • an apparatus for determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects comprising: (a) an input unit for inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between i pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) a core expander for optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) an output unit for issuing a report describing the network of biological relationships.
  • the core expander is operable to expand the core biological network a plurality of times, thereby to obtain a plurality of expanded core biological networks.
  • the core expander is operable to redefine the expanded core biological network to be the core biological network.
  • the core expander comprises a fitness-function unit, operating at least one fitness function.
  • the fitness-function unit operable to select the fitness-function so as to minimize over-fitting of the network of biological relationships to the dataset.
  • the core expander is operable to select the at least one additional object so that the fitness-function is maximized.
  • the apparatus further comprising a scorer, for scoring each object, thereby to obtain a plurality of object-scores and for summing at least a portion of the plurality of object-scores corresponding to objects of the expanded core biological network thereby to obtain a network-score.
  • the scorer comprising: (i) electronic-calculating functionality for calculating a fitness-function for each object of the core biological network, so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby to obtain at least one fitness-function-value; and (ii) electronic-calculating functionality for averaging the at least one fitness- function-value over the core biological network; thereby obtaining the network- score.
  • the apparatus further comprising a consistency function unit for obtaining, for each object, an object consistency-function
  • the consistency function unit comprising: (i) electronic-calculating functionality for defining, for each object, A, having a dependency relation to at least one object B, and for each experiment of the plurality of experiment, an ordered set of states, the ordered set of states comprises states of the object, A, and of each of the at least one other object, B, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each object, A, and for each experiment; and (ii) electronic-calculating functionality for determining, for each pair of experiments, whether respective sets of states are identical, and if so than to define the pair of experiments to be a pair of consistent experiments; and (iii) electronic-calculating functionality for determining for each object, a maximal number of consistent experiments corresponding to the object, and for setting the object consistency-function to be the maximal number of consistent experiments.
  • the fitness-function unit comprising: (i) electronic-calculating functionality for summing all the object consistency-functions of all the objects in the core biological network, thereby to provide a core consistency-function; and (ii) electronic-calculating functionality for summing all the object consistency- functions of all the objects in the expanded core biological network, thereby to provide an expanded core consistency-function.
  • the fitness-function unit operable to set fitness-function to be a regulation- specificity
  • the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function.
  • the fitness-function unit operable to calculate the regulation-specificity for each of the at least one additional object, thereby to provide at least one regulation- specificity.
  • the objects are selected from the group consisting of expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters and epidemiological parameters.
  • At least one of the dependency relations represents an interaction law between to objects.
  • at least one of the dependency relations represents a control relation selected from the group consisting of control relation at a level of mRNA, protein and protein-activity.
  • the core biological network represents a transcriptional network.
  • the core biological network represents protein network having relations among hetrodimers, kinases and ubiqutines. According to still further features in the described preferred embodiments the core biological network represents genes encoding proteins participating in at least one biochemical pathway.
  • the core biological network represents biological interactions among any combination of objects selected from the group consisting of genes, proteins, chemicals and external signals.
  • the fitness-function equals a mutual information function. According to still further features in the described preferred embodiments the fitness-function equals a Bayesian local score function.
  • the Bayesian local score function is a Dirichlet equivalent score function.
  • the report is selected from the group consisting of a graph, a table and a list.
  • the table comprises a plurality of elements each representing a dependency relation between two objects.
  • the graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
  • the report includes the network-score.
  • the report includes the at least one object-score.
  • the report includes at least one of the plurality of network-scores.
  • the report comprises at least one pair of objects having a dependency relation therebetween. According to still further features in the described preferred embodiments the report includes at least one of the plurality of object consistency-functions.
  • the report includes at least one of the at least one regulation-specificity.
  • FIG. la represents a core biological network, according to the present invention.
  • FIG. lb represents an expanded core biological network, according to the present invention.
  • FIG. 2a represents a portion of an expanded core biological network, according to the present invention.
  • FIG. 2b represents a first example of a portion of a dataset, according to the present invention
  • FIG. 2c represents a second example of a portion of a dataset, according to the present invention
  • FIG. 3 shows an apparatus for determining a network of biological relationships from a dataset, according to the present invention
  • FIG. 4 shows a dependency structure editor, according to the present invention
  • FIG. 5 shows a variable list window, according to the present invention
  • FIG. 6 shows a logic viewer, according to the present invention
  • FIG. 7 shows an interaction window, according to the present invention
  • FIG. 8 shows a one-expansion screen result window, according to the present invention
  • FIG. 9 represents a basic ergosterol metabolic pathway from farnesyl to ergosterol
  • FIG. 10 shows an analysis of the promoter region of ERG 11 according to prior art.
  • FIG. 11 shows a Venn-like diagram of the ERG4 dependent genes
  • the present invention is of a method and apparatus for determining a network of biological relationships which can be used to construct a network of interactions between various biological objects.
  • the present invention can be used to construct a network of interactions, in the sub-cellular, cellular, organism or population level, among expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters, or among epidemiological parameters and the like.
  • the principles and operation of the present invention may be better understood with reference to the drawings and accompanying descriptions.
  • the present invention is primarily directed at a network describing a dataset of biological objects, e.g., expressed genes or proteins, each of which is represented by a plurality of states.
  • the states may correspond, for example, to different expression levels of a gene or a protein, or to different condition of a protein: present, absent, phosphorylated etc.
  • a method of determining a network of biological relationships from a dataset is executed by a data processor and is effected by implementing the following method steps, in which, in a first step, a core biological network is inputted.
  • the core biological network consists of a portion of the biological objects and dependency relations between the objects.
  • network refers to a network of biological relationships and the term core refers to a core network.
  • the core is expanded by adding thereto at least one additional object of the dataset and at least one additional dependency relation to at least one existing object of the core.
  • the second step may be executed more than once, so as to obtain at least one candidate expanded core.
  • a third step of the method of this aspect of the present invention comprises issuing a report describing the network.
  • the report may be in any form suitable for sufficiently describing a network and may include a list, a table, a graph or the like.
  • the report includes both the objects of the network (e.g., the nodes of the graph, if such is issued) and the dependency relations (e.g., the edges of the graph).
  • Cores can be constructed manually by an expert with a focused interest on a specific pathway or directly from a dataset containing collections of known pathways.
  • the present invention is aimed to add objects and dependencies to the core so that the dataset is predicted better by the extended core.
  • Figure la illustrates a core 10 having a plurality of objects 12, and a plurality of dependency relations 14 between objects 12.
  • Objects 12 are shown in Figure la as circles and dependency relations 14 as arrows interconnecting two of the circles.
  • Expanded core 20 includes additional objects 16 and dependency relations which have been added to core 10, thus providing expanded core 20.
  • an expansion of a given core leads to a major reduction in model space dimensionality and allows the apparatus to both spend more computational power on a target subsystem and to be less prone to over-fitting.
  • the core may also serve for integrating diverse sources of information into a well defined computational process.
  • the dataset from which the network is determined may represent a plurality of biological measurements and/or other information.
  • the dataset may represent a plurality of genes, proteins or protein activities, which have been detected in a plurality of experiments (e.g., by detecting specific mRNA molecules or by detecting the level and/or catalytic activity of specific proteins).
  • the dataset may represent data pertaining to levels or presence of antibodies, metabolites, reactants and products of enzymatic reactions, physiological parameters and/or biochemical parameters.
  • the dataset may represent a plurality of biological components generated and/or disappeared, such as symptoms occurring in various stages of one ore more pathologies.
  • the dataset is not limited to infra-cellular, cellular or intra-organism objects.
  • the dataset may also include data representing an evolution process among species, which is reflected by transitions, appearances and/or disappearances of biological objects.
  • the dataset may represent data relating of an epidemiological process, again, reflected by transitions, appearances and/or disappearances of biological objects.
  • the dependency relations between objects of the dataset may represent control at the RNA, protein or protein-activity levels.
  • control can take place at the initiation of transcription, provision of RNA precursors, RNA processing (including RNA splicing and RNA editing), nucleolysis of RNA molecules and the like.
  • control can take place at translation, post translational modifications, elimination of protein molecules, e.g., via ubiquitination and the like.
  • control can take place via cellular localization, interactions with other proteins, phosphorylation/dephosphorylation and proteolysis.
  • the dataset may comprise expression levels of genes.
  • the states of the dataset are the expression levels of the genes.
  • the dataset may acquire the form of a gene expression matrix, where the rows of the matrix correspond to genes and the columns correspond to different conditions or experiments.
  • Expression levels of different genes may be correlated due to mutual biological regulatory mechanisms between the corresponding genes. For example, suppose that the dataset contains information on gene G and on gene G', and by comparing their expression levels on different experiments it was found that there is a correlation between the expression levels of G and G'. Then, it may be possible that G regulates the expression of G' and vice versa. Hence, one defines a control-function based on the correlation between expression levels of the genes.
  • a control-function of a variable, V has a domain which includes the states of all its independent variables, and a range which includes the sate of V.
  • V has a domain which includes the states of all its independent variables, and a range which includes the sate of V.
  • G regulates G'
  • the expression levels of G belong to the domain while the expression levels of G' belong to the range.
  • a specific gene say, G'
  • the domain includes the expression levels of all the regulating genes.
  • the domain and the range may be defined in different ways.
  • the domain and the range include a discrete set of values, referred to herein as input-values and output-values, respectively.
  • the domain and the range include a set of distributions over the input and/or output values, representing probabilities rather than discrete values.
  • the distributions in the domain and in the range are referred to as input-distributions and output-distributions, respectively.
  • a system which is subjected to a specific experiment is generally characterized by an initial state (the state of the system before the experiment) and a final state (the state of the system after the experiment).
  • an experiment is defined by a set of input-values and a set of output-values (or by a set of input-distributions and a set of output-distributions), which respectively constitute an input-vector and an output-vector.
  • the input and output vectors are referred to herein as INP and OUT, respectively.
  • some of the objects may be artificially constrained.
  • some of the genes may be "knocked-out" or "over-expressed" prior to the experiment.
  • the mathematical description of an experiment may include an additional set, which represents the perturbed genes.
  • This set is referred to herein as PERT.
  • the set PERT is used for excluding from consideration states of those genes which have been perturbed in a specific experiment and for which no conclusion can be drawn.
  • each experiment is described by a doublet ⁇ INP, OUT ⁇ or a triplet ⁇ INP, OUT, PERT ⁇ .
  • time-series data providing expression levels at a series of n time points, yield ⁇ -1 experiments, where the vectors at time points i and i+ ⁇ form INP and OUT of the z ' th experiment, respectively.
  • the method further includes a scoring step.
  • the scoring step is for the purpose of quantifying the quality of the procedure.
  • the scoring step may be executed in more than one level, e.g., in an object level or in network level.
  • each of the additional objects 16 which have been added to core 10 may be scored so as to obtain at least one object-score
  • each of the candidate expanded cores (such as expanded core 20) may be scored so as to obtain at least one network-score.
  • the object-scores may be used as criteria for adding or not-adding objects 16 to core 10.
  • a specific object may be added to core 10 only if its object- score is above a predetermined threshold.
  • object-scores may be used to quantify the influence of a specific object, once added to core 10, on other objects already present in the expanded core (for example objects 12, see Figures la-b). This may be useful, for example, to identify a specific relation between two objects in expanded core 20.
  • object-scores may be used to quantify the influence of a specific object, once added to core 10, on other objects already present in the expanded core (for example objects 12, see Figures la-b). This may be useful, for example, to identify a specific relation between two objects in expanded core 20.
  • the network-scores are global, and as such reflect the overall goodness of the expanded core with respect to the original dataset.
  • Network-scores may be used to classify all the candidate expanded cores (such as expanded core 20) in accordance with their network-scores, say, in decreasing order from the best expanded core to the worse expanded core.
  • the user may choose the optimal expanded core for a specific pathway.
  • the object-score is calculated by a fitness function.
  • the fitness-fiinction may be any mathematical function which scores "good” expansions high and "bad” expansions low, as is further exemplified hereinbelow.
  • the fitness-function equals a consistency- function, Consist, which is calculated by counting a maximal number of entries in INP which are consistently represented in OUT.
  • the consistency-function, Consist may be better understood by considering a simple example with reference to Figures 2a-c.
  • Figure 2a shows a portion of an expanded core (e.g., expanded core 20), having three objects 32, 34 and 36, where object 36 directly depends on objects 32 and 34 via dependency relations 38 and 40.
  • the objects are represented as circles and the dependency relations are represented as arrows.
  • the portion of the core in Figure 2a represent a portion of gene network in which the genes 32 and 34 regulate together gene 36.
  • Figure 2a exemplifies a candidate core which is to be scored by the consistency function, based on the dataset.
  • FIG. 2b shows a first example of a portion of a dataset having 5 experimental results el, ..., e5, of objects 32, 34 and 36 which are represented as integers 0, 1 or 2.
  • the first two states are 0 and 1, respectively.
  • the state of object 36 in experiment e3 differs from its state in experiment 5.
  • Figure 2c shows a second example of a portion of a similar dataset having 5 experimental results of the three objects as above with an exception that the value of object 36 in experiment e5, is 1 (as opposed to 0 in the first example).
  • the contribution of this pair of experiments to value of Consist is a unity.
  • the overall value of Consist is the sum of its values on each pair of input-output values.
  • Consist measures the total level of consistency of the output object with its (possibly) regulating objects, as measured in the dataset. In one possible embodiment, it equals the maximum number of consistent experiments, each being defined such that the output-value is determined uniquely by the input- value. When Consist equals exactly the number of experiments it is said that the consistency is perfect. As stated, an experiment may include some constraints which are represented by PERT. According to a preferred embodiment of the present invention, in computing the value of Consist of an object, experiments in which that object is perturbed are excluded
  • Probabilistic dataset may be provided by defining the domain and the range of each control-function using input- and output-distributions.
  • each of the input- distributions and each of the output-distributions are defined within a probability space which is related to a direct product of a first probability space Y and a second probability space X.
  • X is used as the domain and Y is used as the range of the states, where each of the spaces X and Y is represented by a plurality of probabilities, which are preferably calculated as detailed herein.
  • a probability in Y equals a fraction of experiments in which INP equals the corresponding set of input-values.
  • a probability in X equals a fraction of experiments in which OUT equals the corresponding output-value.
  • the probability space, (YxX) n is defined by obtaining n random samples from Y and n random samples X, where each sample is interpreted as a single experiment with a value from Y as the input-value and a value from X as the output-value.
  • states from the probability spaces Y and X are used as entries in the vectors INP and OUT, respectively, and the consistency-function is calculated by counting a maximal number of entries in INP which are consistently represented in OUT.
  • over-fitting may occur in cases in which random control patterns yield a score which is as high as the real control function score.
  • the present invention successfully addresses this problem by providing a probability function which gives information regarding the specificity of a speculated regulation pattern.
  • the fitness-function itself has a probabilistic property.
  • the fitness-function is called regulation-specificity and denoted herein rSpec.
  • rSpec equals a probability that the value of the consistency-function of core 20 is larger than- or equal to- the value of the consistency-function of core 10.
  • rSpec may be calculated by the following approximation which is linear in the number of experiments, hence more efficient in terms of computation time.
  • rSpec equals to a probability for obtaining, for each experiment, a consistency which is larger than or equal to k in a dataset with the n inputs from the INP and outputs which are randomly sampled from X.
  • rSpec is the probability of getting «' identical values out of n h z-1,...,/, samples from X where ⁇ «' > k.
  • the fitness-function may also be calculated based on a permutation sensitivity analysis of the observed consistency pattern.
  • the probability spaces X and Y which were defined for rSpec, may also be used in the present embodiment.
  • the measured Y values are fixed while the measured X values are randomly permutated.
  • a probability of obtaining the consistency is calculated.
  • the fitness-function may be calculated by fixing the domain and applying random permutation only on the range of the states.
  • the fitness-function may be a mutual information function M(X,Y).
  • the mutual information function is known in the art. To this end see, e.g., a book by T.M. Cover and J.M. Thomas, entitled “Elements of Information Theory", published by John Wiley & Sons (1991), the contents of which are hereby incorporated by reference.
  • the mutual information function is defined for discrete distributions, it may be generalized to a probability space similarly to the above procedure.
  • the mutual information of a regulating set S, a variable v and a set E of experiments equals M(X E S , X E ⁇ V ⁇ )-
  • the fitness- function may be a Bayesian local score function.
  • Bayesian local score functions are known in the art, and disclosed, e.g., in N. Friedman, et. al., "Using Bayesian networks to analyze expression data", Proceedings of the Fifth Annual International Conference on Computational Molecular Biology 127-135 (2000), the contents of which are hereby incorporated by reference.
  • the fitness function may be a Dirichlet equivalent score function, which is further detailed in N. Friedman, et. al.
  • the object-score which is preferably calculated by one of the fitness-functions detailed above, has a local nature.
  • a global network-score is calculated by the following steps. In a first step, given the objects of core 20 as possible regulators, the fitness function is calculated for each object of core 10, thereby obtaining at least one fitness-function-value.
  • a second step includes averaging the fitness-function-values over the objects of core 10.
  • the present invention computationally infers biological pathways by finding an expanded core that fits the experimental data best.
  • core 10 of Figure lb represents existing biological knowledge therefore it allows an improved specificity of expanded core 20, compared to the quality of some de-novo reconstruction of a network.
  • the object- and network-scores may be used to generate a report describing the network.
  • the report may be in any form suitable for sufficiently describing a network, e.g., any combination of a list, a table and/or a graph.
  • the report may include a list of all candidate networks, sorted in accordance to their individual network-scores.
  • one or more object-scores may be presented, preferably in a decreasing order, so as to report on the relative or absolute contribution of the corresponding object to the overall network-score.
  • the user may decide whether or not to accept the corresponding object to the network.
  • the report may include the effect of a specific object on other objects.
  • the report may include, for a specific gene, the level of regulation of the gene on other genes, so that the user may study a particular regulation mechanism between two genes.
  • the table preferably comprises a plurality of elements each representing a dependency relation between two objects.
  • the graph preferably includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects.
  • the graph provides a visual description of both the objects of core 20 and the dependency relations between different objects in core 20.
  • apparatus 50 comprises an input unit 52, a core expander 54 and an output unit 56.
  • input unit 52 serves for inputting a core biological network which is similar to core 10, as detailed hereinabove.
  • Core expander 54 serves for expanding the core by adding to the core at least one additional object of the dataset.
  • core expander 54 provides an expanded core, which may be, for example, core 20.
  • Output unit 56 serves for issuing a report describing the network of biological relationships.
  • the report, generated by output unit 56 may have any form, preferably one of the forms described above.
  • yeast transcription datasets using the ergosterol pathways as a core The method has been tested on yeast transcription datasets using the ergosterol pathways as a core.
  • the study was focused on yeast, which has the largest publicly available gene expression datasets.
  • the variable set consisted of 6200 yeast open reading frames (ORFs).
  • Transcription profiles were taken from two large scale yeast cDNA arrays experiments: (1) 260 selected knockout experiments performed -by Hughes et. al. [T. R. Hughes et. al., "Functional discovery via a compendium of expression profiles", Cell, 102:109-26, (2000)]; and (2) 100 experiments testing yeast behavior in stressful conditions performed by Gasch et. al. [A. P. Gasch el. al., "Genomic expression programs in the response of yeast cells to environmental changes", Mol. Biol. Cell, 11 :4241-57 (2000)].
  • the platform includes engines for representing networks and computing fitness, a flexible expansion algorithm, viewers for visualization of biological data sets, an application to enable interactive usage of the viewers and engine and an internal database scheme for the storage of datasets and pathways.
  • the GENESYS environment is built as a modular prototype composed of back-end and front-end implemented in C++ and Perl/Tk.
  • the front-end is a
  • GUI application featuring a wide selection of viewers and providing means to invoke computational processes using the back-end.
  • the back-end provides a set of libraries and computational engines using them.
  • GENESYS front-end The basic features of GENESYS front-end are described herein with reference to the accompanying computer screen snapshots.
  • Figure 4 shows a dependency structure editor.
  • the editor enables creation, viewing and saving of network cores, manipulation of nodes and arcs in the graph and it also provides connectivity to the logic viewer described below.
  • Figure 5 shows a variable list window which is a list widget providing access to the all the objects with all the information attached to it.
  • FIG. 6 shows a logic viewer which presents a binary decision diagram (BDD) given a focus variable and its current incoming neighbors.
  • a BDD represents the effect of inputs values on output value of the focus variable and provides insight into the structure of the discrete logic induced by a given dependency structure.
  • To create a BDD a layered tree is constructed where each level represents the addition of a new input variable. The leafs of the tree represent groups of experiments in which the input added up to the leaf level takes the same value. A parent child relation is added whenever the input assignment of level / is a subset of the assignment of the bigger input set at level /+1. In the final tree level, the output value is introduced to reveal the relation between input assignment and output value.
  • users can explore the logic underlying a putative dependency structure and deduce functional roles each input may have (activator/repressor or more complicated functions).
  • Figure 7 shows an interaction window which provides putative interactions from the GENESYS database.
  • the window lets the user navigate in the interaction graph by showing each time the neighborhood of a focus variable which may be selected interactively.
  • different edge colors are used for different types of interaction.
  • Figure 8 shows a one-expansion screen result window which lists the expansion candidates of a given core after the completion of a screen. The user can sort the list according to various parameters and evaluate the affinity of each of the expansion object to different parts of the core.
  • Ergosterol Metabolism Ergosterol is an essential lipid in yeast which is to the equivalent of cholesterol in mammals. Ergosterol's primary role is in the cell membranes but it is also involved in aerobic metabolism, sterol uptake and sterol transport. Ergosterol metabolism is understood rather well. Ergosterol metabolism is composed of two pathways in series. The first, the mevalonate pathway, transforms acetyl-CoA to farnesyl and provides essential components for few important metabolic pathways (e.g., heme and quinines). The latter pathway transforms farnesyl to ergosterol. Much of the regulation of ergosterol is believed to be transcriptionally mediated, but the actual details are known only in part [G. F. Bammert and J. M.
  • Figure 9 shows the basic known ergosterol metabolic pathway from farnesyl to ergosterol, including a series of 11 enzymes and three transcription factors. It is important to stress here the difference between metabolic pathways and regulatory networks: The fact that two enzymes follow each other in a biochemical process does not mean their transcription regulation is directly connected.
  • the ergosterol dependency structure core was modeled as a set of variables, with dependencies marked only between known transcription factors and their targets. In other words, no dependency was prescribed between enzymes.
  • Results are summarized in Table 1 below, showing a fitness gain of each of the putative TFs, ranked against a "naked" core consisting of the eleven ERG enzymes with no dependencies among them.
  • HAPl was ranked second out of 130, in agreement with the known role of HAPl in ERG11 regulation.
  • TUP1 is a general repressor and was thus ranked lower, ROXl was less expressed in the data and was ranked much lower.
  • the ERGl l regulation has been tested by applying a single node expansion to a core consisting of the eleven ERG enzymes as well as HAPl and ROXl as regulators of ERGl l.
  • the algorithm measured the improvement in fitness contributed by each of the 130 TFs, and an uncharacterized gene was ranked first. That gene improves the fitness of ERG1 (and others). Remarkably, it also has a good homology to HAPl (33 % identity, 50 % similarity along 100 amino acids and even better in a shorter range).
  • EXAMPLE 2 Screening All Genes In the present example, all ⁇ 6200 yeast ORFs were screened against the ergosterol core. This type of analysis may discover more general patterns of regulation that cannot be directly tagged as "A is a factor of B". Still, as shown below, some interesting biology may be learned from it. The results of such a screen are given in Table 2 below.
  • the two top ranking genes, POS5, YBR043C, are both of unknown function.
  • POS5 has homology to iron metabolism enzymes. Both present significant fitness gain for ERG4 regulation.
  • ERG4 is the last of the ergosterol pathway enzymes, is not essential and little is known on its regulation.
  • FIG. 11 showing a Venn-like diagram of the ERG4 dependent genes.
  • the diagram represents all experiments in which ERG4, POS5 and YBR043C were induced.
  • the size of each set is indicated by a number.
  • induction POS5 and YBR043C strongly correlates with of ERG4 induction (11/12 experiments in both cases).
  • ERG4 is showing a second, separate regulation pattern (5 experiments) which is unrelated to POS5 and YBR043C. It would be appreciated that using standard clustering or similarity, the behavior of ERG4 in experiments with no POS5, YBR043C involvement would have masked the pattern identified in the present example.
  • the fourth gene in the screen list is INOl which is involved in inositol biogenesis.
  • Inositol has a regulatory function in the pospholipid pathway (adjacent to ergosterol). Note that the dependency is localized differently (improving different variables) in that case.
  • the relation of GAS 1 to ergosterol might be rooted in its function in the cell wall.
  • the dependency between our core and MKK2 is very reasonable considering its function in the signaling pathway to the cell wall protein PCKl.
  • the 11th gene in the list is ERG10, which is the first gene in the mevalonate pathway leading to our core.

Abstract

A method of determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the method is executed by a data processor. The method comprising the following method steps in which a first step comprises inputting a core biological network. The core biological network comprises a portion of the plurality of objects of the dataset, and N dependency relations between pairs of objects of the portion of the dataset, wherein N is a nonnegative number. A second step comprises optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network. A third step comprises issuing a report describing the network of biological relationships.

Description

METHOD OF EXPANDING A BIOLOGICAL NETWORK
FIELD AND BACKGROUND OF THE INVENTION
The present invention relates to biological networks and, more particularly, to an efficient and effective method of expanding biological networks.
Many biological functions are accomplished by altering the expression of various genes at the level of RNA (e.g., through control of initiation of transcription, provision of RNA precursors, RNA processing, and nucleolysis of RNA molecules), protein (e.g., through control of translation, post translational modifications and elimination of protein molecules) or protein-activity (e.g., through control of cellular location, interactions with other proteins, further post translational modifications and proteolysis). For example, fundamental biological processes such as cell cycle regulation, cell differentiation and regulated cell death (also known as apoptosis), are characterized by variations in the expression levels of groups of genes.
Gene expression is also associated with pathogenesis. For example, the lack of sufficient expression of functional tumor suppressor genes could lead to tumorgenesis. Similarly, over expression of oncogenes and metastatic genes could lead to tumorgenesis and metastasis. Thus, changes in the level of expression of particular genes serve as signposts for the presence and progression of various diseases.
The study of gene expression has been generally concentrated on the regulatory regions of the genes of interest and on the relationships among different genes. A limited number of transcriptional factors/DNA binding proteins have so far been identified and a limited number of regulatory pathways have been discovered.
However, the expression of a particular gene is frequently regulated by the expression of a number of other genes. The expression of those regulatory genes may also be under the control of additional genes. The regulatory relationships among genes (or other biological variables, such as proteins) form a biological network, composed of biological objects and relations among them. The biological network is an extremely complex entity, having numerous levels of redundancies on one hand and composed of multi-function factors on the other hand.
It is a common knowledge that the information for the construction and maintenance of all molecular and cellular signaling processes is stored in the genome, whereby the DNA sequence codes for the structure and molecular dynamics of proteins, which in turn determine biochemical recognition. Hence, the information stored in the DNA is responsible to a large extent to the dynamics of a genetic network.
Micro-array technology is one of several developing approaches to comparatively analyze genome-wide patterns of mRNA and protein expression. For mRNA expression studies, the ultimate goal is to develop arrays which contain every gene in a genome against which mRNA expression levels can be quantitatively assessed. Nucleic acid micro-arrays contain a wide sampling, or array, of cDNAs or oligonucleotides representing genes of a single species, such as human, on a slide. Such arrays can be used to determine the levels of mRNA species derived from a cell or tissue of interest tagged with a fluorescent label, and allowed to bind to the array of cDNAs or oligonucleotides corresponding to the different genes. The arrays are put into a scanning microscope that can measure the brightness of each fluorescent dot thus generated, which brightness reveals the amount or level of a respective mRNA. Antibody micro-arrays may include antibodies recognizing prelabeled proteins in a sample in a similar manner. Similarly, antigen micro-arrays can be used to determine the level of different antibodies in a sample.
Advances in genomics and proteomics, specifically in the development of the micro-array high throughput technologies, require new bioinformatics tools for manipulation, analysis and processing of the data. Of particular interest to those in the bioinformatics area are systems for identifying the biological functions of genes based on their temporal (or perturbed) pattern of expression.
One system, known as clustering analysis, clusters genes according to the shape similarity of their pattern of expression, with clusters related to specific biological functions. This approach has been applied, e.g., to identify genes involved in a metabolic shift from the yeast genome [DeRisi et al, Science 278:680-686 (1997)]. Although clustering is a useful way to identify common data patterns, it is a rather crude method, as it is based on pairwise comparisons, and as such it is only a first step of analyzing the data. Based on understanding of the biological regulatory mechanisms and on theoretical examination of the evolutionary implications of the system as a whole, researchers have constructed different mathematical models to describe the behavior of biological systems [A.Arkin, J.Ross et. al, Genetics, 149:1275- 1279 (1998); T.Chen et. al, Proceedings of the Third Annual International Conference on Computational Molecular Biology, 94-103 (1999); T.Chen et. al, Proceedings of the 1999 Pacific Symposium in Biocomputing, 29-40 (1999); D.Dhaeseleer et. al, Proceedings of the 1999 Pacific Symposium in Biocomputing, 41-52 (1999)].
Also of prior art interest is the approach of reverse engineering which, based on a mathematical model, attempts to mimic a genetic network as a simple Boolean network [S.Liang et. al, Proceedings of the 1998 Pacific Symposium in Biocomputing, 18-20 (1998)]. The reverse engineering approach can potentially systematically decipher the complex circuitry of the genetic network from the temporal gene expression pattern. Yet of another prior art interest is a probabilistic approach to handle biological datasets [N. Friedman, et. al, "Using Bayesian networks to analyze expression data", published in Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, pages 127-135 (2000), the contents of which are hereby incorporated by reference]. In this approach a Bayesian genetic network learning framework is used to improve the validity of results obtained by network analysis.
However, the discipline of genetic network analysis has not become yet a practical aid to the biologist. The main reason being the complexity of a massively parallel, redundant and "evolution made" system. Compared to the numerous degrees of freedom in the space of possible molecular networks, even datasets with hundreds of well planned distinct cellular profiles cannot directly identify the perfect model out of many other possible alternatives. Another major source of difficulties is the noise inherent in high throughput experiments and the need to either give away much of the data or tolerate significant levels of "false positives". It has been found by theoretical studies that, without additional assumptions, the mathematical problem of inferring all but tiny genetic networks from experiments is impractical, since the number of experiments that would have to be performed in the worst case is out of reach. This difficulty may be phrased "experimental complexity".
Even when the models assumed are simple Boolean networks, although strong assumptions on the data reduce experimental complexity (e.g., random distribution in the attractor space), these assumptions do not hold for data gathered today. One still cannot expect enough data to support current reconstruction approaches in the foreseeable future.
Deeper inference of relations at a higher level of complexity is called for, and is done mainly manually nowadays. The research on genetic networks is presently trying to shape new methodologies that will enable inference of more complex relations from the data presently available. There is thus a widely recognized need for, and it would be highly advantageous to have a framework for analysis of genetic networks and hypothesis generation, devoid of the above limitations. SUMMARY OF THE INVENTION
According to one aspect of the present invention there is provided a method of determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the method is executed by a data processor and comprising: (a) inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) issuing a report describing the network of biological relationships.
According to further features in preferred embodiments of the invention described below, the at least one additional object has at least one dependency relation to at least one existing object of the core biological network.
According to still further features in the described preferred embodiments the at least one additional object comprises at least one pair having a dependency relation therebetween.
According to still further features in the described preferred embodiments the dataset represents a plurality of experiments, each experiment of the plurality of experiments being characterized by a plurality of states, one state for each object.
According to still further features in the described preferred embodiments at least a portion of the plurality of states has a probabilistic property. According to still further features in the described preferred embodiments the probabilistic property is defined by a plurality of distributions each of the distributions being defined within a probability space, the probability space being represented by a plurality of probabilities, each of the plurality of probabilities corresponding to one state. According to still further features in the described preferred embodiments each probability of the plurality of probabilities equals a fraction of a class of experiments characterized by a predetermined rule.
According to still further features in the described preferred embodiments the predetermined rule is such that a particular state of a particular object has a predetermined value.
According to still further features in the described preferred embodiments the predetermined rule is such that each state of a portion of the plurality of states which respectively corresponds to a portion of the plurality of objects, has a value which is selected from a predetermined set of values.
According to still further features in the described preferred embodiments the method further comprising repeating step (b) at least once, thereby obtaining a plurality of expanded core biological networks.
According to still further features in the described preferred embodiments the method further comprising redefining the expanded core biological network to be the core biological network, prior to step (b).
According to still further features in the described preferred embodiments the optimally expanding is by defining at least one fitness function.
According to still further features in the described preferred embodiments the fitness-function is selected so as to minimize over-fitting of the network of biological relationships to the dataset.
According to still further features in the described preferred embodiments the at least one additional object of step (b) is selected so as to maximize the fitness-function. According to still further features in the described preferred embodiments the optimally expanding is by scoring the expanded core biological network, thereby obtaining a network-score.
According to still further features in the described preferred embodiments the scoring is by: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network; thereby obtaining the network-score.
According to still further features in the described preferred embodiments the method further comprising scoring each object of the at least one additional object, thereby obtaining at least one object-score, and sorting the additional objects in accordance with the object-scores.
According to still further features in the described preferred embodiments the method further comprising scoring each of the plurality of expanded core biological networks, thereby obtaining a plurality of network-scores, and sorting the plurality of expanded core biological networks in accordance with the network-scores.
According to still further features in the described preferred embodiments the scoring each of the plurality of expanded core biological networks is by: for each of the plurality of expanded core biological networks: (i) for each object of the core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and (ii) averaging the at least one fitness-function-value over the core biological network, thereby obtaining a network-score; thereby obtaining the plurality of network-scores.
According to still further features in the described preferred embodiments the method further comprising, for each object, A, having a dependency relation to at least one object B: (i) defining, for each experiment of the plurality of experiments, an ordered set of states, the ordered set of states comprises one state of object, A, and one state for each of the at least one object, B, the states corresponding to the experiment, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each experiment; and (ii) for each pair of experiments determining whether respective ordered sets of states are identical, and if so than defining each experiment of the pair of experiments to be a consistent experiment; thereby obtaining, for each object having a dependency relation to at least one object, a set of consistent experiments.
According to still further features in the described preferred embodiments the method further comprising defining, for each object, an object consistency- function, the object consistency-function equals a maximal number of consistent experiments corresponding to the object, thereby providing a plurality of object consistency-functions .
According to still further features in the described preferred embodiments the fitness-function equals the object consistency-function. According to still further features in the described preferred embodiments the method further comprising: summing all the object consistency-functions of all the objects in the core biological network, thereby providing a core consistency-function; and summing all the object consistency-functions of all the objects in the expanded core biological network, thereby providing an expanded core consistency-function.
According to still further features in the described preferred embodiments the fitness-function equals a regulation-specificity, the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function. According to still further features in the described preferred embodiments the method further comprising calculating the regulation-specificity for each of the at least one additional object, thereby providing at least one regulation- specificity.
According to still further features in the described preferred embodiments the dataset comprises at least one perturbed state, each of the least one perturbed state corresponds to a constraint within the dataset.
According to still further features in the described preferred embodiments the method further comprising excluding each of the at least one perturbed state from the object consistency function. According to another aspect of the present invention there is provided an apparatus for determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the apparatus comprising: (a) an input unit for inputting a core biological network which comprises a portion of the plurality of objects of the dataset, and N dependency relations between i pairs of objects of the portion of the dataset, wherein N is a nonnegative number; (b) a core expander for optimally expanding the core biological network by adding to the core biological network at least one additional object of the plurality of objects of the dataset, so as to obtain an expanded core biological network; and (c) an output unit for issuing a report describing the network of biological relationships.
According to further features in preferred embodiments of the invention described below, the core expander is operable to expand the core biological network a plurality of times, thereby to obtain a plurality of expanded core biological networks.
According to still further features in the described preferred embodiments the core expander is operable to redefine the expanded core biological network to be the core biological network.
According to still further features in the described preferred embodiments the core expander comprises a fitness-function unit, operating at least one fitness function.
According to still further features in the described preferred embodiments the fitness-function unit operable to select the fitness-function so as to minimize over-fitting of the network of biological relationships to the dataset. According to still further features in the described preferred embodiments the core expander is operable to select the at least one additional object so that the fitness-function is maximized.
According to still further features in the described preferred embodiments the apparatus further comprising a scorer, for scoring each object, thereby to obtain a plurality of object-scores and for summing at least a portion of the plurality of object-scores corresponding to objects of the expanded core biological network thereby to obtain a network-score.
According to still further features in the described preferred embodiments the scorer comprising: (i) electronic-calculating functionality for calculating a fitness-function for each object of the core biological network, so as to obtain a fitness-function-value characterizing a fit of the at least one additional object to the dataset, thereby to obtain at least one fitness-function-value; and (ii) electronic-calculating functionality for averaging the at least one fitness- function-value over the core biological network; thereby obtaining the network- score.
According to still further features in the described preferred embodiments the apparatus further comprising a consistency function unit for obtaining, for each object, an object consistency-function, wherein the consistency function unit comprising: (i) electronic-calculating functionality for defining, for each object, A, having a dependency relation to at least one object B, and for each experiment of the plurality of experiment, an ordered set of states, the ordered set of states comprises states of the object, A, and of each of the at least one other object, B, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each object, A, and for each experiment; and (ii) electronic-calculating functionality for determining, for each pair of experiments, whether respective sets of states are identical, and if so than to define the pair of experiments to be a pair of consistent experiments; and (iii) electronic-calculating functionality for determining for each object, a maximal number of consistent experiments corresponding to the object, and for setting the object consistency-function to be the maximal number of consistent experiments.
According to still further features in the described preferred embodiments the fitness-function unit comprising: (i) electronic-calculating functionality for summing all the object consistency-functions of all the objects in the core biological network, thereby to provide a core consistency-function; and (ii) electronic-calculating functionality for summing all the object consistency- functions of all the objects in the expanded core biological network, thereby to provide an expanded core consistency-function.
According to still further features in the described preferred embodiments the fitness-function unit operable to set fitness-function to be a regulation- specificity, the regulation-specificity equals a probability of obtaining the expanded core consistency-function which is larger than- or equal to- the core consistency-function.
According to still further features in the described preferred embodiments the fitness-function unit operable to calculate the regulation-specificity for each of the at least one additional object, thereby to provide at least one regulation- specificity.
According to still further features in the described preferred embodiments the objects are selected from the group consisting of expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters and epidemiological parameters.
According to still further features in the described preferred embodiments at least one of the dependency relations represents an interaction law between to objects. According to still further features in the described preferred embodiments at least one of the dependency relations represents a control relation selected from the group consisting of control relation at a level of mRNA, protein and protein-activity.
According to still further features in the described preferred embodiments the core biological network represents a transcriptional network.
According to still further features in the described preferred embodiments the core biological network represents protein network having relations among hetrodimers, kinases and ubiqutines. According to still further features in the described preferred embodiments the core biological network represents genes encoding proteins participating in at least one biochemical pathway.
According to still further features in the described preferred embodiments the core biological network represents biological interactions among any combination of objects selected from the group consisting of genes, proteins, chemicals and external signals.
According to still further features in the described preferred embodiments the fitness-function equals a mutual information function. According to still further features in the described preferred embodiments the fitness-function equals a Bayesian local score function.
According to still further features in the described preferred embodiments the Bayesian local score function is a Dirichlet equivalent score function.
According to still further features in the described preferred embodiments the report is selected from the group consisting of a graph, a table and a list.
According to still further features in the described preferred embodiments the table comprises a plurality of elements each representing a dependency relation between two objects.
According to still further features in the described preferred embodiments the graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
According to still further features in the described preferred embodiments the report includes the network-score.
According to still further features in the described preferred embodiments the report includes the at least one object-score.
According to still further features in the described preferred embodiments the report includes at least one of the plurality of network-scores.
According to still further features in the described preferred embodiments the report comprises at least one pair of objects having a dependency relation therebetween. According to still further features in the described preferred embodiments the report includes at least one of the plurality of object consistency-functions.
According to still further features in the described preferred embodiments the report includes at least one of the at least one regulation-specificity. The present invention successfully addresses the shortcomings of the presently known configurations by providing a method and apparatus for determining a network of biological relationships far exceeding prior art.
BRIEF DESCRIPTION OF THE DRAWINGS The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
In the drawings:
FIG. la represents a core biological network, according to the present invention;
FIG. lb represents an expanded core biological network, according to the present invention;
FIG. 2a represents a portion of an expanded core biological network, according to the present invention;
FIG. 2b represents a first example of a portion of a dataset, according to the present invention; FIG. 2c represents a second example of a portion of a dataset, according to the present invention;
FIG. 3 shows an apparatus for determining a network of biological relationships from a dataset, according to the present invention; FIG. 4 shows a dependency structure editor, according to the present invention;
FIG. 5 shows a variable list window, according to the present invention;
FIG. 6 shows a logic viewer, according to the present invention;
FIG. 7 shows an interaction window, according to the present invention; FIG. 8 shows a one-expansion screen result window, according to the present invention;
FIG. 9 represents a basic ergosterol metabolic pathway from farnesyl to ergosterol;
FIG. 10 shows an analysis of the promoter region of ERG 11 according to prior art; and -
FIG. 11 shows a Venn-like diagram of the ERG4 dependent genes;
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is of a method and apparatus for determining a network of biological relationships which can be used to construct a network of interactions between various biological objects. Specifically, the present invention can be used to construct a network of interactions, in the sub-cellular, cellular, organism or population level, among expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters, or among epidemiological parameters and the like. The principles and operation of the present invention may be better understood with reference to the drawings and accompanying descriptions.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting. The present invention is primarily directed at a network describing a dataset of biological objects, e.g., expressed genes or proteins, each of which is represented by a plurality of states. The states may correspond, for example, to different expression levels of a gene or a protein, or to different condition of a protein: present, absent, phosphorylated etc. According to one aspect of the present invention there is provided a method of determining a network of biological relationships from a dataset. The method according to this aspect of the present invention is executed by a data processor and is effected by implementing the following method steps, in which, in a first step, a core biological network is inputted. As is further detailed hereinunder, the core biological network consists of a portion of the biological objects and dependency relations between the objects.
As used herein, the term network refers to a network of biological relationships and the term core refers to a core network.
In a second step of the method according to this aspect of the present invention, the core is expanded by adding thereto at least one additional object of the dataset and at least one additional dependency relation to at least one existing object of the core.
According to a preferred embodiment of the present invention the second step may be executed more than once, so as to obtain at least one candidate expanded core.
A third step of the method of this aspect of the present invention comprises issuing a report describing the network. As is further described hereinunder, the report may be in any form suitable for sufficiently describing a network and may include a list, a table, a graph or the like. Preferably, the report includes both the objects of the network (e.g., the nodes of the graph, if such is issued) and the dependency relations (e.g., the edges of the graph).
Cores can be constructed manually by an expert with a focused interest on a specific pathway or directly from a dataset containing collections of known pathways. The present invention is aimed to add objects and dependencies to the core so that the dataset is predicted better by the extended core.
Referring now to the drawings, Figure la illustrates a core 10 having a plurality of objects 12, and a plurality of dependency relations 14 between objects 12. Objects 12 are shown in Figure la as circles and dependency relations 14 as arrows interconnecting two of the circles.
Reference is now made to Figure lb which illustrates an expanded core 20. Expanded core 20 includes additional objects 16 and dependency relations which have been added to core 10, thus providing expanded core 20.
Before providing a further detailed description of the method and apparatus for determining a network of biological relationships, as delineated hereinabove and in accordance with the present invention, attention will be given to the advantages and potential applications offered thereby.
Thus, in sharp distinction to the prior art, an expansion of a given core leads to a major reduction in model space dimensionality and allows the apparatus to both spend more computational power on a target subsystem and to be less prone to over-fitting. Additionally, the core may also serve for integrating diverse sources of information into a well defined computational process.
The dataset from which the network is determined may represent a plurality of biological measurements and/or other information. Thus, for example, the dataset may represent a plurality of genes, proteins or protein activities, which have been detected in a plurality of experiments (e.g., by detecting specific mRNA molecules or by detecting the level and/or catalytic activity of specific proteins). Alternatively, the dataset may represent data pertaining to levels or presence of antibodies, metabolites, reactants and products of enzymatic reactions, physiological parameters and/or biochemical parameters. Still alternatively, the dataset may represent a plurality of biological components generated and/or disappeared, such as symptoms occurring in various stages of one ore more pathologies.
It would be appreciated that the dataset is not limited to infra-cellular, cellular or intra-organism objects. Hence, the dataset may also include data representing an evolution process among species, which is reflected by transitions, appearances and/or disappearances of biological objects. Still alternatively, the dataset may represent data relating of an epidemiological process, again, reflected by transitions, appearances and/or disappearances of biological objects.
The dependency relations between objects of the dataset may represent control at the RNA, protein or protein-activity levels. At the level of RNA, control can take place at the initiation of transcription, provision of RNA precursors, RNA processing (including RNA splicing and RNA editing), nucleolysis of RNA molecules and the like. At the protein level, control can take place at translation, post translational modifications, elimination of protein molecules, e.g., via ubiquitination and the like. At the protein-activity level, control can take place via cellular localization, interactions with other proteins, phosphorylation/dephosphorylation and proteolysis.
As a specific, non limiting example, the dataset may comprise expression levels of genes. In this case the states of the dataset are the expression levels of the genes. Hence, the dataset may acquire the form of a gene expression matrix, where the rows of the matrix correspond to genes and the columns correspond to different conditions or experiments. Expression levels of different genes may be correlated due to mutual biological regulatory mechanisms between the corresponding genes. For example, suppose that the dataset contains information on gene G and on gene G', and by comparing their expression levels on different experiments it was found that there is a correlation between the expression levels of G and G'. Then, it may be possible that G regulates the expression of G' and vice versa. Hence, one defines a control-function based on the correlation between expression levels of the genes. Generally, a control-function of a variable, V, has a domain which includes the states of all its independent variables, and a range which includes the sate of V. In the above example, if G regulates G' then the expression levels of G belong to the domain while the expression levels of G' belong to the range. There are cases, however, where a specific gene (say, G') is regulated by more than one gene. Thus, the domain includes the expression levels of all the regulating genes.
It should be understood that the phrase control-function is chosen for illustrative purposes and these functions may be applied to any dataset as detailed above. As further detailed hereinafter, the domain and the range may be defined in different ways. In one embodiment, the domain and the range include a discrete set of values, referred to herein as input-values and output-values, respectively. In another embodiment the domain and the range include a set of distributions over the input and/or output values, representing probabilities rather than discrete values. In this embodiment, the distributions in the domain and in the range are referred to as input-distributions and output-distributions, respectively.
A system which is subjected to a specific experiment is generally characterized by an initial state (the state of the system before the experiment) and a final state (the state of the system after the experiment). Mathematically, an experiment is defined by a set of input-values and a set of output-values (or by a set of input-distributions and a set of output-distributions), which respectively constitute an input-vector and an output-vector. The input and output vectors are referred to herein as INP and OUT, respectively. In a specific experiment some of the objects may be artificially constrained. For example in cases in which the objects are genes, some of the genes may be "knocked-out" or "over-expressed" prior to the experiment. Hence, according to a preferred embodiment of the present invention, the mathematical description of an experiment may include an additional set, which represents the perturbed genes. This set is referred to herein as PERT. The set PERT is used for excluding from consideration states of those genes which have been perturbed in a specific experiment and for which no conclusion can be drawn. Hence, each experiment is described by a doublet {INP, OUT} or a triplet {INP, OUT, PERT}. Specifically, time-series data, providing expression levels at a series of n time points, yield π-1 experiments, where the vectors at time points i and i+\ form INP and OUT of the z'th experiment, respectively.
An experiment in which INP = OUT, is called a steady state experiment. Datasets are often either time series of samples along some synchronized biological process, or a single sample from a cell culture under some condition. Steady state experiments might contain an averaging of an underlying temporal process and so modeling them correctly entails a less detailed representation of the biological system. Mathematically, for steady state data, one must exclude models with variables regulating themselves in order to avoid the trivial self- regulation solution.
According to a preferred embodiment of the present invention, the method further includes a scoring step. The scoring step is for the purpose of quantifying the quality of the procedure. The scoring step may be executed in more than one level, e.g., in an object level or in network level. Hence, in Figure lb, each of the additional objects 16 which have been added to core 10 may be scored so as to obtain at least one object-score, and each of the candidate expanded cores (such as expanded core 20) may be scored so as to obtain at least one network-score. According to a presently preferred embodiment of the invention, the object-scores may be used as criteria for adding or not-adding objects 16 to core 10. For example, a specific object may be added to core 10 only if its object- score is above a predetermined threshold. In addition, object-scores may be used to quantify the influence of a specific object, once added to core 10, on other objects already present in the expanded core (for example objects 12, see Figures la-b). This may be useful, for example, to identify a specific relation between two objects in expanded core 20. As further exemplified in the example section below, it has been found by the inventors of the present invention, that in some applications a specific object has a high object-score solely due to a strong relation it has to a single object in the network.
Whereas the object-score has a local nature, the network-scores are global, and as such reflect the overall goodness of the expanded core with respect to the original dataset. Network-scores may be used to classify all the candidate expanded cores (such as expanded core 20) in accordance with their network-scores, say, in decreasing order from the best expanded core to the worse expanded core. Based on the network-scores and on the relations between objects within each expanded core (expressed via objects-scores), the user may choose the optimal expanded core for a specific pathway. According to a preferred embodiment of the present invention, the object-score is calculated by a fitness function. The fitness-fiinction may be any mathematical function which scores "good" expansions high and "bad" expansions low, as is further exemplified hereinbelow.
Hence, in one embodiment, the fitness-function equals a consistency- function, Consist, which is calculated by counting a maximal number of entries in INP which are consistently represented in OUT. The consistency-function, Consist, may be better understood by considering a simple example with reference to Figures 2a-c. Figure 2a shows a portion of an expanded core (e.g., expanded core 20), having three objects 32, 34 and 36, where object 36 directly depends on objects 32 and 34 via dependency relations 38 and 40. As above, the objects are represented as circles and the dependency relations are represented as arrows. For example, in a case in which objects 32, 34 and 36 are genes, the portion of the core in Figure 2a represent a portion of gene network in which the genes 32 and 34 regulate together gene 36. Hence, Figure 2a exemplifies a candidate core which is to be scored by the consistency function, based on the dataset.
In each experiment in the dataset, the states of objects 32, 34 and 36 define an ordered set of states. Figure 2b shows a first example of a portion of a dataset having 5 experimental results el, ..., e5, of objects 32, 34 and 36 which are represented as integers 0, 1 or 2. The corresponding order sets of the five experiments are el={0,0,0}, e2={l,2,l}, e3={0,l,l}, e4={2,2,2} and e5={ 0,1,0}. Referring to experiments e5 and e3, in both experiments the first two states (the expression levels of objects 32 and 34) are 0 and 1, respectively. However, the state of object 36 in experiment e3 differs from its state in experiment 5.
In other words, a control-function having the values of objects 32 and 34 in its domain and the value of object 36 in its range is inconsistent on experiments e3 and e5.
According to a preferred embodiment of the present invention for such cases the contribution of this pair of experiments to the value of Consist is zero. Figure 2c shows a second example of a portion of a similar dataset having 5 experimental results of the three objects as above with an exception that the value of object 36 in experiment e5, is 1 (as opposed to 0 in the first example). Unlike the first example, in the dataset of Figure 2c, there is perfect consistency between the ordered sets of experiments e3 and e5 (e3=e5={0,l,l }). For such cases, the contribution of this pair of experiments to value of Consist is a unity. As can be understood all other pairs of input values appear only once and therefore all other ordered sets are trivially consistent. According to a preferred embodiment of the present invention the overall value of Consist is the sum of its values on each pair of input-output values.
Hence, Consist measures the total level of consistency of the output object with its (possibly) regulating objects, as measured in the dataset. In one possible embodiment, it equals the maximum number of consistent experiments, each being defined such that the output-value is determined uniquely by the input- value. When Consist equals exactly the number of experiments it is said that the consistency is perfect. As stated, an experiment may include some constraints which are represented by PERT. According to a preferred embodiment of the present invention, in computing the value of Consist of an object, experiments in which that object is perturbed are excluded
The consistency-function described above provides a sufficient tool for the scoring step. It would be of utmost advantage however, to have a fitness- function which would handle probabilistic input data. While reducing the present invention to practice it has been uncovered that the utility of the invention may be significantly enhanced by providing a fitness-function handling probabilistic datasets. Probabilistic dataset may be provided by defining the domain and the range of each control-function using input- and output-distributions. According to a preferred embodiment of the present invention, each of the input- distributions and each of the output-distributions are defined within a probability space which is related to a direct product of a first probability space Y and a second probability space X. Specifically, X is used as the domain and Y is used as the range of the states, where each of the spaces X and Y is represented by a plurality of probabilities, which are preferably calculated as detailed herein.
For each set of input-values, a probability in Y equals a fraction of experiments in which INP equals the corresponding set of input-values. For each output-value, a probability in X equals a fraction of experiments in which OUT equals the corresponding output-value. The probability space, (YxX)n, is defined by obtaining n random samples from Y and n random samples X, where each sample is interpreted as a single experiment with a value from Y as the input-value and a value from X as the output-value.
Once the probability space (YxX)n is defined, states from the probability spaces Y and X are used as entries in the vectors INP and OUT, respectively, and the consistency-function is calculated by counting a maximal number of entries in INP which are consistently represented in OUT. When studying complicated networks using limited datasets, a major source of concern is over-fitting. Typically, over-fitting may occur in cases in which random control patterns yield a score which is as high as the real control function score. The present invention successfully addresses this problem by providing a probability function which gives information regarding the specificity of a speculated regulation pattern.
Thus, in another embodiment the fitness-function itself has a probabilistic property. In this embodiment, the fitness-function is called regulation-specificity and denoted herein rSpec.
According to a preferred embodiment of the present invention rSpec equals a probability that the value of the consistency-function of core 20 is larger than- or equal to- the value of the consistency-function of core 10. Hence, if a regulated object and a possible set of its regulators have consistency k based on a set of experiments (which consistency has been calculated using discrete or probabilistic domain and range), then the regulation specificity, rSpec, is the probability of obtaining a consistency of k or higher in a probability space, (YxX)n.
The size of the probability space defined above is exponential in the number of experiments. According to a preferred embodiment of the present invention, in cases where the difference n-k is small (i.e., almost perfect consistency), rSpec may be calculated by the following approximation which is linear in the number of experiments, hence more efficient in terms of computation time. In the present embodiment, rSpec equals to a probability for obtaining, for each experiment, a consistency which is larger than or equal to k in a dataset with the n inputs from the INP and outputs which are randomly sampled from X. Specifically, for / input configurations with multiplicities n\,...,n then rSpec is the probability of getting «' identical values out of nh z-1,...,/, samples from X where ∑«' > k.
In an additional embodiment, the fitness-function may also be calculated based on a permutation sensitivity analysis of the observed consistency pattern. The probability spaces X and Y which were defined for rSpec, may also be used in the present embodiment. Hence, according to a preferred embodiment of the present invention, the measured Y values are fixed while the measured X values are randomly permutated. For each random permutation of X, a probability of obtaining the consistency is calculated. A skilled artisan would appreciate that this probability is high for low consistency and low for high consistency. In other words, the fitness-function may be calculated by fixing the domain and applying random permutation only on the range of the states.
It would be appreciated by one ordinarily skilled in the art that other functions may be used as fitness-functions of the present invention. Hence, in still another embodiment, the fitness-function may be a mutual information function M(X,Y). The mutual information function is known in the art. To this end see, e.g., a book by T.M. Cover and J.M. Thomas, entitled "Elements of Information Theory", published by John Wiley & Sons (1991), the contents of which are hereby incorporated by reference.
Although the mutual information function is defined for discrete distributions, it may be generalized to a probability space similarly to the above procedure. Hence, given a set E of n experiments, the input (or output) distribution of a variable set S = v1,...,v^ is calculated as a fraction of experiments in E for which the inputs (or outputs) of S were exactly c\,...,cd. Denoting this distribution by XE S, the mutual information of a regulating set S, a variable v and a set E of experiments equals M(XE S, XE {V})-
In yet another embodiment, the fitness- function may be a Bayesian local score function. Bayesian local score functions are known in the art, and disclosed, e.g., in N. Friedman, et. al., "Using Bayesian networks to analyze expression data", Proceedings of the Fifth Annual International Conference on Computational Molecular Biology 127-135 (2000), the contents of which are hereby incorporated by reference. With respect to Bayesian score functions, the fitness function may be a Dirichlet equivalent score function, which is further detailed in N. Friedman, et. al.
As stated, the object-score which is preferably calculated by one of the fitness-functions detailed above, has a local nature. According to a preferred embodiment of the present invention, a global network-score is calculated by the following steps. In a first step, given the objects of core 20 as possible regulators, the fitness function is calculated for each object of core 10, thereby obtaining at least one fitness-function-value. A second step includes averaging the fitness-function-values over the objects of core 10. A skilled artisan would appreciate that such a procedure provides a tool for evaluating an overall score for core 20. Irrespectively of the scoring procedure as detailed hereinabove, the present invention computationally infers biological pathways by finding an expanded core that fits the experimental data best. As stated, core 10 of Figure lb represents existing biological knowledge therefore it allows an improved specificity of expanded core 20, compared to the quality of some de-novo reconstruction of a network.
Once the scoring step is complete, the object- and network-scores may be used to generate a report describing the network. As stated, the report may be in any form suitable for sufficiently describing a network, e.g., any combination of a list, a table and/or a graph. The report may include a list of all candidate networks, sorted in accordance to their individual network-scores. In addition, in a specific network one or more object-scores may be presented, preferably in a decreasing order, so as to report on the relative or absolute contribution of the corresponding object to the overall network-score. Based on the object-score, the user may decide whether or not to accept the corresponding object to the network. Still in addition, the report may include the effect of a specific object on other objects. For example, in cases in which the dataset comprises data pertaining to the expression levels of a plurality of genes, the report may include, for a specific gene, the level of regulation of the gene on other genes, so that the user may study a particular regulation mechanism between two genes.
In an embodiment in which the report includes a table, the table preferably comprises a plurality of elements each representing a dependency relation between two objects. In an embodiment in which the report includes a graph, the graph preferably includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects. Thus, the graph provides a visual description of both the objects of core 20 and the dependency relations between different objects in core 20. According to another aspect of the invention, there is provided an apparatus for determining a network of biological relationships from a dataset, the apparatus is generally referred to herein as apparatus 50.
Reference is now made to Figure 3, which illustrate apparatus 50. Hence, apparatus 50 comprises an input unit 52, a core expander 54 and an output unit 56. According to a preferred embodiment of the present invention input unit 52 serves for inputting a core biological network which is similar to core 10, as detailed hereinabove. Core expander 54 serves for expanding the core by adding to the core at least one additional object of the dataset. Thus, similarly to the above description, core expander 54 provides an expanded core, which may be, for example, core 20.
Output unit 56 serves for issuing a report describing the network of biological relationships. The report, generated by output unit 56, may have any form, preferably one of the forms described above.
Additional objects, advantages, and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination of the following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.
EXAMPLES
Reference is now made to the following examples, which together with the above descriptions, illustrate the invention in a non limiting fashion.
Database and Computer Resources The method has been tested on yeast transcription datasets using the ergosterol pathways as a core. The study was focused on yeast, which has the largest publicly available gene expression datasets. The variable set consisted of 6200 yeast open reading frames (ORFs). Transcription profiles were taken from two large scale yeast cDNA arrays experiments: (1) 260 selected knockout experiments performed -by Hughes et. al. [T. R. Hughes et. al., "Functional discovery via a compendium of expression profiles", Cell, 102:109-26, (2000)]; and (2) 100 experiments testing yeast behavior in stressful conditions performed by Gasch et. al. [A. P. Gasch el. al., "Genomic expression programs in the response of yeast cells to environmental changes", Mol. Biol. Cell, 11 :4241-57 (2000)].
In the following examples, a single node expansion process was used. In the process, for each object, the sum of fitness gains to all core objects from adding that object to the core, is calculated. Thus, the goal was to identify genes that might regulate or indirectly affect the pathway in those experiments which are left unexplained by the core model.
The computations were conducted on a Pentium III 500 MHz processor, using a Linux operating system. A new software platform GENESYS™ (GEnetic Network Expansion
SYStem) has been developed to implement the method described hereinabove. The platform includes engines for representing networks and computing fitness, a flexible expansion algorithm, viewers for visualization of biological data sets, an application to enable interactive usage of the viewers and engine and an internal database scheme for the storage of datasets and pathways.
The GENESYS environment is built as a modular prototype composed of back-end and front-end implemented in C++ and Perl/Tk. The front-end is a
GUI application featuring a wide selection of viewers and providing means to invoke computational processes using the back-end. The back-end provides a set of libraries and computational engines using them.
The basic features of GENESYS front-end are described herein with reference to the accompanying computer screen snapshots.
Figure 4 shows a dependency structure editor. The editor enables creation, viewing and saving of network cores, manipulation of nodes and arcs in the graph and it also provides connectivity to the logic viewer described below.
Figure 5 shows a variable list window which is a list widget providing access to the all the objects with all the information attached to it.
Figure 6 shows a logic viewer which presents a binary decision diagram (BDD) given a focus variable and its current incoming neighbors. A BDD [R. E. Bryant, "Graph based algorithms for Boolean function manipulation", IEEE Transactions on Computers, C-35(8):677-91 (1986)] represents the effect of inputs values on output value of the focus variable and provides insight into the structure of the discrete logic induced by a given dependency structure. To create a BDD, a layered tree is constructed where each level represents the addition of a new input variable. The leafs of the tree represent groups of experiments in which the input added up to the leaf level takes the same value. A parent child relation is added whenever the input assignment of level / is a subset of the assignment of the bigger input set at level /+1. In the final tree level, the output value is introduced to reveal the relation between input assignment and output value. Using the BDD, users can explore the logic underlying a putative dependency structure and deduce functional roles each input may have (activator/repressor or more complicated functions).
Figure 7 shows an interaction window which provides putative interactions from the GENESYS database. The window lets the user navigate in the interaction graph by showing each time the neighborhood of a focus variable which may be selected interactively. In the interaction window different edge colors are used for different types of interaction.
Figure 8 shows a one-expansion screen result window which lists the expansion candidates of a given core after the completion of a screen. The user can sort the list according to various parameters and evaluate the affinity of each of the expansion object to different parts of the core.
Ergosterol Metabolism Ergosterol is an essential lipid in yeast which is to the equivalent of cholesterol in mammals. Ergosterol's primary role is in the cell membranes but it is also involved in aerobic metabolism, sterol uptake and sterol transport. Ergosterol metabolism is understood rather well. Ergosterol metabolism is composed of two pathways in series. The first, the mevalonate pathway, transforms acetyl-CoA to farnesyl and provides essential components for few important metabolic pathways (e.g., heme and quinines). The latter pathway transforms farnesyl to ergosterol. Much of the regulation of ergosterol is believed to be transcriptionally mediated, but the actual details are known only in part [G. F. Bammert and J. M. Fostel, "Genome-wide expression patterns in Saccharomyces cerevisiae: comparison of drug treatments and genetic alterations affecting biosynthesis of ergosterol", Antimicrobial Agents and Chemotherapy, 44:1255-1265 (2000); G. Daum, N. D. Lees, M. Bard and R. Dickson, "Biochemistry, cell biology and molecular biology of lipids of Saccharomyces cerevisiae", Yeast, 14:1471-1510 (1998); T. G. Turi and J. C. Loper, "Multiple regulatory elements and control expression of the gene encoding the Saccharomyces cerevisiae Cytochrome P450, lanosterol 14a- demethylase (ERG11)", JBC, 267:2046-56 (1992)].
Figure 9 shows the basic known ergosterol metabolic pathway from farnesyl to ergosterol, including a series of 11 enzymes and three transcription factors. It is important to stress here the difference between metabolic pathways and regulatory networks: The fact that two enzymes follow each other in a biochemical process does not mean their transcription regulation is directly connected. The ergosterol dependency structure core was modeled as a set of variables, with dependencies marked only between known transcription factors and their targets. In other words, no dependency was prescribed between enzymes.
The reactions of pathway enzymes in the entire data set were analyzed. A number of experiments showed a global reaction of the pathway: in those experiments most of the pathway enzymes underwent significant change. This is presumably the result of some self regulatory mechanism (and indeed ergosterol itself is reported to function as transcription regulator for its pathway enzymes). However, many other experiments (about 40) showed a change in one or more of the pathway genes, which is not explained by the above mechanism. Those experiments may be explained by a more elaborate model.
EXAMPLE 1
Transcription Factors Screening
Out of the ~ 6200 yeast ORFs, 130 putative transcription factors (TFs) were identified, using SGD annotations [C. A. Ball et. al, "Saccharomyces genome database provides tools to survey gene expression and functional analysis data", Nucleic Acids Research, 29:80-1 (2001)], and typical structural motifs (e.g., zinc fingers). A single node expansion algorithm was then applied, limiting the candidates for node expansions to these putative TFs.
Results are summarized in Table 1 below, showing a fitness gain of each of the putative TFs, ranked against a "naked" core consisting of the eleven ERG enzymes with no dependencies among them. HAPl was ranked second out of 130, in agreement with the known role of HAPl in ERG11 regulation. TUP1 is a general repressor and was thus ranked lower, ROXl was less expressed in the data and was ranked much lower.
Table 1
Figure imgf000032_0001
Prior art analysis [T. Turi and J. Loper, "Multiple regulatory elements and control expression of the gene encoding the Saccharomyces cerevisiae Cytochrome P450, lanosterol 14a-demethylase (ERGl l)", JBC, 267:2046-56 (1992)] of the promoter region of ERG 11 is summarized in Figure 10.
The ERGl l regulation has been tested by applying a single node expansion to a core consisting of the eleven ERG enzymes as well as HAPl and ROXl as regulators of ERGl l. The algorithm measured the improvement in fitness contributed by each of the 130 TFs, and an uncharacterized gene was ranked first. That gene improves the fitness of ERG1 (and others). Remarkably, it also has a good homology to HAPl (33 % identity, 50 % similarity along 100 amino acids and even better in a shorter range).
Moreover, analyzing ERGl l logic as a function of HAPl, ROXl, TUP1 and novel TF shows that the effect of the new putative TF on eRGl l is inductive (as expected from a UAS2 binding gene). Thus, the hypothesis that the novel TF is indeed an ERGl l regulator that might bind to UAS2 is supported by three different methods: (1) sequence homology; (2) promoter analysis indicating a second inducer should exist; and (3) the screening procedure of the present example, using 360 different expression profiles in distinct cell states.
EXAMPLE 2 Screening All Genes In the present example, all ~ 6200 yeast ORFs were screened against the ergosterol core. This type of analysis may discover more general patterns of regulation that cannot be directly tagged as "A is a factor of B". Still, as shown below, some interesting biology may be learned from it. The results of such a screen are given in Table 2 below. The two top ranking genes, POS5, YBR043C, are both of unknown function. POS5 has homology to iron metabolism enzymes. Both present significant fitness gain for ERG4 regulation. ERG4 is the last of the ergosterol pathway enzymes, is not essential and little is known on its regulation.
Reference is now made to Figure 11 showing a Venn-like diagram of the ERG4 dependent genes. The diagram represents all experiments in which ERG4, POS5 and YBR043C were induced. The size of each set is indicated by a number. As can be seen from Figure 11 induction POS5 and YBR043C strongly correlates with of ERG4 induction (11/12 experiments in both cases). In addition, ERG4 is showing a second, separate regulation pattern (5 experiments) which is unrelated to POS5 and YBR043C. It would be appreciated that using standard clustering or similarity, the behavior of ERG4 in experiments with no POS5, YBR043C involvement would have masked the pattern identified in the present example.
The fourth gene in the screen list is INOl which is involved in inositol biogenesis. Inositol has a regulatory function in the pospholipid pathway (adjacent to ergosterol). Note that the dependency is localized differently (improving different variables) in that case. The relation of GAS 1 to ergosterol might be rooted in its function in the cell wall. The dependency between our core and MKK2 is very reasonable considering its function in the signaling pathway to the cell wall protein PCKl. The 11th gene in the list is ERG10, which is the first gene in the mevalonate pathway leading to our core.
The dependencies revealed by the general 1 -expansion screening can serve as the basis for deeper biological exploration. The process pinpoints statistically significant patterns which are hard to identify otherwise. In contrast with the TF 1 -expansion screening, the results are less direct and do not identify specific dependencies.
Table 2
Figure imgf000034_0001
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.

Claims

WHAT IS CLAIMED IS:
1. A method of determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the method is executed by a data processor and comprising:
(a) inputting a core biological network which comprises a portion of said plurality of objects of the dataset, and N dependency relations between pairs of objects of said portion of the dataset, wherein N is a nonnegative number;
(b) optimally expanding said core biological network by adding to said core biological network at least one additional object of said plurality of objects of said dataset, so as to obtain an expanded core biological network; and
(c) issuing a report describing the network of biological relationships.
2. The method of claim 1, wherein said at least one additional object has at least one dependency relation to at least one existing object of said core biological network.
3. The method of claim 1, wherein said at least one additional object comprises at least one pair of objects, said at least one pair of objects having a dependency relation therebetween.
4. The method of claim 1, wherein the objects are selected from the group consisting of expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters and epidemiological parameters.
5. The method of claim 1, wherein at least one of said dependency relations represents an interaction law between to objects.
6. The method of claim 1, wherein at least one of said dependency relations represents a control relation selected from the group consisting of control relation at a level of mRNA, protein and protein-activity.
7. The method of claim 1, wherein said core biological network represents a transcriptional network.
8. The method of claim 1, wherein said core biological network represents protein network having relations among hetrodimers, kinases and ubiqutines.
9. The method of claim 1, wherein said core biological network represents genes encoding proteins participating in at least one biochemical pathway.
10. The method of claim 1, wherein said core biological network represents biological interactions among any combination of objects selected from the group consisting of genes, proteins, chemicals and external signals.
11. The method of claim 1, wherein the dataset represents a plurality of experiments, each experiment of said plurality of experiments being characterized by a plurality of states, one state for each object.
12. The method of claim 11, wherein at least a portion of said plurality of states has a probabilistic property.
13. The method of claim 12, wherein said probabilistic property is defined by a plurality of distributions each of said distributions being defined within a probability space, said probability space being represented by a plurality of probabilities, each of said plurality of probabilities corresponding to one state.
14. The method of claim 13, wherein each probability of said plurality of probabilities equals a fraction of a class of experiments characterized by a predetermined rule.
15. The method of claim 14, wherein said predetermined rule is such that a particular state of a particular object has a predetermined value.
16. The method of claim 14, wherein said predetermined rule is such that each state of a portion of said plurality of states which respectively corresponds to a portion of said plurality of objects, has a value which is selected from a predetermined set of values.
17. The method of claim 11, further comprising repeating said step (b) at least once, thereby obtaining a plurality of expanded core biological networks.
18. The method of claim 17, further comprising redefining said expanded core biological network to be said core biological network, prior to said step (b).
19. The method of claim 11, wherein said optimally expanding is by defining at least one fitness function.
20. The method of claim 19, wherein said fitness-function is selected so as to minimize over-fitting of the network of biological relationships to the dataset.
21. The method of claim 19, wherein said at least one additional object of step (b) is selected so as to maximize said fitness-function.
22. The method of claim 1, wherein said optimally expanding is by scoring said expanded core biological network, thereby obtaining a network- score.
23. The method of claim 19, said optimally expanding is by scoring said expanded core biological network, thereby obtaining a network-score.
24. The method of claim 23, wherein said scoring is by:
(i) for each object of said core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of said at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and
(ii) averaging said at least one fitness-function-value over said core biological network; thereby obtaining said network-score.
25. The method of claim 1, further comprising scoring each object of said at least one additional object, thereby obtaining at least one object-score, and sorting said additional objects in accordance with said object-scores.
26. The method of claim 19, further comprising scoring each object of said at least one additional object, thereby obtaining at least one object-score, and sorting said at least one additional object in accordance with said object- scores, said scoring is by said fitness-function.
27. The method of claim 17, further comprising scoring each of said plurality of expanded core biological networks, thereby obtaining a plurality of network-scores, and sorting said plurality of expanded core biological networks in accordance with said network-scores.
28. The method of claim 19, further comprising repeating said step (b) at least once, thereby obtaining a plurality of expanded core biological networks.
29. The method of claim 28, further comprising scoring each of said plurality of expanded core biological networks, thereby obtaining a plurality of network-scores, and sorting said plurality of expanded core biological networks in accordance with said network-scores.
30. The method of claim 29, wherein said scoring each of said plurality of expanded core biological networks is by: for each of said plurality of expanded core biological networks:
(i) for each object of said core biological network, calculating a fitness-function so as to obtain a fitness-function-value characterizing a fit of said at least one additional object to the dataset, thereby obtaining at least one fitness-function-value; and
(ii) averaging said at least one fitness-function-value over said core biological network, thereby obtaining a network-score; thereby obtaining said plurality of network-scores.
31. The method of claim 19, further comprising for each object, A, having a dependency relation to at least one object B:
(i) defining, for each experiment of said plurality of experiments, an ordered set of states, said ordered set of states comprises one state of said object, A, and one state for each of said at least one object, B, said states corresponding to said experiment, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each experiment; and (ii) for each pair of experiments determining whether respective ordered sets of states are identical, and if so than defining each experiment of said pair of experiments to be a consistent experiment; thereby obtaining, for each object having a dependency relation to at least one object, a set of consistent experiments.
32. The method of claim 31, further comprising defining, for each object, an object consistency-function, said object consistency-function equals a maximal number of consistent experiments corresponding to said object, thereby providing a plurality of object consistency-functions.
33. The method of claim 32, wherein said fitness-function equals said object consistency-function.
34. The method of claim 32, further comprising: summing all said object consistency-functions of all objects in said core biological network, thereby providing a core consistency-function; and summing all said object consistency-functions of all objects in said expanded core biological network, thereby providing an expanded core consistency-function.
35. The method of claim 34, wherein said fitness-function equals a regulation-specificity, said regulation-specificity equals a probability of obtaining said expanded core consistency-function which is larger than- or equal to- said core consistency-function.
36. The method of claim 35, further comprising calculating said regulation-specificity for each of said at least one additional object, thereby providing at least one regulation-specificity.
37. The method of claim 32, wherein the dataset comprises at least one perturbed state, each of said least one perturbed state corresponds to a constraint within the dataset.
38. The method of claim 37, further comprising excluding each of said at least one perturbed state from said object consistency function.
39. The method of claim 19, wherein said fitness-function equals a mutual information function.
40. The method of claim 19, wherein said fitness-function equals a Bayesian local score function.
41. The method of claim 40, wherein said Bayesian local score function is a Dirichlet equivalent score function.
42. The method of claim 1, wherein said report is selected from the group consisting of a graph, a table and a list.
43. The method of claim 42, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
44. The method of claim 42, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
45. The method of claim 22, wherein said report is selected from the group consisting of a graph, a table and a list.
46. The method of claim 45, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
47. The method of claim 45, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
48. The method of claim 45, wherein said report includes said network-score.
49. The method of claim 25, wherein said report is selected from the group consisting of a graph, a table and a list.
50. The method of claim 49, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
51. The method of claim 49, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
52. The method of claim 49, wherein said report includes said at least one object-score.
53. The method of claim 29, wherein said report is selected from the group consisting of a graph, a table and a list.
54. The method of claim 53, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
55. The method of claim 53, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
56. The method of claim 53, wherein said report includes at least one of said plurality of network-scores.
57. The method of claim 1, wherein said report comprises at least one pair of objects having a dependency relation therebetween.
58. The method of claim 32, wherein said report includes at least one of said plurality of object consistency-functions.
59. The method of claim 36, wherein said report includes at least one of said at least one regulation-specificity.
60. An apparatus for determining a network of biological relationships from a dataset, the dataset having a plurality of biological objects, the apparatus comprising:
(a) an input unit for inputting a core biological network which comprises a portion of said plurality of objects of the dataset, and N dependency relations between pairs of objects of said portion of the dataset, wherein N is a nonnegative number;
(b) a core expander for optimally expanding said core biological network by adding to said core biological network at least one additional object of said plurality of objects of said dataset, so as to obtain an expanded core biological network; and
(c) an output unit for issuing a report describing the network of biological relationships.
61. The apparatus of claim 60, wherein said at least one additional object has at least one dependency relation to at least one existing object of said core biological network.
62. The method of claim 60, wherein said at least one additional object comprises at least one pair of objects, said at least one pair of objects having a dependency relation therebetween.
63. The apparatus of claim 60, wherein the objects are selected from the group consisting of expressed genes, proteins, antibodies, metabolites, physiological parameters, biochemical parameters, evolution parameters and epidemiological parameters.
64. The apparatus of claim 60, wherein at least one of said dependency relations represents an interaction law between to objects.
65. The apparatus of claim 60, wherein at least one of said dependency relations represents a control relation selected from the group consisting of control relation at a level of mRNA, protein and protein-activity.
66. The apparatus of claim 60, wherein said core biological network represents a transcriptional network.
67. The apparatus of claim 60, wherein said core biological network represents protein network having relations among hetrodimers, kinases and ubiqutines.
68. The apparatus of claim 60, wherein said core biological network represents genes encoding proteins participating in at least one biochemical pathway.
69. The apparatus of claim 60, wherein said core biological network represents biological interactions among any combination of objects selected from the group consisting of genes, proteins, chemicals and external signals.
70. The apparatus of claim 60, wherein the dataset represents a plurality of experiments, each experiment of said plurality of experiments being characterized by a plurality of states.
71. The apparatus of claim 70, wherein at least a portion of said plurality of states have a probabilistic property.
72. The apparatus of claim 71, wherein said probabilistic property is defined by a plurality of distributions each of said distributions being defined within a probability space, said probability space being represented by a plurality of probabilities, each of said plurality of probabilities corresponding to one.
73. The apparatus of claim 72, wherein each probability of said plurality of probabilities equals a fraction of a class of experiments characterized by a predetermined rule.
74. The apparatus of claim 73, wherein said predetermined rule is such that a particular state of a particular object has a predetermined value.
75. The apparatus of claim 73, wherein said predetermined rule is such that each state of a portion of said plurality of states which respectively corresponds to a portion of said plurality of objects, has a value which is selected from a predetermined set of values.
76. The apparatus of claim 70, wherein said core expander is operable to expand said core biological network a plurality of times, thereby to obtain a plurality of expanded core biological networks.
77. The apparatus of claim 76, wherein said core expander is operable to redefine said expanded core biological network to be said core biological network.
78. The apparatus of claim 70, wherein said core expander comprises a fitness-function unit, operating at least one fitness function.
79. The apparatus of claim 78, wherein said fitness-function unit operable to select said fitness-function so as to minimize over-fitting of the network of biological relationships to the dataset.
80. The apparatus of claim 78, wherein said core expander is operable to select said at least one additional object so that said fitness-fiinction is maximized.
81. The apparatus of claim 60, further comprising a scorer, for scoring each object, thereby to obtain a plurality of object-scores and for summing at least a portion of said plurality of object-scores corresponding to objects of said expanded core biological network thereby to obtain a network-score.
82. The apparatus of claim 78, further comprising a scorer, for scoring said expanded core biological network thereby to obtain a network-score.
83. The apparatus of claim 82, wherein said scorer comprising:
(i) electronic-calculating functionality for calculating a fitness- function for each object of said core biological network, so as to obtain a fitness-function-value characterizing a fit of said at least one additional object to the dataset, thereby to obtain at least one fitness-function-value; and
(ii) electronic-calculating functionality for averaging said at least one fitness-function-value over said core biological network; thereby obtaining said network-score.
84. The apparatus of claim 78, further comprising a consistency function unit for obtaining, for each object, an object consistency-function, wherein said consistency function unit comprising:
(i) electronic-calculating functionality for defining, for each object, A, having a dependency relation to at least one other object B, and for each experiment of said plurality of experiment, an ordered set of states, said ordered set of states comprises states of said object, A, and of each of said at least one object, B, thereby obtaining a plurality of ordered sets of states, one ordered set of states for each object, A, and for each experiment; and
(ii) electronic-calculating functionality for determining, for each pair of experiments, whether respective sets of states are identical, and if so than to define said pair of experiments to be a pair of consistent experiments; and
(iii) electronic-calculating functionality for determining for each object, a maximal number of consistent experiments corresponding to said object, and for setting said object consistency-function to be said maximal number of consistent experiments.
85. The apparatus of claim 84, wherein said fitness-function equals said object consistency-function.
86. The apparatus of claim 84, wherein said fitness-function unit comprising: (i) electronic-calculating functionality for summing all said object consistency-functions of all objects in said core biological network, thereby to provide a core consistency-function; and
(ii) electronic-calculating functionality for summing all said object consistency-functions of all objects in said expanded core biological network, thereby to provide an expanded core consistency-function.
87. The apparatus of claim 86, wherein said fitness-function unit operable to set fitness-function to be a regulation-specificity, said regulation- specificity equals a probability of obtaining said expanded core consistency- function which is larger than- or equal to- said core consistency- function.
88. The apparatus of claim 86, wherein said fitness-function unit operable to calculate said regulation-specificity for each of said at least one additional object, thereby to provide at least one regulation-specificity.
89. The apparatus of claim 84, wherein the dataset comprises at least one perturbed state, each of said least one perturbed state corresponds to a constraint within the dataset.
90. The apparatus of claim 89, further comprising an excluder for excluding each of said at least one perturbed state from said object consistency function.
91. The apparatus of claim 78, wherein said fitness-function equals a mutual information function.
92. The apparatus of claim 78, wherein said fitness-function equals a Bayesian local score function.
93. The apparatus of claim 92, wherein said Bayesian local score function is a Dirichlet equivalent score function.
94. The apparatus of claim 60, wherein said report is selected from the group consisting of a graph, a table and a list.
95. The apparatus of claim 94, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
96. The apparatus of claim 94, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
97. The apparatus of claim 81, wherein said report is selected from the group consisting of a graph, a table and a list.
98. The apparatus of claim 97, wherein said table comprises a plurality of elements each representing a dependency relation between two objects.
99. The apparatus of claim 97, wherein said graph includes a plurality of nodes, each representing an object and a plurality of edges, each representing a dependency relation between two objects
100. The apparatus of claim 97, wherein said report includes said network-score.
101. The apparatus of claim 97, wherein said report includes said at least one object-score.
102. The apparatus of claim 60, wherein said report comprises at least one pair of objects having a dependency relation therebetween.
103. The apparatus of claim 84, wherein said report includes at least one of said plurality of object consistency-functions.
104. The apparatus of claim 88, wherein said report includes at least one of said at least one regulation-specificity.
PCT/IL2002/000460 2001-06-14 2002-06-13 Method of expanding a biological network WO2002103608A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29776501P 2001-06-14 2001-06-14
US60/297,765 2001-06-14

Publications (2)

Publication Number Publication Date
WO2002103608A2 true WO2002103608A2 (en) 2002-12-27
WO2002103608A3 WO2002103608A3 (en) 2004-12-16

Family

ID=23147649

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2002/000460 WO2002103608A2 (en) 2001-06-14 2002-06-13 Method of expanding a biological network

Country Status (1)

Country Link
WO (1) WO2002103608A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1494160A2 (en) * 2003-07-04 2005-01-05 MediCel oy Information management system for biochemical information
WO2006114479A1 (en) * 2005-04-28 2006-11-02 Valtion Teknillinen Tutkimuskeskus Visualization technique for biological information
EP2757496A1 (en) * 2013-01-17 2014-07-23 Fondazione Edmund Mach Systems and methods for determining suitable entities for expanding established causal molecular biological networks and for determining significant causal relationships between entities of established causal molecular biological networks and candidate entities
US9719147B1 (en) 2003-08-22 2017-08-01 Dennis Sunga Fernandez Integrated biosensor and simulation systems for diagnosis and therapy
US9740817B1 (en) 2002-10-18 2017-08-22 Dennis Sunga Fernandez Apparatus for biological sensing and alerting of pharmaco-genomic mutation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999028744A1 (en) * 1997-11-27 1999-06-10 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Identification and characterization of interacting molecules
US6132969A (en) * 1998-06-19 2000-10-17 Rosetta Inpharmatics, Inc. Methods for testing biological network models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999028744A1 (en) * 1997-11-27 1999-06-10 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Identification and characterization of interacting molecules
US6132969A (en) * 1998-06-19 2000-10-17 Rosetta Inpharmatics, Inc. Methods for testing biological network models

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHEN T ET AL: "Identifying gene regulatory networks from experimental data" PARALLEL COMPUTING, ELSEVIER PUBLISHERS, AMSTERDAM, NL, vol. 27, no. 1-2, January 2001 (2001-01), pages 141-162, XP004231333 ISSN: 0167-8191 *
D'HAESELEER P ET AL: "GENETIC NETWORK INFERENCE: FROM CO-EXRESSION CLUSTERING TO REVERSE ENGINEERING" BIOINFORMATICS, OXFORD UNIVERSITY PRESS, OXFORD,, GB, vol. 16, no. 8, August 2000 (2000-08), pages 707-726, XP001088339 ISSN: 1367-4803 *
EDWARDS R ET AL: "Symbolic dynamics and computation in model gene networks" CHAOS AIP USA, vol. 11, no. 1, March 2001 (2001-03), pages 160-169, XP008037801 ISSN: 1054-1500 *
KOHN M C: "Graph theoretical analysis of control of metabolic and genetic systems" PROCEEDINGS OF THE 1986 SUMMER COMPUTER SIMULATION CONFERENCE SCS SAN DIEGO, CA, USA, 1986, pages 381-385, XP008037772 *
KYODA K M ET AL: "A GENE NETWORK INFERENCE METHOD FROM CONTINUOUS-VALUE GENE EXPRESSION DATA OF WILD-TYPE AND MUTANTS" PROCEEDINGS GENOME INFORMATICS WORKSHOP, UNIVERSITY ACADEMY PRESS, TOKYO, JP, vol. 11, 18 December 2000 (2000-12-18), pages 196-204, XP002909075 ISSN: 0919-9454 *
LIANG S ET AL: "Reveal, a general reverse engineering algorithm for inference of genetic network architectures" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING '98, HAWAII, vol. 3, 1998, pages 18-29, XP002960925 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9740817B1 (en) 2002-10-18 2017-08-22 Dennis Sunga Fernandez Apparatus for biological sensing and alerting of pharmaco-genomic mutation
EP1494160A2 (en) * 2003-07-04 2005-01-05 MediCel oy Information management system for biochemical information
EP1494160A3 (en) * 2003-07-04 2007-10-24 MediCel oy Information management system for biochemical information
US9719147B1 (en) 2003-08-22 2017-08-01 Dennis Sunga Fernandez Integrated biosensor and simulation systems for diagnosis and therapy
US10878936B2 (en) 2003-08-22 2020-12-29 Dennis Sunga Fernandez Integrated biosensor and simulation system for diagnosis and therapy
WO2006114479A1 (en) * 2005-04-28 2006-11-02 Valtion Teknillinen Tutkimuskeskus Visualization technique for biological information
US8572064B2 (en) 2005-04-28 2013-10-29 Valtion Teknillinen Tutkimuskeskus Visualization technique for biological information
EP2757496A1 (en) * 2013-01-17 2014-07-23 Fondazione Edmund Mach Systems and methods for determining suitable entities for expanding established causal molecular biological networks and for determining significant causal relationships between entities of established causal molecular biological networks and candidate entities

Also Published As

Publication number Publication date
WO2002103608A3 (en) 2004-12-16

Similar Documents

Publication Publication Date Title
Hvidsten et al. Learning rule-based models of biological process from gene expression time profiles using gene ontology
Thalamuthu et al. Evaluation and comparison of gene clustering methods in microarray analysis
Tseng Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data
Woolf et al. A fuzzy logic approach to analyzing gene expression data
US20110246409A1 (en) Data set dimensionality reduction processes and machines
US8572018B2 (en) Method, system and software arrangement for reconstructing formal descriptive models of processes from functional/modal data using suitable ontology
Troyanskaya Putting microarrays in a context: integrated analysis of diverse biological data
Chen et al. DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning
Kim et al. Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization
Tanay et al. Computational expansion of genetic networks
Cheng et al. DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data
EP1540505A2 (en) Systems and methods for inferring biological networks
Haque et al. A common neighbor based technique to detect protein complexes in PPI networks
Eicher et al. Challenges in proteogenomics: a comparison of analysis methods with the case study of the DREAM proteogenomics sub-challenge
Omranian et al. Computational identification of protein complexes from network interactions: Present state, challenges, and the way forward
WO2002103608A2 (en) Method of expanding a biological network
Balamurugan et al. Biclustering microarray gene expression data using modified Nelder-Mead method
Das et al. Evolutionary biclustering with correlation for gene interaction networks
Hassan et al. Dimensionality reduction methods for extracting functional networks from large‐scale CRISPR screens
CN116665764B (en) Method for predicting deletion reaction in metabolic network
Deng Algorithms for reconstruction of gene regulatory networks from high-throughput gene expression data
Bhattacharya Mining high dimensional transcriptomic data to unravel the causes and consequences of genomic instability in cancers
Treviño Alvarado Identifying the molecular components that matter: a statistical modelling approach to linking functional genomics data to cell physiology
Paul Gene Ontology Semi-supervised Clustering for Prediction of Genes Functions
Nambiar et al. FUN-PROSE: A deep learning approach to predict condition-specific gene expression in fungi

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP