WO2002039214A2 - Relevance networks for visualizing clusters in gene expression data - Google Patents

Relevance networks for visualizing clusters in gene expression data Download PDF

Info

Publication number
WO2002039214A2
WO2002039214A2 PCT/US2001/043604 US0143604W WO0239214A2 WO 2002039214 A2 WO2002039214 A2 WO 2002039214A2 US 0143604 W US0143604 W US 0143604W WO 0239214 A2 WO0239214 A2 WO 0239214A2
Authority
WO
WIPO (PCT)
Prior art keywords
variables
pair
data
score
genes
Prior art date
Application number
PCT/US2001/043604
Other languages
French (fr)
Other versions
WO2002039214A3 (en
WO2002039214A9 (en
Inventor
J. Butte Atul
Original Assignee
Children's Medical Center Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Children's Medical Center Corporation filed Critical Children's Medical Center Corporation
Priority to AU2002219817A priority Critical patent/AU2002219817A1/en
Publication of WO2002039214A2 publication Critical patent/WO2002039214A2/en
Publication of WO2002039214A3 publication Critical patent/WO2002039214A3/en
Publication of WO2002039214A9 publication Critical patent/WO2002039214A9/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates generally to data mining or processing. More specifically, the invention relates to a system and method for mining data from multiple data sets to identify potentially meaningful relationships among variables in the data sets.
  • the present invention relates to a system and method for producing a network of related variables.
  • An objective of the invention is to uncover correlation or association between variables that are significant in different contexts, or conditions of interest.
  • the invention features a method that evaluates a relation between multiple variables under two conditions of interest. According to the method, data for a plurality of variables that represent or correspond to a first condition of interest are obtained. An association between a pair of variables is established. From the data, a first strength of the association between that pair of variables is calculated. The above steps are repeated to calculate a second strength of association between the same pair of variables based on their data that represent or correspond to a second condition of interest.
  • a score is generated based on at least the first and second strengths to help evaluate the association of the pair of variables under both conditions of interest. Another strength of association between the same pair of variables based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated. A score can be generated based on all the calculated strengths. [008] Optionally, a network of those variables may be produced based on the generated score. Or, multiple scores may be generated for multiple pairs of variables, e.g., each possible pair of variables. And a network may be produced based on these multiple scores.
  • the variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.).
  • the network of variables can be graphically displayed.
  • the strength of the association between a pair of variables under a certain condition is calculated from a correlation coefficient for the pair of variables.
  • the strength is based on the slope of the line produced through a linear regression model, i.e., the line that best fits the data for the pair of variables.
  • the strength is based on both the correlation coefficient and the slope.
  • a criterion or threshold may also be established and any number of the scores can be evaluated against the criterion.
  • a network of variables may be produced to include only variables that, in connection with other variables, have a score that meets the criterion.
  • the criterion or threshold value is established by randomly permuting the data for each pair of variables for each condition of interest. A score for each pair of variables is recalculated from the permuted data. The steps of permuting and calculating may be repeated a predetermined number of times. The criterion or threshold value may be set equal to or greater than the highest value for the recalculated score based on the permuted data.
  • the invention in another aspect, relates to a system for evaluating a relation between a plurality of variables under two conditions of interest.
  • the system includes memory storing data for the plurality of variables under at least a first and a second condition of interest.
  • An associator establishes an association between a pair of variables under at least the first and second conditions respectively.
  • a calculator in communication with the memory and the associator, calculates at least a first and a second strength of the association between the pair of variables under the first and second conditions respectively.
  • a score generator generates a score for the pair of variables based on at least the first and second strengths.
  • the calculator may further calculate another strength of association between the same pair of variables based on their data corresponding to another condition of interest. There is no limit on how many strengths in connection with a condition of interest that the calculator is capable of calculating.
  • the score generator may generate a score based on all the calculated strengths.
  • the system may optionally include a network generator that generates a network of variables based on at least the generated score or scores.
  • the system can further include a display in communication with the score generator that display at least the score.
  • the display may also be in communication with the network generator and display the network.
  • the system may also include a criterion setter that establishes a criterion against which the score is evaluated.
  • the criterion setter may establish the criterion through recalculating the score based on randomly permuted data.
  • the network generator may choose to generate a network of variables based on each score that meets the criterion.
  • the method of the invention can be used to evaluate a relation between a plurality of genes under two conditions of interest.
  • Data on the expression levels of a plurality of genes that represent or correspond to a first condition of interest are obtained.
  • An association between the expression levels of a pair of genes is established.
  • a first strength of the association between the expression levels of that pair of genes is calculated.
  • the above steps are repeated to calculate at least a second strength of association between the expression levels of the same pair of genes based on their data that represent or correspond to a second condition of interest.
  • a score for the pair of genes is generated based on both the first and the second strengths to help evaluate the association between the expression levels of the pair of variables under both conditions of interest.
  • Another strength of association between the expression levels of the same pair of genes based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated.
  • a score can be generated based on all the calculated strengths.
  • a network of those genes may be produced based on the generated score.
  • multiple scores may be generated for multiple pairs of genes, e.g., each possible pair of genes. And a network may be produced based on these multiple scores.
  • the network of genes can be graphically displayed.
  • FIG. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention
  • FIG. 2A is an embodiment of a table including data from a data source for a plurality of variables under a first condition of interest
  • FIG. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2 A with a line fit to the data;
  • FIG. 2C is an embodiment of a table including data from a data source for a plurality of variables under a second condition of interest;
  • FIG. 2D is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2C with a line fit to the data;
  • FIGS. 3 A, 3B, and 3C are embodiments of a diagram where the lines from FIGS. 2B and 2D are presented;
  • FIG. 4 A is an embodiment of a graphical representation of a network of a pair of variables based on their associations
  • FIG. 4B is an embodiment of a graphical representation of a network of a plurality of variables based on their associations
  • FIG. 5 is an embodiment of a matrix including examples of score values for each of the associations shown in FIG. 4B;
  • FIG. 6 is a flow chart of an embodiment of an exemplary process that produces a score or a network using the associations between variables under at least two conditions of interest according to the principles of the invention
  • FIG. 7 is a flow chart of an embodiment of an exemplary process of evaluating a score against a criterion
  • FIG. 8 is an embodiment of a table illustrating permutation of the data originally shown in the table of FIG. 2 A;
  • FIG. 9 illustrates an exemplary process for permuting the data of a variable listed in
  • FIG. 2C
  • FIG. 10 is an embodiment of a graph illustrating both results of scores generated in accordance with the invention using actual genomic data and results from an exemplary process used to determine a threshold value for evaluating the scores;
  • FIG. 11 is an embodiment of a relevance network including genes identified in FIG.
  • FIG. 12 is an embodiment of a scatter plot of actual genomic data for a pair of genes under two disease states with lines fit to the two sets of data.
  • FIG. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30.
  • the computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention.
  • the processor in one embodiment is a 266 MHz Pentium® II processor, manufactured by Intel Corporation of Santa Clara, California.
  • One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris®, manufactured by Sun Microsystems, Inc. of Palo Alto, California.
  • the data source 30 in one embodiment is a database system, e.g., ORACLE® 8 or data stored in files on a data storage device, such as a hard disk.
  • the processor of the computer system 20 executes data mining software. Such software is written in any programming language, such as C, C++, etc.
  • the data in the data source 30 represent measurements of variables for multiple sample cases. These sample cases represent variables under at least two different conditions of interest. The conditions of interest may be conditions parallel in time or sequential in time. For example, in a medical context, the sample cases in one embodiment may be different individuals and the variables may be physical characteristics, such as weight, height, gender, race, etc, measured at the same time.
  • sample cases may be divided into age groups (i.e. multiple parallel conditions of interest) of 11-20, 21-30, 31-40 and so on.
  • age groups i.e. multiple parallel conditions of interest
  • the same variables may be measured at various time points (i.e. sequential conditions of interest), such as prior to and post to the occurrence of an environmental change.
  • the measured variables are continuous variables.
  • the sample cases can be measurements of a single human subject conducted over a period of time, such as over a month with daily frequencies.
  • the variables may be measurements of the same set of laboratory tests, such as hemoglobin, hematocrit, cholesterol and thyroxine measurements. And the data may be divided into those before the subject starts a drug therapy and afterwards (i.e. under two sequential conditions of interest).
  • the sample cases are RNA expression measurements of various subjects and the measured variables are different genes. The cases may be divided into two different disease conditions, e.g., two forms of leukemia.
  • sample cases are corporations for which the measured variables are financial data, such as stock prices, price to earning ratios, etc.
  • the data may be divided into different conditions of interest such as according to difference in the industry, or market capitalization.
  • FIG. 2A shows an exemplary tabular representation 40 of data in the data source that corresponds to a first condition of interest.
  • the measured variables A, B, C, D, and E are represented on the x-axis in columns.
  • the sample cases SI, S2, S3, and S4 on the y-axis, which all correspond to the first condition, are represented in rows.
  • FIG. 2A shows an exemplary data set of twenty entries wherein the table 40 includes fifteen numerical data values (NAL1 - NAL15). Five entries of the table 40 lack a data value, each denoted by a dashed line.
  • Each pair of data values 42 appearing in the same sample in the table 40 represents a data point 44 in a scatter plot.
  • FIG. 2B shows an embodiment of an exemplary scatter plot 43 of the data points 44 produced by data 42 for variables D and E.
  • the data points are (NAL9, NAL12), (NAL10, NAL13), and (NAL11, NAL14).
  • Scatter plots can be produced for each pair of variables, such as for variables B and E, in like manner.
  • a variety of methodologies can be used to calculate the strength of the association between a pair of variables. The following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships.
  • One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points. This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below.
  • Another methodology is to consider qualitative characteristics of the data 42. For instance, for empirical data, confidence in the data measurement itself may be taken into consideration in deciding whether an association should be valid to start with.
  • Another exemplary methodology uses a linear regression model to process the data points 44. Still referring to FIG. 2B, a line 46 that best fits all the data points 44 is plotted.
  • the slope of the line 46 sometimes called regression coefficient (sometimes denoted as/?), is calculated.
  • the value of ? indicates how much change in the variable on the y-axis can be expected by a unit of change in the variable on the x-axis.
  • the linear regression model can also be used to compute a correlation coefficient
  • the technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
  • One technique computes a correlation coefficient with a value between -1 and 1.
  • a correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope
  • a correlation coefficient of -1 indicates a perfect linear inverse correlation
  • a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables.
  • the correlation coefficient is Pearson's correlation coefficient.
  • the Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals.
  • the correlation coefficient is a Spearman Rank correlation coefficient.
  • the Spearman Rank correlation coefficient can be a more appropriate coefficient than the Pearson correlation coefficient when actual numerical values cannot be assigned to variables, but a rank order is assigned to each sample case of each variable.
  • the square of the correlation coefficient, r (typically referred to as the coefficient of determination) can be used.
  • the value of r 2 ranges between 0 and 1. Because the value of r 2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r 2 of 0.25, whereas an r of greater than 0.7 has a r 2 of greater than 0.5.
  • Another technique for computing a correlation coefficient uses a nonlinear regression model.
  • Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
  • entropy (H) of the variables and the mutual information between each pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables.
  • the entropy of a variable is a measure of the information content in that variable.
  • Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables.
  • Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n subranges. The proportion of measurements in sub-range X; (or frequency) is denoted as p(x . As n approaches infinity, the histogram increasingly models the probability density function for the variable.
  • Entropy can be calculated using the following equation:
  • MI(A, B) H(A) - H(A
  • MI(A,B) H(A) + H(B) - H(A,B).
  • a mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately.
  • a higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
  • the computer system 20 can use the above- described equations to compute a mutual information relationship between pairs of genes.
  • FIG. 2C shows an exemplary tabular representation 50 of data in the data source that corresponds to a second condition of interest.
  • the same measured variables A, B, C, D, and E are represented on the x-axis in columns.
  • the sample cases S5, S6, S7, and S8 on the y- axis, which all correspond to the second condition, are represented in rows.
  • table 50 contains data values 52 (VAL16 - NAL32) for the variables and each pair of the data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot.
  • FIG. 2D shows an embodiment of an exemplary scatter plot 53 of the data points 54 produced by data 52 for variables D and E.
  • the data points are (NAL26, NAL29), (NAL27, NAL30), and (NAL28, NAL31). Scatter plots can be produced for each pair of variables in like manner.
  • a line 56 can also be plotted that best fits all the data points 54, and a slope of the line 56 ( ⁇ ') can be calculated.
  • a correlation coefficient (r') can also be computed, e.g., through a linear or non-linear regression model. It is noted that the invention contemplates more than two conditions of interest, and the method of generating a strength between a given variables can be repeated infinitely to calculate more strengths.
  • a score can be generated in response to both strengths.
  • the score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest.
  • FIGS. 3A-3C show three possibilities when the lines 46 and 56 are compared with each other.
  • the line 46 is the line that best fits all the data points 44 based on data 42 of a pair of variables (D and E) under the first condition of interest.
  • the line 56 is the line that best fits all the data points 54 based on data 52 of a pair of variables under the second condition of interest.
  • the slopes of the two lines 46 and 56 are of the same sign, i.e., both are of the positive sign or of the negative sign. In the particular example shown in FIG. 3 A, both slopes are of the positive sign. This indicates that the association between the pair of variables under the two conditions may be similar.
  • the slopes of the two lines 46 and 56 are of opposite signs, e.g., one is of the positive sign and the other is of the negative sign.
  • the scenario represented by FIG. 3B also includes the case where one of the slopes is zero, meaning no correlation under one of the conditions, while the other slope is nonzero.
  • FIG. 3B indicates that the correlation between the pair of variables under the two conditions may be dissimilar.
  • a third scenario is depicted in FIG. 3C where both slopes of the lines 46 and 56 are zero. This means that the two variables are statistically not correlated under either condition and are statistically independent to each other.
  • the score (S) may be formulated to accentuate similar or dissimilar correlation between a pair of variables under multiple conditions or the lack of correlation thereof.
  • the score (S) may be generated in response to regression coefficients ( ⁇ ), correlation coefficients (r), another indicator of strength, or a combination of any of the above.
  • the score (S) is be calculated through the following equation:
  • the score (S) is to be calculated through the following equation:
  • the score (S) may be generated based on any combination of two or more of the calculated strengths (e.g. all the calculated strengths).
  • Methods of computing the score are not limited to these exemplary embodiments. Other ways, including other equations and formulas, for generating a score are contemplated by this invention.
  • a way of displaying the score is to build a network, e.g., a relevance network based on the scores generated. Because a relevance network depicts a relationship between two variables, a minimum of one score is needed.
  • FIG. 4A shows an example of a relevance network 110 where a link 100 designated by a score (e.g. 0.4) for variables D and E links the pair. Both variables D and E are shown as a node 180 in the network. When multiple scores are generated for multiple pairs of variables, a relevance network can be built based on all the scores generated.
  • FIG. 4B shows an example of a relevance network 110 that graphically links each of the variables A-E with a link 100.
  • Each variable A, B, C, D, and E is shown as a node 180 in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E.
  • Each link 100 has an assigned value (e.g. the score S) characterizing the relationship between pairs of variables.
  • FIG. 5 shows another way of displaying or organizing multiple score generated in accordance with the invention.
  • An exemplary matrix 104 called a measure triangle, tabulates score values 108 assigned to each of the links 100 of FIG. 4B.
  • the measure triangle 104 places the variables A, B, C, D, and E on both the x- and y-axes.
  • Each value 108 in the measure triangle 104 represents the score determined for the association between the respective pair of variables.
  • the values 108 shown are exemplary and selected only for illustrating the principles of the invention.
  • FIG. 6 shows an exemplary process 15 for finding relationships among the variables D and E under two conditions of interest according to the principles of the invention.
  • the process 15 obtains (step 60) a set of data that correspond to the first condition of interest from the data source 30.
  • the data in the data set includes values for various variables for sample cases.
  • the computer system 20 organizes (step 62) the obtained data in the data set.
  • One exemplary data organization is the tabular representation 40 shown in FIG. 2A.
  • the computer system 20 associates (step 64) two variables in the data set, in this case, variables D and E.
  • the computer system 20 calculates (step 68) the strength of the association between variables D and E.
  • strength is an indication of how closely the variables are related.
  • a strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables.
  • Variables can be related to each other in various ways.
  • variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide.
  • Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils.
  • Some variables can be directly or indirectly related to each other through other variables.
  • An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone.
  • Other variables can have a relationship with each other relating to a pathologic condition.
  • An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha- 1 antitrypsin, an acute phase protein indicative of an inflammatory disease state.
  • Other variables can be related through synonymy. For example, both somatomedin C and insulin-like growth factor- 1 refer to the same molecule.
  • the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names.
  • the exemplary process 15 continues with regard to the second condition of interest.
  • the process 15 obtains (step 70) a set of data that correspond to the second condition of interest from the data source 30.
  • the data in the data set includes values for various variables for sample cases.
  • the computer system 20 organizes (step 72) the obtained data in the data set.
  • One exemplary data organization is the tabular representation 50 shown in FIG. 2C.
  • the computer system 20 associates (step 74) two variables in the data set, in this case, variables D and E.
  • the computer system 20 calculates (step 78) the strength of the association between variables D and E. Similarly to the step 68, strength is an indication of how closely the variables are related, in this case, under the second condition of interest. [0069] After both strengths of association between the variables D and E under the first and second condition have been calculated, the computer system 20 generates a score according to a predetermined formula or equation (step 80). The score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest. The computer system 20 may then display the resulting score in step 90 at a client system.
  • the computer system 20 optionally constructs (step 84) a graphic network of variables (e.g. a relevance network) using the score generated in step 80.
  • the computer system 20 optionally generates more scores for other pairs of the variables in the data source 30 (step 82). These scores may be generated using the same process outlined for the first score.
  • the computer system 20 may further build a graphic network such as a relevance network (step 84) based on the multiple scores generated in step 82.
  • all variables in the data source 30 are paired with each other to generate multiple scores, and in the network of variables based on all of the scores generated, each variable is linked to every other variable (e.g., see FIG. 4B).
  • the result may be displayed in step 90 at the client system.
  • the computer system 20 evaluates or screens (step 92) the score for a pair of variables against a predetermined criterion.
  • the criterion can be a threshold value.
  • the computer system 20 first generates a criterion (step 94) if one does not already exist, then compares the score with the criterion (step 96). It then removes (step 98) the score that fails to satisfy the predetermined criterion.
  • the predetermined criterion can require the score of the association between each pair of variables to be above the threshold value.
  • the predetermined criterion can require the score of the association for a variable pair to be below the threshold value in order for that score to remain.
  • the computer system 20 also removes (step 102) each variable that has no score after step 98; that is, all scores based on the associations of that variable with another variable fail to satisfy the criterion.
  • the remaining variables form one or more graphic networks such as relevance networks.
  • each network is displayed, e.g., at the client system.
  • the removal of associations and variables can divide the network of variables into smaller, separate networks.
  • Each such smaller network is a relevance network because that smaller network represents a group of correlated variables under the conditions of interest.
  • Each variable in that smaller network has an association with at least one other variable in that network whose score satisfies the criterion.
  • the criterion may cause the removal of none, one, or multiple scores without the removal of any variables.
  • the relevance network includes all of the variables in the data set.
  • scores may optionally be screened against a criterion.
  • the screen operates as a filter that removes weakly correlated or non-correlated associations and variables from the network of variables to produce one or more relevance networks.
  • the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
  • the criterion is a threshold value against which the strength of each association is measured.
  • the threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
  • Another exemplary technique for setting the threshold value utilizes data permutation.
  • the data for each variable under each condition of interest is first randomly permuted, then the scores for each pair of variables with permuted data are recalculated, and the threshold value is set based on the new score or scores.
  • One example is to set the threshold value equal to or several times greater than the highest value of the score obtained through random permutation of the data.
  • Various techniques for establishing the threshold value may be combined. For example, after the highest value of the score is obtained through data permutation, trial and error may be used until the number of variables left is in a manageable range.
  • the manner of permuting the data of each variable is independent of the manner used for each other variable.
  • a permutation entails redistributing the original data values among the samples under the same condition of interest.
  • the permutation of the data creates new data points between variables for each condition of interest.
  • FIG. 8 shows an exemplary permutation of the data in table 40, which is originally shown in FIG. 2 A.
  • the permutation shown in FIG. 8 produces two new data points between variables A and C, namely (VAL2, NAL8) and (NAL1, NAL6), which differ from the original data points shown in FIG. 2A, namely (NAL2, NAL6) and (VAL3, VAL8).
  • data are permuted within a given variable across the samples.
  • an original data value for variable A for example, will only be shuffled among the samples to still represent a value for variable A, and not for any other variables.
  • the average value of a given variable under a given condition will remain the same in this embodiment.
  • FIG. 9 illustrates a preferred embodiment of data permutation. Each variable's data under one condition of interest, as exemplified here by variable A's data under the second condition of interest as originally listed in the table 50 of FIG. 2C, is reproduced here.
  • Each data entry of variable A is randomly assigned a corresponding value (e.g. a number between 1 and 100).
  • VAL 16 is randomly assigned, in this example, the corresponding value of "22.”
  • VAL 17 is randomly assigned, in this example, the corresponding value of "22.”
  • VAL 18 is sorted according to the order of the assigned corresponding value. One may choose to sort the corresponding values in an ascending order (as in FIG.
  • a descending order, or any other order and reassign each corresponding value along with the data entry associated with it to the sample cases in sequence (e.g., from S5 to S8 as shown in FIG. 9). Because each original data entry goes with its assigned corresponding value, it gets permutated as a result.
  • VAL 18 which has been randomly assigned a corresponding value of "15,” gets to be shuffled to Sample 5 because "15" ranks first in the ascending order of all corresponding values and Sample 5 is the first case in sequence.
  • the system may display graphical representations of the network (e.g. relevance networks).
  • the relevance networks 110 shown in FIGS. 4 A and 4B are exemplary.
  • Application of the invention works with large numbers of variables.
  • the computer system 20 can execute graph layout software.
  • An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
  • execution of the data mining software causes the computer system 20 to access data in the data source 30.
  • the data mining software first accesses data that correspond to the first condition of interest and associates two variables with each other. There may be multiple samples under the first condition that contain data for that pair of variables.
  • the software calculates, based on all the data of the pair of variables under the first condition, a strength of the association according to a predetermined formula or equation.
  • the software then accesses data that correspond to the second condition of interest and associates the same pair of variables under the second condition of interest. Similarly, there may be multiple samples under the second condition that contain data for the pair of variables.
  • a strength of the new association, based on all the data for the pair of variables under the second condition, is calculated by the software according to a predetermined formula or equation, maybe the formula or equation used to calculate the first strength.
  • the strength of the association can be embodied in a correlation coefficient, the slope of the line that fits all the data points for the pair of variables, or a combination of both.
  • the software then generates a score for that pair of variables based on the two strengths calculated from data corresponding to two conditions of interest. If there are more than two conditions of interest, the software repeats the step of accessing the data corresponding to a new condition of interest (e.g. a third condition), associating the same pair of variables, calculating a new strength of the association for the pair for the new condition.
  • a score designed to incorporate the strength for the association between the pair of variables under the new condition may then be generated.
  • the score is derived from the strengths of the association between a pair of variables corresponding to at least two conditions of interest.
  • the score may be designed to evaluate how different or similar the association between the two variables is under conditions of interest.
  • a score may be designed to measure the correlation between the body length and the body mass under different gravitational conditions. The score may be designed to examine if the change in gravity causes the two variables, body length and body mass, to relate to each other differently. As a way of illustration, if the data show that a mouse with a larger body mass tends to be longer on the earth than other mice on the earth, but a mouse with a larger mass is not any longer than other mice in space, then the score would be high. A high value for the score in this case indicates that gravity may be a factor in any correlation between the mouse's body mass and body length.
  • a score may be designed to accentuate difference in the correlation between the expressions of each pair under two disease states. A high score in this case indicates that the two genes might be functionally linked in developing one or both of the diseases.
  • the data mining software may output each score for display (e.g., at the computer system).
  • the displayed output makes it readily apparent what kind of correlation between the two variables exist under the at least two conditions and therefore indicates potentially worthy target for research.
  • the software may continue to generate scores for any number of the rest of the variables until each variable has been paired with another one.
  • An advantage of the invention is the lack of bias or assumption in choosing the association. Because each possible pair of variables may be examined, a method in accordance with the invention can be used to discover unknown link between seemingly unrelated variables affected by or causing certain conditions.
  • the software may then establish a criterion or threshold value and screen at least one of the generated scores against the criterion or threshold.
  • the software may randomly permute the data to establish a threshold.
  • the data mining software may group together variables into one or more separate networks. It may include in the network each pair of variables whose score meets the criterion or threshold. However, even for the purpose of constructing a graphic network of variables, the establishment of a criterion or threshold is not necessary. For example, in certain preferred embodiments, each score already indicates how similar or dissimilar the associations between two variables are under at least two different conditions. If one or more of such scores are used to indicate how various nodes (i.e., variables) are linked in a graphic fashion as exemplified in FIGS. 4 A and 4B, a criterion for the score is optional because the score is self-explanatory in indicating what kind of correlation between the two variables exist under the at least two conditions. If, however, a threshold is established, it may be used to further screen out correlation that is considered of less significance.
  • the software may further output each network for display (e.g., at the computer system).
  • relevance networks may be produced to exhibit potential correlation between biological traits, e.g., genetic compositions, when normal cells are compared to diseased cells, or one type of diseased cell is compared to another.
  • Relevance networks may also be produced to expose potential correlation between biological traits before and after an event, such as the start of a drug therapy.
  • ALL acute lymphocytic leukemia
  • AML acute myelogenous leukemia
  • a disease state is an alteration in the normal cellular genetic regulatory network.
  • a disease state may represent a change in these gene-gene or gene-protein interactions without significant changes in the absolute expression levels of any genes.
  • the tumor suppressor p53 may upregulate p21/wafl to control progression into the cell cycle. Though mutant p53 may cause biologically significant changes in the downstream control of effector genes, the absolute expression values of those genes may not vary significantly from normal.
  • Models of genetic regulatory networks can be ascertained from gene-expression data sets, with varying degrees of concordance with biologically-proven control networks.
  • a central hypothesis is that finding differences between the models for two diseases, rather than the differences in individual gene expression levels, will elucidate the critical biological differences between the diseases.
  • a score was formulated to reflect the above hypotheses and the data for each gene pair generated a score for that pair. Then, a threshold value was established through random permutation of the original data. A relevance network was then generated using scores that met the threshold. A total of 24 genes were included in the network. [098] Specifically, publicly available expression levels of 7,129 expressed sequence tags measured in 47 patients with ALL and 25 patients with AML were obtained from the Whitehead Institute (http://www.genome.wi.mit.edu/MPR). A computer system considered every possible pair of genes as a potential biological relationship.
  • the computer system constructed two scatter plots: one for the ALL patients (first condition of interest) and one for the AML patients (second condition of interest).
  • ALL scatter plot for the association between gene A and gene B each of the 47 ALL patients is represented as a data point 44 in a two-dimensional space similar to the scatter plot 43 in FIG. 2B where the x-coordinate specifies the expression measurement for gene A, and the y-coordinate is the expression measurement for gene B.
  • S(geneA.geneB) abs(MAML -MALI) X bs(RAMi) x abs(RALi)
  • abs is the absolute value function
  • MA L L is the slope using the ALL patient-points
  • MA ML is the slope using the AML patient-points
  • RALL is the square of correlation coefficient (r 2 ) for the regression using the
  • ALL patient-points (ranging from 0 to 1.0), RAML is (r 2 ) using the AML patient-points.
  • the score S was based on (1) the correlation coefficients in the regression models for both diseases, and (2) the differences "in slopes between the two linear regression models under the two diseases. A gene-gene association would have a high S, with a maximum of 2.0, when there was a large difference in the slopes between the ALL and AML models, and high correlation coefficients in the regression models in both models.
  • FIG. 10 shows, in bold line, the distribution 140 of S based on the actual data.
  • the x-axis of the diagram in FIG. 10 shows the value for score (based on actual or permuted data).
  • the y-axis indicates the count of the score.
  • a threshold 130 for S was set as three times the highest S' ⁇ about 0.3. All gene-gene associations with a score below threshold 130 were eliminated. There were 15 gene-gene associations that had an S with a value higher than 0.3, connecting 24 genes. These 24 genes are listed in Tables I-IV.
  • This technique and its application to the leukemia data set are significant for several reasons.
  • this technique allows comparison of the regulatory networks of multiple forms of leukemia. It is noted while two forms of leukemia are examined in this example, the method can be easily expanded to included more than two forms of diseases.
  • Table III lists 8 of the 24 genes that could possibly have a role in oncogenesis. None of these 24 genes were found in the Whitehead analysis, where genes were listed based on whether their absolute expression measurements were different between ALL and AML. [0106] For example, two genes listed in Table I have been known to be involved in chromosomal translocations. The gene EAP is fused with AML1 in the t(3;21) translocation seen in acute and chronic myeloid leukemias. L-myc and r/ can be found to be fused in small-cell lung cancer, and the fusion product has been found to be deregulated compared to the normal -myc gene. Gene rearrangements involving L-myc have also been seen in multiple myeloma.
  • a dotted line fits the hollow dots and has a slope of -0.014 in radians.
  • the (r ) value and the slope indicate that the two genes were not positively (rather, negatively) correlated in patients with AML. As a result, their score was above the threshold value.
  • these two genes may be functionally linked in ALL and not AML, because (1) the two genes share a 5'-region susceptible to trinucleotide repeats, and thus may also share a common regulatory region, (2) the 5 '-regions may be affected by cancerous cellular machinery affecting the trinucleotide repeats, causing loss of regulation in AML, or (3) in normal individuals, these regions may affect expression of the downstream products similarly, and may serve as a marker linked with the development of leukemia.
  • the method of invention is particularly advantageous in investigating complex diseases such as leukemia because, with potentially multiple etiologies and disruptions in several regulatory networks, these diseases are most likely caused by abnormalities in gene regulation, which may not be manifested through individual genes with different expression levels.
  • samples from two diseases can now be analyzed by examining all gene-gene interactions and not just single gene expression levels, yielding quantitatively strong associations that behave significantly differently between the diseases. This is ideal for identifying key controlling differentiators, which are crucial both for clinical diagnosis as well as improved pathophysiologic understanding.
  • the present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture.
  • the article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape.
  • the computer-readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA.
  • the software programs may be stored on or in one or more articles of manufacture as object code.

Abstract

Described are a system and method for mining data in database to discover significant relationships among variables in the data that correspond to at least two conditions of interest. An association is established between a pair of variables based on data on a first condition. The strength of the association is calculated. A second strength of association between the same variables based on data on a second condition is also calculated. Based on both strengths, a score is generated. The score may be evaluated against a predetermined criteria. All associations with a score that satisfies the criterion are included in one or more networks that can be graphically represented.

Description

A SYSTEM AND METHOD FOR DATA ASSOCIATION AND EVALUATION
Related Application
[001] This application claims the benefit of U.S. Provisional Application, Serial No. 60/247,110, filed November 10, 2000, incorporated herein by reference.
Field of the Invention
[002] The invention relates generally to data mining or processing. More specifically, the invention relates to a system and method for mining data from multiple data sets to identify potentially meaningful relationships among variables in the data sets.
Background of the Invention [003] With data accumulating in databases in ever increasing amounts, the task of extracting useful information from the data, called data mining, has grown into an important industry. Data mining techniques aim to identify significant relationships among variables in the data. Functional genomics, for example, which aims to discover genes responsible for phenotypes, has seen a tremendous increase in data as technologies for systematic genome sequencing and microarray analysis are propagated. Bioinformatics, answering the need for processing biological data, employs various techniques to mine biological databases containing sequences, expression data and biological measurements to identify clusters of genes having related functionality.
[004] Current approaches in functional genomics that try to identify genes responsible for two different phenotypes have focused on calculating the "fold" differences in gene expression between the two phenotypes, e.g., two diseases. However, generating a list of individual genes that have different expression levels in two diseases does not necessarily help determine how regulatory differences contribute to disease etiology. Genes with the highest difference in expression levels may represent dissimilarity at the most downstream levels, compared with the subtle or no changes in expression levels in the central controlling genes. [005] Therefore, there remains a need for a data mining technique that examines interactions between pairs of variables in different contexts in order to uncover the underlining relationships between the variables as the context changes. In the application in functional genomics, there is a need for a technique that examines gene-gene interaction in a network approach by looking at the differences in the models for diseases instead of differences in individual gene expression levels.
Summary of the Invention
[006] The present invention relates to a system and method for producing a network of related variables. An objective of the invention is to uncover correlation or association between variables that are significant in different contexts, or conditions of interest. [007] In one aspect, the invention features a method that evaluates a relation between multiple variables under two conditions of interest. According to the method, data for a plurality of variables that represent or correspond to a first condition of interest are obtained. An association between a pair of variables is established. From the data, a first strength of the association between that pair of variables is calculated. The above steps are repeated to calculate a second strength of association between the same pair of variables based on their data that represent or correspond to a second condition of interest. A score is generated based on at least the first and second strengths to help evaluate the association of the pair of variables under both conditions of interest. Another strength of association between the same pair of variables based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated. A score can be generated based on all the calculated strengths. [008] Optionally, a network of those variables may be produced based on the generated score. Or, multiple scores may be generated for multiple pairs of variables, e.g., each possible pair of variables. And a network may be produced based on these multiple scores. The variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.). The network of variables can be graphically displayed.
[009] In one embodiment, the strength of the association between a pair of variables under a certain condition is calculated from a correlation coefficient for the pair of variables. In another embodiment, the strength is based on the slope of the line produced through a linear regression model, i.e., the line that best fits the data for the pair of variables. In a preferred embodiment, the strength is based on both the correlation coefficient and the slope. [0010] A criterion or threshold may also be established and any number of the scores can be evaluated against the criterion. A network of variables may be produced to include only variables that, in connection with other variables, have a score that meets the criterion. [0011] In one embodiment, the criterion or threshold value is established by randomly permuting the data for each pair of variables for each condition of interest. A score for each pair of variables is recalculated from the permuted data. The steps of permuting and calculating may be repeated a predetermined number of times. The criterion or threshold value may be set equal to or greater than the highest value for the recalculated score based on the permuted data.
[0012] In another aspect, the invention relates to a system for evaluating a relation between a plurality of variables under two conditions of interest. The system includes memory storing data for the plurality of variables under at least a first and a second condition of interest. An associator establishes an association between a pair of variables under at least the first and second conditions respectively. A calculator, in communication with the memory and the associator, calculates at least a first and a second strength of the association between the pair of variables under the first and second conditions respectively. And a score generator generates a score for the pair of variables based on at least the first and second strengths. The calculator may further calculate another strength of association between the same pair of variables based on their data corresponding to another condition of interest. There is no limit on how many strengths in connection with a condition of interest that the calculator is capable of calculating. The score generator may generate a score based on all the calculated strengths.
[0013] The system may optionally include a network generator that generates a network of variables based on at least the generated score or scores. The system can further include a display in communication with the score generator that display at least the score. The display may also be in communication with the network generator and display the network. [0014] The system may also include a criterion setter that establishes a criterion against which the score is evaluated. The criterion setter may establish the criterion through recalculating the score based on randomly permuted data. The network generator may choose to generate a network of variables based on each score that meets the criterion. [0015] In another aspect, the method of the invention can be used to evaluate a relation between a plurality of genes under two conditions of interest. Data on the expression levels of a plurality of genes that represent or correspond to a first condition of interest are obtained. An association between the expression levels of a pair of genes is established. From the data, a first strength of the association between the expression levels of that pair of genes is calculated. The above steps are repeated to calculate at least a second strength of association between the expression levels of the same pair of genes based on their data that represent or correspond to a second condition of interest. A score for the pair of genes is generated based on both the first and the second strengths to help evaluate the association between the expression levels of the pair of variables under both conditions of interest. Another strength of association between the expression levels of the same pair of genes based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated. A score can be generated based on all the calculated strengths. Optionally, a network of those genes may be produced based on the generated score. Or, multiple scores may be generated for multiple pairs of genes, e.g., each possible pair of genes. And a network may be produced based on these multiple scores. The network of genes can be graphically displayed.
Brief Description of the Drawings
[0014] The invention is pointed out with particularity in the appended claims. The advantages of the invention described above, as well as further advantages of the invention, may be better understood by reference to the following description taken in conjunction with the accompanying drawings, in which:
[0015] FIG. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention;
[0016] FIG. 2A is an embodiment of a table including data from a data source for a plurality of variables under a first condition of interest;
[0017] FIG. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2 A with a line fit to the data; [0018] FIG. 2C is an embodiment of a table including data from a data source for a plurality of variables under a second condition of interest;
[0019] FIG. 2D is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2C with a line fit to the data;
[0020] FIGS. 3 A, 3B, and 3C are embodiments of a diagram where the lines from FIGS. 2B and 2D are presented;
[0021] FIG. 4 A is an embodiment of a graphical representation of a network of a pair of variables based on their associations;
[0022] FIG. 4B is an embodiment of a graphical representation of a network of a plurality of variables based on their associations;
[0023] FIG. 5 is an embodiment of a matrix including examples of score values for each of the associations shown in FIG. 4B;
[0024] FIG. 6 is a flow chart of an embodiment of an exemplary process that produces a score or a network using the associations between variables under at least two conditions of interest according to the principles of the invention;
[0025] FIG. 7 is a flow chart of an embodiment of an exemplary process of evaluating a score against a criterion;
[0026] FIG. 8 is an embodiment of a table illustrating permutation of the data originally shown in the table of FIG. 2 A;
[0027] FIG. 9 illustrates an exemplary process for permuting the data of a variable listed in
FIG. 2C;
[0028] FIG. 10 is an embodiment of a graph illustrating both results of scores generated in accordance with the invention using actual genomic data and results from an exemplary process used to determine a threshold value for evaluating the scores;
[0029] FIG. 11 is an embodiment of a relevance network including genes identified in FIG.
10;
[0030] FIG. 12 is an embodiment of a scatter plot of actual genomic data for a pair of genes under two disease states with lines fit to the two sets of data. Detailed Description
[0031] The invention provides a method and apparatus for mining data from databases. FIG. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30. A variety of system architectures can be used to practice the invention. The computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention. The processor in one embodiment is a 266 MHz Pentium® II processor, manufactured by Intel Corporation of Santa Clara, California. One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris®, manufactured by Sun Microsystems, Inc. of Palo Alto, California. [0032] The data source 30 in one embodiment is a database system, e.g., ORACLE® 8 or data stored in files on a data storage device, such as a hard disk. To extract data from the data source, the processor of the computer system 20 executes data mining software. Such software is written in any programming language, such as C, C++, etc. [0033] The data in the data source 30 represent measurements of variables for multiple sample cases. These sample cases represent variables under at least two different conditions of interest. The conditions of interest may be conditions parallel in time or sequential in time. For example, in a medical context, the sample cases in one embodiment may be different individuals and the variables may be physical characteristics, such as weight, height, gender, race, etc, measured at the same time. The sample cases may be divided into age groups (i.e. multiple parallel conditions of interest) of 11-20, 21-30, 31-40 and so on. In a second embodiment, the same variables may be measured at various time points (i.e. sequential conditions of interest), such as prior to and post to the occurrence of an environmental change.
[0034] In another embodiment, the measured variables are continuous variables. For example, the sample cases can be measurements of a single human subject conducted over a period of time, such as over a month with daily frequencies. The variables may be measurements of the same set of laboratory tests, such as hemoglobin, hematocrit, cholesterol and thyroxine measurements. And the data may be divided into those before the subject starts a drug therapy and afterwards (i.e. under two sequential conditions of interest). [0035] As another example, the sample cases are RNA expression measurements of various subjects and the measured variables are different genes. The cases may be divided into two different disease conditions, e.g., two forms of leukemia.
[0036] As still another embodiment, the sample cases are corporations for which the measured variables are financial data, such as stock prices, price to earning ratios, etc. The data may be divided into different conditions of interest such as according to difference in the industry, or market capitalization.
[0037] In general, the principles of the invention can be practiced to examine any type of data in search of relationships among various measured variables under multiple conditions of interest. The invention can mine data from any databases such as those containing customer sales transactions, commercial passenger travel information (e.g., airline), financial data, and data collected by laboratories, research facilities, commercial institutions, finance institutions, etc. An advantage is that the invention can exploit data in existing electronic databases. [0038] FIG. 2A shows an exemplary tabular representation 40 of data in the data source that corresponds to a first condition of interest. The measured variables A, B, C, D, and E are represented on the x-axis in columns. The sample cases SI, S2, S3, and S4 on the y-axis, which all correspond to the first condition, are represented in rows. This column and row arrangement is exemplary; the sample cases and variables can appear on either the x- or y- axis and remain within the scope of the invention. In addition, the principles of the invention extend to more sample cases and variables other than those shown in FIG. 2A. The table 40 can be completely, densely, or sparsely populated with data values 42. FIG. 2A shows an exemplary data set of twenty entries wherein the table 40 includes fifteen numerical data values (NAL1 - NAL15). Five entries of the table 40 lack a data value, each denoted by a dashed line.
[0039] Each pair of data values 42 appearing in the same sample in the table 40 represents a data point 44 in a scatter plot. For example, FIG. 2B shows an embodiment of an exemplary scatter plot 43 of the data points 44 produced by data 42 for variables D and E. The data points are (NAL9, NAL12), (NAL10, NAL13), and (NAL11, NAL14). Scatter plots can be produced for each pair of variables, such as for variables B and E, in like manner. [0040] A variety of methodologies can be used to calculate the strength of the association between a pair of variables. The following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships.
[0041] One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points. This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below.
[0042] Another methodology is to consider qualitative characteristics of the data 42. For instance, for empirical data, confidence in the data measurement itself may be taken into consideration in deciding whether an association should be valid to start with.
[0043] Another exemplary methodology uses a linear regression model to process the data points 44. Still referring to FIG. 2B, a line 46 that best fits all the data points 44 is plotted.
The slope of the line 46, sometimes called regression coefficient (sometimes denoted as/?), is calculated. The value of ? indicates how much change in the variable on the y-axis can be expected by a unit of change in the variable on the x-axis.
[0044] The linear regression model can also be used to compute a correlation coefficient
(typically denoted as r) between a pair of variables. The technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
[0045] One technique computes a correlation coefficient with a value between -1 and 1. A correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope, a correlation coefficient of -1 indicates a perfect linear inverse correlation
(i.e., a relationship with a negative slope), and a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables.
[0046] In one embodiment, the correlation coefficient is Pearson's correlation coefficient.
The Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals. In another embodiment, the correlation coefficient is a Spearman Rank correlation coefficient. The Spearman Rank correlation coefficient can be a more appropriate coefficient than the Pearson correlation coefficient when actual numerical values cannot be assigned to variables, but a rank order is assigned to each sample case of each variable.
[0047] For a coefficient that is more indicative of a predictable linear relationship between two variables than r, the square of the correlation coefficient, r , (typically referred to as the coefficient of determination) can be used. The value of r2 ranges between 0 and 1. Because the value of r2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r2 of 0.25, whereas an r of greater than 0.7 has a r2 of greater than 0.5.
[0048] Another technique for computing a correlation coefficient uses a nonlinear regression model. Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
[0049] Another exemplary methodology for determining the strength of the association between a pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables. The entropy of a variable is a measure of the information content in that variable. Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables. [0050] Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n subranges. The proportion of measurements in sub-range X; (or frequency) is denoted as p(x . As n approaches infinity, the histogram increasingly models the probability density function for the variable.
[0051] Entropy can be calculated using the following equation:
H(A) = - Σi = l to„ p(xi)log2(p(xi)) where log2 is base 2 logarithm. Higher entropy indicates that the data for that variable are more randomly distributed, and thus has higher information. [0052] Mutual information can be calculated by subtracting the entropy of a first variable (A) given an occurrence of a second variable (B) from the entropy of the first variable (A) as represented by the following equation:
MI(A, B) = H(A) - H(A | B). Expressed another way, mutual information can be calculated by subtracting the joint entropy of the two variables from the individual entropy of the two variables.
MI(A,B) = H(A) + H(B) - H(A,B). A mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately. A higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
[0053] In a biological context, for example, the computer system 20 can use the above- described equations to compute a mutual information relationship between pairs of genes. The higher the mutual information is between two genes, the greater the strength of the association between those genes (i.e., the more likely those genes have a biological relationship).
[0054] FIG. 2C shows an exemplary tabular representation 50 of data in the data source that corresponds to a second condition of interest. The same measured variables A, B, C, D, and E are represented on the x-axis in columns. The sample cases S5, S6, S7, and S8 on the y- axis, which all correspond to the second condition, are represented in rows. Similar to the table 40, table 50 contains data values 52 (VAL16 - NAL32) for the variables and each pair of the data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot. FIG. 2D shows an embodiment of an exemplary scatter plot 53 of the data points 54 produced by data 52 for variables D and E. The data points are (NAL26, NAL29), (NAL27, NAL30), and (NAL28, NAL31). Scatter plots can be produced for each pair of variables in like manner.
[0055] The same methods that can be used to calculate a first strength of association between a pair of variables under the first condition of interest can be used here to calculate a second strength of association corresponding to the second condition of interest. For example, referring to FIG. 2D, a line 56 can also be plotted that best fits all the data points 54, and a slope of the line 56 (β') can be calculated. A correlation coefficient (r') can also be computed, e.g., through a linear or non-linear regression model. It is noted that the invention contemplates more than two conditions of interest, and the method of generating a strength between a given variables can be repeated infinitely to calculate more strengths. [0056] After at least two strengths of association of a pair of variables under at lest two conditions of interest have been calculated, a score (S) can be generated in response to both strengths. The score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest. For example, FIGS. 3A-3C show three possibilities when the lines 46 and 56 are compared with each other. The line 46 is the line that best fits all the data points 44 based on data 42 of a pair of variables (D and E) under the first condition of interest. The line 56 is the line that best fits all the data points 54 based on data 52 of a pair of variables under the second condition of interest. In FIG. 3 A, the slopes of the two lines 46 and 56 are of the same sign, i.e., both are of the positive sign or of the negative sign. In the particular example shown in FIG. 3 A, both slopes are of the positive sign. This indicates that the association between the pair of variables under the two conditions may be similar. In FIG. 3B, the slopes of the two lines 46 and 56 are of opposite signs, e.g., one is of the positive sign and the other is of the negative sign. The scenario represented by FIG. 3B also includes the case where one of the slopes is zero, meaning no correlation under one of the conditions, while the other slope is nonzero. FIG. 3B indicates that the correlation between the pair of variables under the two conditions may be dissimilar. A third scenario is depicted in FIG. 3C where both slopes of the lines 46 and 56 are zero. This means that the two variables are statistically not correlated under either condition and are statistically independent to each other.
[0057] The score (S) may be formulated to accentuate similar or dissimilar correlation between a pair of variables under multiple conditions or the lack of correlation thereof. The score (S) may be generated in response to regression coefficients (β), correlation coefficients (r), another indicator of strength, or a combination of any of the above. In one embodiment, the score (S) is be calculated through the following equation:
S= | β! - β2 | x | | x | r2 | Where the subscripts "1" and "2" each denotes derivation from data corresponding to a first and a second condition respectively. There are three terms in this equation, | βϊ-β2 1 , |
Figure imgf000012_0001
| , and I r I . A high value in the score (S) suggests that under both conditions of interest, one variable is highly responsive to the other and the two relate to each other in dissimilar ways under the two conditions.
[0058] In a second embodiment, the score (S) is to be calculated through the following equation:
S= | βι + β2 | x | rι | x | r2 | Similarly, the subscripts "1" and "2" each denotes derivation from data corresponding to a first and a second condition respectively. A high value in the score (S) in this embodiment suggests that under both conditions of interest, one variable is highly responsive to the other and the two relate to each other in similar ways under the two conditions. [0059] When more than two strengths of association have been calculated for a pair of variables in connection with more than two conditions of interest, the score (S) may be generated based on any combination of two or more of the calculated strengths (e.g. all the calculated strengths).
[0060] Methods of computing the score are not limited to these exemplary embodiments. Other ways, including other equations and formulas, for generating a score are contemplated by this invention.
[0061] A way of displaying the score is to build a network, e.g., a relevance network based on the scores generated. Because a relevance network depicts a relationship between two variables, a minimum of one score is needed. FIG. 4A shows an example of a relevance network 110 where a link 100 designated by a score (e.g. 0.4) for variables D and E links the pair. Both variables D and E are shown as a node 180 in the network. When multiple scores are generated for multiple pairs of variables, a relevance network can be built based on all the scores generated. FIG. 4B shows an example of a relevance network 110 that graphically links each of the variables A-E with a link 100. Each variable A, B, C, D, and E is shown as a node 180 in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E. Each link 100 has an assigned value (e.g. the score S) characterizing the relationship between pairs of variables. [0062] FIG. 5 shows another way of displaying or organizing multiple score generated in accordance with the invention. An exemplary matrix 104, called a measure triangle, tabulates score values 108 assigned to each of the links 100 of FIG. 4B. The measure triangle 104 places the variables A, B, C, D, and E on both the x- and y-axes. Each value 108 in the measure triangle 104 represents the score determined for the association between the respective pair of variables. The values 108 shown are exemplary and selected only for illustrating the principles of the invention.
[0063] FIG. 6 shows an exemplary process 15 for finding relationships among the variables D and E under two conditions of interest according to the principles of the invention. The process 15 obtains (step 60) a set of data that correspond to the first condition of interest from the data source 30. The data in the data set includes values for various variables for sample cases. The computer system 20 organizes (step 62) the obtained data in the data set. One exemplary data organization is the tabular representation 40 shown in FIG. 2A. The computer system 20 associates (step 64) two variables in the data set, in this case, variables D and E.
[0064] From the data set, the computer system 20 calculates (step 68) the strength of the association between variables D and E. Here, strength is an indication of how closely the variables are related. A strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables. [0065] Variables can be related to each other in various ways. For example, variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide. Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils. Some variables can be directly or indirectly related to each other through other variables. An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone.
[0066] Other variables can have a relationship with each other relating to a pathologic condition. An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha- 1 antitrypsin, an acute phase protein indicative of an inflammatory disease state. Other variables can be related through synonymy. For example, both somatomedin C and insulin-like growth factor- 1 refer to the same molecule. Here, the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names. [0067] The exemplary process 15 continues with regard to the second condition of interest. The process 15 obtains (step 70) a set of data that correspond to the second condition of interest from the data source 30. The data in the data set includes values for various variables for sample cases. The computer system 20 organizes (step 72) the obtained data in the data set. One exemplary data organization is the tabular representation 50 shown in FIG. 2C. The computer system 20 associates (step 74) two variables in the data set, in this case, variables D and E.
[0068] From the data set, the computer system 20 calculates (step 78) the strength of the association between variables D and E. Similarly to the step 68, strength is an indication of how closely the variables are related, in this case, under the second condition of interest. [0069] After both strengths of association between the variables D and E under the first and second condition have been calculated, the computer system 20 generates a score according to a predetermined formula or equation (step 80). The score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest. The computer system 20 may then display the resulting score in step 90 at a client system.
[0070] Still referring to FIG. 6, in one embodiment, the computer system 20 optionally constructs (step 84) a graphic network of variables (e.g. a relevance network) using the score generated in step 80. In another embodiment, the computer system 20 optionally generates more scores for other pairs of the variables in the data source 30 (step 82). These scores may be generated using the same process outlined for the first score. The computer system 20 may further build a graphic network such as a relevance network (step 84) based on the multiple scores generated in step 82. In a preferred embodiment, all variables in the data source 30 are paired with each other to generate multiple scores, and in the network of variables based on all of the scores generated, each variable is linked to every other variable (e.g., see FIG. 4B). After either step 82 or 84, the result may be displayed in step 90 at the client system.
[0071] In another embodiment, after step 80 or step 82, the computer system 20 evaluates or screens (step 92) the score for a pair of variables against a predetermined criterion. In one embodiment, the criterion can be a threshold value. Referring to FIG. 7, the computer system 20 first generates a criterion (step 94) if one does not already exist, then compares the score with the criterion (step 96). It then removes (step 98) the score that fails to satisfy the predetermined criterion. For example, the predetermined criterion can require the score of the association between each pair of variables to be above the threshold value. In another embodiment, the predetermined criterion can require the score of the association for a variable pair to be below the threshold value in order for that score to remain.
[0072] The computer system 20 also removes (step 102) each variable that has no score after step 98; that is, all scores based on the associations of that variable with another variable fail to satisfy the criterion. The remaining variables form one or more graphic networks such as relevance networks. In step 90, each network is displayed, e.g., at the client system.
[0073] In this embodiment, the removal of associations and variables can divide the network of variables into smaller, separate networks. Each such smaller network is a relevance network because that smaller network represents a group of correlated variables under the conditions of interest. Each variable in that smaller network has an association with at least one other variable in that network whose score satisfies the criterion.
[0074] In some instances, the criterion may cause the removal of none, one, or multiple scores without the removal of any variables. In such a case, the relevance network includes all of the variables in the data set.
[0075] Other embodiments of processes for constructing a relevance network from associations that satisfy the criterion can be used to practice the principles of the invention.
[0076] As described above, scores may optionally be screened against a criterion. The screen operates as a filter that removes weakly correlated or non-correlated associations and variables from the network of variables to produce one or more relevance networks.
Consequently, the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
[0077] In one embodiment, the criterion is a threshold value against which the strength of each association is measured. The threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
[0078] Another exemplary technique for setting the threshold value utilizes data permutation.
In a preferred embodiment, the data for each variable under each condition of interest is first randomly permuted, then the scores for each pair of variables with permuted data are recalculated, and the threshold value is set based on the new score or scores. One example is to set the threshold value equal to or several times greater than the highest value of the score obtained through random permutation of the data. Various techniques for establishing the threshold value may be combined. For example, after the highest value of the score is obtained through data permutation, trial and error may be used until the number of variables left is in a manageable range.
[0079] The manner of permuting the data of each variable is independent of the manner used for each other variable. A permutation entails redistributing the original data values among the samples under the same condition of interest. The permutation of the data creates new data points between variables for each condition of interest. FIG. 8 shows an exemplary permutation of the data in table 40, which is originally shown in FIG. 2 A. For example, the permutation shown in FIG. 8 produces two new data points between variables A and C, namely (VAL2, NAL8) and (NAL1, NAL6), which differ from the original data points shown in FIG. 2A, namely (NAL2, NAL6) and (VAL3, VAL8). In a preferred embodiment, for each condition of interest, data are permuted within a given variable across the samples. In other words, as shown in FIG. 8, an original data value for variable A, for example, will only be shuffled among the samples to still represent a value for variable A, and not for any other variables. The average value of a given variable under a given condition will remain the same in this embodiment.
[0080] From the permuted data points, scores for pairs of variables are recalculated. The technique used to calculate the score for variable pairs with permuted data points is the same as that used for the original data points. If steps of permuting the data and calculating scores are repeated a predetermined number of times (e.g., 30), a distribution of scores generated on the permutated data is achieved. The threshold value may then be set based on the distribution of scores obtained from the repeated permutations of the data. [0081] FIG. 9 illustrates a preferred embodiment of data permutation. Each variable's data under one condition of interest, as exemplified here by variable A's data under the second condition of interest as originally listed in the table 50 of FIG. 2C, is reproduced here. Four samples, S5, S6, S7, and S8, show data entries for variable A as VAL 16, VAL17, VAL18, and VAL 19 respectively. Each data entry of variable A is randomly assigned a corresponding value (e.g. a number between 1 and 100). VAL 16 is randomly assigned, in this example, the corresponding value of "22." Likewise, "40" for VAL 17, "15" for VAL 18, and "18" for VAL 19. Then the data entries for variable A are sorted according to the order of the assigned corresponding value. One may choose to sort the corresponding values in an ascending order (as in FIG. 9), a descending order, or any other order, and reassign each corresponding value along with the data entry associated with it to the sample cases in sequence (e.g., from S5 to S8 as shown in FIG. 9). Because each original data entry goes with its assigned corresponding value, it gets permutated as a result. In this example, VAL 18, which has been randomly assigned a corresponding value of "15," gets to be shuffled to Sample 5 because "15" ranks first in the ascending order of all corresponding values and Sample 5 is the first case in sequence.
[0082] After the computer system 20 generates a graphic network of variables in accordance with the invention, the system may display graphical representations of the network (e.g. relevance networks). The relevance networks 110 shown in FIGS. 4 A and 4B are exemplary. Application of the invention works with large numbers of variables. To graphically represent the relevance networks having large numbers of variables, the computer system 20 can execute graph layout software. An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
[0083] In brief overview, execution of the data mining software causes the computer system 20 to access data in the data source 30. The data mining software first accesses data that correspond to the first condition of interest and associates two variables with each other. There may be multiple samples under the first condition that contain data for that pair of variables. The software calculates, based on all the data of the pair of variables under the first condition, a strength of the association according to a predetermined formula or equation. The software then accesses data that correspond to the second condition of interest and associates the same pair of variables under the second condition of interest. Similarly, there may be multiple samples under the second condition that contain data for the pair of variables. A strength of the new association, based on all the data for the pair of variables under the second condition, is calculated by the software according to a predetermined formula or equation, maybe the formula or equation used to calculate the first strength. As a way of illustration, the strength of the association can be embodied in a correlation coefficient, the slope of the line that fits all the data points for the pair of variables, or a combination of both. [0084] The software then generates a score for that pair of variables based on the two strengths calculated from data corresponding to two conditions of interest. If there are more than two conditions of interest, the software repeats the step of accessing the data corresponding to a new condition of interest (e.g. a third condition), associating the same pair of variables, calculating a new strength of the association for the pair for the new condition. A score designed to incorporate the strength for the association between the pair of variables under the new condition may then be generated.
[0085] The score is derived from the strengths of the association between a pair of variables corresponding to at least two conditions of interest. The score may be designed to evaluate how different or similar the association between the two variables is under conditions of interest.
[0086] For example, where the variables are body length and body mass, and measurements of two mice from the same litter are made under two different conditions: one on the earth with gravity of g = 9.8 m/s and the other in space with zero gravity. A score may be designed to measure the correlation between the body length and the body mass under different gravitational conditions. The score may be designed to examine if the change in gravity causes the two variables, body length and body mass, to relate to each other differently. As a way of illustration, if the data show that a mouse with a larger body mass tends to be longer on the earth than other mice on the earth, but a mouse with a larger mass is not any longer than other mice in space, then the score would be high. A high value for the score in this case indicates that gravity may be a factor in any correlation between the mouse's body mass and body length.
[0087] As a second example, if the variables are gene expression levels, a score may be designed to accentuate difference in the correlation between the expressions of each pair under two disease states. A high score in this case indicates that the two genes might be functionally linked in developing one or both of the diseases.
[0088] The data mining software may output each score for display (e.g., at the computer system). The displayed output makes it readily apparent what kind of correlation between the two variables exist under the at least two conditions and therefore indicates potentially worthy target for research. [0089] The software may continue to generate scores for any number of the rest of the variables until each variable has been paired with another one. An advantage of the invention is the lack of bias or assumption in choosing the association. Because each possible pair of variables may be examined, a method in accordance with the invention can be used to discover unknown link between seemingly unrelated variables affected by or causing certain conditions.
[0090] The software may then establish a criterion or threshold value and screen at least one of the generated scores against the criterion or threshold. The software may randomly permute the data to establish a threshold.
[0091] Further, the data mining software may group together variables into one or more separate networks. It may include in the network each pair of variables whose score meets the criterion or threshold. However, even for the purpose of constructing a graphic network of variables, the establishment of a criterion or threshold is not necessary. For example, in certain preferred embodiments, each score already indicates how similar or dissimilar the associations between two variables are under at least two different conditions. If one or more of such scores are used to indicate how various nodes (i.e., variables) are linked in a graphic fashion as exemplified in FIGS. 4 A and 4B, a criterion for the score is optional because the score is self-explanatory in indicating what kind of correlation between the two variables exist under the at least two conditions. If, however, a threshold is established, it may be used to further screen out correlation that is considered of less significance. [0092] The software may further output each network for display (e.g., at the computer system).
[0093] The present invention is useful in a variety of applications. For example, relevance networks may be produced to exhibit potential correlation between biological traits, e.g., genetic compositions, when normal cells are compared to diseased cells, or one type of diseased cell is compared to another. Relevance networks may also be produced to expose potential correlation between biological traits before and after an event, such as the start of a drug therapy. Example
[0094] An example is now provided on how the invention is used to construct genetic regulatory networks in acute lymphocytic leukemia (ALL) and acute myelogenous leukemia (AML).
[095] One view of a disease state is an alteration in the normal cellular genetic regulatory network. A disease state may represent a change in these gene-gene or gene-protein interactions without significant changes in the absolute expression levels of any genes. For example, in normal cells, the tumor suppressor p53 may upregulate p21/wafl to control progression into the cell cycle. Though mutant p53 may cause biologically significant changes in the downstream control of effector genes, the absolute expression values of those genes may not vary significantly from normal.
[096] Models of genetic regulatory networks can be ascertained from gene-expression data sets, with varying degrees of concordance with biologically-proven control networks. A central hypothesis is that finding differences between the models for two diseases, rather than the differences in individual gene expression levels, will elucidate the critical biological differences between the diseases.
[097] In accordance with the invention, every possible pair-wise gene-gene interaction under two disease states is examined by generating a linear regression model, with a correlation coefficient and slope, for each pair of genes. These parameters were chosen because it was hypothesized that coexpression implies coregulation — if a gene is regulated with another gene in a linear manner, the regression model correlation coefficient will be high. For gene-gene associations with high correlation coefficients in both diseases, a significant difference in the slopes of the linear models may be related to pathophysiological differences in the disease states. For example, a critical transcription factor may suppress expression of another gene in the normal state, but this control may be lost in a cancerous state. Therefore, a score was formulated to reflect the above hypotheses and the data for each gene pair generated a score for that pair. Then, a threshold value was established through random permutation of the original data. A relevance network was then generated using scores that met the threshold. A total of 24 genes were included in the network. [098] Specifically, publicly available expression levels of 7,129 expressed sequence tags measured in 47 patients with ALL and 25 patients with AML were obtained from the Whitehead Institute (http://www.genome.wi.mit.edu/MPR). A computer system considered every possible pair of genes as a potential biological relationship. For each of the resulting 25,407,756 gene-gene associations, the computer system constructed two scatter plots: one for the ALL patients (first condition of interest) and one for the AML patients (second condition of interest). For example, in the ALL scatter plot for the association between gene A and gene B, each of the 47 ALL patients is represented as a data point 44 in a two-dimensional space similar to the scatter plot 43 in FIG. 2B where the x-coordinate specifies the expression measurement for gene A, and the y-coordinate is the expression measurement for gene B. Only expression measurements with the strongest confidence interval were used (i.e., those labeled "P" or present by the Affymetrix GENECHIP software), so that a patient's data was plotted on a scatter plot only if the measurements of both genes were labeled "P". A linear regression model was constructed for each scatter plot, with a calculated slope and correlation coefficient.
[099] Since only "P" labeled gene measurements were used, it was possible that a scatter plot contained very few points, and this could bias the correlation coefficient to 1.0. To offset this bias, a scatter plot was constructed only if the gene-gene association, either corresponding to ALL or AML, contained at least nine data points.
[0100] After linear regression models were constructed for all 25 million possible gene-gene association for each disease, a score S was calculated for each gene pair association using the following formula:
S(geneA.geneB) = abs(MAML -MALI) X bs(RAMi) x abs(RALi) where abs is the absolute value function, is the slope of the regression line of the two gene expression measurements plotted using patient-points (expressed in radians divided by 2π, thus ranging between -1.0 and 1.0), MALL is the slope using the ALL patient-points, MAML is the slope using the AML patient-points, RALL is the square of correlation coefficient (r2) for the regression using the
ALL patient-points (ranging from 0 to 1.0), RAML is (r2) using the AML patient-points. [0101] The score S was based on (1) the correlation coefficients in the regression models for both diseases, and (2) the differences "in slopes between the two linear regression models under the two diseases. A gene-gene association would have a high S, with a maximum of 2.0, when there was a large difference in the slopes between the ALL and AML models, and high correlation coefficients in the regression models in both models. FIG. 10 shows, in bold line, the distribution 140 of S based on the actual data. The x-axis of the diagram in FIG. 10 shows the value for score (based on actual or permuted data). The y-axis indicates the count of the score. As FIG. 10 indicates, the highest value for score S in this case is around 0.4. [0102] Expression measurements were permuted 10 times by randomly shuffling the expression measurements for each gene across patients having the same disease and then recalculating a scores S' for each gene pair using the same formula for S. The resultant average distribution 150 of scores S' based on permutated data is shown in FIG. 10. For example, the average distribution line 150 of S' indicates that about five associations have an S' value of 0.05. The average distribution line 150 also shows no S' over 0.08. Confidence intervals 160 were also calculated at specific intervals on the average S' distribution 150 by plotting points two standard deviation above and below the average distribution of S'. [0103] Based on the average distribution of S', taking into account the confidence intervals, a threshold 130 for S was set as three times the highest S' ~ about 0.3. All gene-gene associations with a score below threshold 130 were eliminated. There were 15 gene-gene associations that had an S with a value higher than 0.3, connecting 24 genes. These 24 genes are listed in Tables I-IV.
Figure imgf000024_0001
Table I. Identified Genes with Known Biological Association with Leukemia
Figure imgf000025_0001
Table III. Identified Genes with Possible Roles in Oncogenesis
Figure imgf000026_0001
Table IV. Identified Genes with No Known Biological Association with Leukemia or
Cancer
[0104] Using the 15 qualifying gene-gene associations as a metric, relevance networks of the 24 gene were produced in FIG. 11. Each node 180 of the network is represented by the accession number of the gene, which is listed in Tables I-IV next to the actual name of the gene. The thickness of link 190 between nodes 180 is proportional to the value of S between the variables represented by the nodes 180. A higher S is indicated with a thicker link 190. As shown in FIG. 11, The strongest association (i.e. the highest S) was between m21186 (cytochrome B-245) and U33838 (NF-kappa-B) with an (S) = 0.46. This association showed a negative correlation across patients with AML and a positive correlation in ALL patients. [0105] This technique and its application to the leukemia data set are significant for several reasons. First, this technique allows comparison of the regulatory networks of multiple forms of leukemia. It is noted while two forms of leukemia are examined in this example, the method can be easily expanded to included more than two forms of diseases. Second, previously well-studied oncogenes such as NF-kappa-B, L-myc, and casein kinase II- o' were found using this approach, as well as genes with no currently known roles. Table I lists 12 of the 24 genes found above the threshold. These 12 genes have already been shown to have a biological association in leukemia and oncogenesis. Table II lists one of the 24 genes that was known to have a role in oncogenesis. Table III lists 8 of the 24 genes that could possibly have a role in oncogenesis. None of these 24 genes were found in the Whitehead analysis, where genes were listed based on whether their absolute expression measurements were different between ALL and AML. [0106] For example, two genes listed in Table I have been known to be involved in chromosomal translocations. The gene EAP is fused with AML1 in the t(3;21) translocation seen in acute and chronic myeloid leukemias. L-myc and r/ can be found to be fused in small-cell lung cancer, and the fusion product has been found to be deregulated compared to the normal -myc gene. Gene rearrangements involving L-myc have also been seen in multiple myeloma.
[0107] Further, connections between genes known to have biological associations with leukemia or oncogenesis and genes previously unknown to have such roles were discovered through the method of the invention. For example, an association between M55268 (casein kinase ϊ -a', listed in Table I) and U02632 (calcium-activated potassium channel listed in Table IV) was revealed, as shown in FIG. 11.
[0108] Now referring to FIG. 12, a comparison between the data points on these two genes under both diseases were diagramed. All the data points in the scatter plot has an x-axis coordinate for expression measurement of casein kinase II-α' and a y-axis coordinate for expression measurement of the gene responsible for the potassium channel. The solid dots (•) represent measurements from patients with ALL, and their coefficient of determination (r2) value is 0.72. A solid line fits the solid dots and has a slope of 0.199 in radians. The (r2) value and the slope indicate that the two genes were highly positively correlated across patients with ALL. The hollow dots (o) represent measurements from patients with AML, and their (r~) value is 0.06. A dotted line fits the hollow dots and has a slope of -0.014 in radians. The (r ) value and the slope indicate that the two genes were not positively (rather, negatively) correlated in patients with AML. As a result, their score was above the threshold value.
[0109] It had been shown that transgenic mice overexpressing casein kinase ll-a' develop lymphoma; with coexpressed c-myc, they develop neonatal leukemia. There was no known association between this calcium-dependent potassium channel and leukemia. However, recent work showed that in humans, both casein kinase II- a' subunit and calcium-activated potassium channel share polymorphic 5'-(GCC)n-3' trinucleotide repeats in their 5'-upstream regions. The results described here was the first known instance of finding genes sharing a common mechanism of polymorphism through differential functional genomics. These two genes may be functionally linked in ALL and not AML, because (1) the two genes share a 5'-region susceptible to trinucleotide repeats, and thus may also share a common regulatory region, (2) the 5 '-regions may be affected by cancerous cellular machinery affecting the trinucleotide repeats, causing loss of regulation in AML, or (3) in normal individuals, these regions may affect expression of the downstream products similarly, and may serve as a marker linked with the development of leukemia.
[0110] The method of invention is particularly advantageous in investigating complex diseases such as leukemia because, with potentially multiple etiologies and disruptions in several regulatory networks, these diseases are most likely caused by abnormalities in gene regulation, which may not be manifested through individual genes with different expression levels. Using this scoring system and using each group as a comparison for the other in accordance with the invention, samples from two diseases can now be analyzed by examining all gene-gene interactions and not just single gene expression levels, yielding quantitatively strong associations that behave significantly differently between the diseases. This is ideal for identifying key controlling differentiators, which are crucial both for clinical diagnosis as well as improved pathophysiologic understanding. The majority of the genes found in this analysis were already known to play a role in leukemogenesis, which testify to the validity of this methodology. Finally, this methodology is tunable, in that the threshold score can be moved higher or lower depending on results. The technique can be expanded to include alternative metrics besides the least squares error minimization of linear regression models used in this example.
[0111] The present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture. The article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape. In general, the computer-readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA. The software programs may be stored on or in one or more articles of manufacture as object code.
[0112] Having described certain embodiments of the invention, it will now become apparent to one of skill in the art that other embodiments incorporating the concepts of the invention may be used. Therefore, the invention should not be limited to certain embodiments, but rather should be limited only by the spirit and scope of the following claims.

Claims

Claims What is claimed is: 1. A method for evaluating a relation between a plurality of variables under at least two conditions of interest, comprising the steps of: (a) obtaining data for a plurality of variables under a first condition of interest; (b) establishing an association between the data for a pair of variables of the plurality of variables; (c) calculating from the data a first strength of the association between the pair of variables; (d) repeating steps (a) to (c) to calculate a second strength of association between the pair of variables under a second condition of interest; and (e) generating a score for the pair of variables based on at least the first and second strengths.
2. The method of claim 1 wherein step (c) comprises calculating from the data a correlation coefficient for the pair of variables.
3. The method of claim 2 wherein the score is generated in response to the correlation coefficients for the pair of variables under the first and second conditions.
4. The method of claim 1 wherein step (c) comprises fitting a line to the data for the pair of variables and determining a slope of the line.
5. The method of claim 4 wherein the score is generated in response to the slopes of the lines for the pair of variables under the first and second conditions.
6 The method of claim 3 wherein step (c) further comprises fitting a line to the data for the pair of variables and determining a slope of the line, and wherein the score is generated in response to both the correlation coefficients and the slopes of the lines for the pair of variables under the first and second conditions.
7. The method of claim 1, further comprising producing a network of variables based on at least the score.
8. The method of claim 1, further comprising the step of: (f) repeating steps (a) to (e) for each other pair of variables in the plurality of variables to generate a plurality of scores.
9. The method of claim 8, further comprising producing a network of variables based on the plurality of scores.
10. The method of claim 8, further comprising the step of: (g) establishing a criterion and evaluating at least one of the plurality of scores against the criterion.
11. The method of claim 10, further comprising producing a network that includes each pair of variables if their score meets the criterion.
12. The method of claim 10 wherein step (g) comprises: (g 1 ) randomly permuting the data for the plurality of variables under the first condition of interest; (g2) calculating from the permuted data a first strength of the association between each pair of variables; (g3) repeating steps (gl) and (g2) to generate from the permuted data a second strength of the association between each pair of variables under the second condition of interest; (g4) generating a score for each pair of variables based on the first and second strengths from the permuted data of each pair of variables; and (g5) establish the criterion based on the score from the permuted data.
13. The method of claim 12 further comprising repeating steps (gl) to (g4) a predetermined number of times.
14. The method of claim 12 wherein step (gl) comprises randomly assigning a corresponding value for the data of each variable and permuting the data of each variable through sorting the corresponding value assigned for the data.
15. The method of claim 1 further comprising the step of: (d') repeating steps (a) to (c) to calculate another strength of association between the pair of variables under another condition of interest.
16. The method of claim 15 further comprising the step of:
(e') generating a score for the pair of variables based on all the calculated strengths.
17. A system for evaluating a relation between a plurality of variables under at least two conditions of interest, comprising: memory storing data for a plurality of variables under at least a first and a second condition of interest; an associator establishing an association between the data for a pair of variables of the plurality of variables at least under the first and second conditions respectively; a calculator, in communication with the memory and the associator, calculating from the data at least a first and a second strength of the association between the pair of variables under the first and second conditions respectively; and a score generator generating a score for the pair of variables based on at least the first and second strengths.
18. The system of claim 17 wherein the calculator calculates a correlation coefficient from the data for the pair of variables.
19. The system of claim 18 wherein the score generator generates the score in response to the correlation coefficient for the pair of variables under the first and second conditions.
20. The system of claim 17 wherein the calculator fits a line to the data for the pair of variables and determines a slope of the line.
21. The system of claim 20 wherein the score generator generates the score in response to the slope for the pair of variables under the first and second conditions.
22. The system of claim 17 further comprising a network generator producing a network of variables based on at least the score.
23. The system of claim 22 further comprising a display in communication with the network generator, displaying the network.
24. The system of claim 17 wherein the score generator generates a plurality of scores for the plurality of variables.
25. The system of claim 24 wherein the score generator generates a score for each pair of variables.
26. The system of claim 24 further comprising a network generator producing a network of variables based on the plurality of scores.
27. The system of claim 17 further comprising a criterion setter that establishes a criterion against which the score is evaluated.
28. The system of claim 27 wherein the criterion setter establishes the criterion through recalculating the score based on randomly permuted data.
29. The system of claim 27 furtl er comprising a network generator producing a network of variables based on each score that meets the criterion.
30. The system of claim 17 further comprising a display in communication with the score generator displaying at least the score.
31. The system of claim 17, wherein the calculator calculates from the data another strength of the association between the pair of variables under another condition of interest.
32. The system of claim 31 , wherein the score generator generates a score for the pair of variables based on all the calculated strengths.
33. A method for evaluating a relation between a plurality of genes under at least two conditions of interest, comprising the steps of: (a) obtaining data on expression levels of a plurality of genes under a first condition of interest; (b) establishing an association between the expression levels of a pair of genes of the plurality of genes; (c) calculating from the data a first strength of the association between the expression levels of the pair of genes; (d) repeating steps (a) to (c) to generate a second strength of association between the expression levels of the pair of genes under a second condition of interest; and (e) generating a score for the pair of genes based on at least the first and second strengths.
34. The method of claim 33 wherein at least one of the first and second conditions of interest is a disease condition.
35. The method of claim 33 wherein one of the first and second conditions of interest is a . normal condition.
36. The method of claim 33 wherein the expression levels of the plurality of genes are based on a quantitative measurement of R A transcripts of the plurality of genes.
37. The method of claim 33 wherein step (c) comprises calculating from the data a correlation coefficient for the expression levels of the pair of genes.
38. The method of claim 37 wherein the score is generated in response to the correlation coefficients for the pair of genes under the first and second conditions.
39. The method of claim 33 wherein step (c) comprises fitting a line to the data for the pair of genes and determining a slope of the line.
40. The method of claim 39 wherein the score is generated in response to the slopes of the lines for the pair of genes under the first and second conditions.
41. The method of claim 38 wherein step (c) further comprises fitting a line to the data for the pair of genes and determining a slope of the line, and wherein the score is generated in response to both the correlation coefficients and the slopes of the lines for the pair of genes under the first and second conditions.
42. The method of claim 33, further comprising producing a network based on at least the score.
43. The method of claim 33, further comprising the step of: (f) repeating steps (a) to (e) for each other pair of genes in the plurality of genes to generate a plurality of scores.
44. The method of claim 43, further comprising producing a network based on the plurality of scores.
45. The method of claim 43, further comprising the step of: (g) establishing a criterion and evaluating at least one of the plurality of scores against the criterion.
46. The method of claim 45, further comprising producing a network that includes each pair of genes if their score meets the criterion.
47. The method of claim 45 wherein step (g) comprises: (gl) randomly permuting the data on the expression levels of the plurality of genes under the first condition of interest; (g2) calculating from the permuted data a first strength of the association between the expression levels of each pair of genes; (g3) repeating steps (gl) and (g2) to generate from the permuted data a second strength of the association between the expression levels of each pair of genes under the second condition of interest; (g4) generating a score for each pair of genes based on the first and second strengths from the permuted data of each pair of genes; and (g5) establish the criterion based on the score from the permuted data.
48. The method of claim 47 further comprising repeating steps (gl ) to (g4) a predetermined number of times .
49. The method of claim 47 wherein step (gl) comprises randomly assigning a corresponding value for the data for each gene and permuting the data for each gene through sorting the corresponding value assigned for the data.
50. The method of claim 33 wherein the data comprise expression level data on the same plurality of genes from multiple subjects.
51. The method of claim 33 further comprising the step of: (d') repeating steps (a) to (c) to calculate another strength of association between the expression levels of the pair of genes under another condition of interest.
52. The method of claim 51 further comprising the step of:
(e') generating a score for the pair of genes based on all the calculated strengths.
PCT/US2001/043604 2000-11-10 2001-11-12 Relevance networks for visualizing clusters in gene expression data WO2002039214A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002219817A AU2002219817A1 (en) 2000-11-10 2001-11-12 Relevance networks for visualizing clusters in gene expression data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US24711000P 2000-11-10 2000-11-10
US60/247,110 2000-11-10

Publications (3)

Publication Number Publication Date
WO2002039214A2 true WO2002039214A2 (en) 2002-05-16
WO2002039214A3 WO2002039214A3 (en) 2002-11-21
WO2002039214A9 WO2002039214A9 (en) 2003-02-06

Family

ID=22933592

Family Applications (3)

Application Number Title Priority Date Filing Date
PCT/US2001/043604 WO2002039214A2 (en) 2000-11-10 2001-11-12 Relevance networks for visualizing clusters in gene expression data
PCT/US2001/047087 WO2002059820A1 (en) 2000-11-10 2001-11-13 Method and apparatus for determining fold difference significance
PCT/US2001/047163 WO2002046474A2 (en) 2000-11-10 2001-11-13 Method and system for identifying time-series relationships of gene expression level using signal processing metrics

Family Applications After (2)

Application Number Title Priority Date Filing Date
PCT/US2001/047087 WO2002059820A1 (en) 2000-11-10 2001-11-13 Method and apparatus for determining fold difference significance
PCT/US2001/047163 WO2002046474A2 (en) 2000-11-10 2001-11-13 Method and system for identifying time-series relationships of gene expression level using signal processing metrics

Country Status (2)

Country Link
AU (2) AU2002219817A1 (en)
WO (3) WO2002039214A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151031A (en) * 2023-10-26 2023-12-01 国网经济技术研究院有限公司 Design evaluation method and system for parallel busbar of high-power electronic device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3016362A1 (en) 2014-11-03 2016-05-04 OpenTV Europe SAS Method and system to share advertisement content from a main device to a secondary device
US11892990B2 (en) 2021-01-04 2024-02-06 International Business Machines Corporation Removal of transaction noise

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
BASETT D E ET AL: "GENE EXPRESSION INFORMATICS - IT'S ALL IN YOUR MINE" NATURE GENETICS, NEW YORK, NY, US, vol. 21, no. SUPPL, January 1999 (1999-01), pages 51-55, XP000865988 ISSN: 1061-4036 *
BUTTE A.J. ET AL: "Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCE, vol. 97, no. 22, 10 October 2000 (2000-10-10), pages 12182-12186, XP002202820 USA *
BUTTE A.J., KOHANE, I.S.: "Mutual Information relevance networks: functional genomic clustering using pairwise entropy measurements" PACIFIC SYMPOSIUM ON BIOCOMPUTING 2000, 4 - 9 January 2000, pages 418-429, XP002202822 Honolulu, USA *
BUTTE, A.J., KOHANE, I.S.: "Unsupervised Knowledge Discovery in Medical Databases Using Relevance Networks" PROCEEDINGS OF THE FALL SYMPOSIUM OF THE AMERICAN MEDIACL INFORMATICS ASSOCIATION (AMIA99), 6 - 10 November 1999, pages 711-715, XP002202821 USA *
CLAVERIE, J.-M.: "Computational methods for the identification of differential and coordinated gene expression" HUMAN MOLECULAR GENETICS, OXFORD UNIVERSITY PRESS, vol. 8, no. 10, 1 September 1999 (1999-09-01), pages 1821-1832, XP002202819 *
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 *
MICHAELS G S ET AL: "CLUSTER ANALYSIS AND DATA VISUALIZATION OF LARGE-SCALE GENE EXPRESSION DATA" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING, XX, XX, 1997, pages 42-53, XP000974575 *
WEINSTEIN JOHN N ET AL: "An information-intensive approach to the molecular pharmacology of cancer" SCIENCE, AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE,, US, vol. 275, no. 5298, 17 January 1997 (1997-01-17), pages 343-349, XP002199806 ISSN: 0036-8075 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151031A (en) * 2023-10-26 2023-12-01 国网经济技术研究院有限公司 Design evaluation method and system for parallel busbar of high-power electronic device
CN117151031B (en) * 2023-10-26 2024-01-30 国网经济技术研究院有限公司 Design evaluation method and system for parallel busbar of high-power electronic device

Also Published As

Publication number Publication date
WO2002039214A3 (en) 2002-11-21
WO2002046474A3 (en) 2002-11-14
WO2002046474A2 (en) 2002-06-13
WO2002059820A9 (en) 2003-11-20
AU2002236577A1 (en) 2002-06-18
WO2002039214A9 (en) 2003-02-06
AU2002219817A1 (en) 2002-05-21
WO2002059820A1 (en) 2002-08-01

Similar Documents

Publication Publication Date Title
Xu et al. CMF-Impute: an accurate imputation tool for single-cell RNA-seq data
Alexa et al. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure
Ge et al. flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding
Pritchard et al. Inference of population structure using multilocus genotype data
Liu et al. Probe-level measurement error improves accuracy in detecting differential gene expression
Koch et al. Pharmacometrics and machine learning partner to advance clinical data analysis
US20140067813A1 (en) Parallelization of synthetic events with genetic surprisal data representing a genetic sequence of an organism
JP2003502669A (en) Method of profiling and classifying an organization using a database containing indices representing the population of the organization
CN108121896B (en) Disease relation analysis method and device based on miRNA
DiNardo et al. Distance measures for tumor evolutionary trees
WO2002044715A1 (en) Methods for efficiently minig broad data sets for biological markers
Pehkonen et al. Theme discovery from gene lists for identification and viewing of multiple functional groups
WO2000079271A1 (en) Online database that includes indices representative of a tissue population
Le Van et al. Simultaneous discovery of cancer subtypes and subtype features by molecular data integration
Chen et al. A comprehensive comparison on cell-type composition inference for spatial transcriptomics data
He et al. Microarrays—the 21st century divining rod?
US20040234995A1 (en) System and method for storage and analysis of gene expression data
Zhang et al. MatchMixeR: a cross-platform normalization method for gene expression data integration
Swanson et al. A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort
WO2020147557A1 (en) Method and device for processing intestinal microorganism sequencing data, storage medium, and processor
Zhang et al. A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer
Li et al. A sparse negative binomial mixture model for clustering RNA-seq count data
Dong et al. Integrating single-cell datasets with ambiguous batch information by incorporating molecular network features
Li et al. Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes
Truong et al. Learning a complex metabolomic dataset using random forests and support vector machines

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/13-13/13, DRAWINGS, REPLACED BY NEW PAGES 1/13-13/13; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP