WO2002039214A2 - Relevance networks for visualizing clusters in gene expression data - Google Patents
Relevance networks for visualizing clusters in gene expression data Download PDFInfo
- Publication number
- WO2002039214A2 WO2002039214A2 PCT/US2001/043604 US0143604W WO0239214A2 WO 2002039214 A2 WO2002039214 A2 WO 2002039214A2 US 0143604 W US0143604 W US 0143604W WO 0239214 A2 WO0239214 A2 WO 0239214A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variables
- pair
- data
- score
- genes
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Definitions
- the invention relates generally to data mining or processing. More specifically, the invention relates to a system and method for mining data from multiple data sets to identify potentially meaningful relationships among variables in the data sets.
- the present invention relates to a system and method for producing a network of related variables.
- An objective of the invention is to uncover correlation or association between variables that are significant in different contexts, or conditions of interest.
- the invention features a method that evaluates a relation between multiple variables under two conditions of interest. According to the method, data for a plurality of variables that represent or correspond to a first condition of interest are obtained. An association between a pair of variables is established. From the data, a first strength of the association between that pair of variables is calculated. The above steps are repeated to calculate a second strength of association between the same pair of variables based on their data that represent or correspond to a second condition of interest.
- a score is generated based on at least the first and second strengths to help evaluate the association of the pair of variables under both conditions of interest. Another strength of association between the same pair of variables based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated. A score can be generated based on all the calculated strengths. [008] Optionally, a network of those variables may be produced based on the generated score. Or, multiple scores may be generated for multiple pairs of variables, e.g., each possible pair of variables. And a network may be produced based on these multiple scores.
- the variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.).
- the network of variables can be graphically displayed.
- the strength of the association between a pair of variables under a certain condition is calculated from a correlation coefficient for the pair of variables.
- the strength is based on the slope of the line produced through a linear regression model, i.e., the line that best fits the data for the pair of variables.
- the strength is based on both the correlation coefficient and the slope.
- a criterion or threshold may also be established and any number of the scores can be evaluated against the criterion.
- a network of variables may be produced to include only variables that, in connection with other variables, have a score that meets the criterion.
- the criterion or threshold value is established by randomly permuting the data for each pair of variables for each condition of interest. A score for each pair of variables is recalculated from the permuted data. The steps of permuting and calculating may be repeated a predetermined number of times. The criterion or threshold value may be set equal to or greater than the highest value for the recalculated score based on the permuted data.
- the invention in another aspect, relates to a system for evaluating a relation between a plurality of variables under two conditions of interest.
- the system includes memory storing data for the plurality of variables under at least a first and a second condition of interest.
- An associator establishes an association between a pair of variables under at least the first and second conditions respectively.
- a calculator in communication with the memory and the associator, calculates at least a first and a second strength of the association between the pair of variables under the first and second conditions respectively.
- a score generator generates a score for the pair of variables based on at least the first and second strengths.
- the calculator may further calculate another strength of association between the same pair of variables based on their data corresponding to another condition of interest. There is no limit on how many strengths in connection with a condition of interest that the calculator is capable of calculating.
- the score generator may generate a score based on all the calculated strengths.
- the system may optionally include a network generator that generates a network of variables based on at least the generated score or scores.
- the system can further include a display in communication with the score generator that display at least the score.
- the display may also be in communication with the network generator and display the network.
- the system may also include a criterion setter that establishes a criterion against which the score is evaluated.
- the criterion setter may establish the criterion through recalculating the score based on randomly permuted data.
- the network generator may choose to generate a network of variables based on each score that meets the criterion.
- the method of the invention can be used to evaluate a relation between a plurality of genes under two conditions of interest.
- Data on the expression levels of a plurality of genes that represent or correspond to a first condition of interest are obtained.
- An association between the expression levels of a pair of genes is established.
- a first strength of the association between the expression levels of that pair of genes is calculated.
- the above steps are repeated to calculate at least a second strength of association between the expression levels of the same pair of genes based on their data that represent or correspond to a second condition of interest.
- a score for the pair of genes is generated based on both the first and the second strengths to help evaluate the association between the expression levels of the pair of variables under both conditions of interest.
- Another strength of association between the expression levels of the same pair of genes based on their data corresponding to another condition of interest can be calculated. There is no limit on how many strengths in connection with conditions of interest that can be calculated.
- a score can be generated based on all the calculated strengths.
- a network of those genes may be produced based on the generated score.
- multiple scores may be generated for multiple pairs of genes, e.g., each possible pair of genes. And a network may be produced based on these multiple scores.
- the network of genes can be graphically displayed.
- FIG. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention
- FIG. 2A is an embodiment of a table including data from a data source for a plurality of variables under a first condition of interest
- FIG. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2 A with a line fit to the data;
- FIG. 2C is an embodiment of a table including data from a data source for a plurality of variables under a second condition of interest;
- FIG. 2D is an embodiment of a scatter plot of the data for a pair of variables from the table of FIG. 2C with a line fit to the data;
- FIGS. 3 A, 3B, and 3C are embodiments of a diagram where the lines from FIGS. 2B and 2D are presented;
- FIG. 4 A is an embodiment of a graphical representation of a network of a pair of variables based on their associations
- FIG. 4B is an embodiment of a graphical representation of a network of a plurality of variables based on their associations
- FIG. 5 is an embodiment of a matrix including examples of score values for each of the associations shown in FIG. 4B;
- FIG. 6 is a flow chart of an embodiment of an exemplary process that produces a score or a network using the associations between variables under at least two conditions of interest according to the principles of the invention
- FIG. 7 is a flow chart of an embodiment of an exemplary process of evaluating a score against a criterion
- FIG. 8 is an embodiment of a table illustrating permutation of the data originally shown in the table of FIG. 2 A;
- FIG. 9 illustrates an exemplary process for permuting the data of a variable listed in
- FIG. 2C
- FIG. 10 is an embodiment of a graph illustrating both results of scores generated in accordance with the invention using actual genomic data and results from an exemplary process used to determine a threshold value for evaluating the scores;
- FIG. 11 is an embodiment of a relevance network including genes identified in FIG.
- FIG. 12 is an embodiment of a scatter plot of actual genomic data for a pair of genes under two disease states with lines fit to the two sets of data.
- FIG. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30.
- the computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention.
- the processor in one embodiment is a 266 MHz Pentium® II processor, manufactured by Intel Corporation of Santa Clara, California.
- One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris®, manufactured by Sun Microsystems, Inc. of Palo Alto, California.
- the data source 30 in one embodiment is a database system, e.g., ORACLE® 8 or data stored in files on a data storage device, such as a hard disk.
- the processor of the computer system 20 executes data mining software. Such software is written in any programming language, such as C, C++, etc.
- the data in the data source 30 represent measurements of variables for multiple sample cases. These sample cases represent variables under at least two different conditions of interest. The conditions of interest may be conditions parallel in time or sequential in time. For example, in a medical context, the sample cases in one embodiment may be different individuals and the variables may be physical characteristics, such as weight, height, gender, race, etc, measured at the same time.
- sample cases may be divided into age groups (i.e. multiple parallel conditions of interest) of 11-20, 21-30, 31-40 and so on.
- age groups i.e. multiple parallel conditions of interest
- the same variables may be measured at various time points (i.e. sequential conditions of interest), such as prior to and post to the occurrence of an environmental change.
- the measured variables are continuous variables.
- the sample cases can be measurements of a single human subject conducted over a period of time, such as over a month with daily frequencies.
- the variables may be measurements of the same set of laboratory tests, such as hemoglobin, hematocrit, cholesterol and thyroxine measurements. And the data may be divided into those before the subject starts a drug therapy and afterwards (i.e. under two sequential conditions of interest).
- the sample cases are RNA expression measurements of various subjects and the measured variables are different genes. The cases may be divided into two different disease conditions, e.g., two forms of leukemia.
- sample cases are corporations for which the measured variables are financial data, such as stock prices, price to earning ratios, etc.
- the data may be divided into different conditions of interest such as according to difference in the industry, or market capitalization.
- FIG. 2A shows an exemplary tabular representation 40 of data in the data source that corresponds to a first condition of interest.
- the measured variables A, B, C, D, and E are represented on the x-axis in columns.
- the sample cases SI, S2, S3, and S4 on the y-axis, which all correspond to the first condition, are represented in rows.
- FIG. 2A shows an exemplary data set of twenty entries wherein the table 40 includes fifteen numerical data values (NAL1 - NAL15). Five entries of the table 40 lack a data value, each denoted by a dashed line.
- Each pair of data values 42 appearing in the same sample in the table 40 represents a data point 44 in a scatter plot.
- FIG. 2B shows an embodiment of an exemplary scatter plot 43 of the data points 44 produced by data 42 for variables D and E.
- the data points are (NAL9, NAL12), (NAL10, NAL13), and (NAL11, NAL14).
- Scatter plots can be produced for each pair of variables, such as for variables B and E, in like manner.
- a variety of methodologies can be used to calculate the strength of the association between a pair of variables. The following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships.
- One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points. This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below.
- Another methodology is to consider qualitative characteristics of the data 42. For instance, for empirical data, confidence in the data measurement itself may be taken into consideration in deciding whether an association should be valid to start with.
- Another exemplary methodology uses a linear regression model to process the data points 44. Still referring to FIG. 2B, a line 46 that best fits all the data points 44 is plotted.
- the slope of the line 46 sometimes called regression coefficient (sometimes denoted as/?), is calculated.
- the value of ? indicates how much change in the variable on the y-axis can be expected by a unit of change in the variable on the x-axis.
- the linear regression model can also be used to compute a correlation coefficient
- the technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
- One technique computes a correlation coefficient with a value between -1 and 1.
- a correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope
- a correlation coefficient of -1 indicates a perfect linear inverse correlation
- a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables.
- the correlation coefficient is Pearson's correlation coefficient.
- the Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals.
- the correlation coefficient is a Spearman Rank correlation coefficient.
- the Spearman Rank correlation coefficient can be a more appropriate coefficient than the Pearson correlation coefficient when actual numerical values cannot be assigned to variables, but a rank order is assigned to each sample case of each variable.
- the square of the correlation coefficient, r (typically referred to as the coefficient of determination) can be used.
- the value of r 2 ranges between 0 and 1. Because the value of r 2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r 2 of 0.25, whereas an r of greater than 0.7 has a r 2 of greater than 0.5.
- Another technique for computing a correlation coefficient uses a nonlinear regression model.
- Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
- entropy (H) of the variables and the mutual information between each pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables.
- the entropy of a variable is a measure of the information content in that variable.
- Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables.
- Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n subranges. The proportion of measurements in sub-range X; (or frequency) is denoted as p(x . As n approaches infinity, the histogram increasingly models the probability density function for the variable.
- Entropy can be calculated using the following equation:
- MI(A, B) H(A) - H(A
- MI(A,B) H(A) + H(B) - H(A,B).
- a mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately.
- a higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
- the computer system 20 can use the above- described equations to compute a mutual information relationship between pairs of genes.
- FIG. 2C shows an exemplary tabular representation 50 of data in the data source that corresponds to a second condition of interest.
- the same measured variables A, B, C, D, and E are represented on the x-axis in columns.
- the sample cases S5, S6, S7, and S8 on the y- axis, which all correspond to the second condition, are represented in rows.
- table 50 contains data values 52 (VAL16 - NAL32) for the variables and each pair of the data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot.
- FIG. 2D shows an embodiment of an exemplary scatter plot 53 of the data points 54 produced by data 52 for variables D and E.
- the data points are (NAL26, NAL29), (NAL27, NAL30), and (NAL28, NAL31). Scatter plots can be produced for each pair of variables in like manner.
- a line 56 can also be plotted that best fits all the data points 54, and a slope of the line 56 ( ⁇ ') can be calculated.
- a correlation coefficient (r') can also be computed, e.g., through a linear or non-linear regression model. It is noted that the invention contemplates more than two conditions of interest, and the method of generating a strength between a given variables can be repeated infinitely to calculate more strengths.
- a score can be generated in response to both strengths.
- the score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest.
- FIGS. 3A-3C show three possibilities when the lines 46 and 56 are compared with each other.
- the line 46 is the line that best fits all the data points 44 based on data 42 of a pair of variables (D and E) under the first condition of interest.
- the line 56 is the line that best fits all the data points 54 based on data 52 of a pair of variables under the second condition of interest.
- the slopes of the two lines 46 and 56 are of the same sign, i.e., both are of the positive sign or of the negative sign. In the particular example shown in FIG. 3 A, both slopes are of the positive sign. This indicates that the association between the pair of variables under the two conditions may be similar.
- the slopes of the two lines 46 and 56 are of opposite signs, e.g., one is of the positive sign and the other is of the negative sign.
- the scenario represented by FIG. 3B also includes the case where one of the slopes is zero, meaning no correlation under one of the conditions, while the other slope is nonzero.
- FIG. 3B indicates that the correlation between the pair of variables under the two conditions may be dissimilar.
- a third scenario is depicted in FIG. 3C where both slopes of the lines 46 and 56 are zero. This means that the two variables are statistically not correlated under either condition and are statistically independent to each other.
- the score (S) may be formulated to accentuate similar or dissimilar correlation between a pair of variables under multiple conditions or the lack of correlation thereof.
- the score (S) may be generated in response to regression coefficients ( ⁇ ), correlation coefficients (r), another indicator of strength, or a combination of any of the above.
- the score (S) is be calculated through the following equation:
- the score (S) is to be calculated through the following equation:
- the score (S) may be generated based on any combination of two or more of the calculated strengths (e.g. all the calculated strengths).
- Methods of computing the score are not limited to these exemplary embodiments. Other ways, including other equations and formulas, for generating a score are contemplated by this invention.
- a way of displaying the score is to build a network, e.g., a relevance network based on the scores generated. Because a relevance network depicts a relationship between two variables, a minimum of one score is needed.
- FIG. 4A shows an example of a relevance network 110 where a link 100 designated by a score (e.g. 0.4) for variables D and E links the pair. Both variables D and E are shown as a node 180 in the network. When multiple scores are generated for multiple pairs of variables, a relevance network can be built based on all the scores generated.
- FIG. 4B shows an example of a relevance network 110 that graphically links each of the variables A-E with a link 100.
- Each variable A, B, C, D, and E is shown as a node 180 in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E.
- Each link 100 has an assigned value (e.g. the score S) characterizing the relationship between pairs of variables.
- FIG. 5 shows another way of displaying or organizing multiple score generated in accordance with the invention.
- An exemplary matrix 104 called a measure triangle, tabulates score values 108 assigned to each of the links 100 of FIG. 4B.
- the measure triangle 104 places the variables A, B, C, D, and E on both the x- and y-axes.
- Each value 108 in the measure triangle 104 represents the score determined for the association between the respective pair of variables.
- the values 108 shown are exemplary and selected only for illustrating the principles of the invention.
- FIG. 6 shows an exemplary process 15 for finding relationships among the variables D and E under two conditions of interest according to the principles of the invention.
- the process 15 obtains (step 60) a set of data that correspond to the first condition of interest from the data source 30.
- the data in the data set includes values for various variables for sample cases.
- the computer system 20 organizes (step 62) the obtained data in the data set.
- One exemplary data organization is the tabular representation 40 shown in FIG. 2A.
- the computer system 20 associates (step 64) two variables in the data set, in this case, variables D and E.
- the computer system 20 calculates (step 68) the strength of the association between variables D and E.
- strength is an indication of how closely the variables are related.
- a strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables.
- Variables can be related to each other in various ways.
- variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide.
- Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils.
- Some variables can be directly or indirectly related to each other through other variables.
- An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone.
- Other variables can have a relationship with each other relating to a pathologic condition.
- An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha- 1 antitrypsin, an acute phase protein indicative of an inflammatory disease state.
- Other variables can be related through synonymy. For example, both somatomedin C and insulin-like growth factor- 1 refer to the same molecule.
- the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names.
- the exemplary process 15 continues with regard to the second condition of interest.
- the process 15 obtains (step 70) a set of data that correspond to the second condition of interest from the data source 30.
- the data in the data set includes values for various variables for sample cases.
- the computer system 20 organizes (step 72) the obtained data in the data set.
- One exemplary data organization is the tabular representation 50 shown in FIG. 2C.
- the computer system 20 associates (step 74) two variables in the data set, in this case, variables D and E.
- the computer system 20 calculates (step 78) the strength of the association between variables D and E. Similarly to the step 68, strength is an indication of how closely the variables are related, in this case, under the second condition of interest. [0069] After both strengths of association between the variables D and E under the first and second condition have been calculated, the computer system 20 generates a score according to a predetermined formula or equation (step 80). The score may be formulated to evaluate how different or similar the association between the two variables are under the conditions of interest. The computer system 20 may then display the resulting score in step 90 at a client system.
- the computer system 20 optionally constructs (step 84) a graphic network of variables (e.g. a relevance network) using the score generated in step 80.
- the computer system 20 optionally generates more scores for other pairs of the variables in the data source 30 (step 82). These scores may be generated using the same process outlined for the first score.
- the computer system 20 may further build a graphic network such as a relevance network (step 84) based on the multiple scores generated in step 82.
- all variables in the data source 30 are paired with each other to generate multiple scores, and in the network of variables based on all of the scores generated, each variable is linked to every other variable (e.g., see FIG. 4B).
- the result may be displayed in step 90 at the client system.
- the computer system 20 evaluates or screens (step 92) the score for a pair of variables against a predetermined criterion.
- the criterion can be a threshold value.
- the computer system 20 first generates a criterion (step 94) if one does not already exist, then compares the score with the criterion (step 96). It then removes (step 98) the score that fails to satisfy the predetermined criterion.
- the predetermined criterion can require the score of the association between each pair of variables to be above the threshold value.
- the predetermined criterion can require the score of the association for a variable pair to be below the threshold value in order for that score to remain.
- the computer system 20 also removes (step 102) each variable that has no score after step 98; that is, all scores based on the associations of that variable with another variable fail to satisfy the criterion.
- the remaining variables form one or more graphic networks such as relevance networks.
- each network is displayed, e.g., at the client system.
- the removal of associations and variables can divide the network of variables into smaller, separate networks.
- Each such smaller network is a relevance network because that smaller network represents a group of correlated variables under the conditions of interest.
- Each variable in that smaller network has an association with at least one other variable in that network whose score satisfies the criterion.
- the criterion may cause the removal of none, one, or multiple scores without the removal of any variables.
- the relevance network includes all of the variables in the data set.
- scores may optionally be screened against a criterion.
- the screen operates as a filter that removes weakly correlated or non-correlated associations and variables from the network of variables to produce one or more relevance networks.
- the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
- the criterion is a threshold value against which the strength of each association is measured.
- the threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
- Another exemplary technique for setting the threshold value utilizes data permutation.
- the data for each variable under each condition of interest is first randomly permuted, then the scores for each pair of variables with permuted data are recalculated, and the threshold value is set based on the new score or scores.
- One example is to set the threshold value equal to or several times greater than the highest value of the score obtained through random permutation of the data.
- Various techniques for establishing the threshold value may be combined. For example, after the highest value of the score is obtained through data permutation, trial and error may be used until the number of variables left is in a manageable range.
- the manner of permuting the data of each variable is independent of the manner used for each other variable.
- a permutation entails redistributing the original data values among the samples under the same condition of interest.
- the permutation of the data creates new data points between variables for each condition of interest.
- FIG. 8 shows an exemplary permutation of the data in table 40, which is originally shown in FIG. 2 A.
- the permutation shown in FIG. 8 produces two new data points between variables A and C, namely (VAL2, NAL8) and (NAL1, NAL6), which differ from the original data points shown in FIG. 2A, namely (NAL2, NAL6) and (VAL3, VAL8).
- data are permuted within a given variable across the samples.
- an original data value for variable A for example, will only be shuffled among the samples to still represent a value for variable A, and not for any other variables.
- the average value of a given variable under a given condition will remain the same in this embodiment.
- FIG. 9 illustrates a preferred embodiment of data permutation. Each variable's data under one condition of interest, as exemplified here by variable A's data under the second condition of interest as originally listed in the table 50 of FIG. 2C, is reproduced here.
- Each data entry of variable A is randomly assigned a corresponding value (e.g. a number between 1 and 100).
- VAL 16 is randomly assigned, in this example, the corresponding value of "22.”
- VAL 17 is randomly assigned, in this example, the corresponding value of "22.”
- VAL 18 is sorted according to the order of the assigned corresponding value. One may choose to sort the corresponding values in an ascending order (as in FIG.
- a descending order, or any other order and reassign each corresponding value along with the data entry associated with it to the sample cases in sequence (e.g., from S5 to S8 as shown in FIG. 9). Because each original data entry goes with its assigned corresponding value, it gets permutated as a result.
- VAL 18 which has been randomly assigned a corresponding value of "15,” gets to be shuffled to Sample 5 because "15" ranks first in the ascending order of all corresponding values and Sample 5 is the first case in sequence.
- the system may display graphical representations of the network (e.g. relevance networks).
- the relevance networks 110 shown in FIGS. 4 A and 4B are exemplary.
- Application of the invention works with large numbers of variables.
- the computer system 20 can execute graph layout software.
- An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
- execution of the data mining software causes the computer system 20 to access data in the data source 30.
- the data mining software first accesses data that correspond to the first condition of interest and associates two variables with each other. There may be multiple samples under the first condition that contain data for that pair of variables.
- the software calculates, based on all the data of the pair of variables under the first condition, a strength of the association according to a predetermined formula or equation.
- the software then accesses data that correspond to the second condition of interest and associates the same pair of variables under the second condition of interest. Similarly, there may be multiple samples under the second condition that contain data for the pair of variables.
- a strength of the new association, based on all the data for the pair of variables under the second condition, is calculated by the software according to a predetermined formula or equation, maybe the formula or equation used to calculate the first strength.
- the strength of the association can be embodied in a correlation coefficient, the slope of the line that fits all the data points for the pair of variables, or a combination of both.
- the software then generates a score for that pair of variables based on the two strengths calculated from data corresponding to two conditions of interest. If there are more than two conditions of interest, the software repeats the step of accessing the data corresponding to a new condition of interest (e.g. a third condition), associating the same pair of variables, calculating a new strength of the association for the pair for the new condition.
- a score designed to incorporate the strength for the association between the pair of variables under the new condition may then be generated.
- the score is derived from the strengths of the association between a pair of variables corresponding to at least two conditions of interest.
- the score may be designed to evaluate how different or similar the association between the two variables is under conditions of interest.
- a score may be designed to measure the correlation between the body length and the body mass under different gravitational conditions. The score may be designed to examine if the change in gravity causes the two variables, body length and body mass, to relate to each other differently. As a way of illustration, if the data show that a mouse with a larger body mass tends to be longer on the earth than other mice on the earth, but a mouse with a larger mass is not any longer than other mice in space, then the score would be high. A high value for the score in this case indicates that gravity may be a factor in any correlation between the mouse's body mass and body length.
- a score may be designed to accentuate difference in the correlation between the expressions of each pair under two disease states. A high score in this case indicates that the two genes might be functionally linked in developing one or both of the diseases.
- the data mining software may output each score for display (e.g., at the computer system).
- the displayed output makes it readily apparent what kind of correlation between the two variables exist under the at least two conditions and therefore indicates potentially worthy target for research.
- the software may continue to generate scores for any number of the rest of the variables until each variable has been paired with another one.
- An advantage of the invention is the lack of bias or assumption in choosing the association. Because each possible pair of variables may be examined, a method in accordance with the invention can be used to discover unknown link between seemingly unrelated variables affected by or causing certain conditions.
- the software may then establish a criterion or threshold value and screen at least one of the generated scores against the criterion or threshold.
- the software may randomly permute the data to establish a threshold.
- the data mining software may group together variables into one or more separate networks. It may include in the network each pair of variables whose score meets the criterion or threshold. However, even for the purpose of constructing a graphic network of variables, the establishment of a criterion or threshold is not necessary. For example, in certain preferred embodiments, each score already indicates how similar or dissimilar the associations between two variables are under at least two different conditions. If one or more of such scores are used to indicate how various nodes (i.e., variables) are linked in a graphic fashion as exemplified in FIGS. 4 A and 4B, a criterion for the score is optional because the score is self-explanatory in indicating what kind of correlation between the two variables exist under the at least two conditions. If, however, a threshold is established, it may be used to further screen out correlation that is considered of less significance.
- the software may further output each network for display (e.g., at the computer system).
- relevance networks may be produced to exhibit potential correlation between biological traits, e.g., genetic compositions, when normal cells are compared to diseased cells, or one type of diseased cell is compared to another.
- Relevance networks may also be produced to expose potential correlation between biological traits before and after an event, such as the start of a drug therapy.
- ALL acute lymphocytic leukemia
- AML acute myelogenous leukemia
- a disease state is an alteration in the normal cellular genetic regulatory network.
- a disease state may represent a change in these gene-gene or gene-protein interactions without significant changes in the absolute expression levels of any genes.
- the tumor suppressor p53 may upregulate p21/wafl to control progression into the cell cycle. Though mutant p53 may cause biologically significant changes in the downstream control of effector genes, the absolute expression values of those genes may not vary significantly from normal.
- Models of genetic regulatory networks can be ascertained from gene-expression data sets, with varying degrees of concordance with biologically-proven control networks.
- a central hypothesis is that finding differences between the models for two diseases, rather than the differences in individual gene expression levels, will elucidate the critical biological differences between the diseases.
- a score was formulated to reflect the above hypotheses and the data for each gene pair generated a score for that pair. Then, a threshold value was established through random permutation of the original data. A relevance network was then generated using scores that met the threshold. A total of 24 genes were included in the network. [098] Specifically, publicly available expression levels of 7,129 expressed sequence tags measured in 47 patients with ALL and 25 patients with AML were obtained from the Whitehead Institute (http://www.genome.wi.mit.edu/MPR). A computer system considered every possible pair of genes as a potential biological relationship.
- the computer system constructed two scatter plots: one for the ALL patients (first condition of interest) and one for the AML patients (second condition of interest).
- ALL scatter plot for the association between gene A and gene B each of the 47 ALL patients is represented as a data point 44 in a two-dimensional space similar to the scatter plot 43 in FIG. 2B where the x-coordinate specifies the expression measurement for gene A, and the y-coordinate is the expression measurement for gene B.
- S(geneA.geneB) abs(MAML -MALI) X bs(RAMi) x abs(RALi)
- abs is the absolute value function
- MA L L is the slope using the ALL patient-points
- MA ML is the slope using the AML patient-points
- RALL is the square of correlation coefficient (r 2 ) for the regression using the
- ALL patient-points (ranging from 0 to 1.0), RAML is (r 2 ) using the AML patient-points.
- the score S was based on (1) the correlation coefficients in the regression models for both diseases, and (2) the differences "in slopes between the two linear regression models under the two diseases. A gene-gene association would have a high S, with a maximum of 2.0, when there was a large difference in the slopes between the ALL and AML models, and high correlation coefficients in the regression models in both models.
- FIG. 10 shows, in bold line, the distribution 140 of S based on the actual data.
- the x-axis of the diagram in FIG. 10 shows the value for score (based on actual or permuted data).
- the y-axis indicates the count of the score.
- a threshold 130 for S was set as three times the highest S' ⁇ about 0.3. All gene-gene associations with a score below threshold 130 were eliminated. There were 15 gene-gene associations that had an S with a value higher than 0.3, connecting 24 genes. These 24 genes are listed in Tables I-IV.
- This technique and its application to the leukemia data set are significant for several reasons.
- this technique allows comparison of the regulatory networks of multiple forms of leukemia. It is noted while two forms of leukemia are examined in this example, the method can be easily expanded to included more than two forms of diseases.
- Table III lists 8 of the 24 genes that could possibly have a role in oncogenesis. None of these 24 genes were found in the Whitehead analysis, where genes were listed based on whether their absolute expression measurements were different between ALL and AML. [0106] For example, two genes listed in Table I have been known to be involved in chromosomal translocations. The gene EAP is fused with AML1 in the t(3;21) translocation seen in acute and chronic myeloid leukemias. L-myc and r/ can be found to be fused in small-cell lung cancer, and the fusion product has been found to be deregulated compared to the normal -myc gene. Gene rearrangements involving L-myc have also been seen in multiple myeloma.
- a dotted line fits the hollow dots and has a slope of -0.014 in radians.
- the (r ) value and the slope indicate that the two genes were not positively (rather, negatively) correlated in patients with AML. As a result, their score was above the threshold value.
- these two genes may be functionally linked in ALL and not AML, because (1) the two genes share a 5'-region susceptible to trinucleotide repeats, and thus may also share a common regulatory region, (2) the 5 '-regions may be affected by cancerous cellular machinery affecting the trinucleotide repeats, causing loss of regulation in AML, or (3) in normal individuals, these regions may affect expression of the downstream products similarly, and may serve as a marker linked with the development of leukemia.
- the method of invention is particularly advantageous in investigating complex diseases such as leukemia because, with potentially multiple etiologies and disruptions in several regulatory networks, these diseases are most likely caused by abnormalities in gene regulation, which may not be manifested through individual genes with different expression levels.
- samples from two diseases can now be analyzed by examining all gene-gene interactions and not just single gene expression levels, yielding quantitatively strong associations that behave significantly differently between the diseases. This is ideal for identifying key controlling differentiators, which are crucial both for clinical diagnosis as well as improved pathophysiologic understanding.
- the present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture.
- the article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape.
- the computer-readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA.
- the software programs may be stored on or in one or more articles of manufacture as object code.
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2002219817A AU2002219817A1 (en) | 2000-11-10 | 2001-11-12 | Relevance networks for visualizing clusters in gene expression data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US24711000P | 2000-11-10 | 2000-11-10 | |
US60/247,110 | 2000-11-10 |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2002039214A2 true WO2002039214A2 (en) | 2002-05-16 |
WO2002039214A3 WO2002039214A3 (en) | 2002-11-21 |
WO2002039214A9 WO2002039214A9 (en) | 2003-02-06 |
Family
ID=22933592
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/043604 WO2002039214A2 (en) | 2000-11-10 | 2001-11-12 | Relevance networks for visualizing clusters in gene expression data |
PCT/US2001/047087 WO2002059820A1 (en) | 2000-11-10 | 2001-11-13 | Method and apparatus for determining fold difference significance |
PCT/US2001/047163 WO2002046474A2 (en) | 2000-11-10 | 2001-11-13 | Method and system for identifying time-series relationships of gene expression level using signal processing metrics |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/047087 WO2002059820A1 (en) | 2000-11-10 | 2001-11-13 | Method and apparatus for determining fold difference significance |
PCT/US2001/047163 WO2002046474A2 (en) | 2000-11-10 | 2001-11-13 | Method and system for identifying time-series relationships of gene expression level using signal processing metrics |
Country Status (2)
Country | Link |
---|---|
AU (2) | AU2002219817A1 (en) |
WO (3) | WO2002039214A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151031A (en) * | 2023-10-26 | 2023-12-01 | 国网经济技术研究院有限公司 | Design evaluation method and system for parallel busbar of high-power electronic device |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3016362A1 (en) | 2014-11-03 | 2016-05-04 | OpenTV Europe SAS | Method and system to share advertisement content from a main device to a secondary device |
US11892990B2 (en) | 2021-01-04 | 2024-02-06 | International Business Machines Corporation | Removal of transaction noise |
-
2001
- 2001-11-12 WO PCT/US2001/043604 patent/WO2002039214A2/en not_active Application Discontinuation
- 2001-11-12 AU AU2002219817A patent/AU2002219817A1/en not_active Abandoned
- 2001-11-13 WO PCT/US2001/047087 patent/WO2002059820A1/en not_active Application Discontinuation
- 2001-11-13 AU AU2002236577A patent/AU2002236577A1/en not_active Abandoned
- 2001-11-13 WO PCT/US2001/047163 patent/WO2002046474A2/en not_active Application Discontinuation
Non-Patent Citations (8)
Title |
---|
BASETT D E ET AL: "GENE EXPRESSION INFORMATICS - IT'S ALL IN YOUR MINE" NATURE GENETICS, NEW YORK, NY, US, vol. 21, no. SUPPL, January 1999 (1999-01), pages 51-55, XP000865988 ISSN: 1061-4036 * |
BUTTE A.J. ET AL: "Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCE, vol. 97, no. 22, 10 October 2000 (2000-10-10), pages 12182-12186, XP002202820 USA * |
BUTTE A.J., KOHANE, I.S.: "Mutual Information relevance networks: functional genomic clustering using pairwise entropy measurements" PACIFIC SYMPOSIUM ON BIOCOMPUTING 2000, 4 - 9 January 2000, pages 418-429, XP002202822 Honolulu, USA * |
BUTTE, A.J., KOHANE, I.S.: "Unsupervised Knowledge Discovery in Medical Databases Using Relevance Networks" PROCEEDINGS OF THE FALL SYMPOSIUM OF THE AMERICAN MEDIACL INFORMATICS ASSOCIATION (AMIA99), 6 - 10 November 1999, pages 711-715, XP002202821 USA * |
CLAVERIE, J.-M.: "Computational methods for the identification of differential and coordinated gene expression" HUMAN MOLECULAR GENETICS, OXFORD UNIVERSITY PRESS, vol. 8, no. 10, 1 September 1999 (1999-09-01), pages 1821-1832, XP002202819 * |
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 * |
MICHAELS G S ET AL: "CLUSTER ANALYSIS AND DATA VISUALIZATION OF LARGE-SCALE GENE EXPRESSION DATA" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING, XX, XX, 1997, pages 42-53, XP000974575 * |
WEINSTEIN JOHN N ET AL: "An information-intensive approach to the molecular pharmacology of cancer" SCIENCE, AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE,, US, vol. 275, no. 5298, 17 January 1997 (1997-01-17), pages 343-349, XP002199806 ISSN: 0036-8075 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151031A (en) * | 2023-10-26 | 2023-12-01 | 国网经济技术研究院有限公司 | Design evaluation method and system for parallel busbar of high-power electronic device |
CN117151031B (en) * | 2023-10-26 | 2024-01-30 | 国网经济技术研究院有限公司 | Design evaluation method and system for parallel busbar of high-power electronic device |
Also Published As
Publication number | Publication date |
---|---|
WO2002039214A3 (en) | 2002-11-21 |
WO2002046474A3 (en) | 2002-11-14 |
WO2002046474A2 (en) | 2002-06-13 |
WO2002059820A9 (en) | 2003-11-20 |
AU2002236577A1 (en) | 2002-06-18 |
WO2002039214A9 (en) | 2003-02-06 |
AU2002219817A1 (en) | 2002-05-21 |
WO2002059820A1 (en) | 2002-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | CMF-Impute: an accurate imputation tool for single-cell RNA-seq data | |
Alexa et al. | Improved scoring of functional groups from gene expression data by decorrelating GO graph structure | |
Ge et al. | flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding | |
Pritchard et al. | Inference of population structure using multilocus genotype data | |
Liu et al. | Probe-level measurement error improves accuracy in detecting differential gene expression | |
Koch et al. | Pharmacometrics and machine learning partner to advance clinical data analysis | |
US20140067813A1 (en) | Parallelization of synthetic events with genetic surprisal data representing a genetic sequence of an organism | |
JP2003502669A (en) | Method of profiling and classifying an organization using a database containing indices representing the population of the organization | |
CN108121896B (en) | Disease relation analysis method and device based on miRNA | |
DiNardo et al. | Distance measures for tumor evolutionary trees | |
WO2002044715A1 (en) | Methods for efficiently minig broad data sets for biological markers | |
Pehkonen et al. | Theme discovery from gene lists for identification and viewing of multiple functional groups | |
WO2000079271A1 (en) | Online database that includes indices representative of a tissue population | |
Le Van et al. | Simultaneous discovery of cancer subtypes and subtype features by molecular data integration | |
Chen et al. | A comprehensive comparison on cell-type composition inference for spatial transcriptomics data | |
He et al. | Microarrays—the 21st century divining rod? | |
US20040234995A1 (en) | System and method for storage and analysis of gene expression data | |
Zhang et al. | MatchMixeR: a cross-platform normalization method for gene expression data integration | |
Swanson et al. | A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort | |
WO2020147557A1 (en) | Method and device for processing intestinal microorganism sequencing data, storage medium, and processor | |
Zhang et al. | A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer | |
Li et al. | A sparse negative binomial mixture model for clustering RNA-seq count data | |
Dong et al. | Integrating single-cell datasets with ambiguous batch information by incorporating molecular network features | |
Li et al. | Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes | |
Truong et al. | Learning a complex metabolomic dataset using random forests and support vector machines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/13-13/13, DRAWINGS, REPLACED BY NEW PAGES 1/13-13/13; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |