WO2001016805A2 - A system and method for mining data from a database using relevance networks - Google Patents

A system and method for mining data from a database using relevance networks Download PDF

Info

Publication number
WO2001016805A2
WO2001016805A2 PCT/US2000/024257 US0024257W WO0116805A2 WO 2001016805 A2 WO2001016805 A2 WO 2001016805A2 US 0024257 W US0024257 W US 0024257W WO 0116805 A2 WO0116805 A2 WO 0116805A2
Authority
WO
WIPO (PCT)
Prior art keywords
variables
association
strength
data
network
Prior art date
Application number
PCT/US2000/024257
Other languages
French (fr)
Other versions
WO2001016805A3 (en
Inventor
Atul Janardhan Butte
Isaac S. Kohane
Original Assignee
Children's Medical Center Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Children's Medical Center Corporation filed Critical Children's Medical Center Corporation
Priority to CA002383549A priority Critical patent/CA2383549A1/en
Priority to EP00959855A priority patent/EP1266305A2/en
Priority to JP2001520685A priority patent/JP2003527662A/en
Priority to AU71105/00A priority patent/AU7110500A/en
Publication of WO2001016805A2 publication Critical patent/WO2001016805A2/en
Publication of WO2001016805A3 publication Critical patent/WO2001016805A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the invention relates generally to data processing. More specifically, the invention relates to a system and method for mining data from a data set to identify potentially meaningful relationships among variables in the data set.
  • the simple criteria matching technique measures RNA expression levels before and after an intervention. For each gene, fold-differences are calculated. The genes are then sorted according to the calculated fold-differences. Genes showing a fold-change greater than a given threshold are "clustered" with the intervention.
  • the self-organizing map technique represents genes as multi-dimensional points in a multi-dimensional space. Coordinates for these points represent expression levels of each gene at various moments in time. A grid of centroids is imposed in the multi-dimensional space, and the centroids are allowed to drift. Each centroid drifts towards a collection of points. When the drifting completes, the centroids identify clusters of genes that exhibit similar time-course behavior. In this way, related genes have a smaller Euclidean distance in the multidimensional space. However, large numbers of dimensions can cause the technique to become computationally intensive. Moreover, the resulting gene/ time course clusters provide little information about specific gene-to-gene relationships among the genes in the clusters .
  • Techniques that perform comprehensive pair-wise comparisons generally compare each gene against each other gene using a metric.
  • One particular technique creates a vector for each gene. The vector is made up of expression levels taken at various times. Each gene is compared against each other gene by recording the correlation coefficient between the corresponding vectors. The technique then constructs a phylogenetic-type tree with branch lengths between genes being proportional to the correlation coefficients. However, phylogenetic-type trees, in general, do not show more than the most correlated relationships of each gene, omitting the lesser correlated, yet potentially significant relationships.
  • Another technique combines the Euclidean distance and pair- wise comparison techniques by constructing phylogenetic-type trees with branch length proportional to the Euclidean distance between genes. The coordinates again represent expression levels at various time points.
  • the present invention relates to a system and method for producing a network of related variables.
  • An objective of the invention is to group variables occurring in data extracted from a data source in a manner that makes readily apparent any potentially significant relationships among those variables and consequently motivate hypotheses for targeted research.
  • Another objective is to concurrently examine relationships among large numbers of variables .
  • the invention features a method that obtains data for a plurality of variables.
  • An association between each pair of variables is established.
  • a strength of the association between each pair of variables is calculated and evaluated according to a predetermined criterion.
  • a network of variables is produced.
  • the network of variables includes each association having a strength that satisfies the criterion.
  • the variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.).
  • the network of variables can be graphically displayed.
  • the network of variables is produced by including each established association irrespective of the strength of that association, and subsequently removing each association from the network of variables that fails to satisfy the criterion.
  • One embodiment includes each variable in the network of variables, and subsequently removes that variable from the network of variables if all associations with that variable fail to satisfy the criterion. Removing a variable from the network of variables can produce a plurality of separate networks of variables.
  • the method establishes the criterion as a threshold value for the strength of the association.
  • each association having a strength above the threshold value satisfies the criterion.
  • the strength of the association between the two variables can be calculated using mutual information between the variables.
  • Other embodiments use a linear regression model (e.g., computing a Pearson correlation coefficient) or a non-linear regression model.
  • the threshold value is determined by randomly permuting the data for each pair of variables. A strength of the association between each pair of variables is calculated from the permuted data. The steps of permuting and calculating are repeated a predetermined number of times. The strongest association is determined from the strengths of associations determined using permuted data. The threshold value is set equal to the strongest association.
  • the invention relates to a system for producing a network of related variables. The system includes memory storing data for the plurality of variables . An associator establishes an association between each pair of variables in the network of variables. A calculator calculates the strength of the association between each pair of variables.
  • An evaluator evaluates the strength of the association between each pair of variables according to a predetermined criterion.
  • a network generator produces a network of variables that includes each association that satisfies the criterion.
  • the invention in another aspect, relates to a system for determining a strength of association between any two of a plurality of variables.
  • the system includes memory, storing data for two or more variables, and a processor in communication with the memory.
  • the processor executes software that (1) establishes an association between each pair of variables to produce a network of variables, (2) calculates from the data a strength of the association between each pair of variables, (3) evaluates the strength of each association according to a predetermined criterion, (4) produces a network of variables that includes each association, (5) removes each association from the network of variables that fails to satisfy the criterion, and (6) graphically displays the network of variables .
  • Fig. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention
  • Fig. 2A is an embodiment of a table including data from a data source for a plurality of variables
  • Fig. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of Fig. 2A;
  • Fig. 2C is an embodiment of a scatter plot of the data for another pair of variables from the table of Fig. 2A;
  • Fig. 3 is a flow chart of an embodiment of exemplary process that produces relevance networks using the associations between variables in a data set according to the principles of the invention
  • Fig. 4 is a block diagram of an embodiment of a graphical representation of the associations between each pair of variables in the data set;
  • Fig. 5 is an embodiment of a variable matrix including examples of strength values for each of associations shown in Fig. 4;
  • Fig. 6 is an embodiment of a table illustrating an exemplary permutation of the data in the table shown in Fig. 2A;
  • Fig. 7 is an embodiment of a graph illustrating results from an exemplary process used to determine a threshold value for evaluating the strengths of the associations between each pair of variables;
  • Figs. 8A, 8B, 8C are embodiments of relevance networks produced by applying different criterion to the variables and links shown in Fig. 4 with the exemplary associated strength values of Fig. 5;
  • Fig. 9 is an embodiment of a relevance network produced from actual genomic data.
  • Fig. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30.
  • the computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention.
  • the processor in one embodiment is a 266 MHz Pentium IITM processor, manufactured by Intel Corporation of Santa Clara, California.
  • One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris, manufactured by Sun
  • the data source 30 in one embodiment is a database system, e.g., ORACLE 8TM, or data stored in files on a data storage device, such as a hard disk.
  • the processor of the computer system 20 executes data mining software.
  • data mining software is written in any programming language, such as C, C++, etc.
  • the data in the data source 30 represent measurements of multiple variables for various sample cases.
  • the sample cases in one embodiment are individuals and the measured variables are physical characteristics, such as weight, height, age, gender, race, etc.
  • the sample cases in one embodiment pertain to a single patient evaluated at different time intervals.
  • the patient is subject to particular laboratory tests, such as hemoglobin, hematocrit, and thyroxine measurements taken over a period of time.
  • the measured variables are continuous variables.
  • the sample cases are RNA expression measurements and the measured variables are genes.
  • the sample cases are corporate institutions for which the measured variables are financial data, such as stock prices, price to earning ratios, etc., acquired over time, In general, the principles of the invention can be practiced to examine any type of data in search of relationships among various measured variables.
  • the invention can mine data from databases containing customer sales transactions, commercial passenger travel information (e.g., airline), financial data, and data collected by laboratories, research facilities, commercial institutions, finance institutions, etc.
  • An advantage is that the invention can exploit existing electronic databases.
  • execution of the data mining software causes the computer system 20 to access data in the data source 30.
  • the data mining software associates each variable in the accessed data with every other variable and determines the significance of the association between each pair of variables. Significance can be defined according to a predetermined criterion.
  • the data mining software groups together variables into one or more separate relevance networks.
  • Each relevance network represents a group of related variables; that is, each variable in a relevance network has a significant association (as defined by the criterion) with at least one other variable in that relevance network, and does not have a significant association (as defined by the criterion) with variables in other relevance networks.
  • the data mining software outputs each relevance network for display (e.g., at the computer system) . The displayed output makes it readily apparent that a relationship potentially worthy of targeted research was detected among variables in the data.
  • Fig. 2A shows an exemplary tabular representation 50 of data in the data source.
  • SI, S2, S3, and S4 on the y-axis are represented as rows. This column and row arrangement is exemplary; the sample cases and variables can appear on either the x- or y-axis and remain within the scope of the invention. In addition, the principles of the invention extend to more sample cases and variables other than those shown in Fig. 2A.
  • the table 50 can be completely, densely, or sparsely populated with data values 52.
  • Fig. 2A shows an exemplary data set of twenty entries wherein the table 50 includes fifteen numerical data values (VALl - VAL15) . Five entries of the table 50 lack a data value, each denoted by a dashed line.
  • the data values 52 are used to determine the degree of a relationship between each pair of variables.
  • Each pair of data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot.
  • Fig. 2B shows an embodiment of an exemplary scatter plot of the data points 54 produced by the data for variables D and E.
  • the data points are (VAL9, VAL12), (VAL10, VAL13), and (VAL11, VAL14).
  • Fig. 2C illustrates another embodiment of an exemplary scatter plot of the data points for variables A and E, where the data points are
  • Fig. 3 shows an exemplary process for finding relationships among the variables A, B, C, D, and E according to the principles of the invention.
  • the process obtains (step 60) a set of data from the data source 30.
  • the data in the data set includes values for various variables for the sample cases .
  • the computer system 20 organizes (step 64) the obtained data in the data set.
  • One exemplary data organization is the tabular representation 50 shown in Fig. 2A.
  • the computer system 20 associates (step 68) each variable with every other variable in the data set. Accordingly, an association exists between each pair of variables in the data set.
  • the computer system 20 calculates (step 72) the strength of each association.
  • strength is an indication of how closely the variables are related. A strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables.
  • Variables can be related to each other in various ways .
  • variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide.
  • Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils.
  • Some variables can be directly or indirectly related to each other through other variables.
  • An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone .
  • variables can have a relationship with each other relating to a pathologic condition.
  • An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha-1 antitrypsin, an acute phase protein indicative of an inflammatory disease state.
  • Other variables can be related through synonymy. For example, both somatomedin C and insulinlike growth factor-1 refer to the same molecule.
  • the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names.
  • the computer system 20 constructs (step 74) a graphical network of variables using every association established in step 68. In this network of variables, each variable is linked to every other variable (e.g., see Fig. 4) .
  • the computer system 20 evaluates (step 76) the strength of the association between each pair of variables according to a predetermined criterion.
  • the criterion can be a threshold value.
  • the computer system 20 removes (step 80) the association between each pair of variables if the strength of that association fails to satisfy the predetermined criterion.
  • the predetermined criterion can require the strength of the association between each pair of variables to be above the threshold value, or otherwise that pair of variables becomes disassociated.
  • the predetermined criterion can require the strength of the association for each variable pair to be below the threshold value in order for that association to remain.
  • the computer system 20 also removes (step 84) each variable that has no associations with other variables remaining after step 80; that is, all associations of that variable fail to satisfy the criterion.
  • the remaining associations and variables form one or more relevance networks.
  • each relevance network is displayed at the client system 26.
  • the removal of associations and variables can divide the network of variables into smaller, separate networks. Each such smaller network is a relevance network because that smaller network represents a group of related variables. Each variable in that smaller network has an association with at least one other variable in that network that satisfies the criterion.
  • the criterion may cause the removal of none, one, or multiple associations without the removal of any variables.
  • the relevance network includes all of the variables in the data set .
  • the computer produces (step 74') the graphical network of variables after the strength of each association is evaluated against the criterion in step 76.
  • the computer system 20 constructs the network of variables using only those associations that satisfy the criterion. Variables appear in this network of variables if there is at least one association with that variable which satisfies the criterion.
  • this network of variables is constructed as a relevance network because the network of variables includes only those variables and associations that satisfy the criterion throughout construction of the variable network. No associations or variables need to be removed from this variable network, such as described in connection with steps 80 and 84, to produce a relevance network.
  • Fig. 4 shows an exemplary embodiment of a network of variables 110 graphically representing the associations initially established between each pair of variables.
  • the associations are represented as links 100 between pairs of variables.
  • Each variable A, B, C, D, and E is shown as a node in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E.
  • Each link 100 has an assigned value representing the strength of the association between the pairs of variables.
  • Fig. 5 shows an exemplary matrix 104 containing examples of strength values 108 assigned to each of the links 100 of Fig. 4.
  • the matrix 104 places the variables A, B, C, D, and E on both the x- and y-axes.
  • Each value 108 in the matrix 104 represents the strength determined for the association between the respective pair of variables.
  • the matrix 104 is symmetric, and those entries in the matrix 104 denoted by X are either duplicative of another entry in the matrix 104 (e.g., entries (A, B) and (B, A)), or tautological (e.g., entry (A, A)) .
  • entries need not be calculated or stored.
  • the values 108 shown are exemplary and selected only for illustrating the principles of the invention.
  • a variety of methodologies can be used to calculate strength of the association between each pair of variables.
  • the following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships.
  • One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points.
  • This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below.
  • Another exemplary methodology computes a correlation coefficient (typically denoted as r) between each pair of variables. The technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
  • a correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope
  • a correlation coefficient of -1 indicates a perfect linear inverse correlation (i.e., a relationship with a negative slope)
  • a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables.
  • the correlation coefficient is Pearson's correlation coefficient. The Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals.
  • the correlation coefficient is a Spearman Rank correlation coefficient. The Spearman Rank correlation coefficient can be a more appropriate coefficient than the
  • the square of the correlation coefficient, r 2 (typically referred to as the coefficient of determination) can be used.
  • the value of r 2 ranges between 0 and 1. Because the value of r 2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r 2 of 0.25, whereas an r of greater than 0.7 has a r 2 of greater than 0.5.
  • Another technique for computing a correlation coefficient uses a nonlinear regression model.
  • Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
  • entropy (H) of the variables and the mutual information between each pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables.
  • the entropy of a variable is a measure of the information content in that variable.
  • Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables .
  • Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n sub-ranges.
  • the proportion of measurements in sub-range x (or frequency) is denoted as p (x ⁇ ) .
  • the histogram increasingly models the probability density function for the variable .
  • Entropy can be calculated using the following equation:
  • H(A) - ⁇ i - i to n p(Xi)log2(p(x ⁇ )) where log2 is base 2 logarithm. Higher entropy indicates that the data for that variable are more randomly distributed, and thus has higher information.
  • Mutual information can be calculated by subtracting the entropy of a first variable (A) given an occurrence of a second variable (B) from the entropy of the first variable (A) as represented by the following equation:
  • MI (A, B) H(A) - H(A
  • MI (A, B) H (A) + H ( B) - H (A, B) .
  • a mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately.
  • a higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
  • the computer system 20 can use the above-described equations to compute a mutual information relationship between pairs of genes.
  • the strength of each association is compared with a criterion.
  • the comparison operates as a filter that removes weakly related or unrelated associations and variables from the network of variables to produce one or more relevance networks. Consequently, the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
  • the criterion is a minimum number of data points upon which the strength of each association between variables must be based. Any association based on less than that minimum number of data points fails to satisfy the criterion and is removed from the network of variables. Such an association is deemed weak because of the paucity of data supporting the association. For example, referring to Fig. 2A, if the minimum number of data points is two, than the associations between variables B and A and between variables B and D fail to satisfy the criterion because both associations are based on one data point only, (VAL5, VAL3) and (VAL4,
  • the criterion is a threshold value against which the strength of each association is measured.
  • the threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
  • FIG. 6 shows an exemplary permutation of the data in table 50 shown in Fig. 2A.
  • the permutation of the data creates new data points between variables.
  • the permutation shown in Fig. 6 produces two new data points between variables A and C, namely (VAL2, VAL8) and (VALl, VAL6) , which differ from the original data points shown in Fig. 2A, namely (VAL2, VAL6) and (VAL3 ,
  • strengths of associations between pairs of variables are calculated.
  • the technique used to calculate the strength of associations for permuted data points is the same as that used for the original data points. Accordingly, if mutual information is used to indicate the strength of associations for the original data points, then mutual information is also used for the permuted data points.
  • the steps of permuting the data and calculating strengths are repeated a predetermined number of times (e.g., 30). The threshold value is then set to the strongest association obtained from the repeated permutations of the data.
  • Fig. 7 is a exemplary graph illustrating the results of this process for determining a threshold value as applied to actual data taken from 2,467 genes in Saccharamomyces cerevisiae .
  • the results are described in the U.S. provisional patent application, filed September 13, 1999, and given serial number 60/153,593, attorney docket number CMC-008PR1, and incorporated by reference herein.
  • mutual information was calculated between measurements of RNA expression between pairs of the 2,467 genes.
  • the distribution of the mutual information appears as filled circles.
  • Mutual information was also calculated using permuted RNA expression measurements .
  • the average distribution of 30 repeated permutations appears as open circles. The permutations did not produce any associations having a mutual information value greater than 1.3.
  • the threshold value used to filter associations can be set to 1.3.
  • any associations produced from the original data points having a mutual information above 1.3 could be considered significant .
  • Figs. 8A, 8B, and 8C show the resulting relevance networks produced by the process described in Fig. 3.
  • a different criterion is applied to the links 100 representing the associations between variables A, B, C, D, and E shown in Fig.
  • the relevance networks of Figs. 8A, 8B and 8C are the results of applying minimum thresholds of .4, .6, and .7 respectively. Links 100 having a strength value below the threshold are removed, and links 100 greater than or equal to the threshold remain. In these examples, the criterion does not require a minimum number of data points.
  • Fig. 8A displays a relevance network 120 that includes all of the variables A, B, C, D, E, but fewer associations than those shown in the original network 110 shown in Fig. 4. In particular, all but one association between D and the other variables has been removed. The only remaining association with variable D is with variable E. In Fig. 8B, the remaining association between variables D and E also fails to satisfy the threshold value of .6. Consequently, the resulting relevance network 122 does not include the variable D because the variable
  • D has no associations with any of the other variables that meet the criterion.
  • Fig. 8C illustrates how the threshold value of .7 has divided the original network of variables 110 into two smaller, separate relevance networks 125 and 125' .
  • the relevance network
  • Figs. 8A, 8B, and 8C are exemplary.
  • Application of the invention works with large numbers of variables.
  • the computer system 20 can execute graph layout software.
  • An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
  • Fig. 9 is an embodiment of a relevance network 130 produced from actual genome data as described in the U.S. provisional application, serial number 60/153,593.
  • This particular relevance network 130 clustered 143 genes out of a data set of 79 RNA expression measurements of 2,467 genes.
  • the graph layout software isolates two branches of genes 132 and 132' attached to the network 130 by a single association. In Fig. 9, the branches are exploded to show some detail regarding the names of the associated genes. Such branches of biologically relevant gene clusters identify opportunities for further study.
  • the present invention is useful in a variety of applications. For example, relevance networks produced for normal cells can be compared to those relevance networks produced for various cancer cells to help identify distinctions and similarities. Similarly, the invention enables comparisons between the relevance networks of various cancers . Another example uses the relevance networks to monitor changes of certain variables throughout the treatment of a patient.
  • the present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture.
  • the article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape.
  • the computer- readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA.
  • the software programs may be stored on or in one or more articles of manufacture as object code.

Abstract

Described are a system and method for mining data in databases to discover significant relationships among variables in the data. An association is established between each pair of variables. From the data, the strength of the each association is calculated. Correlation coefficients can determine the strength of the associations. In another embodiment, the strength of each association is computed according to mutual information. These calculated strengths are evaluated according to a predetermined criterion. All associations that satisfy the criterion are included in one or more relevance networks. Each relevance network is displayed to provide a pictorial view of the relevant relationships among variables in the data.

Description

A SYSTEM AND METHOD FOR MINING DATA FROM A DATABASE USING
RELEVANCE NETWORKS
Related Application
This application claims the benefit of U.S. Provisional Application, Serial No. 60/152,500, filed September 2, 1999, and U.S. Provisional Application, Serial No. 60/153,593, filed September 13, 1999, both incorporated by reference herein.
Field of the Invention
The invention relates generally to data processing. More specifically, the invention relates to a system and method for mining data from a data set to identify potentially meaningful relationships among variables in the data set.
Background of the Invention
With data accumulating in databases in ever increasing amounts, the task of extracting useful information from the data, called data mining, has grown into an important industry. Data mining techniques aim to identify significant relationships among variables in the data. In the field of genomics, for example, human genome sequencing and microarray technology have produced vast quantities of data that may hold the secret to identifying the functions of newly discovered genes. One discipline in particular, called bioinformatics, employs various techniques to mine genomic databases containing sequence, organism, and expression data to identify clusters of genes having related functionality. As discussed below, current techniques using RNA expression data for identifying gene clusters generally fall into three types: those techniques that use simple criteria matching, those that use Euclidean distance, and those that perform comprehensive pair-wise comparisons.
The simple criteria matching technique measures RNA expression levels before and after an intervention. For each gene, fold-differences are calculated. The genes are then sorted according to the calculated fold-differences. Genes showing a fold-change greater than a given threshold are "clustered" with the intervention.
Techniques that use Euclidean distance include self- organizing maps. The self-organizing map technique represents genes as multi-dimensional points in a multi-dimensional space. Coordinates for these points represent expression levels of each gene at various moments in time. A grid of centroids is imposed in the multi-dimensional space, and the centroids are allowed to drift. Each centroid drifts towards a collection of points. When the drifting completes, the centroids identify clusters of genes that exhibit similar time-course behavior. In this way, related genes have a smaller Euclidean distance in the multidimensional space. However, large numbers of dimensions can cause the technique to become computationally intensive. Moreover, the resulting gene/ time course clusters provide little information about specific gene-to-gene relationships among the genes in the clusters .
Techniques that perform comprehensive pair-wise comparisons generally compare each gene against each other gene using a metric. One particular technique creates a vector for each gene. The vector is made up of expression levels taken at various times. Each gene is compared against each other gene by recording the correlation coefficient between the corresponding vectors. The technique then constructs a phylogenetic-type tree with branch lengths between genes being proportional to the correlation coefficients. However, phylogenetic-type trees, in general, do not show more than the most correlated relationships of each gene, omitting the lesser correlated, yet potentially significant relationships. Another technique combines the Euclidean distance and pair- wise comparison techniques by constructing phylogenetic-type trees with branch length proportional to the Euclidean distance between genes. The coordinates again represent expression levels at various time points. Although this hybrid technique provides an alternative to clustering genes, the above-described limitations of both the Euclidean distance and phylogenetic techniques remain present .
Thus, a need remains for a data mining technique that can uncover the multi-faceted relationships of the various variables in a data set without encountering the problems and limitations of the aforementioned techniques.
Summary of the Invention
The present invention relates to a system and method for producing a network of related variables. An objective of the invention is to group variables occurring in data extracted from a data source in a manner that makes readily apparent any potentially significant relationships among those variables and consequently motivate hypotheses for targeted research. Another objective is to concurrently examine relationships among large numbers of variables .
In one aspect, the invention features a method that obtains data for a plurality of variables. An association between each pair of variables is established. From the data, a strength of the association between each pair of variables is calculated and evaluated according to a predetermined criterion. A network of variables is produced. The network of variables includes each association having a strength that satisfies the criterion. The variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.). The network of variables can be graphically displayed.
In one embodiment, the network of variables is produced by including each established association irrespective of the strength of that association, and subsequently removing each association from the network of variables that fails to satisfy the criterion. One embodiment includes each variable in the network of variables, and subsequently removes that variable from the network of variables if all associations with that variable fail to satisfy the criterion. Removing a variable from the network of variables can produce a plurality of separate networks of variables. In another embodiment, the method establishes the criterion as a threshold value for the strength of the association. In one embodiment, each association having a strength above the threshold value satisfies the criterion. The strength of the association between the two variables can be calculated using mutual information between the variables. Other embodiments use a linear regression model (e.g., computing a Pearson correlation coefficient) or a non-linear regression model.
In one embodiment, the threshold value is determined by randomly permuting the data for each pair of variables. A strength of the association between each pair of variables is calculated from the permuted data. The steps of permuting and calculating are repeated a predetermined number of times. The strongest association is determined from the strengths of associations determined using permuted data. The threshold value is set equal to the strongest association. In another aspect, the invention relates to a system for producing a network of related variables. The system includes memory storing data for the plurality of variables . An associator establishes an association between each pair of variables in the network of variables. A calculator calculates the strength of the association between each pair of variables.
An evaluator evaluates the strength of the association between each pair of variables according to a predetermined criterion.
A network generator produces a network of variables that includes each association that satisfies the criterion.
In another aspect, the invention relates to a system for determining a strength of association between any two of a plurality of variables. The system includes memory, storing data for two or more variables, and a processor in communication with the memory. The processor executes software that (1) establishes an association between each pair of variables to produce a network of variables, (2) calculates from the data a strength of the association between each pair of variables, (3) evaluates the strength of each association according to a predetermined criterion, (4) produces a network of variables that includes each association, (5) removes each association from the network of variables that fails to satisfy the criterion, and (6) graphically displays the network of variables . Brief Description of the Drawings
The invention is pointed out with particularity in the appended claims. The advantages of the invention described above, as well as further advantages of the invention, may be better understood by reference to the following description taken in conjunction with the accompanying drawings, in which:
Fig. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention; Fig. 2A is an embodiment of a table including data from a data source for a plurality of variables;
Fig. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of Fig. 2A;
Fig. 2C is an embodiment of a scatter plot of the data for another pair of variables from the table of Fig. 2A;
Fig. 3 is a flow chart of an embodiment of exemplary process that produces relevance networks using the associations between variables in a data set according to the principles of the invention; Fig. 4 is a block diagram of an embodiment of a graphical representation of the associations between each pair of variables in the data set;
Fig. 5 is an embodiment of a variable matrix including examples of strength values for each of associations shown in Fig. 4; Fig. 6 is an embodiment of a table illustrating an exemplary permutation of the data in the table shown in Fig. 2A;
Fig. 7 is an embodiment of a graph illustrating results from an exemplary process used to determine a threshold value for evaluating the strengths of the associations between each pair of variables;
Figs. 8A, 8B, 8C are embodiments of relevance networks produced by applying different criterion to the variables and links shown in Fig. 4 with the exemplary associated strength values of Fig. 5; and
Fig. 9 is an embodiment of a relevance network produced from actual genomic data.
Detailed Description
The invention provides a method and apparatus for mining data from databases. Fig. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30. A variety of system architectures can be used to practice the invention. The computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention. The processor in one embodiment is a 266 MHz Pentium II™ processor, manufactured by Intel Corporation of Santa Clara, California. One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris, manufactured by Sun
Microsystems, Inc. of Palo Alto, California.
The data source 30 in one embodiment is a database system, e.g., ORACLE 8™, or data stored in files on a data storage device, such as a hard disk. To extract data from the data source, the processor of the computer system 20 executes data mining software. Such software is written in any programming language, such as C, C++, etc.
The data in the data source 30 represent measurements of multiple variables for various sample cases. For example, in a medical context, the sample cases in one embodiment are individuals and the measured variables are physical characteristics, such as weight, height, age, gender, race, etc.
Similarly, the sample cases in one embodiment pertain to a single patient evaluated at different time intervals. In this embodiment, the patient is subject to particular laboratory tests, such as hemoglobin, hematocrit, and thyroxine measurements taken over a period of time. Here, the measured variables are continuous variables. As another example, the sample cases are RNA expression measurements and the measured variables are genes. As still another embodiment, the sample cases are corporate institutions for which the measured variables are financial data, such as stock prices, price to earning ratios, etc., acquired over time, In general, the principles of the invention can be practiced to examine any type of data in search of relationships among various measured variables. The invention can mine data from databases containing customer sales transactions, commercial passenger travel information (e.g., airline), financial data, and data collected by laboratories, research facilities, commercial institutions, finance institutions, etc. An advantage is that the invention can exploit existing electronic databases. In brief overview, execution of the data mining software causes the computer system 20 to access data in the data source 30. The data mining software associates each variable in the accessed data with every other variable and determines the significance of the association between each pair of variables. Significance can be defined according to a predetermined criterion.
From the determination, the data mining software groups together variables into one or more separate relevance networks. Each relevance network represents a group of related variables; that is, each variable in a relevance network has a significant association (as defined by the criterion) with at least one other variable in that relevance network, and does not have a significant association (as defined by the criterion) with variables in other relevance networks. The data mining software outputs each relevance network for display (e.g., at the computer system) . The displayed output makes it readily apparent that a relationship potentially worthy of targeted research was detected among variables in the data.
Fig. 2A shows an exemplary tabular representation 50 of data in the data source. The measured variables A, B, C, D, and
E are represented on the x-axis as columns. The sample cases
SI, S2, S3, and S4 on the y-axis are represented as rows. This column and row arrangement is exemplary; the sample cases and variables can appear on either the x- or y-axis and remain within the scope of the invention. In addition, the principles of the invention extend to more sample cases and variables other than those shown in Fig. 2A. The table 50 can be completely, densely, or sparsely populated with data values 52. Fig. 2A shows an exemplary data set of twenty entries wherein the table 50 includes fifteen numerical data values (VALl - VAL15) . Five entries of the table 50 lack a data value, each denoted by a dashed line.
The data values 52 are used to determine the degree of a relationship between each pair of variables. Each pair of data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot. For example, Fig. 2B shows an embodiment of an exemplary scatter plot of the data points 54 produced by the data for variables D and E. The data points are (VAL9, VAL12), (VAL10, VAL13), and (VAL11, VAL14). Fig. 2C illustrates another embodiment of an exemplary scatter plot of the data points for variables A and E, where the data points are
(VALl, VAL12), (VAL2, VAL13), and (VAL3 , VAL15). Scatter plots can be produced for each pair of variables in like manner.
Fig. 3 shows an exemplary process for finding relationships among the variables A, B, C, D, and E according to the principles of the invention. The process obtains (step 60) a set of data from the data source 30. The data in the data set includes values for various variables for the sample cases . The computer system 20 organizes (step 64) the obtained data in the data set. One exemplary data organization is the tabular representation 50 shown in Fig. 2A. The computer system 20 associates (step 68) each variable with every other variable in the data set. Accordingly, an association exists between each pair of variables in the data set. From the data set, the computer system 20 calculates (step 72) the strength of each association. Here, strength is an indication of how closely the variables are related. A strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables.
Variables can be related to each other in various ways . For example, variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide. Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils. Some variables can be directly or indirectly related to each other through other variables. An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone .
Other variables can have a relationship with each other relating to a pathologic condition. An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha-1 antitrypsin, an acute phase protein indicative of an inflammatory disease state. Other variables can be related through synonymy. For example, both somatomedin C and insulinlike growth factor-1 refer to the same molecule. Here, the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names.
In one embodiment, the computer system 20 constructs (step 74) a graphical network of variables using every association established in step 68. In this network of variables, each variable is linked to every other variable (e.g., see Fig. 4) . The computer system 20 evaluates (step 76) the strength of the association between each pair of variables according to a predetermined criterion. In one embodiment, the criterion can be a threshold value. The computer system 20 removes (step 80) the association between each pair of variables if the strength of that association fails to satisfy the predetermined criterion. For example, the predetermined criterion can require the strength of the association between each pair of variables to be above the threshold value, or otherwise that pair of variables becomes disassociated. In another embodiment, the predetermined criterion can require the strength of the association for each variable pair to be below the threshold value in order for that association to remain.
The computer system 20 also removes (step 84) each variable that has no associations with other variables remaining after step 80; that is, all associations of that variable fail to satisfy the criterion. The remaining associations and variables form one or more relevance networks. In step 88, each relevance network is displayed at the client system 26. The removal of associations and variables can divide the network of variables into smaller, separate networks. Each such smaller network is a relevance network because that smaller network represents a group of related variables. Each variable in that smaller network has an association with at least one other variable in that network that satisfies the criterion.
In some instances, the criterion may cause the removal of none, one, or multiple associations without the removal of any variables. In such a case, the relevance network includes all of the variables in the data set . In another embodiment, shown in Fig. 3 with dashed lines, the computer produces (step 74') the graphical network of variables after the strength of each association is evaluated against the criterion in step 76. In this embodiment, the computer system 20 constructs the network of variables using only those associations that satisfy the criterion. Variables appear in this network of variables if there is at least one association with that variable which satisfies the criterion.
Thus, this network of variables is constructed as a relevance network because the network of variables includes only those variables and associations that satisfy the criterion throughout construction of the variable network. No associations or variables need to be removed from this variable network, such as described in connection with steps 80 and 84, to produce a relevance network.
Other embodiments of processes for constructing a relevance network from associations that satisfy the criterion can be used to practice the principles of the invention.
Fig. 4 shows an exemplary embodiment of a network of variables 110 graphically representing the associations initially established between each pair of variables. The associations are represented as links 100 between pairs of variables. Each variable A, B, C, D, and E is shown as a node in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E. Each link 100 has an assigned value representing the strength of the association between the pairs of variables. Fig. 5 shows an exemplary matrix 104 containing examples of strength values 108 assigned to each of the links 100 of Fig. 4.
The matrix 104 places the variables A, B, C, D, and E on both the x- and y-axes. Each value 108 in the matrix 104 represents the strength determined for the association between the respective pair of variables. As such, the matrix 104 is symmetric, and those entries in the matrix 104 denoted by X are either duplicative of another entry in the matrix 104 (e.g., entries (A, B) and (B, A)), or tautological (e.g., entry (A, A)) . Such entries need not be calculated or stored. The values 108 shown are exemplary and selected only for illustrating the principles of the invention.
A variety of methodologies can be used to calculate strength of the association between each pair of variables. The following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships. One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points. This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below. Another exemplary methodology computes a correlation coefficient (typically denoted as r) between each pair of variables. The technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
One technique uses a linear regression model to compute a correlation coefficient with a value between -1 and 1. A correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope, a correlation coefficient of -1 indicates a perfect linear inverse correlation (i.e., a relationship with a negative slope), and a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables. In one embodiment, the correlation coefficient is Pearson's correlation coefficient. The Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals. In another embodiment, the correlation coefficient is a Spearman Rank correlation coefficient. The Spearman Rank correlation coefficient can be a more appropriate coefficient than the
Pearson correlation coefficient when actual numerical values cannot be assigned to variables, but a rank order is assigned to each sample case of each variable. For a coefficient that is more indicative of a predictable linear relationship between two variables than r, the square of the correlation coefficient, r2, (typically referred to as the coefficient of determination) can be used. The value of r2 ranges between 0 and 1. Because the value of r2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r2 of 0.25, whereas an r of greater than 0.7 has a r2 of greater than 0.5.
Another technique for computing a correlation coefficient uses a nonlinear regression model. Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
Another exemplary methodology for determining the strength of the association between a pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables. The entropy of a variable is a measure of the information content in that variable. Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables . Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n sub-ranges.
The proportion of measurements in sub-range x (or frequency) is denoted as p (x±) . As n approaches infinity, the histogram increasingly models the probability density function for the variable .
Entropy can be calculated using the following equation:
H(A) = - ∑i - i ton p(Xi)log2(p(xι)) where log2 is base 2 logarithm. Higher entropy indicates that the data for that variable are more randomly distributed, and thus has higher information.
Mutual information can be calculated by subtracting the entropy of a first variable (A) given an occurrence of a second variable (B) from the entropy of the first variable (A) as represented by the following equation:
MI (A, B) = H(A) - H(A|B) .
Expressed another way, mutual information can be calculated by subtracting the joint entropy of the two variables from the individual entropy of the two variables . MI (A, B) = H (A) + H ( B) - H (A, B) .
A mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately. A higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
In a biological context, for example, the computer system 20 can use the above-described equations to compute a mutual information relationship between pairs of genes. The higher the mutual information is between two genes, the greater the strength of the association between those genes (i.e., the more likely those genes have a biological relationship) . As described above, the strength of each association is compared with a criterion. The comparison operates as a filter that removes weakly related or unrelated associations and variables from the network of variables to produce one or more relevance networks. Consequently, the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
In one embodiment, the criterion is a minimum number of data points upon which the strength of each association between variables must be based. Any association based on less than that minimum number of data points fails to satisfy the criterion and is removed from the network of variables. Such an association is deemed weak because of the paucity of data supporting the association. For example, referring to Fig. 2A, if the minimum number of data points is two, than the associations between variables B and A and between variables B and D fail to satisfy the criterion because both associations are based on one data point only, (VAL5, VAL3) and (VAL4,
VALID , respectively. If instead, the minimum number was set to three data points, then all associations with B would fail to satisfy the criteria, and the process described in Fig. 3 would consequently remove variable B from the network of variables.
In another embodiment, the criterion is a threshold value against which the strength of each association is measured. The threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
Another exemplary technique for setting the threshold value randomly permutes the data for each variable . The manner of permuting the data of each variable is independent of the manner used for each other variable. Fig. 6 shows an exemplary permutation of the data in table 50 shown in Fig. 2A. The permutation of the data creates new data points between variables. For example, the permutation shown in Fig. 6 produces two new data points between variables A and C, namely (VAL2, VAL8) and (VALl, VAL6) , which differ from the original data points shown in Fig. 2A, namely (VAL2, VAL6) and (VAL3 ,
VAL8) .
From the permuted data points, strengths of associations between pairs of variables are calculated. The technique used to calculate the strength of associations for permuted data points is the same as that used for the original data points. Accordingly, if mutual information is used to indicate the strength of associations for the original data points, then mutual information is also used for the permuted data points. The steps of permuting the data and calculating strengths are repeated a predetermined number of times (e.g., 30). The threshold value is then set to the strongest association obtained from the repeated permutations of the data.
Fig. 7 is a exemplary graph illustrating the results of this process for determining a threshold value as applied to actual data taken from 2,467 genes in Saccharamomyces cerevisiae . The results are described in the U.S. provisional patent application, filed September 13, 1999, and given serial number 60/153,593, attorney docket number CMC-008PR1, and incorporated by reference herein. Here, mutual information was calculated between measurements of RNA expression between pairs of the 2,467 genes. The distribution of the mutual information appears as filled circles. Mutual information was also calculated using permuted RNA expression measurements . The average distribution of 30 repeated permutations appears as open circles. The permutations did not produce any associations having a mutual information value greater than 1.3.
Accordingly, the threshold value used to filter associations can be set to 1.3. In this example, any associations produced from the original data points having a mutual information above 1.3 could be considered significant .
Figs. 8A, 8B, and 8C show the resulting relevance networks produced by the process described in Fig. 3. A different criterion is applied to the links 100 representing the associations between variables A, B, C, D, and E shown in Fig.
4, having the exemplary associated strength values shown in Fig.
5. The relevance networks of Figs. 8A, 8B and 8C are the results of applying minimum thresholds of .4, .6, and .7 respectively. Links 100 having a strength value below the threshold are removed, and links 100 greater than or equal to the threshold remain. In these examples, the criterion does not require a minimum number of data points.
Fig. 8A displays a relevance network 120 that includes all of the variables A, B, C, D, E, but fewer associations than those shown in the original network 110 shown in Fig. 4. In particular, all but one association between D and the other variables has been removed. The only remaining association with variable D is with variable E. In Fig. 8B, the remaining association between variables D and E also fails to satisfy the threshold value of .6. Consequently, the resulting relevance network 122 does not include the variable D because the variable
D has no associations with any of the other variables that meet the criterion.
Fig. 8C illustrates how the threshold value of .7 has divided the original network of variables 110 into two smaller, separate relevance networks 125 and 125' . The relevance network
125 includes one link between variables A and E, and the other relevance network 125' includes one link between variables B and
C. The graphical representations of relevance networks shown in Figs. 8A, 8B, and 8C are exemplary. Application of the invention works with large numbers of variables. To graphically represent the relevance networks having large numbers of variables, the computer system 20 can execute graph layout software. An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
Fig. 9 is an embodiment of a relevance network 130 produced from actual genome data as described in the U.S. provisional application, serial number 60/153,593. This particular relevance network 130 clustered 143 genes out of a data set of 79 RNA expression measurements of 2,467 genes. The graph layout software isolates two branches of genes 132 and 132' attached to the network 130 by a single association. In Fig. 9, the branches are exploded to show some detail regarding the names of the associated genes. Such branches of biologically relevant gene clusters identify opportunities for further study.
The present invention is useful in a variety of applications. For example, relevance networks produced for normal cells can be compared to those relevance networks produced for various cancer cells to help identify distinctions and similarities. Similarly, the invention enables comparisons between the relevance networks of various cancers . Another example uses the relevance networks to monitor changes of certain variables throughout the treatment of a patient. The present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture. The article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape. In general, the computer- readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA. The software programs may be stored on or in one or more articles of manufacture as object code. Having described certain embodiments of the invention, it will now become apparent to one of skill in the art that other embodiments incorporating the concepts of the invention may be used. Therefore, the invention should not be limited to certain embodiments, but rather should be limited only by the spirit and scope of the following claims.

Claims

Claims
What is claimed is: 1. A method for producing a network of related variables, comprising the steps of: (a) obtaining data for a plurality of variables; (b) establishing an association between each pair of variables of the plurality of variables; (c) calculating from the data a strength of the association between each pair of variables; (d) evaluating the strength of the association between each pair of variables according to a predetermined criterion; and (e) producing a network of variables that includes each association if the strength of that association satisfies the criterion.
2. The method of claim 1 wherein producing the network of variables includes the steps of : including each established association in the network of variables irrespective of the strength of that association; and removing each established association from the network of variables that fails to satisfy the criterion.
3. The method of claim 1 further comprising the step of : including each of the plurality of variables in the network of variables; and removing each variable from the network of variables if all associations with that variable fail to satisfy the criterion.
4. The method of claim 1 wherein the removing the variable from the network of variables produces a plurality of separate networks of variables .
5. The method of claim 1 further comprising the step of establishing the criterion as a threshold value for the strength of the association.
6. The method of claim 1 wherein each association having a strength above the threshold value satisfies the criterion.
7. The method of claim 1 further comprising the steps of: randomly permuting the data for the plurality of variables; calculating from the permuted data a strength of the association between each pair of variables; repeating the steps of permuting and calculating a predetermined number of times; determining a strongest association from the strengths of associations determined using permuted data; and setting the threshold value equal to the strongest association.
8. The method of claim 1 further comprising the step of graphically displaying the network of variables .
9. The method of claim 1 wherein the step of calculating the strength of the association between each pair of variables uses a linear regression model.
10. The method of claim 1 wherein the step of calculating the strength of the association between each pair of variables includes computing a Pearson correlation coefficient.
11. The method of claim 1 wherein the step of calculating the strength of the association between each pair of variables uses a non-linear regression model.
12. The method of claim 1 further comprising the steps of: determining the strength of the association between each pair of variables using mutual information.
13. The method of claim 1 wherein the variables are genes.
14. The method of claim 1 wherein the variables represent financial metrics.
15. A system for producing a network of related variables, comprising: memory storing data for the plurality of variables; an associator establishing an association between each pair of variables; a calculator, in communication with the memory and the associator, calculating from the data a strength of the association between each pair of variables; an evaluator evaluating the strength of the association between each pair of variables according to a predetermined criterion; and a network generator producing a network of variables that includes each association that satisfies the criterion.
16. The system of claim 1 further comprising a remover that removes each variable from the network of variables if all associations of that variable fail to satisfy the criterion.
17. The system of claim 1 wherein the evaluator further comprises a criterion setter that establishes the predetermined criterion as a threshold value for the strength of the association.
18. The system of claim 1 further comprising a comparator that compares the strength of each association with the predetermined criterion.
19. The system of claim 1 wherein each association having a strength above the threshold value satisfies the criterion.
20. The system of claim 1 further comprising: a data permutation device randomly permuting the data for each of the plurality of variables; and wherein the calculator, calculates from the permuted data a strength of the association between each pair of variables, and the criterion setter sets the threshold value to a strongest association from the strengths of associations determined using the permuted data.
21. The system of claim 1 further comprising an output device displaying the network of variables .
22. The system of claim 1 wherein the network of variables includes a plurality of separate networks of variables.
23. The system of claim 1 wherein the calculator applies a linear regression model to the data of each pair of variables to determine the strength of the association between that pair of variables.
2 . The system of claim 1 wherein the calculator applies a non- linear regression model to the data of each pair of variables to determine the strength of the association between that pair of variables.
25. The system of claim 1 wherein the calculator computes a mutual information value between each pair of variables to determine the strength of the association between that pair of variables.
26. A system for determining a strength of association between any two of a plurality of variables, comprising: memory storing data for two or more variables ; a processor in communication with the memory, the processor executing software that (1) establishes an association between each pair of variables, (2) calculates from the data a strength of the association between each pair of variables, (3) evaluates the strength of each association according to a predetermined criterion; (4) produces a network of variables that includes each association; (5) removes each association from the network of variables that fails to satisfy the criterion; and (6) graphically displays the network of variables.
PCT/US2000/024257 1999-09-02 2000-09-01 A system and method for mining data from a database using relevance networks WO2001016805A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CA002383549A CA2383549A1 (en) 1999-09-02 2000-09-01 A system and method for mining data from a database using relevance networks
EP00959855A EP1266305A2 (en) 1999-09-02 2000-09-01 A system and method for mining data from a database using relevance networks
JP2001520685A JP2003527662A (en) 1999-09-02 2000-09-01 System and apparatus for retrieving data from a database using an associated network
AU71105/00A AU7110500A (en) 1999-09-02 2000-09-01 A system and method for mining data from a database using relevance networks

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US15250099P 1999-09-02 1999-09-02
US60/152,500 1999-09-02
US15359399P 1999-09-13 1999-09-13
US60/153,593 1999-09-13
US43045099A 1999-10-29 1999-10-29
US09/430,450 1999-10-29

Publications (2)

Publication Number Publication Date
WO2001016805A2 true WO2001016805A2 (en) 2001-03-08
WO2001016805A3 WO2001016805A3 (en) 2002-09-26

Family

ID=27387262

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/024257 WO2001016805A2 (en) 1999-09-02 2000-09-01 A system and method for mining data from a database using relevance networks

Country Status (5)

Country Link
EP (1) EP1266305A2 (en)
JP (1) JP2003527662A (en)
AU (1) AU7110500A (en)
CA (1) CA2383549A1 (en)
WO (1) WO2001016805A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1455283A2 (en) * 2003-03-03 2004-09-08 Fujitsu Limited Information relevance display method program storage medium and apparatus
US7647287B1 (en) 2008-11-21 2010-01-12 International Business Machines Corporation Suggesting a relationship for a node pair based upon shared connections versus total connections
WO2019165388A1 (en) * 2018-02-25 2019-08-29 Graphen, Inc. System for discovering hidden correlation relationships for risk analysis using graph-based machine learning

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025596B (en) 2016-02-01 2021-07-16 腾讯科技(深圳)有限公司 Risk assessment method and system

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ALON U ET AL: "BROAD PATTERNS OF GENE EXPRESSION REVEALED BY CLUSTERING ANALYSIS OF TUMOR AND NORMAL COLON TISSUES PROBED BY OLIGONUCLEOTIDE ARRAYS" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 96, June 1999 (1999-06), pages 6745-6750, XP002901769 ISSN: 0027-8424 *
BASETT D E ET AL: "GENE EXPRESSION INFORMATICS - IT'S ALL IN YOUR MINE" NATURE GENETICS, NEW YORK, NY, US, vol. 21, no. SUPPL, January 1999 (1999-01), pages 51-55, XP000865988 ISSN: 1061-4036 *
CHEN T ET AL: "Identifying Gene Regulatory Networks from Experimental Data" RECOMB 99, ACM PRESS, April 1999 (1999-04), XP002189969 USA *
CHEN Y ET AL: "CLUSTERING ANALYSIS FOR GENE EXPRESSION DATA" PROCEEDINGS OF THE SPIE, SPIE, BELLINGHAM, VA, US, vol. 3602, January 1999 (1999-01), pages 422-428, XP001001103 *
CLAVERIE, J.-M.: "Computational methods for the identification of differential and coordinated gene expression" HUMAN MOLECULAR GENETICS, OXFORD UNIVERSITY PRESS, vol. 8, no. 10, 1 September 1999 (1999-09-01), pages 1821-1832, XP002202819 *
D'HAESELEER, P. ET AL: "Gene Expression Data Analysis and Modeling" PACIFIC SYMPOSIUM ON BIOCOMPUTING 1999 (PSB99), TUTORIAL, 4 - 9 January 1999, pages 1-34, XP002203022 Hawaii, USA *
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 *
MICHAELS G S ET AL: "CLUSTER ANALYSIS AND DATA VISUALIZATION OF LARGE-SCALE GENE EXPRESSION DATA" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING, XX, XX, 1997, pages 42-53, XP000974575 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1455283A2 (en) * 2003-03-03 2004-09-08 Fujitsu Limited Information relevance display method program storage medium and apparatus
EP1455283A3 (en) * 2003-03-03 2006-04-12 Fujitsu Limited Information relevance display method program storage medium and apparatus
US7203698B2 (en) 2003-03-03 2007-04-10 Fujitsu Limited Information relevance display method, program, storage medium and apparatus
US7647287B1 (en) 2008-11-21 2010-01-12 International Business Machines Corporation Suggesting a relationship for a node pair based upon shared connections versus total connections
WO2019165388A1 (en) * 2018-02-25 2019-08-29 Graphen, Inc. System for discovering hidden correlation relationships for risk analysis using graph-based machine learning

Also Published As

Publication number Publication date
CA2383549A1 (en) 2001-03-08
WO2001016805A3 (en) 2002-09-26
JP2003527662A (en) 2003-09-16
EP1266305A2 (en) 2002-12-18
AU7110500A (en) 2001-03-26

Similar Documents

Publication Publication Date Title
Aldino et al. Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency
Beaumont Detecting population expansion and decline using microsatellites
Neill et al. Detecting significant multidimensional spatial clusters
CN109411033B (en) Drug efficacy screening method based on complex network
CN109935337B (en) Medical record searching method and system based on similarity measurement
Songdechakraiwut et al. Topological learning and its application to multimodal brain network integration
US20040234995A1 (en) System and method for storage and analysis of gene expression data
CN112131399A (en) Old medicine new use analysis method and system based on knowledge graph
CN110910991A (en) Medical automatic image processing system
CN108921211A (en) A method of based on density peaks cluster calculation fractal dimension
CN117370565A (en) Information retrieval method and system
EP1266305A2 (en) A system and method for mining data from a database using relevance networks
Truong et al. Learning a complex metabolomic dataset using random forests and support vector machines
US20030187592A1 (en) Association rule mining and visualization for disease related gene
CN109033746B (en) Protein compound identification method based on node vector
CN115336977B (en) Accurate ICU alarm grading evaluation method
CN116705310A (en) Data set construction method, device, equipment and medium for perioperative risk assessment
CN111202511A (en) Recommendation and distribution method and device for electrocardiogram data labeling
Rosenberg Enumeration of lonely pairs of gene trees and species trees by means of antipodal cherries
CN116049644A (en) Feature screening and clustering and binning method and device, electronic equipment and storage medium
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
Abdelfattah Variables Selection Procedure for the DEA Overall Efficiency Assessment Based Plithogenic Sets and Mathematical Programming
Muttaqien et al. Recommendation of Student Admission Priorities Using K-Means Clustering
CN112101021A (en) Method, device and equipment for realizing standard word mapping
WO2001073602A2 (en) Clustering and examining large data sets

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2383549

Country of ref document: CA

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2001 520685

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 71105/00

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2000959855

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

WWP Wipo information: published in national office

Ref document number: 2000959855

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2000959855

Country of ref document: EP