WO2001016805A2 - A system and method for mining data from a database using relevance networks - Google Patents
A system and method for mining data from a database using relevance networks Download PDFInfo
- Publication number
- WO2001016805A2 WO2001016805A2 PCT/US2000/024257 US0024257W WO0116805A2 WO 2001016805 A2 WO2001016805 A2 WO 2001016805A2 US 0024257 W US0024257 W US 0024257W WO 0116805 A2 WO0116805 A2 WO 0116805A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variables
- association
- strength
- data
- network
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000005065 mining Methods 0.000 title abstract description 5
- 108090000623 proteins and genes Proteins 0.000 claims description 32
- 238000012417 linear regression Methods 0.000 claims description 7
- 238000004519 manufacturing process Methods 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 description 7
- 238000005259 measurement Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 4
- 229910052799 carbon Inorganic materials 0.000 description 4
- 238000007418 data mining Methods 0.000 description 4
- 101100316753 Arabidopsis thaliana VAL2 gene Proteins 0.000 description 3
- 101100316754 Arabidopsis thaliana VAL3 gene Proteins 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- CURLTUGMZLYLDI-UHFFFAOYSA-N Carbon dioxide Chemical compound O=C=O CURLTUGMZLYLDI-UHFFFAOYSA-N 0.000 description 2
- XUIIKFGFIJCVMT-GFCCVEGCSA-N D-thyroxine Chemical compound IC1=CC(C[C@@H](N)C(O)=O)=CC(I)=C1OC1=CC(I)=C(O)C(I)=C1 XUIIKFGFIJCVMT-GFCCVEGCSA-N 0.000 description 2
- 102100037852 Insulin-like growth factor I Human genes 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- 238000003646 Spearman's rank correlation coefficient Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 108091008053 gene clusters Proteins 0.000 description 2
- 210000000440 neutrophil Anatomy 0.000 description 2
- 229940034208 thyroxine Drugs 0.000 description 2
- XUIIKFGFIJCVMT-UHFFFAOYSA-N thyroxine-binding globulin Natural products IC1=CC(CC([NH3+])C([O-])=O)=CC(I)=C1OC1=CC(I)=C(O)C(I)=C1 XUIIKFGFIJCVMT-UHFFFAOYSA-N 0.000 description 2
- 102000011767 Acute-Phase Proteins Human genes 0.000 description 1
- 108010062271 Acute-Phase Proteins Proteins 0.000 description 1
- BVKZGUZCCUSVTD-UHFFFAOYSA-M Bicarbonate Chemical compound OC([O-])=O BVKZGUZCCUSVTD-UHFFFAOYSA-M 0.000 description 1
- 102000001554 Hemoglobins Human genes 0.000 description 1
- 108010054147 Hemoglobins Proteins 0.000 description 1
- 101000599951 Homo sapiens Insulin-like growth factor I Proteins 0.000 description 1
- 206010061218 Inflammation Diseases 0.000 description 1
- 108090000723 Insulin-Like Growth Factor I Proteins 0.000 description 1
- XNSAINXGIQZQOO-UHFFFAOYSA-N L-pyroglutamyl-L-histidyl-L-proline amide Natural products NC(=O)C1CCCN1C(=O)C(NC(=O)C1NC(=O)CC1)CC1=CN=CN1 XNSAINXGIQZQOO-UHFFFAOYSA-N 0.000 description 1
- 102100032251 Pro-thyrotropin-releasing hormone Human genes 0.000 description 1
- 102000011923 Thyrotropin Human genes 0.000 description 1
- 108010061174 Thyrotropin Proteins 0.000 description 1
- 239000000627 Thyrotropin-Releasing Hormone Substances 0.000 description 1
- 101800004623 Thyrotropin-releasing hormone Proteins 0.000 description 1
- 102000015395 alpha 1-Antitrypsin Human genes 0.000 description 1
- 108010050122 alpha 1-Antitrypsin Proteins 0.000 description 1
- 229940024142 alpha 1-antitrypsin Drugs 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 229910002092 carbon dioxide Inorganic materials 0.000 description 1
- 239000001569 carbon dioxide Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 210000003743 erythrocyte Anatomy 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 238000005534 hematocrit Methods 0.000 description 1
- 238000010237 hybrid technique Methods 0.000 description 1
- 208000027866 inflammatory disease Diseases 0.000 description 1
- 230000004054 inflammatory process Effects 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 238000012775 microarray technology Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- XNSAINXGIQZQOO-SRVKXCTJSA-N protirelin Chemical compound NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@H]1NC(=O)CC1)CC1=CN=CN1 XNSAINXGIQZQOO-SRVKXCTJSA-N 0.000 description 1
- 238000004062 sedimentation Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 229940034199 thyrotropin-releasing hormone Drugs 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Definitions
- the invention relates generally to data processing. More specifically, the invention relates to a system and method for mining data from a data set to identify potentially meaningful relationships among variables in the data set.
- the simple criteria matching technique measures RNA expression levels before and after an intervention. For each gene, fold-differences are calculated. The genes are then sorted according to the calculated fold-differences. Genes showing a fold-change greater than a given threshold are "clustered" with the intervention.
- the self-organizing map technique represents genes as multi-dimensional points in a multi-dimensional space. Coordinates for these points represent expression levels of each gene at various moments in time. A grid of centroids is imposed in the multi-dimensional space, and the centroids are allowed to drift. Each centroid drifts towards a collection of points. When the drifting completes, the centroids identify clusters of genes that exhibit similar time-course behavior. In this way, related genes have a smaller Euclidean distance in the multidimensional space. However, large numbers of dimensions can cause the technique to become computationally intensive. Moreover, the resulting gene/ time course clusters provide little information about specific gene-to-gene relationships among the genes in the clusters .
- Techniques that perform comprehensive pair-wise comparisons generally compare each gene against each other gene using a metric.
- One particular technique creates a vector for each gene. The vector is made up of expression levels taken at various times. Each gene is compared against each other gene by recording the correlation coefficient between the corresponding vectors. The technique then constructs a phylogenetic-type tree with branch lengths between genes being proportional to the correlation coefficients. However, phylogenetic-type trees, in general, do not show more than the most correlated relationships of each gene, omitting the lesser correlated, yet potentially significant relationships.
- Another technique combines the Euclidean distance and pair- wise comparison techniques by constructing phylogenetic-type trees with branch length proportional to the Euclidean distance between genes. The coordinates again represent expression levels at various time points.
- the present invention relates to a system and method for producing a network of related variables.
- An objective of the invention is to group variables occurring in data extracted from a data source in a manner that makes readily apparent any potentially significant relationships among those variables and consequently motivate hypotheses for targeted research.
- Another objective is to concurrently examine relationships among large numbers of variables .
- the invention features a method that obtains data for a plurality of variables.
- An association between each pair of variables is established.
- a strength of the association between each pair of variables is calculated and evaluated according to a predetermined criterion.
- a network of variables is produced.
- the network of variables includes each association having a strength that satisfies the criterion.
- the variables can represent any type of data (e.g., genomic data, financial information, customer transaction, airline travel information, etc.).
- the network of variables can be graphically displayed.
- the network of variables is produced by including each established association irrespective of the strength of that association, and subsequently removing each association from the network of variables that fails to satisfy the criterion.
- One embodiment includes each variable in the network of variables, and subsequently removes that variable from the network of variables if all associations with that variable fail to satisfy the criterion. Removing a variable from the network of variables can produce a plurality of separate networks of variables.
- the method establishes the criterion as a threshold value for the strength of the association.
- each association having a strength above the threshold value satisfies the criterion.
- the strength of the association between the two variables can be calculated using mutual information between the variables.
- Other embodiments use a linear regression model (e.g., computing a Pearson correlation coefficient) or a non-linear regression model.
- the threshold value is determined by randomly permuting the data for each pair of variables. A strength of the association between each pair of variables is calculated from the permuted data. The steps of permuting and calculating are repeated a predetermined number of times. The strongest association is determined from the strengths of associations determined using permuted data. The threshold value is set equal to the strongest association.
- the invention relates to a system for producing a network of related variables. The system includes memory storing data for the plurality of variables . An associator establishes an association between each pair of variables in the network of variables. A calculator calculates the strength of the association between each pair of variables.
- An evaluator evaluates the strength of the association between each pair of variables according to a predetermined criterion.
- a network generator produces a network of variables that includes each association that satisfies the criterion.
- the invention in another aspect, relates to a system for determining a strength of association between any two of a plurality of variables.
- the system includes memory, storing data for two or more variables, and a processor in communication with the memory.
- the processor executes software that (1) establishes an association between each pair of variables to produce a network of variables, (2) calculates from the data a strength of the association between each pair of variables, (3) evaluates the strength of each association according to a predetermined criterion, (4) produces a network of variables that includes each association, (5) removes each association from the network of variables that fails to satisfy the criterion, and (6) graphically displays the network of variables .
- Fig. 1 is a block diagram of an embodiment of an exemplary system for mining data in databases according to the principles of the invention
- Fig. 2A is an embodiment of a table including data from a data source for a plurality of variables
- Fig. 2B is an embodiment of a scatter plot of the data for a pair of variables from the table of Fig. 2A;
- Fig. 2C is an embodiment of a scatter plot of the data for another pair of variables from the table of Fig. 2A;
- Fig. 3 is a flow chart of an embodiment of exemplary process that produces relevance networks using the associations between variables in a data set according to the principles of the invention
- Fig. 4 is a block diagram of an embodiment of a graphical representation of the associations between each pair of variables in the data set;
- Fig. 5 is an embodiment of a variable matrix including examples of strength values for each of associations shown in Fig. 4;
- Fig. 6 is an embodiment of a table illustrating an exemplary permutation of the data in the table shown in Fig. 2A;
- Fig. 7 is an embodiment of a graph illustrating results from an exemplary process used to determine a threshold value for evaluating the strengths of the associations between each pair of variables;
- Figs. 8A, 8B, 8C are embodiments of relevance networks produced by applying different criterion to the variables and links shown in Fig. 4 with the exemplary associated strength values of Fig. 5;
- Fig. 9 is an embodiment of a relevance network produced from actual genomic data.
- Fig. 1 shows an exemplary embodiment of system architecture 10 including a computer system 20 in communication with a data source 30.
- the computer system 20 includes a processor and memory (not shown) programmed to perform data mining that discovers relationships among variables in the data according to the principles of the invention.
- the processor in one embodiment is a 266 MHz Pentium IITM processor, manufactured by Intel Corporation of Santa Clara, California.
- One embodiment of the computer system 20 is a Sun Ultra HPC 5000 server running Solaris, manufactured by Sun
- the data source 30 in one embodiment is a database system, e.g., ORACLE 8TM, or data stored in files on a data storage device, such as a hard disk.
- the processor of the computer system 20 executes data mining software.
- data mining software is written in any programming language, such as C, C++, etc.
- the data in the data source 30 represent measurements of multiple variables for various sample cases.
- the sample cases in one embodiment are individuals and the measured variables are physical characteristics, such as weight, height, age, gender, race, etc.
- the sample cases in one embodiment pertain to a single patient evaluated at different time intervals.
- the patient is subject to particular laboratory tests, such as hemoglobin, hematocrit, and thyroxine measurements taken over a period of time.
- the measured variables are continuous variables.
- the sample cases are RNA expression measurements and the measured variables are genes.
- the sample cases are corporate institutions for which the measured variables are financial data, such as stock prices, price to earning ratios, etc., acquired over time, In general, the principles of the invention can be practiced to examine any type of data in search of relationships among various measured variables.
- the invention can mine data from databases containing customer sales transactions, commercial passenger travel information (e.g., airline), financial data, and data collected by laboratories, research facilities, commercial institutions, finance institutions, etc.
- An advantage is that the invention can exploit existing electronic databases.
- execution of the data mining software causes the computer system 20 to access data in the data source 30.
- the data mining software associates each variable in the accessed data with every other variable and determines the significance of the association between each pair of variables. Significance can be defined according to a predetermined criterion.
- the data mining software groups together variables into one or more separate relevance networks.
- Each relevance network represents a group of related variables; that is, each variable in a relevance network has a significant association (as defined by the criterion) with at least one other variable in that relevance network, and does not have a significant association (as defined by the criterion) with variables in other relevance networks.
- the data mining software outputs each relevance network for display (e.g., at the computer system) . The displayed output makes it readily apparent that a relationship potentially worthy of targeted research was detected among variables in the data.
- Fig. 2A shows an exemplary tabular representation 50 of data in the data source.
- SI, S2, S3, and S4 on the y-axis are represented as rows. This column and row arrangement is exemplary; the sample cases and variables can appear on either the x- or y-axis and remain within the scope of the invention. In addition, the principles of the invention extend to more sample cases and variables other than those shown in Fig. 2A.
- the table 50 can be completely, densely, or sparsely populated with data values 52.
- Fig. 2A shows an exemplary data set of twenty entries wherein the table 50 includes fifteen numerical data values (VALl - VAL15) . Five entries of the table 50 lack a data value, each denoted by a dashed line.
- the data values 52 are used to determine the degree of a relationship between each pair of variables.
- Each pair of data values 52 appearing in the same row in the table 50 represents a data point 54 in a scatter plot.
- Fig. 2B shows an embodiment of an exemplary scatter plot of the data points 54 produced by the data for variables D and E.
- the data points are (VAL9, VAL12), (VAL10, VAL13), and (VAL11, VAL14).
- Fig. 2C illustrates another embodiment of an exemplary scatter plot of the data points for variables A and E, where the data points are
- Fig. 3 shows an exemplary process for finding relationships among the variables A, B, C, D, and E according to the principles of the invention.
- the process obtains (step 60) a set of data from the data source 30.
- the data in the data set includes values for various variables for the sample cases .
- the computer system 20 organizes (step 64) the obtained data in the data set.
- One exemplary data organization is the tabular representation 50 shown in Fig. 2A.
- the computer system 20 associates (step 68) each variable with every other variable in the data set. Accordingly, an association exists between each pair of variables in the data set.
- the computer system 20 calculates (step 72) the strength of each association.
- strength is an indication of how closely the variables are related. A strong association indicates that the variables are closely related; a weak association indicates a low or no relationship between the variables.
- Variables can be related to each other in various ways .
- variables can be related through physiology, such as serum concentration of bicarbonate is related to the alveolar partial pressure of carbon dioxide.
- Variables can be related through mathematical formulae, such as neutrophil count and percentage of neutrophils.
- Some variables can be directly or indirectly related to each other through other variables.
- An example of an indirect relationship is how thyrotropin-releasing hormone controls thyroxine level through thyroid stimulating hormone .
- variables can have a relationship with each other relating to a pathologic condition.
- An example of such a relationship is a relationship between the erythrocyte sedimentation rate, which is an indicator of inflammation, and alpha-1 antitrypsin, an acute phase protein indicative of an inflammatory disease state.
- Other variables can be related through synonymy. For example, both somatomedin C and insulinlike growth factor-1 refer to the same molecule.
- the principles of the invention can recognize when distinct variables represent the same thing, although referred to by different names.
- the computer system 20 constructs (step 74) a graphical network of variables using every association established in step 68. In this network of variables, each variable is linked to every other variable (e.g., see Fig. 4) .
- the computer system 20 evaluates (step 76) the strength of the association between each pair of variables according to a predetermined criterion.
- the criterion can be a threshold value.
- the computer system 20 removes (step 80) the association between each pair of variables if the strength of that association fails to satisfy the predetermined criterion.
- the predetermined criterion can require the strength of the association between each pair of variables to be above the threshold value, or otherwise that pair of variables becomes disassociated.
- the predetermined criterion can require the strength of the association for each variable pair to be below the threshold value in order for that association to remain.
- the computer system 20 also removes (step 84) each variable that has no associations with other variables remaining after step 80; that is, all associations of that variable fail to satisfy the criterion.
- the remaining associations and variables form one or more relevance networks.
- each relevance network is displayed at the client system 26.
- the removal of associations and variables can divide the network of variables into smaller, separate networks. Each such smaller network is a relevance network because that smaller network represents a group of related variables. Each variable in that smaller network has an association with at least one other variable in that network that satisfies the criterion.
- the criterion may cause the removal of none, one, or multiple associations without the removal of any variables.
- the relevance network includes all of the variables in the data set .
- the computer produces (step 74') the graphical network of variables after the strength of each association is evaluated against the criterion in step 76.
- the computer system 20 constructs the network of variables using only those associations that satisfy the criterion. Variables appear in this network of variables if there is at least one association with that variable which satisfies the criterion.
- this network of variables is constructed as a relevance network because the network of variables includes only those variables and associations that satisfy the criterion throughout construction of the variable network. No associations or variables need to be removed from this variable network, such as described in connection with steps 80 and 84, to produce a relevance network.
- Fig. 4 shows an exemplary embodiment of a network of variables 110 graphically representing the associations initially established between each pair of variables.
- the associations are represented as links 100 between pairs of variables.
- Each variable A, B, C, D, and E is shown as a node in the network of variables and shares a link 100 with every other variable. For example, variable A shares a link 100 with variable B, another link 100 with variable C, another link 100 with variable D, and yet another link 100 with variable E.
- Each link 100 has an assigned value representing the strength of the association between the pairs of variables.
- Fig. 5 shows an exemplary matrix 104 containing examples of strength values 108 assigned to each of the links 100 of Fig. 4.
- the matrix 104 places the variables A, B, C, D, and E on both the x- and y-axes.
- Each value 108 in the matrix 104 represents the strength determined for the association between the respective pair of variables.
- the matrix 104 is symmetric, and those entries in the matrix 104 denoted by X are either duplicative of another entry in the matrix 104 (e.g., entries (A, B) and (B, A)), or tautological (e.g., entry (A, A)) .
- entries need not be calculated or stored.
- the values 108 shown are exemplary and selected only for illustrating the principles of the invention.
- a variety of methodologies can be used to calculate strength of the association between each pair of variables.
- the following described methodologies are exemplary, as the principles of the invention can be practiced using any methodology capable of assessing the quality of relationships between pairs of variables. Such methodologies can make quantitative or qualitative assessments of those relationships.
- One methodology is to consider the number of data points that are used to establish an association between a pair of variables. Associations between variables based on a high number of data points are stronger than those associations based on fewer data points.
- This methodology for establishing the strength of an association can be used alone or in combination with other methodologies, such as those described below.
- Another exemplary methodology computes a correlation coefficient (typically denoted as r) between each pair of variables. The technique for computing a correlation coefficient can depend upon the kinds of variables in the data set.
- a correlation coefficient of 1 indicates a perfect linear relationship between variables with a positive slope
- a correlation coefficient of -1 indicates a perfect linear inverse correlation (i.e., a relationship with a negative slope)
- a correlation coefficient of 0 indicates no linear relationship. Use of this correlation coefficient detects positive and negative relationships between two variables.
- the correlation coefficient is Pearson's correlation coefficient. The Pearson correlation coefficient can measure the linear association between variables for which the data have been measured over intervals.
- the correlation coefficient is a Spearman Rank correlation coefficient. The Spearman Rank correlation coefficient can be a more appropriate coefficient than the
- the square of the correlation coefficient, r 2 (typically referred to as the coefficient of determination) can be used.
- the value of r 2 ranges between 0 and 1. Because the value of r 2 is the square of the correlation coefficient, the value is always positive with respect to the coefficient and tends to enhance the differences between correlation coefficient values that are highly correlated. That is, a correlation coefficient, r, of 0.5 has a r 2 of 0.25, whereas an r of greater than 0.7 has a r 2 of greater than 0.5.
- Another technique for computing a correlation coefficient uses a nonlinear regression model.
- Other statistical methods of computing correlation coefficients between variables are known in the art and can be used to determine the strength of the associations between pairs of variables.
- entropy (H) of the variables and the mutual information between each pair of variables computes entropy (H) of the variables and the mutual information between each pair of variables.
- the entropy of a variable is a measure of the information content in that variable.
- Mutual information is a measure of the additional information known about one variable when given another variable, and is useful for variables (e.g., color) that do not have a numerical relationship with other variables .
- Entropy for a variable is computed using a histogram model for discrete probabilities. A range of values for the variable is calculated. That range is then subdivided into n sub-ranges.
- the proportion of measurements in sub-range x (or frequency) is denoted as p (x ⁇ ) .
- the histogram increasingly models the probability density function for the variable .
- Entropy can be calculated using the following equation:
- H(A) - ⁇ i - i to n p(Xi)log2(p(x ⁇ )) where log2 is base 2 logarithm. Higher entropy indicates that the data for that variable are more randomly distributed, and thus has higher information.
- Mutual information can be calculated by subtracting the entropy of a first variable (A) given an occurrence of a second variable (B) from the entropy of the first variable (A) as represented by the following equation:
- MI (A, B) H(A) - H(A
- MI (A, B) H (A) + H ( B) - H (A, B) .
- a mutual information of zero means that the joint distribution of values for a pair of variables holds no more information than the variables considered separately.
- a higher mutual information between two variables indicates that one variable is predictable from the other variable. Consequently, mutual information can be used as a metric between two variables related to their degree of independence.
- the computer system 20 can use the above-described equations to compute a mutual information relationship between pairs of genes.
- the strength of each association is compared with a criterion.
- the comparison operates as a filter that removes weakly related or unrelated associations and variables from the network of variables to produce one or more relevance networks. Consequently, the setting of the criterion is determinative as to which variables and associations appear in a relevance network.
- the criterion is a minimum number of data points upon which the strength of each association between variables must be based. Any association based on less than that minimum number of data points fails to satisfy the criterion and is removed from the network of variables. Such an association is deemed weak because of the paucity of data supporting the association. For example, referring to Fig. 2A, if the minimum number of data points is two, than the associations between variables B and A and between variables B and D fail to satisfy the criterion because both associations are based on one data point only, (VAL5, VAL3) and (VAL4,
- the criterion is a threshold value against which the strength of each association is measured.
- the threshold value can be set using any technique for the purposes of practicing the invention, such as, for example, trial and error.
- FIG. 6 shows an exemplary permutation of the data in table 50 shown in Fig. 2A.
- the permutation of the data creates new data points between variables.
- the permutation shown in Fig. 6 produces two new data points between variables A and C, namely (VAL2, VAL8) and (VALl, VAL6) , which differ from the original data points shown in Fig. 2A, namely (VAL2, VAL6) and (VAL3 ,
- strengths of associations between pairs of variables are calculated.
- the technique used to calculate the strength of associations for permuted data points is the same as that used for the original data points. Accordingly, if mutual information is used to indicate the strength of associations for the original data points, then mutual information is also used for the permuted data points.
- the steps of permuting the data and calculating strengths are repeated a predetermined number of times (e.g., 30). The threshold value is then set to the strongest association obtained from the repeated permutations of the data.
- Fig. 7 is a exemplary graph illustrating the results of this process for determining a threshold value as applied to actual data taken from 2,467 genes in Saccharamomyces cerevisiae .
- the results are described in the U.S. provisional patent application, filed September 13, 1999, and given serial number 60/153,593, attorney docket number CMC-008PR1, and incorporated by reference herein.
- mutual information was calculated between measurements of RNA expression between pairs of the 2,467 genes.
- the distribution of the mutual information appears as filled circles.
- Mutual information was also calculated using permuted RNA expression measurements .
- the average distribution of 30 repeated permutations appears as open circles. The permutations did not produce any associations having a mutual information value greater than 1.3.
- the threshold value used to filter associations can be set to 1.3.
- any associations produced from the original data points having a mutual information above 1.3 could be considered significant .
- Figs. 8A, 8B, and 8C show the resulting relevance networks produced by the process described in Fig. 3.
- a different criterion is applied to the links 100 representing the associations between variables A, B, C, D, and E shown in Fig.
- the relevance networks of Figs. 8A, 8B and 8C are the results of applying minimum thresholds of .4, .6, and .7 respectively. Links 100 having a strength value below the threshold are removed, and links 100 greater than or equal to the threshold remain. In these examples, the criterion does not require a minimum number of data points.
- Fig. 8A displays a relevance network 120 that includes all of the variables A, B, C, D, E, but fewer associations than those shown in the original network 110 shown in Fig. 4. In particular, all but one association between D and the other variables has been removed. The only remaining association with variable D is with variable E. In Fig. 8B, the remaining association between variables D and E also fails to satisfy the threshold value of .6. Consequently, the resulting relevance network 122 does not include the variable D because the variable
- D has no associations with any of the other variables that meet the criterion.
- Fig. 8C illustrates how the threshold value of .7 has divided the original network of variables 110 into two smaller, separate relevance networks 125 and 125' .
- the relevance network
- Figs. 8A, 8B, and 8C are exemplary.
- Application of the invention works with large numbers of variables.
- the computer system 20 can execute graph layout software.
- An example of such software is the Graph Editor Toolkit, developed by Tom Sawyer Software of Berkeley California.
- Fig. 9 is an embodiment of a relevance network 130 produced from actual genome data as described in the U.S. provisional application, serial number 60/153,593.
- This particular relevance network 130 clustered 143 genes out of a data set of 79 RNA expression measurements of 2,467 genes.
- the graph layout software isolates two branches of genes 132 and 132' attached to the network 130 by a single association. In Fig. 9, the branches are exploded to show some detail regarding the names of the associated genes. Such branches of biologically relevant gene clusters identify opportunities for further study.
- the present invention is useful in a variety of applications. For example, relevance networks produced for normal cells can be compared to those relevance networks produced for various cancer cells to help identify distinctions and similarities. Similarly, the invention enables comparisons between the relevance networks of various cancers . Another example uses the relevance networks to monitor changes of certain variables throughout the treatment of a patient.
- the present invention may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture.
- the article of manufacture may be a floppy disk, a hard disk, a CD-ROM, a flash memory card, a PROM, a RAM, a ROM, or a magnetic tape.
- the computer- readable programs may be implemented in any programming language, LISP, PERL, C, C++, PROLOG, or any byte code language such as JAVA.
- the software programs may be stored on or in one or more articles of manufacture as object code.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA002383549A CA2383549A1 (en) | 1999-09-02 | 2000-09-01 | A system and method for mining data from a database using relevance networks |
EP00959855A EP1266305A2 (en) | 1999-09-02 | 2000-09-01 | A system and method for mining data from a database using relevance networks |
JP2001520685A JP2003527662A (en) | 1999-09-02 | 2000-09-01 | System and apparatus for retrieving data from a database using an associated network |
AU71105/00A AU7110500A (en) | 1999-09-02 | 2000-09-01 | A system and method for mining data from a database using relevance networks |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15250099P | 1999-09-02 | 1999-09-02 | |
US60/152,500 | 1999-09-02 | ||
US15359399P | 1999-09-13 | 1999-09-13 | |
US60/153,593 | 1999-09-13 | ||
US43045099A | 1999-10-29 | 1999-10-29 | |
US09/430,450 | 1999-10-29 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001016805A2 true WO2001016805A2 (en) | 2001-03-08 |
WO2001016805A3 WO2001016805A3 (en) | 2002-09-26 |
Family
ID=27387262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/024257 WO2001016805A2 (en) | 1999-09-02 | 2000-09-01 | A system and method for mining data from a database using relevance networks |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1266305A2 (en) |
JP (1) | JP2003527662A (en) |
AU (1) | AU7110500A (en) |
CA (1) | CA2383549A1 (en) |
WO (1) | WO2001016805A2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1455283A2 (en) * | 2003-03-03 | 2004-09-08 | Fujitsu Limited | Information relevance display method program storage medium and apparatus |
US7647287B1 (en) | 2008-11-21 | 2010-01-12 | International Business Machines Corporation | Suggesting a relationship for a node pair based upon shared connections versus total connections |
WO2019165388A1 (en) * | 2018-02-25 | 2019-08-29 | Graphen, Inc. | System for discovering hidden correlation relationships for risk analysis using graph-based machine learning |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107025596B (en) | 2016-02-01 | 2021-07-16 | 腾讯科技(深圳)有限公司 | Risk assessment method and system |
-
2000
- 2000-09-01 AU AU71105/00A patent/AU7110500A/en not_active Abandoned
- 2000-09-01 WO PCT/US2000/024257 patent/WO2001016805A2/en not_active Application Discontinuation
- 2000-09-01 CA CA002383549A patent/CA2383549A1/en not_active Abandoned
- 2000-09-01 JP JP2001520685A patent/JP2003527662A/en active Pending
- 2000-09-01 EP EP00959855A patent/EP1266305A2/en not_active Withdrawn
Non-Patent Citations (8)
Title |
---|
ALON U ET AL: "BROAD PATTERNS OF GENE EXPRESSION REVEALED BY CLUSTERING ANALYSIS OF TUMOR AND NORMAL COLON TISSUES PROBED BY OLIGONUCLEOTIDE ARRAYS" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 96, June 1999 (1999-06), pages 6745-6750, XP002901769 ISSN: 0027-8424 * |
BASETT D E ET AL: "GENE EXPRESSION INFORMATICS - IT'S ALL IN YOUR MINE" NATURE GENETICS, NEW YORK, NY, US, vol. 21, no. SUPPL, January 1999 (1999-01), pages 51-55, XP000865988 ISSN: 1061-4036 * |
CHEN T ET AL: "Identifying Gene Regulatory Networks from Experimental Data" RECOMB 99, ACM PRESS, April 1999 (1999-04), XP002189969 USA * |
CHEN Y ET AL: "CLUSTERING ANALYSIS FOR GENE EXPRESSION DATA" PROCEEDINGS OF THE SPIE, SPIE, BELLINGHAM, VA, US, vol. 3602, January 1999 (1999-01), pages 422-428, XP001001103 * |
CLAVERIE, J.-M.: "Computational methods for the identification of differential and coordinated gene expression" HUMAN MOLECULAR GENETICS, OXFORD UNIVERSITY PRESS, vol. 8, no. 10, 1 September 1999 (1999-09-01), pages 1821-1832, XP002202819 * |
D'HAESELEER, P. ET AL: "Gene Expression Data Analysis and Modeling" PACIFIC SYMPOSIUM ON BIOCOMPUTING 1999 (PSB99), TUTORIAL, 4 - 9 January 1999, pages 1-34, XP002203022 Hawaii, USA * |
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 * |
MICHAELS G S ET AL: "CLUSTER ANALYSIS AND DATA VISUALIZATION OF LARGE-SCALE GENE EXPRESSION DATA" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING, XX, XX, 1997, pages 42-53, XP000974575 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1455283A2 (en) * | 2003-03-03 | 2004-09-08 | Fujitsu Limited | Information relevance display method program storage medium and apparatus |
EP1455283A3 (en) * | 2003-03-03 | 2006-04-12 | Fujitsu Limited | Information relevance display method program storage medium and apparatus |
US7203698B2 (en) | 2003-03-03 | 2007-04-10 | Fujitsu Limited | Information relevance display method, program, storage medium and apparatus |
US7647287B1 (en) | 2008-11-21 | 2010-01-12 | International Business Machines Corporation | Suggesting a relationship for a node pair based upon shared connections versus total connections |
WO2019165388A1 (en) * | 2018-02-25 | 2019-08-29 | Graphen, Inc. | System for discovering hidden correlation relationships for risk analysis using graph-based machine learning |
Also Published As
Publication number | Publication date |
---|---|
CA2383549A1 (en) | 2001-03-08 |
WO2001016805A3 (en) | 2002-09-26 |
JP2003527662A (en) | 2003-09-16 |
EP1266305A2 (en) | 2002-12-18 |
AU7110500A (en) | 2001-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Aldino et al. | Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency | |
Beaumont | Detecting population expansion and decline using microsatellites | |
Neill et al. | Detecting significant multidimensional spatial clusters | |
CN109411033B (en) | Drug efficacy screening method based on complex network | |
CN109935337B (en) | Medical record searching method and system based on similarity measurement | |
Songdechakraiwut et al. | Topological learning and its application to multimodal brain network integration | |
US20040234995A1 (en) | System and method for storage and analysis of gene expression data | |
CN112131399A (en) | Old medicine new use analysis method and system based on knowledge graph | |
CN110910991A (en) | Medical automatic image processing system | |
CN108921211A (en) | A method of based on density peaks cluster calculation fractal dimension | |
CN117370565A (en) | Information retrieval method and system | |
EP1266305A2 (en) | A system and method for mining data from a database using relevance networks | |
Truong et al. | Learning a complex metabolomic dataset using random forests and support vector machines | |
US20030187592A1 (en) | Association rule mining and visualization for disease related gene | |
CN109033746B (en) | Protein compound identification method based on node vector | |
CN115336977B (en) | Accurate ICU alarm grading evaluation method | |
CN116705310A (en) | Data set construction method, device, equipment and medium for perioperative risk assessment | |
CN111202511A (en) | Recommendation and distribution method and device for electrocardiogram data labeling | |
Rosenberg | Enumeration of lonely pairs of gene trees and species trees by means of antipodal cherries | |
CN116049644A (en) | Feature screening and clustering and binning method and device, electronic equipment and storage medium | |
CN113392086B (en) | Medical database construction method, device and equipment based on Internet of things | |
Abdelfattah | Variables Selection Procedure for the DEA Overall Efficiency Assessment Based Plithogenic Sets and Mathematical Programming | |
Muttaqien et al. | Recommendation of Student Admission Priorities Using K-Means Clustering | |
CN112101021A (en) | Method, device and equipment for realizing standard word mapping | |
WO2001073602A2 (en) | Clustering and examining large data sets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2383549 Country of ref document: CA |
|
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2001 520685 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 71105/00 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000959855 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
WWP | Wipo information: published in national office |
Ref document number: 2000959855 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000959855 Country of ref document: EP |