WO2001073602A2 - Clustering and examining large data sets - Google Patents

Clustering and examining large data sets Download PDF

Info

Publication number
WO2001073602A2
WO2001073602A2 PCT/IB2001/000625 IB0100625W WO0173602A2 WO 2001073602 A2 WO2001073602 A2 WO 2001073602A2 IB 0100625 W IB0100625 W IB 0100625W WO 0173602 A2 WO0173602 A2 WO 0173602A2
Authority
WO
WIPO (PCT)
Prior art keywords
values
series
similarity
computer
calculating
Prior art date
Application number
PCT/IB2001/000625
Other languages
French (fr)
Other versions
WO2001073602A3 (en
Inventor
Gregoire Thomas
Nikos Berdenis
Christophe Van Huffel
Original Assignee
Starlab Nv/Sa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Starlab Nv/Sa filed Critical Starlab Nv/Sa
Priority to AU44478/01A priority Critical patent/AU4447801A/en
Publication of WO2001073602A2 publication Critical patent/WO2001073602A2/en
Publication of WO2001073602A3 publication Critical patent/WO2001073602A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • This invention relates to data processing, and more particularly to clustering and examining a large amount of data.
  • Examining data may uncover relationships among some or all of the data points included in the data. However, it can be difficult to detect relationships when the data includes a large number of data points. Clustering the data can help detect relationships by identifying the proximity of data points. Given a large amount of data, the data points may range over a wide data space, but certain areas of the data space may be more densely populated than other areas. Clustering helps identify the density of data points in the data space, usually producing a graphical representation of the set of data in tree (dendogram) form.
  • Hierarchical clustering may be divisive, beginning with all of the data points and repeatedly breaking data points off into smaller and smaller clusters, or it may be agglomerative, starting with single data points and merging them with other data points to form larger and larger clusters. In each case, hierarchical clustering can produce a tree showing the proximal similarity between all of the data points .
  • the invention features methods for analyzing large amounts of gene expression data.
  • the methods of the invention can be used, for example, to identify relationships between various genes.
  • each vector in the array includes a series of values corresponding to the expression of one gene over time under a given set of conditions. For each vector, a series of features are derived. The nature of this transformation depends on the experiment represented in that particular vector. Generally, this transformation includes ' normalizing the data.
  • a similarity function calculates a similarity index: the distance between each pair of genes using the feature series of each gene.
  • the similarity function indicates a quantitative measure of the similarity between gene-vector pairs.
  • Each similarity index is stored in a similarity matrix and hierarchically clustered.
  • the product of the clustering is a tree (dendogram) for each experiment having terminal nodes of individual genes, the distance between two genes is related to their similarity in expression.
  • the tree is a complete list of the genes, structured on the basis of their expression profiles.
  • Each similarity matrix also is processed using a variance function to provide a quantitative measure of any repetitive behavior between gene-vector pairs throughout the experiment.
  • the variance function also indicates the consistency (reproducibility) of the gene-to-gene relationships throughout the experiment.
  • the genes can be hierarchically clustered using these measures of relationship consistency to produce a tree.
  • the graphical display of the trees provides a tool for searching for comparing the genes and experiments, particularly on a multiple window, graphical display. Upon selection of a node in a tree, the genes that are children of this node are selected in the other displayed trees. This selection allows the comparison of several clustering and can be used to validate a clustering and search for families of related genes.
  • the tool also allows for displaying a subset of genes as nodes of a network. The links between each pair of genes are weighted using the distance (similarity) or the variance.
  • FIG. 1 is a block diagram of a computer system.
  • FIG. 2 is a diagram showing a method of extracting relationship networks.
  • FIGS. 3 and 4 are diagrams of cluster trees.
  • a computer system 10 includes a driver 12 configured to process data, cluster the processed data in one or more clusters, display the cluster (s), and allow graphical comparison of the clusters.
  • the driver 12 may instead be included in a processor 14 or in a graphics device 16.
  • the data that the driver 12 processes can include an array of values representing, e.g., gene expression levels at multiple time points for a range of genes under a number of treatments and experiments .
  • the array of values can include hundreds or more data points, so the driver 12 is configured to rapidly process and cluster a large amount of data.
  • the clusters typically structured as trees (dendograms) , may be displayed on a display unit 26 alone or in groups.
  • This graphical display of the data allows a user (not shown) of the system 10 to view the relationships and relationship intensities within the clusters.
  • the displayed data may be manipulated using a mouse 28 by selecting a node of a tree included in one cluster and thereby selecting the genes that are the children of this node in other displayed trees.
  • gene networks i.e., gene networks
  • the data that the driver 12 processes can be provided by a device included in the system 10, such as a memory 18, or the data can enter the system through one or more input/output (I/O) units 20, such as a keyboard 22 or a data unit 22.
  • the data unit 24 could include another computer system, a data generator, a data storage device, or any other mechanism capable of providing data to the system 10. If the data enters the system 10 via one or more of the I/O units 20, the data may be temporarily stored in the memory 18 or in local memory (not shown) before being used by the driver 12.
  • the display unit 26, -such as a computer monitor or a television, is configured to display the data clustered by the driver 12.
  • a process 30 includes processing data sets 32a-n using a similarity index function 34, a relationship index function 36, and a graphical function 38.
  • vectors Xu vectors Xu
  • Xu (xui, Xu2. . . . , Xum) / each of which represents the expression levels for one gene (U) at a series of m time points.
  • the similarity index function 34 provides a quantitative measure of the similarity between gene-vector pairs using, for each data set 32a-n, a similarity function (fl) 40a-n, a similarity matrix (r) 42a-n, and a hierarchical clustering function (Y) 44a-n.
  • the relationship index function 36 provides a quantitative measure of any repetitive behavior between gene-vector pairs throughout a set of experiments using a variance function (f2) 46, a variance matrix ( ⁇ ) 48, and a hierarchical clustering function (Y) 50.
  • the graphical function 38 displays the results of the hierarchical clustering functions 44a-n and 50 and allows the comparison of multiple hierarchical clustering functions 44a-n and 50.
  • each similarity function 40a-n can calculate gene similarity indices for its corresponding data set 32a-n and store the gene similarity indices in the corresponding matrix r 42a-n.
  • the similarity indices are real positive numbers indicating a quantitative measure of the similarity between the genes being compared, which here is only two genes. A similarity index of zero indicates the highest degree of similarity.
  • the similarity index function 34 may first transform (normalize or feature extract) the values included in the data sets 32a-n in one of at least five ways (where i runs from one to m in each case) . The transforming method used may depend on the type of data included in the data sets 32a-n. For example, in the analysis of yeast cell cycle data, the similarity index function 34 typically uses a method based on the spectrum analysis of the expression kinetics.
  • the first method transforms each value Xui to the amount that it deviates from the average of all values in the vector Xu- [18]
  • a second method x'ui - (xui - mean (Xu) ) / standard deviation (Xu)
  • the second method transforms each value Xui to an amount that approximates its deviation from the average of all values in the vector Xu, where an amount of one indicates no deviation.
  • the third method transforms each value xui to an amount that equals its logarithm.
  • the fourth method transforms each value Xui to an amount that approximates its deviation from the average of all log values in the vector Xu.
  • x'ui equals the power spectrum derived from Xu using a fast Fourier transform algorithm or a maximum entropy algorithm.
  • the power spectrum indicates the frequency of the given value in the vector Xu.
  • Another way to calculate the similarity indices and store the similarity indices in the similarity matrices 42a-n includes:
  • U, V, and i are as described for the first method, and 1 ranges from one to m (inclusive) .
  • Both of these formulas calculate each entry in the matrix r 42a-n as the difference between a value in the gene U and a value in the gene V, squaring the difference to achieve a positive result.
  • the first formula compares two values from the same time point, e.g., the third measurement of gene U is compared with the third measurement of gene V.
  • the second compares two adjacent values, e.g., the third measurement of gene U is compared with the fourth measurement of gene V.
  • the second formula also adds a in function that returns the smallest values in U and V.
  • a hierarchical clustering function 44a-n can cluster each matrix r 42a-n, providing a tree (dendogram) of which the terminal nodes are individual genes and the distance between two genes in the tree is related to their similarity in expression.
  • the hierarchical clustering functions 44a-n can include single linkage hierarchical clustering.
  • the tree is a complete list of the genes, structured on the basis of the expression profiles of the genes. The success of the clustering depends on the relevance of the indices. Both the list of gene-to-gene distance (matrix r 42a-n) and the tree are considered fingerprints of the experiment. Examples of trees are shown in FIGS. 3 and 4.
  • the graphical function 38 can display these trees as described further below.
  • the relationship index function 36 can compare the fingerprints (here the matrices r 42a-n) by measuring the consistency (reproducibility) ' of the gene-to-gene relationships throughout the set of experiments. Essentially, the relationship index function 36 provides a quantitative measure of any repetitive behavior between gene vector pairs throughout a set of experiments.
  • the relationship index function 36 performs the variance function 46 to derive variation indices of positive value less than or equal to one from the experimental data in the matrices r 42a-n.
  • the variation function 46 is based on the variance in the similarity indices throughout a set of k experiments.
  • One variance function 46 includes:
  • stdevuv is the standard deviation of the k ruv values and h(x) is defined by:
  • tl and t2 are threshold parameters.
  • Another variance function 46 includes:
  • each standard deviation algorithm stdevuv is stored as an array ⁇ ru ⁇ r 2 uv, m) , where stdevuv is defined as :
  • the variation indices are stored in the variation matrix 48, clustered in the hierarchical clustering function 50, and displayed in the graphical function 38.
  • the hierarchical clustering function 50 can cluster the variance matrix 48 to provide a tree (dendogram) as described above with reference to the hierarchical clustering functions 44a-n.
  • the trees produced here allows for the clustering of genes that have a highly consistent relationship.
  • the graphical function 38 displays the trees produced by the hierarchical clustering functions 44a-n and
  • This display of the relationship and relationship intensities between the clusters highlights the gene networks that can be inferred from the data.
  • the user can search for genes simultaneously in different trees, such as by clicking on a node as explained above, and compare the trees. Thus, the user can validate a clustering and search for families of related genes using the graphical display.
  • the graphical function 38 can display a subset of genes as nodes in a network. The links between each pair of genes U and V are weighted using the distance ruv (hierarchical clustering functions 44a-n) or the variance ⁇ u (hierarchical clustering function 50) .
  • a wide range of data structures can be represented using these trees.
  • knowledge-based classification of genes is usually done with a hierarchical structure, i.e., root > "cellular role” > “cell cycle control” > “Gl/S-specific cyclin” > “YMR199W/CLN1. " Therefore, using the graphical function 38, knowledge-based information can be displayed and compared with the experimental clustered data.
  • the invention is not limited to analysis of gene data and can be applied to other types of data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

In one aspect, the invention features methods for analyzing large amounts of gene expression data. The methods of the invention can be used, for example, to identify relationships between various genes.

Description

CLUSTERING AND EXAMINING LARGE DATA SETS
CROSS-REFERENCE TO RELATED APPLICATIONS [1] This application claims priority under 35 TJSC §119 to United States provisional application 60/192,982, "CLUSTERING AND EXAMINING LARGE DATA SETS," filed March 28, 2000.
BACKGROUND
[2] This invention relates to data processing, and more particularly to clustering and examining a large amount of data.
[3] Examining data may uncover relationships among some or all of the data points included in the data. However, it can be difficult to detect relationships when the data includes a large number of data points. Clustering the data can help detect relationships by identifying the proximity of data points. Given a large amount of data, the data points may range over a wide data space, but certain areas of the data space may be more densely populated than other areas. Clustering helps identify the density of data points in the data space, usually producing a graphical representation of the set of data in tree (dendogram) form.
[4] • One method of clustering includes hierarchical clustering. Hierarchical clustering may be divisive, beginning with all of the data points and repeatedly breaking data points off into smaller and smaller clusters, or it may be agglomerative, starting with single data points and merging them with other data points to form larger and larger clusters. In each case, hierarchical clustering can produce a tree showing the proximal similarity between all of the data points . SUMMARY
[5] In one aspect, the invention features methods for analyzing large amounts of gene expression data. The methods of the invention can be used, for example, to identify relationships between various genes.
[6] In recent years, a very large number of genes have been partially or completely sequenced. Using this information base, scientists are beginning to understand the role of individual genes and the relationships between genes in normal and diseases processes. High throughput gene expression analysis, using, for example, GeneChip® Probe Arrays (Affmetrix, Inc. Santa Clara, CA) can be used to measure the expression of hundreds or thousands of genes simultaneously. As a result, scientists can now attempt to build a complete picture of changes in gene expression over time. For example, it is now possible to measure the expression of thousands of genes in a colon cancer cell over time after exposure to a therapeutic agent. The data generated by such an experiment could be used to simply identify up regulated and down regulated genes. However, it would also be valuable to identify relationships between genes based on their response—over time—to the therapeutic agent. The methods of the invention provide analytical tools that can be used to identify relationships between genes in such experiments .
[7] An array of vectors representing the expression of genes at multiple time points for a plurality of genes and for different experiments can be processed in such a way as to allow a clustering of the genes based on their relationships and on their relationship intensities. A graphical display of the clusters highlights the gene networks that can be identified and/or inferred from the array of values. [8] In one method of the invention, each vector in the array includes a series of values corresponding to the expression of one gene over time under a given set of conditions. For each vector, a series of features are derived. The nature of this transformation depends on the experiment represented in that particular vector. Generally, this transformation includes 'normalizing the data. Then for each vector, a similarity function calculates a similarity index: the distance between each pair of genes using the feature series of each gene. The similarity function indicates a quantitative measure of the similarity between gene-vector pairs. Each similarity index is stored in a similarity matrix and hierarchically clustered. The product of the clustering is a tree (dendogram) for each experiment having terminal nodes of individual genes, the distance between two genes is related to their similarity in expression. The tree is a complete list of the genes, structured on the basis of their expression profiles. Each similarity matrix also is processed using a variance function to provide a quantitative measure of any repetitive behavior between gene-vector pairs throughout the experiment. The variance function also indicates the consistency (reproducibility) of the gene-to-gene relationships throughout the experiment. The genes can be hierarchically clustered using these measures of relationship consistency to produce a tree.
[9] The graphical display of the trees provides a tool for searching for comparing the genes and experiments, particularly on a multiple window, graphical display. Upon selection of a node in a tree, the genes that are children of this node are selected in the other displayed trees. This selection allows the comparison of several clustering and can be used to validate a clustering and search for families of related genes. The tool also allows for displaying a subset of genes as nodes of a network. The links between each pair of genes are weighted using the distance (similarity) or the variance.
DESCRIPTION OF DRAWINGS [10] FIG. 1 is a block diagram of a computer system. [11] FIG. 2 is a diagram showing a method of extracting relationship networks.
[12] FIGS. 3 and 4 are diagrams of cluster trees.
DESCRIPTION [13] Referring to FIG. 1, a computer system 10 includes a driver 12 configured to process data, cluster the processed data in one or more clusters, display the cluster (s), and allow graphical comparison of the clusters. The driver 12 may instead be included in a processor 14 or in a graphics device 16. The data that the driver 12 processes can include an array of values representing, e.g., gene expression levels at multiple time points for a range of genes under a number of treatments and experiments . The array of values can include hundreds or more data points, so the driver 12 is configured to rapidly process and cluster a large amount of data. The clusters, typically structured as trees (dendograms) , may be displayed on a display unit 26 alone or in groups. This graphical display of the data allows a user (not shown) of the system 10 to view the relationships and relationship intensities within the clusters. For example, the displayed data may be manipulated using a mouse 28 by selecting a node of a tree included in one cluster and thereby selecting the genes that are the children of this node in other displayed trees. In this way, gene-to-gene relationships that are persistent through the treatments and experiments, i.e., gene networks, can be identified. By extracting fundamental gene networks, new solutions/information can be derived for gene identification, functional annotation of genes, and diagnostic or drug therapy.
[14] The data that the driver 12 processes can be provided by a device included in the system 10, such as a memory 18, or the data can enter the system through one or more input/output (I/O) units 20, such as a keyboard 22 or a data unit 22. The data unit 24 could include another computer system, a data generator, a data storage device, or any other mechanism capable of providing data to the system 10. If the data enters the system 10 via one or more of the I/O units 20, the data may be temporarily stored in the memory 18 or in local memory (not shown) before being used by the driver 12. The display unit 26, -such as a computer monitor or a television, is configured to display the data clustered by the driver 12.
[15] Referring to FIG. 2, a process 30 includes processing data sets 32a-n using a similarity index function 34, a relationship index function 36, and a graphical function 38. Each data set 32a-n corresponds to one experiments and includes an array of ordered sets of m values (vectors Xu) , where Xu = (xui, Xu2. . . . , Xum) / each of which represents the expression levels for one gene (U) at a series of m time points. Although the process 30 is described with reference to processing gene-related data, the process 30 can accommodate any type of numerical data. The similarity index function 34 provides a quantitative measure of the similarity between gene-vector pairs using, for each data set 32a-n, a similarity function (fl) 40a-n, a similarity matrix (r) 42a-n, and a hierarchical clustering function (Y) 44a-n. The relationship index function 36 provides a quantitative measure of any repetitive behavior between gene-vector pairs throughout a set of experiments using a variance function (f2) 46, a variance matrix (σ) 48, and a hierarchical clustering function (Y) 50. The graphical function 38 displays the results of the hierarchical clustering functions 44a-n and 50 and allows the comparison of multiple hierarchical clustering functions 44a-n and 50.
[16] More speci ically, each similarity function 40a-n can calculate gene similarity indices for its corresponding data set 32a-n and store the gene similarity indices in the corresponding matrix r 42a-n. The similarity indices are real positive numbers indicating a quantitative measure of the similarity between the genes being compared, which here is only two genes. A similarity index of zero indicates the highest degree of similarity. In calculating the similarity indices, the similarity index function 34 may first transform (normalize or feature extract) the values included in the data sets 32a-n in one of at least five ways (where i runs from one to m in each case) . The transforming method used may depend on the type of data included in the data sets 32a-n. For example, in the analysis of yeast cell cycle data, the similarity index function 34 typically uses a method based on the spectrum analysis of the expression kinetics.
[17] In a first method:
Figure imgf000007_0001
The first method transforms each value Xui to the amount that it deviates from the average of all values in the vector Xu- [18] In a second method: x'ui - (xui - mean (Xu) ) / standard deviation (Xu)
where the standard deviation algorithm (s) typically is
Figure imgf000008_0001
The second method transforms each value Xui to an amount that approximates its deviation from the average of all values in the vector Xu, where an amount of one indicates no deviation. [19] In a third method:
Figure imgf000008_0002
The third method transforms each value xui to an amount that equals its logarithm.
[20] In a fourth method:
x'ui = log(xui)/∑jlog(XUj)
The fourth method transforms each value Xui to an amount that approximates its deviation from the average of all log values in the vector Xu.
[21] In a fifth method, x'ui equals the power spectrum derived from Xu using a fast Fourier transform algorithm or a maximum entropy algorithm. The power spectrum indicates the frequency of the given value in the vector Xu.
[22] One way of calculating the similarity indices and storing the similarity indices in the similarity matrices 42a- n includes: 2 ruv = fl (U, V) = ∑i ( yui - yvi
where U and V represent the genes being compared and i runs from one to m, the size of the gene vectors. Another way to calculate the similarity indices and store the similarity indices in the similarity matrices 42a-n includes:
ruv = fl(U,V) = min(∑i(yui - yv +i))2) t
where U, V, and i are as described for the first method, and 1 ranges from one to m (inclusive) . Both of these formulas calculate each entry in the matrix r 42a-n as the difference between a value in the gene U and a value in the gene V, squaring the difference to achieve a positive result. The first formula compares two values from the same time point, e.g., the third measurement of gene U is compared with the third measurement of gene V. The second compares two adjacent values, e.g., the third measurement of gene U is compared with the fourth measurement of gene V. The second formula also adds a in function that returns the smallest values in U and V.
[23] A hierarchical clustering function 44a-n can cluster each matrix r 42a-n, providing a tree (dendogram) of which the terminal nodes are individual genes and the distance between two genes in the tree is related to their similarity in expression. The hierarchical clustering functions 44a-n can include single linkage hierarchical clustering. Thus, for each array (experiment) , the tree is a complete list of the genes, structured on the basis of the expression profiles of the genes. The success of the clustering depends on the relevance of the indices. Both the list of gene-to-gene distance (matrix r 42a-n) and the tree are considered fingerprints of the experiment. Examples of trees are shown in FIGS. 3 and 4. The graphical function 38 can display these trees as described further below.
[24] The relationship index function 36 can compare the fingerprints (here the matrices r 42a-n) by measuring the consistency (reproducibility)' of the gene-to-gene relationships throughout the set of experiments. Essentially, the relationship index function 36 provides a quantitative measure of any repetitive behavior between gene vector pairs throughout a set of experiments. The relationship index function 36 performs the variance function 46 to derive variation indices of positive value less than or equal to one from the experimental data in the matrices r 42a-n. The variation function 46 is based on the variance in the similarity indices throughout a set of k experiments.
[25] One variance function 46 includes:
σ = f2(U,V) = h(stdevuv) ,
where stdevuv is the standard deviation of the k ruv values and h(x) is defined by:
Figure imgf000010_0001
where tl and t2 are threshold parameters.
[26] Another variance function 46 includes:
σuv = f2(U,V) = h(min (g(stdevuw, stdevv ) ) ) , where W e E (E is the ensemble of genes) and g(stdevu, stdevv) is an appropriate function that has its minimum when both stdevuw and stdevyw are small. Note that to optimize memory usage, each standard deviation algorithm stdevuv is stored as an array {∑ru ∑r2uv, m) , where stdevuv is defined as :
stdevuv
Figure imgf000011_0001
See United States Patent No. 5,832,182 for more information on this memory-saving standard deviation technique.
[27] The variation indices are stored in the variation matrix 48, clustered in the hierarchical clustering function 50, and displayed in the graphical function 38. The hierarchical clustering function 50 can cluster the variance matrix 48 to provide a tree (dendogram) as described above with reference to the hierarchical clustering functions 44a-n. The trees produced here allows for the clustering of genes that have a highly consistent relationship.
[28] The graphical function 38 displays the trees produced by the hierarchical clustering functions 44a-n and
50. This display of the relationship and relationship intensities between the clusters highlights the gene networks that can be inferred from the data. The user can search for genes simultaneously in different trees, such as by clicking on a node as explained above, and compare the trees. Thus, the user can validate a clustering and search for families of related genes using the graphical display. Also, the graphical function 38 can display a subset of genes as nodes in a network. The links between each pair of genes U and V are weighted using the distance ruv (hierarchical clustering functions 44a-n) or the variance σu (hierarchical clustering function 50) .
[29] A wide range of data structures can be represented using these trees. For example, knowledge-based classification of genes is usually done with a hierarchical structure, i.e., root > "cellular role" > "cell cycle control" > "Gl/S-specific cyclin" > "YMR199W/CLN1. " Therefore, using the graphical function 38, knowledge-based information can be displayed and compared with the experimental clustered data.
[30] A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.
The invention is not limited to analysis of gene data and can be applied to other types of data.

Claims

CLAI MS :
1 . A method comprising : calculating a distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene; and forming a cluster tree from the calculated distance values .
2. The method of claim 1 further comprising displaying the cluster tree on a display device.
3. The method of claim 1 further comprising normalizing the series of values and using the normalized series of values in calculating the distance between every pair of values in the series of values.
4. The method of claim 1 further comprising repeating the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values.
5. The method of claim 4 further comprising calculating a variance value for each pair of the calculated distance values and forming a cluster tree from the variance values.
6. The method of claim 5 further comprising displaying the cluster tree on a display device.
7. The method of claim 1 in which the calculating includes a power spectrum analysis.
8. An article comprising a computer-readable medium which stores computer-executable instructions, the instructions causing a computer to: calculate a distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene; and form a cluster tree from the calculated distance values.
9. The article of claim 8 further causing a computer to display the cluster tree on a display device.
10. The article of claim 8 further causing a computer to normalize the series of values and to use the normalized series of values in calculating the distance between every pair of values in the series of values.
11. The article of claim 8 further causing a computer to repeat the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values .
12. The article of claim 11 further causing a computer to calculate a variance value for each pair of the calculated distance values and to form a cluster tree from the variance values.
13. The article of claim 12 further causing a computer to display the cluster tree on a display device.
14. The article of claim 1 in which the calculating includes a power spectrum analysis.
15. A system comprising: a device configured to receive data including expression levels for genes; and a driver configured to calculate the distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene and to form a cluster tree from the calculated distance values.
16. The system of claim 15 in which the driver is also configured to allow graphical comparison of the clusters on a display device.
17. The system of claim 15 further comprising a display device configured to display the cluster tree.
18. The system of claim 15 in which the driver is also configured to normalize the series of values and to use the normalized series of values in calculating the distance between every pair of values in the series of values.
19. The system of claim 15 in which the driver is also configured to repeat the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values.
20. The system of claim 19 in which the driver is also configured to calculate a variance value for each pair of the calculated distance values and form a cluster tree from the variance values .
21. The system of claim 20 in which the driver is also configured to display the cluster tree on a display device.
22. The system of claim 15 in which the calculating includes a power spectrum analysis.
23 . A method comprising : normalizing every value in an array of ordered sets of vectors; computing for each of the normalized values a similarity index that indicates a similarity between each of the vectors in the array; storing each of the similarity indices in a similarity matrix; hierarchically clustering each similarity matrix; performing a variance function on each of the similarity matrices to compute variance indices indicating any repetitive behavior in the similarity indices; storing the variance indices in a variance matrix; and hierarchically clustering the variance matrix.
24. The method of claim 23 further comprising displaying the clusters on a display device.
25. The method of claim 24 further comprising selecting one node on a cluster on the display device and highlighting nodes related to that one node on other clusters on the display device.
26. The method of claim 23 in which each vector includes expression levels for a gene.
27. A computer-operated method of analysing data comprising inputting data relating to genes and forming a plurality of vectors based on said gene data, calculating a measure of the similarity between pairs of vectors to derive a plurality of similarity measures, and using said similarity measures to form a hieriarchical diagram.
28. A computer program for causing a computer to perform the method as claimed in any one of claims 1 to 7, 23-26 or 27.
29. A computer-readable storage medium storing a computer program as claimed in claim 28.
30. A system comprising means for inputting gene data, means for processing said data according to a method as claimed in any one of claims 1 to 7, 23-26 or 27, and means for displaying said hieriarchical diagram.
PCT/IB2001/000625 2000-03-28 2001-03-28 Clustering and examining large data sets WO2001073602A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU44478/01A AU4447801A (en) 2000-03-28 2001-03-28 Clustering and examining large data sets

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US19298200P 2000-03-28 2000-03-28
US60/192,982 2000-03-28

Publications (2)

Publication Number Publication Date
WO2001073602A2 true WO2001073602A2 (en) 2001-10-04
WO2001073602A3 WO2001073602A3 (en) 2003-03-13

Family

ID=22711823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2001/000625 WO2001073602A2 (en) 2000-03-28 2001-03-28 Clustering and examining large data sets

Country Status (2)

Country Link
AU (1) AU4447801A (en)
WO (1) WO2001073602A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331078C (en) * 2003-09-30 2007-08-08 联想(北京)有限公司 Method and system for obtaining clustering distance
US8396872B2 (en) 2010-05-14 2013-03-12 National Research Council Of Canada Order-preserving clustering data analysis system and method
WO2014120380A1 (en) * 2013-02-04 2014-08-07 Olsen David Allen System and method for grouping segments of data sequences into clusters
WO2016193075A1 (en) * 2015-06-02 2016-12-08 Koninklijke Philips N.V. Methods, systems and apparatus for subpopulation detection from biological data
CN106991193A (en) * 2017-04-26 2017-07-28 努比亚技术有限公司 Obtain the method and terminal, computer-readable recording medium of article similarity
CN110046297A (en) * 2019-03-28 2019-07-23 广州视源电子科技股份有限公司 Operation and maintenance violation identification method and device and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
"DESCRIPTION OF PIROUETTE ALGORITHMS" CHEMOMETRICS TECHNICAL NOTE, XX, XX, 1993, pages 1-4, XP002927703 *
ALON U ET AL: "BROAD PATTERNS OF GENE EXPRESSION REVEALED BY CLUSTERING ANALYSIS OF TUMOR AND NORMAL COLON TISSUES PROBED BY OLIGONUCLEOTIDE ARRAYS" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 96, 1999, pages 6745-6750, XP000900484 ISSN: 0027-8424 *
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 *
MICHAELS G S ET AL: "CLUSTER ANALYSIS AND DATA VISUALIZATION OF LARGE-SCALE GENE EXPRESSION DATA" PROCEEDINGS OF THE PACIFIC SYMPOSIUM ON BIOCOMPUTING, XX, XX, 1997, pages 42-53, XP000974575 *
MICHAUD P: "Clustering techniques" FUTURE GENERATIONS COMPUTER SYSTEMS, ELSEVIER SCIENCE PUBLISHERS. AMSTERDAM, NL, vol. 13, no. 2-3, 1 November 1997 (1997-11-01), pages 135-147, XP004099490 ISSN: 0167-739X *
RALF-HERWIG ET AL: "Large-Scale Clustering of cDNA-Fingerprinting Data" GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, US, vol. 9, November 1999 (1999-11), pages 1093-1105, XP002176537 ISSN: 1088-9051 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331078C (en) * 2003-09-30 2007-08-08 联想(北京)有限公司 Method and system for obtaining clustering distance
US8396872B2 (en) 2010-05-14 2013-03-12 National Research Council Of Canada Order-preserving clustering data analysis system and method
WO2014120380A1 (en) * 2013-02-04 2014-08-07 Olsen David Allen System and method for grouping segments of data sequences into clusters
WO2016193075A1 (en) * 2015-06-02 2016-12-08 Koninklijke Philips N.V. Methods, systems and apparatus for subpopulation detection from biological data
CN106991193A (en) * 2017-04-26 2017-07-28 努比亚技术有限公司 Obtain the method and terminal, computer-readable recording medium of article similarity
CN106991193B (en) * 2017-04-26 2020-03-13 努比亚技术有限公司 Method and terminal for acquiring similarity of articles and computer readable storage medium
CN110046297A (en) * 2019-03-28 2019-07-23 广州视源电子科技股份有限公司 Operation and maintenance violation identification method and device and storage medium

Also Published As

Publication number Publication date
WO2001073602A3 (en) 2003-03-13
AU4447801A (en) 2001-10-08

Similar Documents

Publication Publication Date Title
Bhan et al. A duplication growth model of gene expression networks
Van Der Laan et al. Gene expression analysis with the parametric bootstrap
US20020095260A1 (en) Methods for efficiently mining broad data sets for biological markers
US20030224344A1 (en) Method and system for clustering data
KR20100098407A (en) Hierarchically organizing data using a partial least squares analysis (pls-trees)
Montserrat et al. Lai-net: Local-ancestry inference with neural networks
EP1252588B1 (en) Method for the manipulation, storage, modeling, visualization and quantification of datasets
Benati et al. A mixed integer linear model for clustering with variable selection
Ayadi et al. BiMine+: an efficient algorithm for discovering relevant biclusters of DNA microarray data
Yin et al. Clustering of gene expression data: performance and similarity analysis
Babu et al. A comparative study of gene selection methods for cancer classification using microarray data
Hanczar et al. On the comparison of classifiers for microarray data
Marjan et al. PCA-based dimensionality reduction for face recognition
WO2001073602A2 (en) Clustering and examining large data sets
Bouzebda Limit theorems in the nonparametric conditional single-index U-processes for locally stationary functional random fields under stochastic sampling design
Vilo et al. Regulatory sequence analysis: application to the interpretation of gene expression
Arcolano et al. Nyström approximation of Wishart matrices
CN107710206B (en) Methods, systems, and apparatus for subpopulation detection based on biological data
Lin et al. Integer matrix factorization and its application
Tang et al. Mining multiple phenotype structures underlying gene expression profiles
Droghetti et al. An evolutionary model identifies the main evolutionary biases for the evolution of genome-replication profiles
Gorban et al. Statistical approaches to automated gene identification without teacher
WO2001016805A2 (en) A system and method for mining data from a database using relevance networks
Fai et al. IDENTIFICATION OF GENE AND LINK COEXPRESSION FROM GENE EXPRESSION MICROARRAY DATA USING DCGL
Salahub et al. Optimal Structured Matrix Approximation for Robustness to Incomplete Biosequence Data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP