WO2001073602A2

WO2001073602A2 - Clustering and examining large data sets

Info

Publication number: WO2001073602A2
Application number: PCT/IB2001/000625
Authority: WO
Inventors: Gregoire Thomas; Nikos Berdenis; Christophe Van Huffel
Original assignee: Starlab Nv/Sa
Priority date: 2000-03-28
Filing date: 2001-03-28
Publication date: 2001-10-04
Also published as: WO2001073602A3; AU4447801A

Abstract

In one aspect, the invention features methods for analyzing large amounts of gene expression data. The methods of the invention can be used, for example, to identify relationships between various genes.

Description

CLUSTERING AND EXAMINING LARGE DATA SETS

CROSS-REFERENCE TO RELATED APPLICATIONS [1] This application claims priority under 35 TJSC §119 to United States provisional application 60/192,982, "CLUSTERING AND EXAMINING LARGE DATA SETS," filed March 28, 2000.

BACKGROUND

[2] This invention relates to data processing, and more particularly to clustering and examining a large amount of data.

[3] Examining data may uncover relationships among some or all of the data points included in the data. However, it can be difficult to detect relationships when the data includes a large number of data points. Clustering the data can help detect relationships by identifying the proximity of data points. Given a large amount of data, the data points may range over a wide data space, but certain areas of the data space may be more densely populated than other areas. Clustering helps identify the density of data points in the data space, usually producing a graphical representation of the set of data in tree (dendogram) form.

[4] • One method of clustering includes hierarchical clustering. Hierarchical clustering may be divisive, beginning with all of the data points and repeatedly breaking data points off into smaller and smaller clusters, or it may be agglomerative, starting with single data points and merging them with other data points to form larger and larger clusters. In each case, hierarchical clustering can produce a tree showing the proximal similarity between all of the data points . SUMMARY

[5] In one aspect, the invention features methods for analyzing large amounts of gene expression data. The methods of the invention can be used, for example, to identify relationships between various genes.

[6] In recent years, a very large number of genes have been partially or completely sequenced. Using this information base, scientists are beginning to understand the role of individual genes and the relationships between genes in normal and diseases processes. High throughput gene expression analysis, using, for example, GeneChip® Probe Arrays (Affmetrix, Inc. Santa Clara, CA) can be used to measure the expression of hundreds or thousands of genes simultaneously. As a result, scientists can now attempt to build a complete picture of changes in gene expression over time. For example, it is now possible to measure the expression of thousands of genes in a colon cancer cell over time after exposure to a therapeutic agent. The data generated by such an experiment could be used to simply identify up regulated and down regulated genes. However, it would also be valuable to identify relationships between genes based on their response—over time—to the therapeutic agent. The methods of the invention provide analytical tools that can be used to identify relationships between genes in such experiments .

[7] An array of vectors representing the expression of genes at multiple time points for a plurality of genes and for different experiments can be processed in such a way as to allow a clustering of the genes based on their relationships and on their relationship intensities. A graphical display of the clusters highlights the gene networks that can be identified and/or inferred from the array of values. [8] In one method of the invention, each vector in the array includes a series of values corresponding to the expression of one gene over time under a given set of conditions. For each vector, a series of features are derived. The nature of this transformation depends on the experiment represented in that particular vector. Generally, this transformation includes ^'normalizing the data. Then for each vector, a similarity function calculates a similarity index: the distance between each pair of genes using the feature series of each gene. The similarity function indicates a quantitative measure of the similarity between gene-vector pairs. Each similarity index is stored in a similarity matrix and hierarchically clustered. The product of the clustering is a tree (dendogram) for each experiment having terminal nodes of individual genes, the distance between two genes is related to their similarity in expression. The tree is a complete list of the genes, structured on the basis of their expression profiles. Each similarity matrix also is processed using a variance function to provide a quantitative measure of any repetitive behavior between gene-vector pairs throughout the experiment. The variance function also indicates the consistency (reproducibility) of the gene-to-gene relationships throughout the experiment. The genes can be hierarchically clustered using these measures of relationship consistency to produce a tree.

[9] The graphical display of the trees provides a tool for searching for comparing the genes and experiments, particularly on a multiple window, graphical display. Upon selection of a node in a tree, the genes that are children of this node are selected in the other displayed trees. This selection allows the comparison of several clustering and can be used to validate a clustering and search for families of related genes. The tool also allows for displaying a subset of genes as nodes of a network. The links between each pair of genes are weighted using the distance (similarity) or the variance.

DESCRIPTION OF DRAWINGS [10] FIG. 1 is a block diagram of a computer system. [11] FIG. 2 is a diagram showing a method of extracting relationship networks.

[12] FIGS. 3 and 4 are diagrams of cluster trees.

DESCRIPTION [13] Referring to FIG. 1, a computer system 10 includes a driver 12 configured to process data, cluster the processed data in one or more clusters, display the cluster (s), and allow graphical comparison of the clusters. The driver 12 may instead be included in a processor 14 or in a graphics device 16. The data that the driver 12 processes can include an array of values representing, e.g., gene expression levels at multiple time points for a range of genes under a number of treatments and experiments . The array of values can include hundreds or more data points, so the driver 12 is configured to rapidly process and cluster a large amount of data. The clusters, typically structured as trees (dendograms) , may be displayed on a display unit 26 alone or in groups. This graphical display of the data allows a user (not shown) of the system 10 to view the relationships and relationship intensities within the clusters. For example, the displayed data may be manipulated using a mouse 28 by selecting a node of a tree included in one cluster and thereby selecting the genes that are the children of this node in other displayed trees. In this way, gene-to-gene relationships that are persistent through the treatments and experiments, i.e., gene networks, can be identified. By extracting fundamental gene networks, new solutions/information can be derived for gene identification, functional annotation of genes, and diagnostic or drug therapy.

[14] The data that the driver 12 processes can be provided by a device included in the system 10, such as a memory 18, or the data can enter the system through one or more input/output (I/O) units 20, such as a keyboard 22 or a data unit 22. The data unit 24 could include another computer system, a data generator, a data storage device, or any other mechanism capable of providing data to the system 10. If the data enters the system 10 via one or more of the I/O units 20, the data may be temporarily stored in the memory 18 or in local memory (not shown) before being used by the driver 12. The display unit 26, -such as a computer monitor or a television, is configured to display the data clustered by the driver 12.

[15] Referring to FIG. 2, a process 30 includes processing data sets 32a-n using a similarity index function 34, a relationship index function 36, and a graphical function 38. Each data set 32a-n corresponds to one experiments and includes an array of ordered sets of m values (vectors Xu) , where Xu = (xui, Xu2. . . . , Xum) / each of which represents the expression levels for one gene (U) at a series of m time points. Although the process 30 is described with reference to processing gene-related data, the process 30 can accommodate any type of numerical data. The similarity index function 34 provides a quantitative measure of the similarity between gene-vector pairs using, for each data set 32a-n, a similarity function (fl) 40a-n, a similarity matrix (r) 42a-n, and a hierarchical clustering function (Y) 44a-n. The relationship index function 36 provides a quantitative measure of any repetitive behavior between gene-vector pairs throughout a set of experiments using a variance function (f2) 46, a variance matrix (σ) 48, and a hierarchical clustering function (Y) 50. The graphical function 38 displays the results of the hierarchical clustering functions 44a-n and 50 and allows the comparison of multiple hierarchical clustering functions 44a-n and 50.

[16] More speci ically, each similarity function 40a-n can calculate gene similarity indices for its corresponding data set 32a-n and store the gene similarity indices in the corresponding matrix r 42a-n. The similarity indices are real positive numbers indicating a quantitative measure of the similarity between the genes being compared, which here is only two genes. A similarity index of zero indicates the highest degree of similarity. In calculating the similarity indices, the similarity index function 34 may first transform (normalize or feature extract) the values included in the data sets 32a-n in one of at least five ways (where i runs from one to m in each case) . The transforming method used may depend on the type of data included in the data sets 32a-n. For example, in the analysis of yeast cell cycle data, the similarity index function 34 typically uses a method based on the spectrum analysis of the expression kinetics.

[17] In a first method:

The first method transforms each value Xui to the amount that it deviates from the average of all values in the vector Xu- [18] In a second method: x'ui - (xui - mean (Xu) ) / standard deviation (Xu)

where the standard deviation algorithm (s) typically is

The second method transforms each value Xui to an amount that approximates its deviation from the average of all values in the vector Xu, where an amount of one indicates no deviation. [19] In a third method:

The third method transforms each value xui to an amount that equals its logarithm.

[20] In a fourth method:

x'ui = log(x_ui)/∑jlog(X_Uj)

The fourth method transforms each value Xui to an amount that approximates its deviation from the average of all log values in the vector Xu.

[21] In a fifth method, x'ui equals the power spectrum derived from Xu using a fast Fourier transform algorithm or a maximum entropy algorithm. The power spectrum indicates the frequency of the given value in the vector Xu.

[22] One way of calculating the similarity indices and storing the similarity indices in the similarity matrices 42a- n includes: 2 r_uv = fl (U, V) = ∑i ( yui - yvi

where U and V represent the genes being compared and i runs from one to m, the size of the gene vectors. Another way to calculate the similarity indices and store the similarity indices in the similarity matrices 42a-n includes:

r_uv = fl(U,V) = min(∑i(yui - yv +i))²) t

where U, V, and i are as described for the first method, and 1 ranges from one to m (inclusive) . Both of these formulas calculate each entry in the matrix r 42a-n as the difference between a value in the gene U and a value in the gene V, squaring the difference to achieve a positive result. The first formula compares two values from the same time point, e.g., the third measurement of gene U is compared with the third measurement of gene V. The second compares two adjacent values, e.g., the third measurement of gene U is compared with the fourth measurement of gene V. The second formula also adds a in function that returns the smallest values in U and V.

[23] A hierarchical clustering function 44a-n can cluster each matrix r 42a-n, providing a tree (dendogram) of which the terminal nodes are individual genes and the distance between two genes in the tree is related to their similarity in expression. The hierarchical clustering functions 44a-n can include single linkage hierarchical clustering. Thus, for each array (experiment) , the tree is a complete list of the genes, structured on the basis of the expression profiles of the genes. The success of the clustering depends on the relevance of the indices. Both the list of gene-to-gene distance (matrix r 42a-n) and the tree are considered fingerprints of the experiment. Examples of trees are shown in FIGS. 3 and 4. The graphical function 38 can display these trees as described further below.

[24] The relationship index function 36 can compare the fingerprints (here the matrices r 42a-n) by measuring the consistency (reproducibility)^' of the gene-to-gene relationships throughout the set of experiments. Essentially, the relationship index function 36 provides a quantitative measure of any repetitive behavior between gene vector pairs throughout a set of experiments. The relationship index function 36 performs the variance function 46 to derive variation indices of positive value less than or equal to one from the experimental data in the matrices r 42a-n. The variation function 46 is based on the variance in the similarity indices throughout a set of k experiments.

[25] One variance function 46 includes:

σ = f2(U,V) = h(stdevuv) ,

where stdevuv is the standard deviation of the k ruv values and h(x) is defined by:

where tl and t2 are threshold parameters.

[26] Another variance function 46 includes:

σuv = f2(U,V) = h(min (g(stdev_uw, stdevv ) ) ) , where W e E (E is the ensemble of genes) and g(stdevu, stdevv) is an appropriate function that has its minimum when both stdevuw and stdevyw are small. Note that to optimize memory usage, each standard deviation algorithm stdevuv is stored as an array {∑ru ∑r²uv, m) , where stdevuv is defined as :

stdevuv

See United States Patent No. 5,832,182 for more information on this memory-saving standard deviation technique.

[27] The variation indices are stored in the variation matrix 48, clustered in the hierarchical clustering function 50, and displayed in the graphical function 38. The hierarchical clustering function 50 can cluster the variance matrix 48 to provide a tree (dendogram) as described above with reference to the hierarchical clustering functions 44a-n. The trees produced here allows for the clustering of genes that have a highly consistent relationship.

[28] The graphical function 38 displays the trees produced by the hierarchical clustering functions 44a-n and

50. This display of the relationship and relationship intensities between the clusters highlights the gene networks that can be inferred from the data. The user can search for genes simultaneously in different trees, such as by clicking on a node as explained above, and compare the trees. Thus, the user can validate a clustering and search for families of related genes using the graphical display. Also, the graphical function 38 can display a subset of genes as nodes in a network. The links between each pair of genes U and V are weighted using the distance ruv (hierarchical clustering functions 44a-n) or the variance σu (hierarchical clustering function 50) .

[29] A wide range of data structures can be represented using these trees. For example, knowledge-based classification of genes is usually done with a hierarchical structure, i.e., root > "cellular role" > "cell cycle control" > "Gl/S-specific cyclin" > "YMR199W/CLN1. " Therefore, using the graphical function 38, knowledge-based information can be displayed and compared with the experimental clustered data.

[30] A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.

The invention is not limited to analysis of gene data and can be applied to other types of data.

Claims

CLAI MS :

1 . A method comprising : calculating a distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene; and forming a cluster tree from the calculated distance values .

2. The method of claim 1 further comprising displaying the cluster tree on a display device.

3. The method of claim 1 further comprising normalizing the series of values and using the normalized series of values in calculating the distance between every pair of values in the series of values.

4. The method of claim 1 further comprising repeating the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values.

5. The method of claim 4 further comprising calculating a variance value for each pair of the calculated distance values and forming a cluster tree from the variance values.

6. The method of claim 5 further comprising displaying the cluster tree on a display device.

7. The method of claim 1 in which the calculating includes a power spectrum analysis.

8. An article comprising a computer-readable medium which stores computer-executable instructions, the instructions causing a computer to: calculate a distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene; and form a cluster tree from the calculated distance values.

9. The article of claim 8 further causing a computer to display the cluster tree on a display device.

10. The article of claim 8 further causing a computer to normalize the series of values and to use the normalized series of values in calculating the distance between every pair of values in the series of values.

11. The article of claim 8 further causing a computer to repeat the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values .

12. The article of claim 11 further causing a computer to calculate a variance value for each pair of the calculated distance values and to form a cluster tree from the variance values.

13. The article of claim 12 further causing a computer to display the cluster tree on a display device.

14. The article of claim 1 in which the calculating includes a power spectrum analysis.

15. A system comprising: a device configured to receive data including expression levels for genes; and a driver configured to calculate the distance between every pair of values in a series of values, each value including an array of ordered sets of vectors, each vector including the expression level for a gene and to form a cluster tree from the calculated distance values.

16. The system of claim 15 in which the driver is also configured to allow graphical comparison of the clusters on a display device.

17. The system of claim 15 further comprising a display device configured to display the cluster tree.

18. The system of claim 15 in which the driver is also configured to normalize the series of values and to use the normalized series of values in calculating the distance between every pair of values in the series of values.

19. The system of claim 15 in which the driver is also configured to repeat the calculating and forming for one or more other series of values, each vector in the other series of values including at least some of the genes included in the series of values.

20. The system of claim 19 in which the driver is also configured to calculate a variance value for each pair of the calculated distance values and form a cluster tree from the variance values .

21. The system of claim 20 in which the driver is also configured to display the cluster tree on a display device.

22. The system of claim 15 in which the calculating includes a power spectrum analysis.

23 . A method comprising : normalizing every value in an array of ordered sets of vectors; computing for each of the normalized values a similarity index that indicates a similarity between each of the vectors in the array; storing each of the similarity indices in a similarity matrix; hierarchically clustering each similarity matrix; performing a variance function on each of the similarity matrices to compute variance indices indicating any repetitive behavior in the similarity indices; storing the variance indices in a variance matrix; and hierarchically clustering the variance matrix.

24. The method of claim 23 further comprising displaying the clusters on a display device.

25. The method of claim 24 further comprising selecting one node on a cluster on the display device and highlighting nodes related to that one node on other clusters on the display device.

26. The method of claim 23 in which each vector includes expression levels for a gene.

27. A computer-operated method of analysing data comprising inputting data relating to genes and forming a plurality of vectors based on said gene data, calculating a measure of the similarity between pairs of vectors to derive a plurality of similarity measures, and using said similarity measures to form a hieriarchical diagram.

28. A computer program for causing a computer to perform the method as claimed in any one of claims 1 to 7, 23-26 or 27.

29. A computer-readable storage medium storing a computer program as claimed in claim 28.

30. A system comprising means for inputting gene data, means for processing said data according to a method as claimed in any one of claims 1 to 7, 23-26 or 27, and means for displaying said hieriarchical diagram.