WO2010086782A2

WO2010086782A2 - Methods for the subclassification of breast tumours

Info

Publication number: WO2010086782A2
Application number: PCT/IB2010/050316
Authority: WO
Inventors: Sitharthan Kamalakaran; Angel Janevski; James B. Hicks
Original assignee: Koninklijke Philips Electronics N.V.; Cold Spring Harbor Laboratory
Priority date: 2009-01-30
Filing date: 2010-01-25
Publication date: 2010-08-05
Also published as: WO2010086782A8; EP2955235A3; EP2391735A2; RU2011135955A; KR20110113642A; WO2010086782A3; EP2955235A2; JP2012517215A; CN102549165A; US20120004118A1; BRPI1005306A2

Abstract

Provided is a method for the analysis of breast cancer disorders, comprising determining the genomic methylation status of one or more CpG dinucleotides. Furthermore, a computer program product stored on a computer-readable medium comprising software code adapted to perform the steps of the method when executed on a data-processing apparatus is provided. A device comprising means for supporting a clinician is also provided.

Description

METHODS FOR THE SUBCLAS SIFIC ATION OF BREAST TUMOURS

FIELD OF THE INVENTION This invention pertains in general to the field of biology and bioinformatics. More particularly the invention relates to the field of categorization of cancer tumours and even more particularly to identifying methylated sites, which may aid in categorization of cancer tumours.

BACKGROUND OF THE INVENTION

Worldwide, breast cancer is the fifth most common cause of cancer death, after lung cancer, stomach cancer, liver cancer, and colon cancer. Among women, breast cancer is the most common cancer and the most common cause of cancer death. Breast cancer is diagnosed by the pathological examination of surgically removed breast tissue. Following diagnosis, it is important to analyze the tumour type in order to aid clinicians when choosing the right therapy. Within the art, such analysis is performed according to two categories.

The first category involves the use of immuno-histopathological variables, such as tumour size, ER/PR status, lymph node negativity, etc. to define a clinical prognostic index such as the Nottingham Prognostic Index (NPI). The problem with such an index is that it has been shown to be very conservative, thus typically causing patients to receive aggressive therapy even when they are a low risk of disease recurrence. The second category involves the measurement of the expression levels of a large number of genes, typically around 500, and calculating probability of a subtype based on the relative expression levels of the genes. This method is very costly in terms of tissue handling requirements. It is also hard to perform in a clinical setting, due to the demand of laboratory equipment. DNA methylation, a type of chemical modification of DNA that can be inherited and subsequently removed without changing the original DNA sequence, is the most well studied epigenetic mechanism of gene regulation. There are areas in DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases called CpG islands.

CpG islands are generally heavily methylated in normal cells. However, during tumorigenesis, hypomethylation occurs at these islands, which may result in the expression of certain repeats. These hypomethylation events also correlate to the severity of some cancers. Under certain circumstances, which may occur in pathologies such as cancer, imprinting, development, tissue specificity, or X chromosome inactivation, gene associated islands may be heavily methylated. Specifically, in cancer, methylation of islands proximal to tumour suppressors is a frequent event, often occurring when the second allele is lost by deletion (Loss of Heterozygosity, LOH). Some tumour suppressors commonly seen with methylated islands are pi 6, Rassfla, and BRCAl.

There are reported epigenetic markers for colorectal and prostate cancer. For example, Epigenomics AG (Berlin, Germany) has the Septin 9 as a marker for colorectal cancer screening in blood plasma. A method for using methylation sites to predict differential therapy responses in cancer and recommending an appropriate therapy has been disclosed in US20050021240A1. However, the results predicted by this method are limited, since they cannot be directly applied in clinical practice. Therefore, it would advantageous to have a method for the analysis of breast cancer disorders, which is time efficient, reliable and cost-effective.

SUMMARY OF THE INVENTION

Accordingly, the present invention preferably seeks to mitigate, alleviate or eliminate one or more of the above-identified deficiencies in the art and disadvantages singly or in any combination and solves at least the above mentioned problems by providing a method for the analysis of breast cancer disorders according to the appended patent claims.

According to an aspect a method for analysis of breast cancer disorders is disclosed. The method comprises determining the genomic methylation status of one or more CpG dinucleotides in a sequence selected from the group of sequences consisting of SEQ ID NO. 1 to SEQ ID NO. 600. The method provides for improved abilities to characterize cancer tumours using methylation patterns. The regions of interest of the sequences SEQ ID NO. 1 to 600 are designated in table 1 (as "start" and "end" on respective "chromosome").

This aspect presents improvements over the state of the art in that it enables a highly specific classification of breast cell proliferative disorders. In an aspect a computer program product is disclosed. The computer program product is stored on a computer-readable medium comprising software code adapted to perform the steps of the method according to an aspect when executed on a data-processing apparatus.

In an aspect a device is disclosed. The device comprises means adapted to carry out methods according to som embodiments. An advantage with this is to support a clinician.

Herein, the sequences claimed also encompass the sequences, which are reverse complement to the sequences designated.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects, features and advantages of which the invention is capable of will be apparent and elucidated from the following description of embodiments of the present invention, reference being made to the accompanying drawings, in which

Fig. 1 is a schematic illustration of a method according to some embodiments;

Fig. 2 is a schematic illustration of a dataset 20 of five measurements 1 to 5; Fig. 3 is a schematic illustration of a first subset 30 of five measurements 1 to 5; Fig. 4 is a schematic illustration of a second subset 40 of five measurements 1 to 5; and Fig. 5 is an illustration of clusters 51, 52, 53, where Fig. 5 A is a first cluster 51 , Fig. 5B is a second cluster 52 and Fig. 5C is a third cluster 53. Fig. 6 is a schematic illustration of a computer program product according to an embodiment.Fig. 7 is a schematic illustration of a device according to an embodiment.

DETAILED DESCRIPTION OF EMBODIMENTS

Several embodiments of the present invention will be described in more detail below with reference to the accompanying drawings in order for those skilled in the art to be able to carry out the invention. The invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. The embodiments do not limit the invention, but the invention is only limited by the appended patent claims. Furthermore, the terminology used in the detailed description of the particular embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention.

An idea according to some embodiments is a method using a small selection of DNA sequences to analyze breast cancer disorders. The analysis is done by determining genomic methylation status of one or more CpG dinucleotides, in either sequence disclosed herein, or its reverse complement. It was surprisingly found that some DNA sequences, SEQ ID NO: 1 to

SEQ ID NO: 600 act as epigenetic markers that may be used to analyze breast cancer by subtyping tumours. In prior art, it is possible to subtype breast cancer based on gene expression. Five different subtypes have been reported; luminal A, luminal B, basal, ERBB2 overexpressing, and normal-like. The inventors have identified the same subtypes using DNA methylation.

The DNA SEQ ID NO: 1 to SEQ ID NO: 600 were identified by analysing 150 000 individual genomic loci for methylation, across a set of 83 breast tumours. The availability of clinical information regarding tumour specimens allowed for an investigation of DNA methylation in the context of breast cancer subtypes, histology and tumour aggressiveness. The five major breast cancer molecular subtypes

(luminal A and B, basal, ERBB2 overexpressing, and normal-like) were identified. First, an investigation was performed regarding however unsupervised clustering of the tumour set using methylation recapitulates the major Luminal and basal classes that were identified by expression analysis or not. A filtering criterion was used to identify the features to be used in clustering. This criterion was the top 500 loci that varied most across the 83 tumour samples. Then, the top 100 loci that distinguished tumours from normal tissues from were added. These 600 features, displayed in table 1, were used to cluster the 83 tumours for which the expression subtype data was available. Hierarchical clustering with Pearson correlation and complete linkage of the samples based on these six hundred loci gave a dendrogram that is surprisingly similar to the one produced by expression analysis.

In an embodiment a method 10 is provided, according to Fig. 1. Said method 10 comprises selecting 100 a feature subset comprising at least one post from the methylation classification list according to SEQ ID NO. 1 to SEQ ID NO. 600.

Selecting 100 a feature subset may be performed based on hierarchical clustering with Pearson correlation and complete linkage to characterize the fitness of each feature subset, given a dataset with methylation characterization for of each sample (S₁, i=l..M) in a form of a vector Hi₁ of N values, where In_1J provides the methylation status for the i-th sample and the j-th probe. Typically, some statistical analysis of the measured signal will produce a set of probes (features) to be input to the hierarchical clustering method above.

The feature subset selection 100 uses a Genetic Algorithm (GA), which repetitively evaluate feature subsets based on a fitness function that in some way characterizes some property of the feature subset. In an embodiment, hierarchical clustering with Pearson correlation and complete linkage is used as the fitness function to assess how good a feature subset is.

The following example is used to illustrate the principle.

Fig. 2 show a dataset 20 of measurements, in this case 5 samples, which are displayed as 1 to 5 are characterized with 8 features, which are displayed as letters A to H. Figs. 3 and 4 show two feature subsets, generated from the measurements dataset by selecting rows (features) from the dataset. Fig. 3 shows a first feature subset 30 with the 5 samples, which are displayed as 1 to 5, but only four of the features. Fig. 4 shows a second subset 40 with the 5 samples, which are displayed as 1 to 5, but only six of the features.

Next, clustering may be performed. Fig. 5 show clusters, or dendrograms, based on the datasets from Figs. 2 to 4, when subjected to hierarchical clustering with

Pearson correlation and complete linkage. Fig. 5 A shows a first cluster 51 based on the total dataset 20. Fig. 5B shows a second cluster 52 based on the first feature subset 30 and Fig. 5C shows a third cluster 53 based on the second feature subset 40.

After having clustered the datasets, a ranking of all clustering results is performed. In one embodiment, a cluster analysis method is used for the ranking. For example, it is possible to characterize and rank individual clusters based on their validity, for example in terms of cluster cohesion or separation. This may be done in one of multiple ways well known to a person skilled in the art. Thus, it is possible to rank two or more feature subsets based on the quality of the clusters they generate when used to cluster the samples.

In another embodiment, some property of the samples (e.g. cancer subtype based on pathology) is used for ranking. From this property, the same or related subtypes are grouped together. For example, if the five samples from Figs. 2 to 4 have the following subtype labels associated with them (I=X, 2=X, 3=Y, 4=Y, 5=X} respectively, this would then produce the following label groupings for the three clusters shown in Fig. 5: A: (XXY, YX}; B: (XY, YXX}; C: (XXX, YY}. In this case, the second subset 40, represented by Fig. 5C, is clearly better compared to the first feature subset 30 or the clustering based on the entire dataset 20, since it correctly cluster the subtypes together.

In an embodiment, two clustering outputs Di and D₂, are compared based on the clusters.. First, N (Ci, C₂, ... C_N) clusters are obtained based on the dendrogram, produced by the clustering. Then, a property is computed based on the clusters, such as the popular method of silhouette width - SIL(C₁). Now a single-number characterization of a clustering is obtained by the formula:

AVGSIL(D) = (SUM[i=l..N] SIL(C₁)VN

By comparing AVGSIL(Di) and AVGSIL(D₂), it may be determined which clustering is preferable. In another embodiment, build a data structure G is built in form of a matrix with dimensions N x L, where L is the number of distinct labels available for the samples. With labels (X. Y}, L = 2, or for labels (normal, aggressive cancer, non-aggressive cancer} L = 3. Then for each cluster i (i=l ..N) L values are obtained in the following manner for each element g_y from G:

gi_j = count(sample in cluster i and has label j)

Now, it is possible to compute uniformity of each cluster C₁:

UNIFORMITY(C₁) = max(countsinrowi in G)/sum(countsinrowiinG)

Finally, the clustering is characterized with: AVGUNIFORMITY(D) = SUM[i=l..N] (UNIFORMITY(CO)ZN

as a single-number characterization of a clustering. By comparing AVGUNIFORMITY (D_I) and AVGUNIFORMITY (D₂) it may be determined which clustering is preferable.

Iterative repetition of this selection process gradually refines the quality of the clustering of the feature subsets discovered by the GA. After a number of repetitions, all evaluated features subsets can be further filtered based on their performance during the GA execution. In one embodiment, feature subsets are sorted by the average clustering performance in stratification of the clinical samples. In another embodiment, feature subsets, in addition to the average performance, are filtered based on their persistent re-evaluation. In other words, feature subsets that are repeatedly selected for further evaluation are preferred to feature subsets that are dropped from consideration only after a few iterations. The final output of a GA feature subset selection is to run multiple instances with different initial conditions, and merge the filtered feature subsets from each of these instances. Feature subsets from one such evaluation are listed in Table 3 A. Furthermore, a cumulative characterization of a collection of GA runs can be obtained and used to generate feature subsets that aggregate the feature subsets in single set of subsets. In one embodiment, the appearance of each feature in feature subsets is counted and a total histogram is obtained giving the degree of utilization of each of the 600 features. Based on this information and for example in one embodiment the frequencies of the pairwise occurrences of the 600 features are used to build feature subsets that summarize the GA run in a single set of subsets, a so called trend pattern. Table 3B provides such feature subset of lengths 45 and 60.

Examples of feature subsets are provided in Tables 2, 3A and 3B. Thus, in an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 2.

Table 2. Feature subsets. Each subset comprise a selection of sequences indicated by numbers corresponding to the FragID:s in table 1.

10 152494, 55649, 158649, 33381, 129193, 38485, 86866, 1601, 153363, 158646, 72675, 128850, 13583, 4109, 38815, 63267, 19926, 103295, 79123, 4823, 80726, 115442, 25715, 71104,92237, 152496, 134481, 1359,65610,55215,11111, 114219, 118132, 149792,757, 27685, 71089, 120745, 3535, 36661, 52666, 148458, 56504, 87210, 110848, 39760, 152716, 94345, 47510, 87185, 156306, 71105, 89865, 54424, 95724, 153087, 42953, 71090, 57442, 76797, 70538, 156440, 113989, 13394, 46277, 14656, 20225, 9029, 89183

11 152494, 110545, 12301, 14289, 61152, 1650, 129193, 99554, 153362, 72675, 120416, 149794, 13583, 19926, 32667, 103295, 150393, 92237, 45338, 95107, 96587, 149788, 66071, 14254, 757, 37395, 99668, 14231, 118129, 152681, 155418, 36661, 146589, 148458, 1249, 55611, 110848, 71074, 88982, 32624, 47510, 31913, 26073, 71121, 71105, 145717, 72461, 15478, 118488, 153027, 154875, 133709, 144856, 60445, 73062, 5525, 152213, 92849, 80168, 63043, 90137, 56922

12 152494, 110545, 114218, 129193, 86495, 86866, 99554, 45501, 38815, 19926, 158648, 103295, 60291, 10427, 149824, 115442, 151564, 152496, 98223, 147018, 65670, 77777, 55218, 118491, 118132, 33338, 142017, 54824, 55941, 36661, 145238, 87210, 138677, 39760, 45409, 123890, 99150, 71121, 25164, 1324, 71105, 82920, 1534, 123955, 133709, 24273, 60445, 94051, 71090, 80169, 108016, 70538, 78440, 39539, 131234, 134630, 50444, 87698, 143864, 90137, 64684, 45650

13 152494, 110545, 55649, 1650, 102005, 158649, 129193, 86495, 86866, 128414, 128850, 146035, 1173, 19926, 153364, 4823, 149824, 14609, 72674, 56402, 118551, 45338, 65670, 114220, 61161, 118491, 130315, 18856, 118129, 148458, 87210, 110848, 134826, 145015, 93471, 48491, 80728, 125612, 46110, 110793, 99150, 71121, 96210, 10393, 2123, 15066, 152094, 27268, 28887, 1339, 133709, 111802, 76797, 42441, 145731, 26333, 147896, 63043, 87698, 11354, 73907, 27495

14 114205, 129193, 86866, 99554, 152321, 52027, 80645, 72674, 76619, 151564, 71104, 113247, 47435, 95107, 126936, 136763, 147018, 84490, 65670, 55275, 105101, 20895, 757, 99668, 50853, 27685, 148458, 56504, 110848, 145015, 144226, 89408, 99113, 158958, 125612, 144360, 7116, 26073, 99150, 96210, 71105, 124831, 152094, 71216, 1339, 14451, 88395, 142439, 71090, 92849, 103793, 57442, 119665, 88411, 46277, 10916, 134630, 11354,90137,27495

15 110545, 102005, 129193, 158646, 153362, 73586, 27115, 114138, 127886, 56402, 5104, 115442, 150632, 151564, 71104, 152496, 53338, 114207, 134481, 116804, 65670, 55275, 118132, 130315, 96227, 71581, 118129, 79207, 155418, 123180, 114108, 52666, 1249, 84518, 64725, 87210, 136153, 135257, 145015, 156308, 48491, 152480, 45409, 88982, 26073, 71121, 152094, 40505, 149461, 54424, 28887, 14451, 123955, 56289, 83839, 1391, 108016, 39539, 119665, 88411, 9278, 102061,27677, 115870, 14656, 56922

16 152494, 110545, 86939, 55649, 102005, 25023, 128737, 129193, 14197, 99554, 152321, 153362, 72675, 13583, 39470, 61003, 103295, 79123, 80726, 118551, 114139, 147620, 96587, 55218, 38714, 8273, 757, 54400, 1823, 15771, 46721, 157076, 71120, 3535, 52666, 11474, 148458, 87210, 57206, 152480, 55475, 89408, 99113, 148624, 7116, 8778, 110793, 47510, 26073, 76120, 25164, 71105, 124831, 127669, 9928, 27268, 154875, 144856, 60445, 88395, 94051, 36595, 71090, 111358, 76797, 50444, 27677, 23738, 76467, 71700

17 110545, 114140, 102005, 129193, 99554, 152321, 128850, 5455, 124390, 149824, 80726, 126928, 56402, 151564, 17697, 47435, 152496, 38417, 147018, 116804, 84490, 65670, 4389, 118491, 757, 99668, 15771, 46721, 118129, 79207, 105085, 127220, 36661, 22036, 148458, 64725, 52146, 87210, 136153, 145015, 31913, 26073, 71105, 15066, 145717, 20134, 130161, 14451, 50717, 17091, 60445, 87160, 33136, 54796, 57442, 76797, 59067, 61099, 20706, 28326, 72750, 76801, 82859, 105873, 27677, 113614, 9029

18 152494, 110545, 55649, 153365, 129193, 21537, 86866, 99554, 72675, 120581, 52027, 19926, 103295, 114138, 1340, 151564, 128857, 132985, 118551, 95107, 152748, 98223, 14203,65670, 149788,55218, 118491, 118132, 142017, 118129, 11782,27685,99472, 36661, 87210, 38910, 55611, 135107, 135257, 149787, 48491, 80728, 7116, 110793, 99150, 71105, 9928, 40858, 58680, 1534, 133709, 60445, 94051, 5525, 71090, 70538, 80112, 2643, 9937, 98985, 64684 In an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 3 A.

Table 3 A. Feature subsets. Each subset comprise a selection of sequences indicated by numbers corresponding to the FragID:s in table 1.

In an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 3B.

Table 3B. Feature subsets. Each subset comprise a selection of sequences indicated by numbers corresponding to the FragID:s in table 1.

In an embodiment the method 10 comprises determining 120 the methylation status of one or more CpG dinucleotides in a sequence selected from the group of sequences corresponding to the marker panel, resulting in a methylation classification list. There are numerous methods for determining 120 the methylation status of a DNA molecule of a subject, corresponding to the feature subset. The DNA may be obtained by any method for purifying DNA known to a person skilled in the art. In an embodiment the methylation status is determined 110 by means of one or more of the methods selected form the group of, bisulfite sequencing, pyrosequencing, methylation-sensitive single-strand conformation analysis (MS-SSCA), high resolution melting analysis (HRM), methylation-sensitive single nucleotide primer extension (MS- SnuPE), base-specific cleavage/MALDI-TOF, methylation-specific PCR (MSP), microarray-based methods, msp I cleavage.

In an embodiment, the method 10 also comprises statistically analyzing 120 the methylation classification list, thus obtaining a category of the breast cancer of the subject. This may be done by jointly clustering the subject methylation data and the samples from the clinical study. The resulting clustering is then split in N groups (e.g. by cutting the clustering dendrogram into N sub-trees). The sub-tree containing the subject is evaluated for the categories of breast cancer present in the study samples and the subject sample is assigned the category of the majority samples in the sub-tree. In an embodiment, the method 10 further comprises classifying (130) the subject as belonging to one of the five major subtypes of breast cancers.

In an embodiment according to Fig. 6, a computer program product 60 is provided. The computer program product 60 is stored on a computer-readable medium, which comprises a first 61, second 62, third 63 and forth 64 code segments arranged, when run by an apparatus having computer-processing properties, for performing all of the method steps defined in some embodiments.

In an embodiment according to Fig. 7, a device 70 for supporting a clinician is provided. Said device comprising means for selecting 700 a feature subset comprising at least one post from the methylation classification list according to SEQ ID NO. 1 to SEQ ID NO. 600. Furthermore, the device 70 comprises means for determining 710 the methylation status of one or more CpG dinucleotides in DNA of a subject, corresponding to the feature subset. Furthermore, the device 70 comprises means for statistically analyzing 720 the methylation classification list, thus obtaining a category of the breast cancer of the subject. Furthermore, the device 70 comprises means for classifying 730 the subject as belonging to one of the five major subtypes of breast cancers. Said means 700, 710, 720, 730 may be operatively connected to each other.

The invention may be implemented in any suitable form including hardware, software, firmware or any combination of these. However, preferably, the invention is implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors. Although the present invention has been described above with reference to specific embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the invention is limited only by the accompanying claims and, other embodiments than the specific above are equally possible within the scope of these appended claims. In the claims, the term "comprises/comprising" does not exclude the presence of other elements or steps. Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by e.g. a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. The terms "a", "an", "first", "second" etc do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way. LIST OF REFERENCE SIGNS:

10 A method

100 A selecting step

110 A determining step

120 An analyzing step

130 A classifying step

20 A dataset

30 A first feature subset

40 A second feature subset

51 A first cluster

52 A second cluster

53 A third cluster

60 A computer program product

61 A first code segment

62 A second code segment

63 A third code segment

64 A fourth code segment

70 A device

700 Selecing means

710 Determining means

720 Analyzing means

730 Classifying means

1 to 5 Sample numbers

Claims

CLAIMS:

1. Method (10) for the analysis of breast cancer disorders, comprising determining the genomic methylation status of one or more CpG dinucleotides in a sequence selected from the group of sequences consisting of SEQ ID NO. 1 to SEQ ID NO. 600.

2. Method according to claim 1, wherein the analysis is categorization of breast cancer in a subject and wherein the following steps are performed, a. selecting (100) a feature subset comprising at least one post from the methylation classification list according to SEQ ID NO. 1 to SEQ ID NO. 600; b. determining (110) the methylation status of one or more CpG dinucleotides in DNA of a subject, corresponding to the feature subset; and c. statistically analyzing (120) the methylation classification list, thus obtaining a category of the breast cancer of the subject.

3. Method according to claims 1 or 2, wherein additionally following steps are performed, d. classifying (130) the subject as belonging to one of the five major subtypes of breast cancers.

4. Method according to claims 1 to 3, wherin the methylation status is determined (110) for a subgroup of sequences where in the specific subgroup is selected from Table 2, 3A or 3B.

5. Method according to claims 1 to 3, wherein the methylation status is determined (110) for a subgroup of sequences determined by selecting (100) a feature subset.

6. Method according to claim 5, whereinthe feature subset selection (100) is a genetic algorithm with hierarchical clustering.

7. Method according to claims 1 to 3, wherein the methylation status is determined (110) for a subgroup of sequences determined by a summarization of output of feature subset selection (100).

8. Method according to claim 7, wherein the summarization of output of feature subset selection (100) is the count of appearance of each feature in feature subsets and pairwise occurrences of sequences selected from the group of sequences consisting of SEQ ID NO. 1 to SEQ ID NO. 600.

9. Method according to claim 8, wherein the the count of appearance of each feature in feature subsets and pairwise occurrences of sequences are of size 45.

10. Method according to claim 8, wherein the the count of appearance of each feature in feature subsets and pairwise occurrences of sequences are of size 60.

11. Method according to claims 1 to 4, wherein the methylation status is determined (110) by means of one or more of the methods selected form the group of, a. bisulfite sequencing b. pyrosequencing c. methylation-sensitive single-strand conformation analysis(MS-SSCA) d. high resolution melting analysis (HRM) e. methylation-sensitive single nucleotide primer extension (MS-SnuPE) f. base-specific cleavage/MALDI-TOF g. methylation-specific PCR (MSP) h. microarray-based methods and i. msp I cleavage.

12. A computer program product (60) stored on a computer-readable medium comprising software code adapted to perform the steps of the method according to claim 2, 3, 4 or 7 when executed on a data-processing apparatus.

13. A device (70) for supporting a clinician, said device comprising means for a. selecting (700) a feature subset comprising at least one post from the methylation classification list according to SEQ ID NO. 1 to SEQ ID NO. 600; b. determining (710) the methylation status of one or more CpG dinucleotides in DNA of a subject, corresponding to the feature subset; c. statistically analyzing (720) the methylation classification list, thus obtaining a category of the breast cancer of the subject; and d. classifying (730) the subject as belonging to one of the five major subtypes of breast cancers. said means being operatively connected to each other.