WO2007073220A1

WO2007073220A1 - Prognosis prediction for colorectal cancer

Info

Publication number: WO2007073220A1
Application number: PCT/NZ2006/000343
Authority: WO
Inventors: Hjalmar Nekarda; Jan Friederichs; Bernhard Holzmann; Robert Rosenberg; Anthony Edmund Reeve; Michael Alan Black; John Lindsay Mccall; Yu-Hsin Lin; Robert Craig Pollock
Original assignee: Pacific Edge Biotechnology Limited
Priority date: 2005-12-23
Filing date: 2006-12-22
Publication date: 2007-06-28
Also published as: NZ544432A; EP2371972A1; EP2371972B1; EP1977237A1; US20200140955A1; JP2014193172A; EP2392678B1; ES2536233T3; KR20080102360A; CA3015335A1; JP6058780B2; JP6218141B2; WO2007073220A9; KR20150005726A; CA2640352C; KR101562644B1; NZ597363A; NZ586616A; CN101389957B; US20160068916A1

Abstract

This invention relates to prognostic signatures, and compositions and methods for determining the prognosis of cancer in a patient, particularly for colorectal cancer. Specifically, this invention relates to the use of genetic markers for the prediction of the prognosis of cancer, such as colorectal cancer, based on signatures of genetic markers. In various aspects, the invention relates to a method of predicting the likelihood of long-term survival of a cancer patient, a method of determining a treatment regime for a cancer patient, a method of preparing a treatment modality for a cancer patient, among other methods as well as kits and devices for carrying out these methods.

Description

PROGNOSIS PREDICTION FOR COLORECTAL CANCER

RELATED APPLICATION This application claims the benefit of New Zealand Provisional Patent Application No. 544432 filed December 23, 2005, which is incorporated by reference herein in its entirety.

FIELD OF THE INVENTION This invention relates to methods and compositions for determining the prognosis of cancer, particularly colorectal cancer, in a patient. Specifically, this invention relates to the use of genetic markers for determining the prognosis of cancer, such as colorectal cancer, based on prognostic signatures.

BACKGROUND OF THE INVENTION

Colorectal cancer (CRC) is one of the most common cancers in the developed world, and its incidence is continuing to increase. Although the progression of colorectal cancer from benign polyp to adenoma to carcinoma is well studied (1), the molecular events influencing the transition and establishment of metastasis are less well understood. The prognosis and treatment of CRC currently depends on the clinico- pathological stage of disease at the time of diagnosis, and primary surgical treatment. Unfortunately disease stage alone does not allow accurate prediction of outcome for individual patients. If patient outcomes could be predicted more accurately treatments could be tailored to avoid under-treating patients destined to relapse, or over-treating patients who would be helped by surgery alone.

Many attempts have been made to identify markers that predict clinical outcome in CRC. Until recently most studies focused on single proteins or gene mutations with limited success in terms of prognostic information (2). Microarray technology enables the identification of sets of genes, called classifiers or signatures that correlate with cancer outcome. This approach has been applied to a variety of cancers, including CRC (3-5), but methodological problems and a lack of independent validation has cast doubt over the findings (6,7). Furthermore, doubts about the ability of classifiers/signatures to predict outcome have arisen due to poor concordance of identified by different researchers using different array platforms and methodologies (8).

There is a need for further tools to predict the prognosis of colorectal cancer. This invention provides further methods, compositions, kits, and devices based on prognostic cancer markers, specifically colorectal cancer prognostic markers, to aid in the prognosis and treatment of cancer.

SUMMARY OF THE INVENTION In certain embodiments there is provided a set of markers genes identified to be differentially expressed in recurrent and non-recurrent colorectal tumours. This set of genes can be used to generate prognostics signatures, comprising two or more markers, capable of predicting the progression of colorectal tumour in a patient.

The individual markers can differentially expressed depending on whether the tumour is recurrent or not. The accuracy of prediction can be enhanced by combining the markers together into a prognostic signature for, providing for much more effective individual tests than single-gene assays. Also provided for is the application of techniques, such as statistics, machine learning, artificial intelligence, and data mining to the prognostics signatures to generate prediction models. In another embodiment, expression levels of the markers of a particular prognostic signature in the tumour of a patient can then be applied to the prediction model to determine the prognosis.

In certain embodiments, the expression level of the markers can be established using microarray methods, quantitative polymerase chain reaction (qPCR), or immunoassays.

BRIEF DESCRIPTION OF THE FIGURES

This invention is described with reference to specific embodiments thereof and with reference to the figures, in which:

Figure 1 depicts a flow chart showing the methodology for producing the prognostic signatures from 149 New Zealand (NZ) and 55 German (DE) colorectal cancer (CRC) samples. New Zealand RNA samples were hybridized to oligonucleotide spotted arrays, with a 22-gene signature produced via leave one out cross validation (LOOCV), and then independently validated by LOOCV using the 55 sample DE data set. German RNA samples were hybridized to Affymetrix arrays, with a 19-gene signature produced via LOOCV, and then independently validated by LOOCV using the NZ data set.

Figure 2 depicts a Kaplan-Meier analysis of disease-free survival time with patients predicted as high versus low risk of tumour recurrence: a, using NZ 22-gene signature on 149 tumours from NZ patients; b, using DE 19-gene signature on 55 tumours from DE patients; c, NZ prognostic signature validated on 55 tumours from DE patients; d, DE prognostic signature validated on 149 tumours from NZ patients. P-values were calculated using the log-rank test.

Figure 3 depicts a Kaplan-Meier analysis of disease free survival time with patients predicted as high versus low risk of tumour recurrence: a, using the 22-gene NZ signature on NZ patients with Stage II and Stage III disease; b, using the 19-gene DE signature on NZ patients with Stage II and Stage III disease.

Figure 4 shows the predictive value of signatures of varying lengths for prognosis of colorectal cancer. These signatures were derived from 10 replicate runs of 11-fold cross validation. Each replicate 11 -fold validation run is indicated by the various dashed lines; the mean across replicates by the bold line. In each fold of the cross-validation, genes were removed if the fold-change across classes was < 1.1 (for the remaining samples not removed in that particular fold). The genes were then ranked using a modified t-statistic, obtaining a different set of genes for each fold, and classifiers using the top n- genes (where n=2 to 200) were constructed for each fold. The genes therefore may differ for each fold of each replicate 11-fold cross validation. Figure 4 (A): Sensitivity (proportion of recurrent tumours correctly classified), with respect to number of genes/signature. Figure 4 (B): Specificity (proportion of non- recurrent tumours correctly classified), with respect to number of genes/signature. Figure 4 (C): Classification rate (proportion of tumours correctly classified), with respect to number of genes/signature. The nomenclature applied by the statistician is as follows: I refers to Stage I or Stage II colorectal cancer (with no progression), and IV refers to eventual progression to Stage IV metastases. Figure 5 shows the decreased predictive value of signatures for the prognosis of colorectal cancer, in a repeat of the experiment of Figure 4, except with the two genes, FAS and ME2,removed from the data set. Figure 5 (A): Sensitivity (proportion of recurrent tumours correctly classified), with respect to number of genes/signature. Figure 5 (B): Specificity (proportion of non-recurrent tumours correctly classified), with respect to number of genes/signature. Figure 5 (C): Classification rate (proportion of tumours correctly classified), with respect to number of genes/signature.

Figure 6 shows a pairs chart of "top counts" (number of times each gene appeared in the "top-n" gene lists, i.e., top 10, top 20, top 100, and top 325 as described in Example 17) using three different normalization methods produced using the R statistical computing package(10,39) , in accordance with Example 17, below. The "pairs" chart is described in by Becker et al, in their treatise on the S Language (upon which R is based; see reference 39) To compare methods, use row and column as defined on the diagonal to obtain the scatter plot between those two methods, analogous to reading distances off a distance chart on a map

Figure 7 shows the pairs chart (39) of top counts (number of times each gene appeared in the "top-n" gene lists, i.e., top 10, top 20, top 100, and top 325 as described in Example 17) using three different filtering statistics: (a) two-sample Wilcoxon test (41), (b) t-test (modified using an ad-hoc correction factor in the denominator to abrogate the effect of low-variance genes falsely appearing as significant) and (c) empirical Bayes as provided by the "limma"( 10,40,42) package of Bioconductor (12,40).

DETAILED DESCRIPTION Definitions Before describing embodiments of the invention in detail, it will be useful to provide some definitions of terms used herein.

The term "marker" refers to a molecule that is associated quantitatively or qualitatively with the presence of a biological phenomenon. Examples of "markers" include a polynucleotide, such as a gene or gene fragment, RNA or RNA fragment; or a gene product, including a polypeptide such as a peptide, oligopeptide, protein, or protein fragment; or any related metabolites, by products, or any other identifying molecules, such as antibodies or antibody fragments, whether related directly or indirectly to a mechanism underlying the phenomenon. The markers of the invention include the nucleotide sequences (e.g., GenBank sequences) as disclosed herein, in particular, the full-length sequences, any coding sequences, any fragments, or any complements thereof, and any measurable marker thereof as defined above.

The terms "CCPM" or "colorectal cancer prognostic marker" or "CCPM family member" refer to a marker with altered expression that is associated with a particular prognosis, e.g., a higher or lower likelihood of recurrence of cancer, as described herein, but can exclude molecules that are known in the prior art to be associated with prognosis of colorectal cancer. It is to be understood that the term CCPM does not require that the marker be specific only for colorectal tumours. Rather, expression of CCPM can be altered in other types of tumours, including malignant tumours.

The terms "prognostic signature," "signature," and the like refer to a set of two or more markers, for example CCPMs, that when analysed together as a set allow for the determination of or prediction of an event, for example the prognostic outcome of colorectal cancer. The use of a signature comprising two or more markers reduces the effect of individual variation and allows for a more robust prediction. Non-limiting examples of CCPMs are set forth in Tables 1, 2, 5, and 9, while non-limiting examples of prognostic signatures are set forth in Tables 3, 4, 8A, 8B, and 9, herein. In the context of the present invention, reference to "at least one," "at least two," "at least five," etc., of the markers listed in any particular set (e.g., any signature) means any one or any and all combinations of the markers listed.

The term "prediction method" is defined to cover the broader genus of methods from the fields of statistics, machine learning, artificial intelligence, and data mining, which can be used to specify a prediction model. These are discussed further in the Detailed Description section. The term "prediction model" refers to the specific mathematical model obtained by applying a prediction method to a collection of data. In the examples detailed herein, such data sets consist of measurements of gene activity in tissue samples taken from recurrent and non-recurrent colorectal cancer patients, for which the class (recurrent or non-recurrent) of each sample is known. Such models can be used to (1) classify a sample of unknown recurrence status as being one of recurrent or non-recurrent, or (2) make a probabilistic prediction (i.e., produce either a proportion or percentage to be interpreted as a probability) which represents the likelihood that the unknown sample is recurrent, based on the measurement of mRNA expression levels or expression products, of a specified collection of genes, in the unknown sample. The exact details of how these gene-specific measurements are combined to produce classifications and probabilistic predictions are dependent on the specific mechanisms of the prediction method used to construct the model.

"Sensitivity", "specificity" (or "selectivity"), and "classification rate", when applied to the describing the effectiveness of prediction models mean the following: "Sensitivity" means the proportion of truly positive samples that are also predicted (by the model) to be positive, hi a test for CRC recurrence, that would be the proportion of recurrent tumours predicted by the model to be recurrent. "Specificity" or "selectivity" means the proportion of truly negative samples that are also predicted (by the model) to be negative. In a test for CRC recurrence, this equates to the proportion of non-recurrent samples that are predicted to by non-recurrent by the model. "Classification Rate" is the proportion of all samples that are correctly classified by the prediction model (be that as positive or negative).

As used herein "antibodies" and like terms refer to immunoglobulin molecules and immunologically active portions of immunoglobulin (Ig) molecules, i.e., molecules that contain an antigen binding site that specifically binds (immunoreacts with) an antigen. These include, but are not limited to, polyclonal, monoclonal, chimeric, single chain, Fc, Fab, Fab', and Fab₂ fragments, and a Fab expression library. Antibody molecules relate to any of the classes IgG, IgM, IgA, IgE, and IgD, which differ from one another by the nature of heavy chain present in the molecule. These include subclasses as well, such as IgGl, IgG2, and others. The light chain may be a kappa chain or a lambda chain. Reference herein to antibodies includes a reference to all classes, subclasses, and types. Also included are chimeric antibodies, for example, monoclonal antibodies or fragments thereof that are specific to more than one source, e.g., a mouse or human sequence. Further included are camerid antibodies, shark antibodies or nanobodies.

The terms "cancer" and "cancerous" refer to or describe the physiological condition in mammals that is typically characterized by abnormal or unregulated cell growth. Cancer and cancer pathology can be associated, for example, with metastasis, interference with the normal functioning of neighbouring cells, release of cytokines or other secretory products at abnormal levels, suppression or aggravation of inflammatory or immunological response, neoplasia, premalignancy, malignancy, invasion of surrounding or distant tissues or organs, such as lymph nodes, etc. Specifically included are colorectal cancers, such as, bowel (e.g., large bowel), anal, and rectal cancers.

The term "colorectal cancer" includes cancer of the colon, rectum, and/or anus, and especially, adenocarcinomas, and may also include carcinomas (e.g., squamous cloacogenic carcinomas), melanomas, lymphomas, and sarcomas. Epidermoid (nonkeratinizing squamous cell or basaloid) carcinomas are also included. The cancer may be associated with particular types of polyps or other lesions, for example, tubular adenomas, tubulovillous adenomas (e.g., villoglandular polyps), villous (e.g., papillary) adenomas (with or without adenocarcinoma), hyperplastic polyps, hamartomas, juvenile polyps, polypoid carcinomas, pseudopolyps, lipomas, or leiomyomas. The cancer may be associated with familial polyposis and related conditions such as Gardner's syndrome or Peutz-Jeghers syndrome. The cancer may be associated, for example, with chronic fistulas, irradiated anal skin, leukoplakia, lymphogranuloma venereum, Bowen's disease (intraepithelial carcinoma), condyloma acuminatum, or human papillomavirus. In other aspects, the cancer may be associated with basal cell carcinoma, extramammary Paget's disease, cloacogenic carcinoma, or malignant melanoma.

The terms "differentially expressed," "differential expression," and like phrases, refer to a gene marker whose expression is activated to a higher or lower level in a subject (e.g., test sample) having a condition, specifically cancer, such as colorectal cancer, relative to its expression in a control subject (e.g., reference sample). The terms also include markers whose expression is activated to a higher or lower level at different stages of the same condition; in recurrent or non-recurrent disease; or in cells with higher or lower levels of proliferation. A differentially expressed marker may be either activated or inhibited at the polynucleotide level or polypeptide level, or may be subject to alternative splicing to result in a different polypeptide product. Such differences may be evidenced by a change in mRNA levels, surface expression, secretion or other partitioning of a polypeptide, for example.

Differential expression may include a comparison of expression between two or more markers (e.g., genes or their gene products); or a comparison of the ratios of the expression between two or more markers (e.g., genes or their gene products); or a comparison of two differently processed products (e.g., transcripts or polypeptides) of the same marker, which differ between normal subjects and diseased subjects; or between various stages of the same disease; or between recurring and non-recurring disease; or between cells with higher and lower levels of proliferation; or between normal tissue and diseased tissue, specifically cancer, or colorectal cancer. Differential expression includes both quantitative, as well as qualitative, differences in the temporal or cellular expression pattern in a gene or its expression products among, for example, normal and diseased cells, or among cells which have undergone different disease events or disease stages, or cells with different levels of proliferation.

The term "expression" includes production of polynucleotides and polypeptides, in particular, the production of RNA (e.g., mRNA) from a gene or portion of a gene, and includes the production of a polypeptide encoded by an RNA or gene or portion of a gene, and the appearance of a detectable material associated with expression. For example, the formation of a complex, for example, from a polypeptide-polypeptide interaction, polypeptide-nucleotide interaction, or the like, is included within the scope of the term "expression". Another example is the binding of a binding ligand, such as a hybridization probe or antibody, to a gene or other polynucleotide or oligonucleotide, a polypeptide or a protein fragment, and the visualization of the binding ligand. Thus, the intensity of a spot on a microarray, on a hybridization blot such as a Northern blot, or on an immunoblot such as a Western blot, or on a bead array, or by PCR analysis, is included within the term "expression" of the underlying biological molecule.

The terms "expression threshold," and "defined expression threshold" are used interchangeably and refer to the level of a marker in question outside which the polynucleotide or polypeptide serves as a predictive marker for patient survival without cancer recurrence. The threshold will be dependent on the predictive model established are derived experimentally from clinical studies such as those described in the Examples below. Depending on the prediction model used, the expression threshold may be set to achieve maximum sensitivity, or for maximum specificity, or for minimum error (maximum classification rate). For example a higher threshold may be set to achieve minimum errors, but this may result in a lower sensitivity. Therefore, for any given predictive model, clinical studies will be used to set an expression threshold that generally achieves the highest sensitivity while having a minimal error rate. The determination of the expression threshold for any situation is well within the knowledge of those skilled in the art.

The term "long-term survival" is used herein to refer to survival for at least 5 years, more preferably for at least 8 years, most preferably for at least 10 years following surgery or other treatment.

The term "microarray" refers to an ordered or unordered arrangement of capture agents, preferably polynucleotides (e.g., probes) or polypeptides on a substrate. See, e.g., Microarray Analysis, M. Schena, John Wiley & Sons, 2002; Microarray Biochip Technology, M. Schena, ed., Eaton Publishing, 2000; Guide to Analysis of DNA Microarray Data, S. Knudsen, John Wiley & Sons, 2004; and Protein Microarray Technology, D. Kambhampati, ed., John Wiley & Sons, 2004.

The term "oligonucleotide" refers to a polynucleotide, typically a probe or primer, including, without limitation, single-stranded deoxyribonucleotides, single- or double- stranded ribonucleotides, RNA: DNA hybrids, and double-stranded DNAs. Oligonucleotides, such as single-stranded DNA probe oligonucleotides, are often synthesized by chemical methods, for example using automated oligonucleotide synthesizers that are commercially available, or by a variety of other methods, including in vitro expression systems, recombinant techniques, and expression in cells and organisms.

The term "polynucleotide," when used in the singular or plural, generally refers to any polyribonucleotide or polydeoxribonucleotide, which may be unmodified RNA or DNA or modified RNA or DNA. This includes, without limitation, single- and double-stranded DNA, DNA including single- and double- stranded regions, single- and double-stranded RNA, and RNA including single- and double-stranded regions, hybrid molecules comprising DNA and RNA that may be single-stranded or, more typically, double-stranded or include single- and double-stranded regions. Also included are triple-stranded regions comprising RNA or DNA or both RNA and DNA. Specifically included are mRNAs, cDNAs, and genomic DNAs, and any fragments thereof. The term includes DNAs and RNAs that contain one or more modified bases, such as tritiated bases, or unusual bases, such as inosine. The polynucleotides of the invention can encompass coding or non-coding sequences, or sense or antisense sequences. It will be understood that each reference to a "polynucleotide" or like term, herein, will include the full-length sequences as well as any fragments, derivatives, or variants thereof.

"Polypeptide," as used herein, refers to an oligopeptide, peptide, or protein sequence, or fragment thereof, and to naturally occurring, recombinant, synthetic, or semisynthetic molecules. Where "polypeptide" is recited herein to refer to an amino acid sequence of a naturally occurring protein molecule, "polypeptide" and like terms, are not meant to limit the amino acid sequence to the complete, native amino acid sequence for the full-length molecule. It will be understood that each reference to a "polypeptide" or like term, herein, will include the full-length sequence, as well as any fragments, derivatives, or variants thereof.

The term "prognosis" refers to a prediction of medical outcome, for example, a poor or good outcome (e.g., likelihood of long-term survival); a negative prognosis, or poor outcome, includes a prediction of relapse, disease progression (e.g., tumour growth or metastasis, or drug resistance), or mortality; a positive prognosis, or good outcome, includes a prediction of disease remission, (e.g., disease-free status), amelioration (e.g., tumour regression), or stabilization. The term "proliferation" refers to the processes leading to increased cell size or cell number, and can include one or more of: tumour or cell growth, angiogenesis, innervation, and metastasis.

The term "qPCR" or "QPCR" refers to quantative polymerase chain reaction as described, for example, in PCR Technique: Quantitative PCR, J. W. Larrick, ed., Eaton Publishing, 1997, and A-Z of Quantitative PCR, S. Bustin, ed., IUL Press, 2004.

The term "tumour" refers to all neoplastic cell growth and proliferation, whether malignant or benign, and all pre-cancerous and cancerous cells and tissues.

"Stringency" of hybridization reactions is readily determinable by one of ordinary skill in the art, and generally is an empirical calculation dependent upon probe length, washing temperature, and salt concentration, hi general, longer probes require higher temperatures for proper annealing, while shorter probes need lower temperatures. Hybridization generally depends on the ability of denatured DNA to reanneal when complementary strands are present in an environment below their melting temperature. The higher the degree of desired homology between the probe and hybridisable sequence, the higher the relative temperature which can be used. As a result, it follows that higher relative temperatures would tend to make the reaction conditions more stringent, while lower temperatures less so. Additional details and explanation of stringency of hybridization reactions, are found e.g., in Ausubel et al., Current Protocols in Molecular Biology, Wiley Interscience Publishers, (1995).

"Stringent conditions" or "high stringency conditions", as defined herein, typically: (1) employ low ionic strength and high temperature for washing, for example 0.015 M sodium chloride/0.0015 M sodium citrate/0.1% sodium dodecyl sulfate at 50°C; (2) employ a denaturing agent during hybridization, such as formamide, for example, 50% (v/v) formamide with 0.1% bovine serum albumin/0.1% Ficoll/0.1% polyvinylpyrrolidone/50 mM sodium phosphate buffer at pH 6.5 with 750 mM sodium chloride, 75 mM sodium citrate at 42°C; or (3) employ 50% formamide, 5X SSC (0.75 M NaCl, 0.075 M sodium citrate), 50 mM sodium phosphate (pH 6.8), 0.1% sodium pyrophosphate, 5X, Denhardt's solution, sonicated salmon sperm DNA (50 μg/ml), 0.1% SDS, and 10% dextran sulfate at 42°C, with washes at 42°C in 0.2X SSC (sodium chloride/sodium citrate) and 50% fortnamide at 55⁰C, followed by a high-stringency wash comprising 0.1X SSC containing EDTA at 55⁰C.

"Moderately stringent conditions" may be identified as described by Sambrook et al., Molecular Cloning: A Laboratory Manual, New York: Cold Spring Harbor Press, 1989, and include the use of washing solution and hybridization conditions (e. g., temperature, ionic strength, and % SDS) less stringent that those described above. An example of moderately stringent conditions is overnight incubation at 37°C in a solution comprising: 20% formamide, 5X SSC (150 mM NaCl₅ 15 mM trisodium citrate), 50 mM sodium phosphate (pH 7.6), 5X Denhardt's solution, 10% dextran sulfate, and 20 mg/ml denatured sheared salmon sperm DNA, followed by washing the filters in IX SSC at about 37-50⁰C. The skilled artisan will recognize how to adjust the temperature, ionic strength, etc. as necessary to accommodate factors such as probe length and the like.

The practice of the present invention will employ, unless otherwise indicated, conventional techniques of molecular biology (including recombinant techniques), microbiology, cell biology, and biochemistry, which are within the skill of the art. Such techniques are explained fully in the literature, such as, Molecular Cloning: A Laboratory Manual, 2nd edition, Sambrook et al., 1989; Oligonucleotide Synthesis, MJ Gait, ed., 1984; Animal Cell Culture, RJ. Freshney, ed., 1987; Methods in Enzymology, Academic Press, Inc.; Handbook of Experimental Immunology, 4th edition, D .M. Weir & CC. Blackwell, eds., Blackwell Science Inc., 1987; Gene Transfer Vectors for Mammalian Cells, J.M. Miller & M.P. Calos, eds., 1987; Current Protocols in Molecular Biology, F.M. Ausubel et al., eds., 1987; and PCR: The Polymerase Chain Reaction, Mullis et al., eds., 1994.

Description of Embodiments of the Invention In colorectal cancer, discordant results have been reported for prognostic markers. The present invention discloses the use of microarrays to reach a firmer conclusion, and to determine the prognostic role of specific prognostic signatures in colorectal cancer. The microarray-based studies shown herein indicate that particular prognostic signatures in colorectal cancer are associated with a prognosis. The invention can therefore be used to identify patients at high risk of recurrence of cancer, or patients with a high likelihood of recovery.

The present invention provides for markers for the determination of disease prognosis, for example, the likelihood of recurrence of tumours, including colorectal tumours.

Using the methods of the invention, it has been found that numerous markers are associated with the prognosis of colorectal cancer, and can be used to predict disease outcome. Microarray analysis of samples taken from patients with various stages of colorectal tumours has led to the surprising discovery that specific patterns of marker expression are associated with prognosis of the cancer. The present invention therefore provides for a set of genes, outlined in Table 1 and Table 2, that are differentially expressed in recurrent and non-recurrent colorectal cancers. The genes outlined in Table 1 and Table 2 provide for a set of colorectal cancer prognostic makers (CCPMs).

A decrease in certain colorectal cancer prognostic markers (CCPMs), for example, markers associated with immune responses, is indicative of a particular prognosis. This can include increased likelihood of cancer recurrence after standard treatment, especially for colorectal cancer. Conversely, an increase in other CCPMs is indicative of a particular prognosis. This can include disease progression or the increased likelihood of cancer recurrence, especially for colorectal cancer. A decrease or increase in expression can be determined, for example, by comparison of a test sample, e.g., patient's tumour sample, to a reference sample, e.g., a sample associated with a known prognosis, hi particular, one or more samples from patient(s) with non- recurrent cancer could be used as a reference sample.

For example, to obtain a prognosis, expression levels in a patient's sample (e.g., tumour sample) can be compared to samples from patients with a known outcome. If the patient's sample shows increased or decreased expression of one or more CCPMs that compares to samples with good outcome (no recurrence), then a positive prognosis, or recurrence is unlikely, is implicated. If the patient's sample shows expression of one or more CCPMs that is comparable to samples with poor outcome (recurrence), then a positive prognosis, or recurrence of the tumour is likely, is implicated. As further examples, the expression levels of a prognostic signature comprising two or more CCPMS from a patient's sample (e.g., tumour sample) can be compared to samples of recurrent/non-recurrent cancer. If the patient's sample shows increased or decreased expression of CCPMs by comparison to samples of non-recurrent cancer, and/or comparable expression to samples of recurrent cancer, then a negative prognosis is implicated. If the patient's sample shows expression of CCPMs that is comparable to samples of non-recurrent cancer, and/or lower or higher expression than samples of recurrent cancer, then a positive prognosis is implicated.

As one approach, a prediction method can be applied to a panel of markers, for example the panel of CCPMs outlined in Table 1 and Table 2, in order to generate a predictive model. This involves the generation of a prognostic signature, comprising two or more CCPMs.

The disclosed CCPMs in Table 1 and Table 2 therefore provide a useful set of markers to generate prediction signatures for determining the prognosis of cancer, and establishing a treatment regime, or treatment modality, specific for that tumour. In particular, a positive prognosis can be used by a patient to decide to pursue standard or less invasive treatment options. A negative prognosis can be used by a patient to decide to terminate treatment or to pursue highly aggressive or experimental treatments. In addition, a patient can chose treatments based on their impact on the expression of prognostic markers (e.g., CCPMs).

Levels of CCPMs can be detected in tumour tissue, tissue proximal to the tumour, lymph node samples, blood samples, serum samples, urine samples, or faecal samples, using any suitable technique, and can include, but is not limited to, oligonucleotide probes, quantitative PCR, or antibodies raised against the markers. It will be appreciated that by analyzing the presence and amounts of expression of a plurality of CCPMs in the form of prediction signatures, and constructing a prognostic signature (e.g., as set forth in Tables 3, 4, 8A, 8B, and 9), the sensitivity and accuracy of prognosis will be increased. Therefore, multiple markers according to the present invention can be used to determine the prognosis of a cancer. The invention includes the use of archived paraffin-embedded biopsy material for assay of the markers in the set, and therefore is compatible with the most widely available type of biopsy material. It is also compatible with several different methods of tumour tissue harvest, for example, via core biopsy or fine needle aspiration. In certain aspects, RNA is isolated from a fixed, wax-embedded cancer tissue specimen of the patient. Isolation may be performed by any technique known in the art, for example from core biopsy tissue or fine needle aspirate cells.

In one aspect, the invention relates to a method of predicting a prognosis, e.g., the likelihood of long-term survival of a cancer patient without the recurrence of cancer, comprising determining the expression level of one or more prognostic markers or their expression products in a sample obtained from the patient, normalized against the expression level of other RNA transcripts or their products in the sample, or of a reference set of RNA transcripts or their expression products. In specific aspects, the prognostic marker is one or more markers listed in Tables 1, 2, or 5, , or is included as one or more of the prognostic signatures derived from the markers listed in Tables 1,

2, and 5, or the prognostic signatures listed in Tables 3, 4, 8A, 8B, or 9.

In further aspects, the expression levels of the prognostic markers or their expression products are determined, e.g., for the markers listed in Tables 1, 2, or 5, a prognostic signature derived from the markers listed in Tables 1, 2, and 5, e.g., for the prognostic signatures listed in Tables 3, 4, 8 A, 8B, or 9. In another aspect, the method comprises the determination of the expression levels of a full set of prognosis markers or their expression products, e.g., for the markers listed in Tables 1, 2, or 5, or, a prognostic signature derived from the markers listed in Tables 1, 2, and 5, e.g., for the prognostic signatures listed in Tables 3, 4, 8A, 8B, or 9.

In an additional aspect, the invention relates to an array (e.g., microarray) comprising polynucleotides hybridizing to two or more markers, e.g., for the markers listed in Tables 1, 2, and 5, , or a prognostic signature derived from the markers listed in Tables 1, 2, and 5, e.g., the prognostic signatures listed in Tables 3, 4, 8A, 8B, and 9. In particular aspects, the array comprises polynucleotides hybridizing to prognostic signature derived from the markers listed in Tables 1, 2, and 5, or e.g., for the prognostic signatures listed in Tables 3, 4, 8A, 8B, or 9. In another specific aspect, the array comprises polynucleotides hybridizing to the full set of markers, e.g., for the markers listed in Tables 1, 2, or 5, or, e.g., for the prognostic signatures listed hi Tables 3, 4, 8A, 8B, or 9.

For these arrays, the polynucleotides can be cDNAs, or oligonucleotides, and the solid surface on which they are displayed can be glass, for example. The polynucleotides can hybridize to one or more of the markers as disclosed herein, for example, to the full-length sequences, any coding sequences, any fragments, or any complements thereof. In particular aspects, an increase or decrease in expression levels of one or

10 more CCPM indicates a decreased likelihood of long-term survival, e.g., due to cancer recurrence, while a lack of an increase or decrease in expression levels of one or more CCPM indicates an increased likelihood of long-term survival without cancer recurrence.

15 Table 1: Colorectal Cancer Predictive Markers (corresponding to Affymetrix GeneChip probes that show statistically significant differential expression, P<0.05, as ascertained by BRB Array Tools)

Table 2: Markers with expression correlating to that of the 22 genes from NZ signature.

General approaches to prognostic marker detection

The following approaches are non-limiting methods that can be used to detect the proliferation markers, including CCPM family members: microarray approaches using oligonucleotide probes selective for a CCPM; real-time qPCR on tumour samples using CCPM specific primers and probes; real-time qPCR on lymph node, blood, serum, faecal, or urine samples using CCPM specific primers and probes; enzyme-linked immunological assays (ELISA); immunohistochemistry using anti- marker antibodies; and analysis of array or qPCR data using computers.

10

Other useful methods include northern blotting and in situ hybridization (Parker and Barnes, Methods in Molecular Biology 106: 247-283 (1999)); RNase protection assays (Hod, BioTechniques 13: 852-854 (1992)); reverse transcription polymerase chain reaction (RT-PCR; Weis et al., Trends in Genetics 8: 263-264

15 (1992)); serial analysis of gene expression (SAGE; Velculescu et.al., Science 270: 484-487 (1995); and Velculescu et al., Cell 88: 243-51 (1997)), MassARRAY technology (Sequenom, San Diego, CA), and gene expression analysis by massively parallel signature sequencing (MPSS; Brenner et al., Nature Biotechnology 18: 630- 634 (2000)). Alternatively, antibodies may be employed that can recognize specific

20 complexes, including DNA duplexes, RNA duplexes, and DNA-RNA hybrid duplexes or DNA-polypeptide duplexes.

Primary data can be collected and fold change analysis can be performed, for example, by comparison of marker expression levels in tumour tissue and non-tumour

25 tissue; by comparison of marker expression levels to levels determined in recurring tumours and non-recurring tumours; by comparison of marker expression levels to levels determined in tumours with or without metastasis; by comparison of marker expression levels to levels determined in differently staged tumours; or by comparison of marker expression levels to levels determined in cells with different levels of proliferation. A negative or positive prognosis is determined based on this analysis. Further analysis of tumour marker expression includes matching those markers exhibiting increased or decreased expression with expression profiles of known colorectal tumours to provide a prognosis.

A threshold for concluding that expression is increased will be dependent on the particular marker and also the particular predictive model that is to be applied. The threshold is generally set to achieve the highest sensitivity and selectivity with the lowest error rate, although variations may be desirable for a particular clinical situation. The desired threshold is determined by analysing a population of sufficient size taking into account the statistical variability of any predictive model and is calculated from the size of the sample used to produce the predictive model. The same applies for the determination of a threshold for concluding that expression is decreased. It can be appreciated that other thresholds, or methods for establishing a threshold,for concluding that increased or decreased expression has occurred can be selected without departing from the scope of this invention.

It is also possible that a prediction model may produce as it's output a numerical value, for example a score, likelihood value or probability. In these instances, it is possible to apply thresholds to the results produced by prediction models, and in these cases similar principles apply as those used to set thresholds for expression values.

Once the expression level, or output of a prediction model, of a predictive signature in a tumour sample has been obtained, the likelihood of the cancer recurring can then be determined.

From the markers identified, prognostic signatures comprising one or more CCPMs can be used to determine the prognosis of a cancer, by comparing the expression level of the one or more markers to the disclosed prognostic signature. By comparing the expression of one or more of the CCPMs in a tumour sample with the disclosed prognostic signature, the likelihood of the cancer recurring can be determined. The comparison of expression levels of the prognostic signature to establish a prognosis can be done by applying a predictive model as described previously.

Determining the likelihood of the cancer recurring is of great value to the medical practitioner. A high likelihood of re-occurrence means that a longer or higher dose treatment should be given, and the patient should be more closely monitored for signs of recurrence of the cancer. An accurate prognosis is also of benefit to the patient. It allows the patient, along with their partners, family, and friends to also make decisions about treatment, as well as decisions about their future and lifestyle changes. Therefore, the invention also provides for a method establishing a treatment regime for a particular cancer based on the prognosis established by matching the expression of the markers in a tumour sample with the differential expression signature.

It will be appreciated that the marker selection, or construction of a prognostic signature, does not have to be restricted to the CCPMs disclosed in Tables 1, 2, or 5, herein, or the prognostic signatures disclosed in Tables 3, 4, 8A, 8B, and 9, but could involve the use of one or more CCPMs from the disclosed signatures, or a new signature may be established using CCPMs selected from the disclosed marker lists. The requirement of any signature is that it predicts the likelihood of recurrence with enough accuracy to assist a medical practitioner to establish a treatment regime.

Reverse Transcription PCR (RT-PCR)

Of the techniques listed above, the most sensitive and most flexible quantitative method is RT-PCR, which can be used to compare RNA levels in different sample populations, in normal and tumour tissues, with or without drug treatment, to characterize patterns of expression, to discriminate between closely related RNAs, and to analyze RNA structure.

For RT-PCR, the first step is the isolation of RNA from a target sample. The starting material is typically total RNA isolated from human tumours or tumour cell lines, and corresponding normal tissues or cell lines, respectively. RNA can be isolated from a variety of samples, such as tumour samples from breast, lung, colon (e.g., large bowel or small bowel), colorectal, gastric, esophageal, anal, rectal, prostate, brain, liver, kidney, pancreas, spleen, thymus, testis, ovary, uterus, etc., tissues, from primary tumours, or tumour cell lines, and from pooled samples from healthy donors. If the source of RNA is a tumour, RNA can be extracted, for example, from frozen or archived paraffin-embedded and fixed (e.g., formalin-fixed) tissue samples.

The first step in gene expression profiling by RT-PCR is the reverse transcription of the RNA template into cDNA, followed by its exponential amplification in a PCR reaction. The two most commonly used reverse transcriptases are avian myeloblastosis virus reverse transcriptase (AMV-RT) and Moloney murine leukaemia virus reverse transcriptase (MMLV-RT). The reverse transcription step is typically primed using specific primers, random hexamers, or oligo-dT primers, depending on the circumstances and the goal of expression profiling. For example, extracted RNA can be reverse-transcribed using a GeneAmp RNA PCR kit (Perkin Elmer, CA, USA), following the manufacturer's instructions. The derived cDNA can then be used as a template in the subsequent PCR reaction.

Although the PCR step can use a variety of thermostable DNA-dependent DNA polymerases, it typically employs the Taq DNA polymerase, which has a 5'-3' nuclease activity but lacks a 3 '-5' proofreading endonuclease activity. Thus, TaqMan (q) PCR typically utilizes the 5' nuclease activity of Taq or Tth polymerase to hydrolyze a hybridization probe bound to its target amplicon, but any enzyme with equivalent 5' nuclease activity can be used.

Two oligonucleotide primers are used to generate an amplicon typical of a

PCR reaction. A third oligonucleotide, or probe, is designed to detect nucleotide sequence located between the two PCR primers. The probe is non-extendible by Taq DNA polymerase enzyme, and is labeled with a reporter fluorescent dye and a quencher fluorescent dye. Any laser-induced emission from the reporter dye is quenched by the quenching dye when the two dyes are located close together as they are on the probe. During the amplification reaction, the Taq DNA polymerase enzyme cleaves the probe in a template-dependent manner. The resultant probe fragments disassociate in solution, and signal from the released reporter dye is free from the quenching effect of the second fluorophore. One molecule of reporter dye is liberated for each new molecule synthesized, and detection of the unquenched reporter dye provides the basis for quantitative interpretation of the data.

TaqMan RT-PCR can be performed using commercially available equipment, such as, for example, ABI PRISM 7700 Sequence Detection System (Perkin-Elmer- Applied Biosystems, Foster City, CA₅ USA), or Lightcycler (Roche Molecular Biochemicals, Mannheim, Germany). In a preferred embodiment, the 5' nuclease procedure is run on a real-time quantitative PCR device such as the ABI PRISM 7700tam Sequence Detection System. The system consists of a theπnocycler, laser, charge-coupled device (CCD), camera, and computer. The system amplifies samples in a 96-well format on a thermocycler. During amplification, laser-induced fluorescent signal is collected in real-time through fibre optics cables for all 96 wells, and detected at the CCD. The system includes software for running the instrument and for analyzing the data.

5' nuclease assay data are initially expressed as Ct, or the threshold cycle. As discussed above, fluorescence values are recorded during every cycle and represent the amount of product amplified to that point in the amplification reaction. The point when the fluorescent signal is first recorded as statistically significant is the threshold cycle.

To minimize errors and the effect of sample-to-sample variation, RT-PCR is usually performed using an internal standard. The ideal internal standard is expressed at a constant level among different tissues, and is unaffected by the experimental treatment. RNAs most frequently used to normalize patterns of gene expression are mRNAs for the housekeeping genes glyceraldehyde-3-phosphate-dehydrogenase (GAPDH) and-actin.

Real-time quantitative PCR (qPCR) A more recent variation of the RT-PCR technique is the real time quantitative

PCR, which measures PCR product accumulation through a dual-labeled fluorigenic probe (i.e., TaqMan probe). Real time PCR is compatible both with quantitative competitive PCR and with quantitative comparative PCR. The former uses an internal competitor for each target sequence for normalization, while the latter uses a normalization gene contained within the sample, or a housekeeping gene for RT-PCR. Further details are provided, e.g., by Held et al., Genome Research 6: 986-994 (1996).

Expression levels can be determined using fixed, paraffin-embedded tissues as the RNA source. According to one aspect of the present invention, PCR primers and probes are designed based upon intron sequences present in the gene to be amplified.

In this embodiment, the first step in the primer/probe design is the delineation of intron sequences within the genes. This can be done by publicly available software, such as the DNA BLAT software developed by Kent, W. J., Genome Res. 12 (4): 656-64 (2002), or by the BLAST software including its variations. Subsequent steps follow well established methods of PCR primer and probe design.

In order to avoid non-specific signals, it is useful to mask repetitive sequences within the nitrons when designing the primers and probes. This can be easily accomplished by using the Repeat Masker program available on-line through the Baylor College of Medicine, which screens DNA sequences against a library of repetitive elements and returns a query sequence in which the repetitive elements are masked. The masked sequences can then be used to design primer and probe sequences using any commercially or otherwise publicly available primer/probe design packages, such as Primer Express (Applied Biosystems); MGB assay-by- design (Applied Biosystems); Primer3 (Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist programmers in: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386).

The most important factors considered in PCR primer design include primer length, melting temperature (T_m), and G/C content, specificity, complementary primer sequences, and 3' end sequence. In general, optimal PCR primers are generally 17-30 bases in length, and contain about 20-80%, such as, for example, about 50-60% G+C bases. Melting temperatures between 50 and 80⁰C, e.g., about 50 to 70⁰C, are typically preferred. For further guidelines for PCR primer and probe design see, e.g., Dieffenbach, C. W. et al., General Concepts for PCR Primer Design in: PCR Primer, A Laboratory Manual, Cold Spring Harbor Laboratory Press, New York, 1995, pp. 133-155; Innis and Gelfand, Optimization of PCRs in: PCR Protocols, A Guide to Methods and Applications, CRC Press, London, 1994, pp. 5-11; and Plasterer, T. N. Primerselect: Primer and probe design. Methods MoI. Biol. 70: 520-527 (1997), the entire disclosures of which are hereby expressly incorporated by reference.

Microarray analysis

Differential expression can also be identified, or confirmed using the microarray technique. Thus, the expression profile of CCPMs can be measured in either fresh or paraffin-embedded tumour tissue, using microarray technology. In this method, polynucleotide sequences of interest (including cDNAs and oligonucleotides) are plated, or arrayed, on a microchip substrate. The arrayed sequences (i.e., capture probes) are then hybridized with specific polynucleotides from cells or tissues of interest (i.e., targets). Just as in the RT-PCR method, the source of RNA typically is total RNA isolated from human tumours or tumour cell lines, and corresponding normal tissues or cell lines. Thus RNA can be isolated from a variety of primary tumours or tumour cell lines. If the source of RNA is a primary tumour, RNA can be extracted, for example, from frozen or archived formalin fixed paraffin-embedded (FFPE) tissue samples and fixed (e.g., formalin-fixed) tissue samples, which are routinely prepared and preserved in everyday clinical practice.

In a specific embodiment of the microarray technique, PCR amplified inserts of cDNA clones are applied to a substrate. The substrate can include up to 1, 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or 75 nucleotide sequences. In other aspects, the substrate can include at least 10,000 nucleotide sequences. The microarrayed sequences, immobilized on the microchip, are suitable for hybridization under stringent conditions. As other embodiments, the targets for the microarrays can be at least 50, 100, 200, 400, 500, 1000, or 2000 bases in length; or 50-100, 100-200, 100- 500, 100-1000, 100-2000, or 500-5000 bases in length. As further embodiments, the capture probes for the microarrays can be at least 10, 15, 20, 25, 50, 75, 80, or 100 bases in length; or 10-15, 10-20, 10-25, 10-50, 10-75, 10-80, or 20-80 bases in length.

Fluorescently labeled cDNA probes may be generated through incorporation of fluorescent nucleotides by reverse transcription of RNA extracted from tissues of interest. Labeled cDNA probes applied to the chip hybridize with specificity to each spot of DNA on the array. After stringent washing to remove non-specifically bound probes, the chip is scanned by confocal laser microscopy or by another detection method, such as a CCD camera. Quantitation of hybridization of each arrayed element allows for assessment of corresponding mRNA abundance. With dual colour fluorescence, separately labeled cDNA probes generated from two sources of RNA are hybridized pairwise to the array. The relative abundance of the transcripts from the two sources corresponding to each specified gene is thus determined simultaneously. An exemplary protocol for this is described in detail in Example 4.

The miniaturized scale of the hybridization affords a convenient and rapid evaluation of the expression pattern for large numbers of genes. Such methods have been shown to have the sensitivity required to detect rare transcripts, which are expressed at a few copies per cell, and to reproducibly detect at least approximately two-fold differences in the expression levels (Schena et al., Proc. Natl. Acad. Sci. USA 93 (2): 106-149 (1996)). Microarray analysis can be performed by commercially available equipment, following manufacturer's protocols, such as by using the Affymetrix GenChip technology, IUumina microarray technology or Incyte's microarray technology. The development of microarray methods for large-scale analysis of gene expression makes it possible to search systematically for molecular markers of cancer classification and outcome prediction in a variety of tumour types.

RNA isolation, purification, and amplification

General methods for mRNA extraction are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., Current Protocols of Molecular Biology, John Wiley and Sons (1997). Methods for RNA extraction from paraffin embedded tissues are disclosed, for example, in Rupp and Locker, Lab Invest. 56: A67 (1987), and De Sandres et al., BioTechniques 18: 42044 (1995). In particular, RNA isolation can be performed using purification kit, buffer set, and protease from commercial manufacturers, such as Qiagen, according to the manufacturer's instructions. For example, total RNA from cells in culture can be isolated using Qiagen RNeasy mini-columns. Other commercially available RNA isolation kits include MasterPure Complete DNA and RNA Purification Kit (EPICENTRE (D, Madison, WI), and Paraffin Block RNA Isolation Kit (Ambion, Inc.). Total RNA from tissue samples can be isolated using RNA Stat-60 (Tel-Test). RNA prepared from tumour can be isolated, for example, by cesium chloride density gradient centrifugation.

The steps of a representative protocol for profiling gene expression using fixed, paraffin-embedded tissues as the RNA source, including mRNA isolation, purification, primer extension and amplification are given in various published journal articles (for example: T. E. Godfrey et al. J. Molec. Diagnostics 2: 84-91 (2000); K. Specht et al., Am. J. Pathol. 158: 419-29 (2001)). Briefly, a representative process starts with cutting about 10 μm thick sections of paraffin-embedded tumour tissue samples. The RNA is then extracted, and protein and DNA are removed. After analysis of the RNA concentration, RNA repair and/or amplification steps may be included, if necessary, and RNA is reverse transcribed using gene specific promoters followed by RT-PCR. Finally, the data are analyzed to identify the best treatment option(s) available to the patient on the basis of the characteristic gene expression pattern identified in the tumour sample examined.

Immunohistochemistry and proteomics

Immunohistochemistry methods are also suitable for detecting the expression levels of the proliferation markers of the present invention. Thus, antibodies or antisera, preferably polyclonal antisera, and most preferably monoclonal antibodies specific for each marker, are used to detect expression. The antibodies can be detected by direct labeling of the antibodies themselves, for example, with radioactive labels, fluorescent labels, hapten labels such as, biotin, or an enzyme such as horse radish peroxidase or alkaline phosphatase. Alternatively, unlabeled primary antibody is used in conjunction with a labeled secondary antibody, comprising antisera, polyclonal antisera or a monoclonal antibody specific for the primary antibody. Immunohistochemistry protocols and kits are well known in the art and are commercially available.

Proteomics can be used to analyze the polypeptides present in a sample (e.g., tissue, organism, or cell culture) at a certain point of time, hi particular, proteomic techniques can be used to assess the global changes of polypeptide expression in a sample (also referred to as expression proteomics). Proteomic analysis typically includes: (1) separation of individual polypeptides in a sample by 2-D gel electrophoresis (2-D PAGE); (2) identification of the individual polypeptides recovered from the gel, e.g., by mass spectrometry or N-terminal sequencing, and (3) analysis of the data using bioinformatics. Proteomics methods are valuable supplements to other methods of gene expression profiling, and can be used, alone or in combination with other methods, to detect the products of the proliferation markers of the present invention.

Once the expression level of one or more prognostic markers in a tumour sample has been assessed the likelihood of the cancer recurring can then be determined. The inventors have identified a number of markers that are differentially expressed in nonrecurring colorectal cancers compared to recurring colorectal cancers in patient data sets. The markers are set out in Tables 1, 2, and 9, in the examples below.

Selection of Differentially Expressed Genes. An early approach to the selection of genes deemed significant involved simply looking at the "fold change" of a given gene between the two groups of interest. While this approach hones in on genes that seem to change the most spectacularly, consideration of basic statistics leads one to realize that if the variance (or noise level) is quite high (as is often seen in microarray experiments), then seemingly large fold- change can happen frequently by chance alone.

Microarray experiments, such as those described here, typically involve the simultaneous measurement of thousands of genes. If one is comparing the expression levels for a particular gene between two groups (for example recurrent and non- recurrent tumours), the typical tests for significance (such as the t-test) are not adequate. This is because, in an ensemble of thousands of experiments (in this context each gene constitutes an "experiment"), the probability of at least one experiment passing the usual criteria for significance by chance alone is essentially unity, hi a test for significance, one typically calculates the probability that the "null hypothesis" is correct. In the case of comparing two groups, the null hypothesis is that there is no difference between the two groups. If a statistical test produces a probability for the null hypothesis below some threshold (usually 0.05 or 0.01), it is stated that we can reject the null hypothesis, and accept the hypothesis that the two groups are significantly different. Clearly, in such a test, a rejection of the null hypothesis by chance alone could be expected 1 in 20 times (or 1 in 100). The use of t-tests, or other similar statistical tests for significance, fail in the context of microarrays, producing far too many false positives (or type I errors)

In this type of situation, where one is testing multiple hypotheses at the same time, one applies typical multiple comparison procedures, such as the Bonferroni Method (43). However such tests are too conservative for most microarray experiments, resulting in too many false negative (type II) errors.

A more recent approach is to do away with attempting to apply a probability for a given test being significant, and establish a means for selecting a subset of experiments, such that the expected proportion of Type I errors (or false discovery rate; 47) is controlled for. It is this approach that has been used in this investigation, through various implementations, namely the methods provided with BRB Array Tools (48), and the limma (11,42) package of Bioconductor (that uses the R statistical environment; 10,39).

General methodology for Data Mining: Generation of Prognostic Signatures

Data Mining is the term used to describe the extraction of "knowledge", in other words the "know-how", or predictive ability from (usually) large volumes of data (the dataset). This is the approach used in this study to generate prognostic signatures. In the case of this study the "know-how" is the ability to accurately predict prognosis from a given set of gene expression measurements, or "signature" (as described generally in this section and in more detail in the examples section).

The specific details used for the methods used in this study are described in Examples 17-20. However, application of any of the data mining methods (both those described in the Examples, and those described here) can follow this general protocol.

Data mining (49), and the related topic machine learning (40) is a complex, repetitive mathematical task that involves the use of one or more appropriate computer software packages (see below). The use of software is advantageous on the one hand, in that one does not need to be completely familiar with the intricacies of the theory behind each technique in order to successfully use data mining techniques, provided that one adheres to the correct methodology. The disadvantage is that the application of data mining can often be viewed as a "black box": one inserts the data and receives the answer. How this is achieved is often masked from the end-user (this is the case for many of the techniques described, and can often influence the statistical method chosen for data mining. For example, neural networks and support vector machines have a particularly complex implementation that makes it very difficult for the end user to extract out the "rules" used to produce the decision. On the other hand, k- nearest neighbours and linear discriminant analysis have a very transparent process for decision making that is not hidden from the user.

There are two types of approach used in data mining: supervised and unsupervised approaches. In the supervised approach, the information that is being linked to the data is known, such as categorical data (e.g. recurrent vs. non recurrent tumours). What is required is the ability to link the observed response (e.g. recurrence vs. non- recurrence) to the input variables. In the unsupervised approach, the classes within the dataset are not known in advance, and data mining methodology is employed to attempt to find the classes or structure within the dataset.

In the present example the supervised approach was used and is discussed in detail here, although it will be appreciated that any of the other techniques could be used.

The overall protocol involves the following steps:

• Data representation. This involves transformation of the data into a form that is most likely to work successfully with the chosen data mining technique, hi where the data is numerical, such as in this study where the data being investigated represents relative levels of gene expression, this is fairly simple. If the data covers a large dynamic range (i.e. many orders of magnitude) often the log of the data is taken. If the data covers many measurements of separate samples on separate days by separate investigators, particular care has to be taken to ensure systematic error is minimised. The minimisation of systematic error (i.e. errors resulting from protocol differences, machine differences, operator differences and other quantifiable factors) is the process referred to here as "normalisation".

• Feature Selection. Typically the dataset contains many more data elements than would be practical to measure on a day-to-day basis, and additionally many elements that do not provide the information needed to produce a prediction model. The actual ability of a prediction model to describe a dataset is derived from some subset of the full dimensionality of the dataset. These dimensions the most important components (or features) of the dataset. Note in the context of microarray data, the dimensions of the dataset are the individual genes. Feature selection, in the context described here, involves finding those genes which are most "differentially expressed". In a more general sense, it involves those groups which pass some statistical test for significance, i.e. is the level of a particular variable consistently higher or lower in one or other of the groups being investigated. Sometimes the features are those variables (or dimensions) which exhibit the greatest variance. The application of feature selection is completely independent of the method used to create a prediction model, and involves a great deal of experimentation to achieve the desired results. Within this invention, the selection of significant genes, and those which correlated with the earlier successful model

(the NZ classifier), entailed feature selection. In addition, methods of data reduction (such as principal component analysis) can be applied to the dataset.

• Training. Once the classes (e.g. recurrence/non-recurrence) and the features of the dataset have been established, and the data is represented in a form that is acceptable as input for data mining, the reduced dataset (as described by the features) is applied to the prediction model of choice. The input for this model is usually in the form a multi-dimensional numerical input,(known as a vector), with associated output information (a class label or a response). In the training process, selected data is input into the prediction model, either sequentially (in techniques such as neural networks) or as a whole (in techniques that apply some form of regression, such as linear models, linear discriminant analysis, support vector machines). In some instances (e.g. k- nearest neighbours) the dataset (or subset of the dataset obtained after feature selection) is itself the model. As discussed, effective models can be established with minimal understanding of the detailed mathematics, through the use of various software packages where the parameters of the model have been pre-determined by expert analysts as most likely to lead to successful results.

• Validation. This is a key component of the data-mining protocol, and the incorrect application of this frequently leads to errors. Portions of the dataset are to be set aside, apart from feature selection and training, to test the success of the prediction model. Furthermore, if the results of validation are used to effect feature selection and training of the model, then one obtains a further validation set to test the model before it is applied to real-life situations. If this process is not strictly adhered to the model is likely to fail in real-world situations. The methods of validation are described in more detail below.

• Application. Once the model has been constructed, and validated, it must be packaged in some way as it is accessible to end users. This often involves implementation of some form a spreadsheet application, into which the model has been imbedded, scripting of a statistical software package, or refactoring of the model into a hard-coded application by information technology staff.

Examples of software packages that are frequently used are:

- Spreadsheet plugins, obtained from multiple vendors.

- The R statistical environment. - The commercial packages MatLab, S-plus, SAS, SPSS, STATA.

- Free open-source software such as Octave (a MatLab clone)

- many and varied C++ libraries, which can be used to implement prediction models in a commercial, closed-source setting.

Examples of Data Mining Methods.

The methods can be by first performing the step of data mining process (above), and then applying the appropriate known software packages. Further description of the process of data mining is described in detail in many extremely well- written texts.(49) • Linear models (49, 50): The data is treated as the input of a linear regression model, of which the class labels or responses variables are the output. Class labels, or other categorical data, must be transformed into numerical values (usually integer). In generalised linear models, the class labels or response variables are not themselves linearly related to the input data, but are transformed through the use of a "link function". Logistic regression is the most common form of generalized linear model.

• Linear Discriminant analysis (49, 51, 52). Provided the data is linearly separable (i.e. the groups or classes of data can be separated by a hyperplane, which is an n-dimensional extension of a threshold), this technique can be applied. A combination of variables is used to separate the classes, such that the between group variance is maximised, and the within-group variance is minimised. The byproduct of this is the formation of a classification rule.

Application of this rule to samples of unknown class allows predictions or classification of class membership to be made for that sample. There are variations of linear discriminant analysis such as nearest shrunken centroids which are commonly used for microarray analysis.

• Support vector machines (53): A collection of variables is used in conjunction with a collection of weights to determine a model that maximizes the separation between classes in terms of those weighted variables. Application of this model to a sample then produces a classification or prediction of class membership for that sample.

• Neural networks (52): The data is treated as input into a network of nodes, which superficially resemble biological neurons, which apply the input from all the nodes to which they are connected, and transform the input into an output. Commonly, neural networks use the "multiply and sum" algorithm, to transform the inputs from multiple connected input nodes into a single output. A node may not necessarily produce an output unless the inputs to that node exceed a certain threshold. Each node has as its input the output from several other nodes, with the final output node usually being linked to a categorical variable. The number of nodes, and the topology of the nodes can be varied in almost infinite ways, providing for the ability to classify extremely noisy data that may not be possible to categorize in other ways. The most common implementation of neural networks is the multi-layer perceptron.

• Classification and regression trees (54): In these, variables are used to define a hierarchy of rules that can be followed in a stepwise manner to determine the class of a sample. The typical process creates a set of rules which lead to a specific class output, or a specific statement of the inability to discriminate. A example classification tree is an implementation of an algorithm such as: if gene A> x and gene Y > x and gene Z = z then class A else if geneA = q then class B

• Nearest neighbour methods (51, 52). Predictions or classifications are made by comparing a sample (of unknown class) to those around it (or known class), with closeness defined by a distance function. It is possible to define many different distance functions. Commonly used distance functions are the

Euclidean distance (an extension of the Pythagorean distance, as in triangulation, to n-dimensions), various forms of correlation (including Pearson Correlation co-efficient). There are also transformation functions that convert data points that would not normally be interconnected by a meaningful distance metric into euclidean space, so that Euclidean distance can then be applied (e.g. Mahalanobis distance). Although the distance metric can be quite complex, the basic premise of k-nearest neighbours is quite simple, essentially being a restatement of "find the k-data vectors that are most similar to the unknown input, find out which class they correspond to, and vote as to which class the unknown input is".

• Other methods:

- Bayesian networks. A directed acyclic graph is used to represent a collection of variables in conjunction with their joint probability distribution, which is then used to determine the probability of class membership for a sample. - Independent components analysis, in which independent signals (e.g., class membership) re isolated (into components) from a collection of variables. These components can then be used to produce a classification or prediction of class membership for a sample.

Ensemble learning methods in which a collection of prediction methods are combined to produce a joint classification or prediction of class membership for a sample

There are many variations of these methodologies that can be explored (49), and many new methodologies are constantly being defined and developed. It will be appreciated that any one of these methodologies can be applied in order to obtain an acceptable result. Particular care must be taken to avoid overfitting, by ensuring that all results are tested via a comprehensive validation scheme.

Validation

Application of any of the prediction methods described involves both training and cross-validation (43, 55) before the method can be applied to new datasets (such as data from a clinical trial). Training involves taking a subset of the dataset of interest (in this case gene expression measurements from colorectal tumours), such that it is stratified across the classes that are being tested for (in this case recurrent and nonrecurrent tumours). This training set is used to generate a prediction model (defined above), which is tested on the remainder of the data (the testing set).

It is possible to alter the parameters of the prediction model so as to obtain better performance in the testing set, however, this can lead to the situation known as overfitting, where the prediction model works on the training dataset but not on any external dataset. In order to circumvent this, the process of validation is followed. There are two major types of validation typically applied, the first (hold-out validation) involves partitioning the dataset into three groups: testing, training, and validation. The validation set has no input into the training process whatsoever, so that any adjustment of parameters or other refinements must take place during application to the testing set (but not the validation set). The second major type is cross-validation, which can be applied in several different ways, described below.

There are two main sub-types of cross-validation: K-fold cross-validation, and leave- one-out cross-validation

K-fold cross-validation: .The dataset is divided into K subsamples, each subsample containing approximately the same proportions of the class groups as the original. In each round of validation, one of the K subsamples is set aside, and training is accomplished using the remainder of the dataset. The effectiveness of the training for that round is guaged by how correctly the classification of the left-out group is. This procedure is repeated K- times, and the overall effectiveness ascertained by comparison of the predicted class with the known class.

Leave-one-out cross-validation: A commonly used variation of K-fold cross validation, in which

where n is the number of samples.

Combinations of CCPMS, such as those described above in Tables 1 and 2, can be used to construct predictive models for prognosis.

Prognostic Signatures

Prognostic signatures, comprising one or more of these markers, can be used to determine the outcome of a patient, through application of one or more predictive models derived from the signature. In particular, a clinician or researcher can determine the differential expression (e.g., increased or decreased expression) of the one or more markers in the signature, apply a predictive model, and thereby predict the negative prognosis, e.g., likelihood of disease relapse, of a patient, or alternatively the likelihood of a positive prognosis (continued remission).

A set of prognostic signatures have been developed. In the first instance, there are two signatures developed by cross-comparison of predictive ability between two datasets: the set of microarray experiments encompassing the German colorectal cancer samples, and the set of microarray experiments encompassing the New Zealand samples (discussed in example 6). In the second instance there has been an exhaustive statistical search for effective signatures based solely on the German dataset (discussed in example 17).

As described in Example 6 below, a prognostic signature comprising 19 genes has been established from a set of colorectal samples from Germany (Table 4). Another prognostic signature, of 22 genes, has also been established from samples of colorectal tumours from patients in New Zealand (Table 3). By obtaining a patient sample (e.g., tumour sample), and matching the expression levels of one or more markers in the sample to the differential expression profile, the likelihood of the cancer recurring can be determined.

Table 3: New Zealand prognostic signature

Table 4: German prognostic signature

In certain aspects, this invention provides methods for determining the prognosis of a cancer, comprising: (a) providing a sample of the cancer; (b) detecting the expression level of a CCPM family member in said sample; and (c) determining the prognosis of the cancer. In one aspect, the cancer is colorectal cancer.

In other aspects, the invention includes a step of detecting the expression level of a CCPM mRNA. In other aspects, the invention includes a step of detecting the expression level of a CCPM polypeptide. In yet a further aspect, the invention includes a step of detecting the level of a CCPM peptide. In yet another aspect, the invention includes detecting the expression level of more than one CCPM family member in said sample. In a further aspect the CCPM is a gene associated with an immune response. In a further aspect the CCPM is selected from the markers set forth in Tables 3, 4, 8A, 8B, or 9. In a still further aspect, the CCPM is included in a signature selected from the signatures set forth in Tables 3, 4, 8 A, 8B, or 9.

In a further aspect the invention comprises detecting the expression level of; WDR44, RBMSl, SACMlL₃ SOATl, PBK₅ G3BP2, ZBTB20, ZNF410, COMMD2, PSMCl,

COXlO, GTF3C5, HMMR, UBE2L3, GNAS, PPP2R2A, RNASE2, SCOC PSMD9,

EIF3S7, ATP2B4, and ABCC9. In a further aspect the invention comprises detecting the expression level of; CXCLlO₅ FAS₅ CXCLO, TLKl, CXCLIl, PBK₅ PSATl,

MAD2L1, CA2, GZMB, SLC4A4, DLG7, TNFRSFIlA₅ KITLG, INDO₅ GBPl₅ CXCL13, CLCA4, and PCP4.

In still further aspects, the invention includes a method of determining a treatment regime for a cancer comprising: (a) providing a sample of the cancer; (b) detecting the expression level of a CCPM family member in said sample; (c) determining the prognosis of the cancer based on the expression level of a CCPM family member; and (d) determining the treatment regime according to the prognosis.

In still further aspects, the invention includes a device for detecting a CCPM₅ comprising: a substrate having a CCPM capture reagent thereon; and a detector associated with said substrate, said detector capable of detecting a CCPM associated with said capture reagent. Additional aspects include kits for detecting cancer, comprising: a substrate; a CCPM capture reagent; and instructions for use. Yet further aspects of the invention include method for detecting a CCPM using qPCR, comprising: a forward primer specific for said CCPM; a reverse primer specific for said CCPM; PCR reagents; a reaction vial; and instructions for use.

Additional aspects of this invention comprise a kit for detecting the presence of a CCPM polypeptide or peptide, comprising: a substrate having a capture agent for said CCPM polypeptide or peptide; an antibody specific for said CCPM polypeptide or peptide; a reagent capable of labeling bound antibody for said CCPM polypeptide or peptide; and instructions for use.

In yet further aspects, this invention includes a method for determining the prognosis of colorectal cancer, comprising the steps of: providing a tumour sample from a patient suspected of having colorectal cancer; measuring the presence of a CCPM polypeptide using an ELISA method. In specific aspects of this invention the CCPM of the invention is selected from the markers set forth in Tables 1, 2, 5, or 9. In still further aspects, the CCPM is included in a prognostic signature selected from the signatures set forth in Tables 3, 4, 8A, 8B, or 10.

EXAMPLES

The examples described herein are for purposes of illustrating embodiments of the invention. Other embodiments, methods, and types of analyses are within the scope of persons of ordinary skill in the molecular diagnostic arts and need not be described in detail hereon. Other embodiments within the scope of the art are considered to be part of this invention.

Example 1: Patients and methods

Two cohorts of patients were included in this study, one set from New Zealand (NZ) and the second from Germany (DE). The NZ patients were part of a prospective cohort study that included all disease stages, whereas the DE samples were selected from a tumour bank. Clinical information is shown in Table 6, while Figure 1 summarises the experimental design.

Example 2: Tumour samples

Primary colorectal tumor samples from 149 NZ patients were obtained from patients undergoing surgery at Dunedin Hospital and Auckland Hospital between 1995-2000. Tumor samples were snap frozen in liquid nitrogen. All surgical specimens were reviewed by a single pathologist (H-S Y) and were estimated to contain an average of 85% tumor cells. Among the 149 CRC patients, 12 had metastatic disease at presentation, 35 developed recurrent disease, and 102 were disease-free after a minimum of 5-year follow up.

Primary colorectal tumor samples from DE patients were obtained from patients undergoing surgery at the Surgical Department of the Technical University of Munich between 1995-2001. A group of 55 colorectal carcinoma samples was selected from banked tumours which had been obtained fresh from surgery, snap frozen in liquid nitrogen. The samples were obtained from 11 patients with stage I cancer and 44 patients with stage II cancer. Twenty nine patients were recurrence-free and 26 patients had experienced disease recurrence after a minimum of 5-year follow up.

Tumor content ranged between 70 and 100% with an average of 87%.

Table 6: Clinical characteristics of New Zealand and German colorectal tumours

1. Persisting disease

Example 3: RNA Extraction and target labeling

NZ tumours: Tumours were homogenized and RNA was extracted using Tri-Reagent (Progenz, Auckland, New Zealand). The RNA was then further purified using RNeasy mini column (Qiagen, Victoria, Australia). Ten micrograms of RNA was labelled with Cy5 dUTP using the indirect amino-allyl cDNA labelling protocol. A reference RNA from 12 different cell lines was labelled with Cy3 dUTP. The fluorescently labelled cDNA were purified using a QiaQuick PCR purification kit (Qiagen, Victoria, Australia) according to the manufacturer's protocol.

DE tumours: Tumours were homogenized and RNA was isolated using RNeasy Mini Kit (Qiagen, Hilden, Germany). cRNA preparation was performed as described previously (9), purified on RNeasy Columns (Qiagen, Hilden, Germany), and eluted in 55 μl of water. Fifteen micrograms of cRNA was fragmented for 35 minutes at

95⁰C and double stranded cDNA was synthesized with a oligo-dT-T7 primer

(Eurogentec, Kόln, Germany) and transcribed using the Promega RiboMax T7-kit (Promega, Madison, WT) and Biotin-NTP labelling mix (Loxo, Dossenheim,

Germany).

Example 4: Microarray experiments

NZ tumours: Hybridisation of the labelled target cDNA was performed using MWG Human 30K Array oligonucleotides printed on epoxy coated slides. Slides were blocked with 1% BSA and the hybridisation was done in pre-hybridisation buffer at 42°C for at least 12 hours followed by a high stringency wash. Slides were scanned with a GenePix Microarray Scanner and data was analyzed using GenePix Pro 4.1 Microarray Acquisition and Analysis Software (Axon, CA).

DE tumours: cRNA was mixed with B2-control oligonucleotide (Affymetrix, Santa Clara, CA), eukaryotic hybridization controls (Affymetrix, Santa Clara, CA), herring sperm (Promega, Madison, WT), buffer and BSA to a final volume of 300 μl and hybridized to one microarray chip (Affymetrix, Santa Clara, CA) for 16 hours at 45°C. Washing steps and incubation with streptavidin (Roche, Mannheim, Germany), biotinylated goat-anti streptavidin antibody (Serva, Heidelberg, Germany), goat-IgG (Sigma, Taufkirchen, Germany), and streptavidin-phycoerythrin (Molecular Probes, Leiden, Netherlands) was performed in an Affymetrix Fluidics Station according to the manufacturer's protocol. The arrays were then scanned with a HP-argon-ion laser confocal microscope and the digitized image data were processed using the Affymetrix® Microarray Suite 5.0 Software. Example 5: Data pre-processing

NZ data: Data pre-processing and normalization was performed in the R computing environment (10). A Iog2 transformation was applied to the foreground intensities from each channel of each array. Data from each spot was used on a per array basis to perform print-tip loss normalization via the limma package (11) from the Bioconductor suite of analysis tools (12). Scale normalization (13) was then used to standardize the distribution of log intensity ratios across arrays. Post-normalization cluster analysis revealed the presence of a gene-specific print-run effect present in the data. Analysis of variance (ANOVA) normalization was used to estimate and remove print run effects from the data for each gene. Replicate array data was available for 46 of the 149 samples. Cluster analysis of the entire data set indicated that the duplicate arrays clustered well with each other suggesting internal consistency of the array platform. Genes with low intensity, large differences between replicates (mean 1Og₂ difference between duplicates higher than 0.5), and unknown proteins were removed from the data set. After the initial normalization procedure, a subset of 10,318 genes was chosen for further analysis.

DE data: All Affymetrix U133A GeneChips passed quality control to eliminate scans with abnormal characteristics, that is, abnormal low or high dynamic range, high perfect match saturation, high pixel noise, grid misalignment problems, and low mean signal to noise ratio. Background correction and normalization were performed in the R computing environment (10, 40). Background corrected and normalized expression measures from probe level data (eel-files) were obtained using the robust multi-array average function (14) implemented in the Bioconductor package affy.

Example 6: Prognostic signatures and cross validation

Data analysis was performed using the BRB Array-Tools package (hypertext transfer protocol://h^us.nci.nih.gov/BRB-ArrayTools.html). Gene selection was performed using a random variance model f-test. In the DE data, 318 genes were found to be differentially expressed when using a significance threshold of 0.001. As most of the differentially expressed genes exhibited relatively small changes in expression, a condition requiring the mean log₂ fold change between the two classes to be higher than 1.1 was added to the gene selection process for the DE data. Gene-based prognostic signatures were produced using leave one out cross validation (LOOCV) in each of the NZ and DE data sets. To avoid the problem of over-fitting, both the gene selection and signature construction were performed during each LOOCV iteration. After LOOCV, the prediction rate was estimated by the fraction of samples correctly predicted, hi order to find a gene set that could make the best prediction for unknown samples, different t-test thresholds using a random variance model were investigated in conjunction with six classification methods: compound covariate classifier (CCP), diagonal linear discriminant analysis (DLD), 3- nearest neighbours (3-NN), 1- nearest neighbours (1-NN), nearest centroid (NC)₃ and support vector machines (SVM).

To establish the validity of the NZ and DE prognosis signatures, reciprocal validation was performed, with the NZ signature validated using the DE data set, and vice versa. To test the NZ genes, probes relating to the 22 genes from the NZ signature were identified in the DE data, and LOOCV was used to assess the performance of a signature for the DE samples, based only on these probes. Similarly, probes relating to the 19 genes in the DE signature were identified in the NZ data and LOOCV was used to assess the performance of a signature for the NZ samples, hi both cases a significance threshold of 0.999 was used to ensure that all genes were used in each LOOCV iteration. Differences between the platforms (in particular, log-ratio data versus log-intensity data) meant that direct application of a prediction rule across data sets was not feasible. The consequence of this is that only the gene sets, and not the prediction rules used, can be generalized to new samples. The significance of the LOOCV prediction results was calculated by permuting the class labels of the samples and finding the proportion of times that the permuted data resulted in a higher LOOCV prediction rate than that obtained for the unpermuted data. All permutation analysis involved 2000 permutations, with small P-values indicating that prediction results were unlikely to be due to chance.

Example 7: Survival analysis Kaplan-Meier survival analysis for censored data was performed using the survival package within the R computing environment. Survival was defined to be "disease free survival" post surgery. For each analysis, survival curves were constructed, and the log-rank test (15) was used to assess the presence of significant differences between the curves for the two groups in question. Censoring was taken into account for both the NZ and DE data sets. For the disease-free survival data, right censoring prior to five years could only occur for non-recurrent patients as a result of either death, or the last clinical follow-up occurring at less than five years. Odds ratios and confidence intervals were produced using the epitools package for R.

Example 8: Identification of markers co-expressed with chemokine Iigands Genes in the DE data which had a Pearson correlation coefficient greater than 0.75 with at least one of the four chemokines appearing in the predictor in the non-relapse group were selected for ontology analysis. Ontology was performed using DAVID (hypertext transfer protocol://apps 1.niaid.nih.gov/david/).

Example 9: Results and analysis

To identify robust prognostic signatures to predict disease relapse for CRC, two independent sets of samples from NZ and DE were used to generate array expression data sets from separate series of primary tumours with clinical follow-up of five or more years. After normalization, each data set was analyzed using the same statistical methods to generate a prognostic signature, which was then validated on the alternate series of patients. As such, the DE prognostic signature was validated on the NZ data set and the NZ prognostic signature was validated on the DE data set.

Example 10: Exhaustive Identification of differentially expressed markers DE Data Set: The BRB Array Tools class comparison procedure was used to detect probes exhibiting statistically significant differences in average intensity between relapse and non-relapse samples. The RVM (random variance model) was again used to produce p- values for each probe in the data set. In this second round, a total of 325 probes were found to be significantly differentially expressed between the two sample classes using an arbitrary significance threshold of 0.05. Note this selection of genes did not apply any fold-change threshold, and used a significance cut off of 0.05, rather than the threshold of 0.001 that was used in Example 6. The purpose of this less stringent threshold (p=0.05 instead of p=0.001) was to put forward a larger number of genes for construction of the second round of signatures (see example 17) These probes represent 270 unique genes (Table 1 and Table 2). Explicitly, the test for significance (random variance model) comprises the following: generating a test statistic for each gene which was identical to that of a standard two sample t-test (45) except that the estimate of the pooled variance was obtained by representing the variance structure across all genes as an F-distribution, and then using the parameters, a and b, of this distribution (obtained via maximization of the empirical likelihood function) to form the following estimate of the pooled variance (see next page),

_{^ 2} Jn - 2)s _p ² _ooled + 2b-¹ (n - 2) + 2a

where S> is the new estimate of the pooled variance, Sp_ooled is the standard estimate of pooled variance (45), n is the number of samples, and a and b are the parameters of the F-distribution (46). Based on the t-statistic formed, a t-distribution with n-2+2a degrees of freedom was used to obtain a p- value for each gene. To adjust for multiple hypothesis testing, the False Discovery Rate controlling procedure of Benjamini and Hochberg (7) was used to produce adjusted p-values for each gene. A gene was considered to have undergone significant differential expression if its adjusted p-value was less than 0.05.

Example 11: Identification of correlated markers

In order to identify additional genes that can be used as prognostic predictors, correlation analysis was carried out using the R statistical computing software package. This analysis revealed 167 probes that had a Pearson correlation coefficient (40, 44, 45) of at least 0.8. Of these probes, 51 were already present in the set of 325 significantly differentially expressed probes, while the remaining 116 were reported as non-significant (using a 0.05 threshold for the FDR, or "false-discovery rate" (47) controlling procedure, the RVM, or rando variance model). These 116 probes represent 111 distinct genes (Table 2).

Example 12: Construction of prognostic signatures

The NZ data set was generated using oligonucleotide printed microarrays. Six different signatures were constructed, with a support vector machine (SVM) using a gene selection threshold of 0.0008 yielding the highest LOOCV prediction rate, and producing a 22-gene signature (77% prediction rate, 53% sensitivity, 88% specificity; p=0.002, Tables 7, 8A, and 8B). For Tables 8A and 8B, the gene descriptions are shown in Tables 3 and 4, respectively.

Table 7: Construction of prognostic signatures

22 gene NZ signature tested on German data

Data set Prediction rate Sensitivity Specificity P value* Odd ratio

NZ data ( training; SVM) 0.77 (0.66, 0.86)§ 0.53 (0.33, 0.73) 0.88 (0.77, 0.95) 0.002 8.4 (3.5, 21.4)

NZ data minus 4 genes not found in

German data were removed from 0.72 0.38 0.87 0.011

NZ data set (training; SVM)

German data (test; SVM) 0.71 (0.₅l, 0.86) 0.62 (0.32, 0.86) 0.79 (0.52, 0.95) 0.002 5.9 (1.6, 24.5)

19 gene German signature tested on NZ data

Data set Prediction rate Sensitivity Specificity P value * Odd ratio

German data (training; 3-NN) 0.84 (0.65, 0.9₅) 0.85 0.83 < 0.0001 24.1 (5.3, 144.7)

German data minus 5 genes not found in NZ data were removed

0.67 0.65 0.66 0.046 from German data set (training, 3-

NN)

NZ data (test; 3-NN) 0.67 (0.55, 0.78) 0.42 (0.22, 0.64) 0.78 (0.65, 0.89) 0.045 2.6 (1.2, 6.0)

SVM' support vector machine signature; 3-NN: 3 nearest neighbour signature.

§ 95% confidence interval

* P values were calculated from 2,000 permutation of class labels

Table 8A: NZ prognostic signature

New Zealand 22-gene prognostic signature

Table 8B: DE prognostic signature

German 19-gene prognostic signature

The NZ signature had an odds ratio for disease recurrence in the NZ patients of 8.4 (95% CI 3.5-21.4).

The DE data set was generated using Affymetrix arrays resulting in a 19-gene (22- probe) and 3-nearest neighbour (3-NN) signature (selection threshold 0.002, log₂ fold change>l.l, 84% classification rate, 85% sensitivity, 83% specificity, pO.OOOl,

Tables 3, 4, 7). The DE signature had an odds ratio for recurrence in the DE patients of 24.1 (95% CI 5.3-144.7). Using Kaplan-Meier analysis, disease-free survival in

NZ and DE patients was significantly different for those predicted to recur or not recur (NZ signature, pO.OOOl, Fig. 2A; DE signature, pO.OOOl, Fig. 2B). Example 13: External validation of the NZ and DE prognostic signatures

To validate the NZ signature, the 22 genes were used to construct a SVM signature in the DE data set by LOOCV. A prediction rate of 71% was achieved, which was highly significant (p=0.002; Table 7). The odds ratio for recurrence in DE patients, using the NZ signature, was 5.9 (95% CI 1.6-24.5). We surmise that the reduction in prediction rate, from 77% in NZ patients to 71% in DE patients (Table 7), was due to four genes from the NZ signature not being present in the DE data. Disease-free survival for DE patients predicted to relapse, according to the NZ signature, was significantly lower than disease-free survival for patients predicted not to relapse

(p=0.0049, Fig. 2C).

The DE signature was next validated by using the 19 genes to construct a 3-NN signature in the NZ data set by LOOCV. The prediction rate of 67% was again significant (p=0.046; Table 7), confirming the validity of the DE signature. The odds ratio for recurrence in NZ patients, using the DE signature, was 2.6 (95% CI 1.2-6.0). We consider that the reduction of the prediction rate was due to five genes from DE signature not being present in the NZ data set. This was confirmed when removal of these five genes from the DE data set resulted in a reduction of the LOOCV prediction rate from 84% to 67% (Table 7). Disease-free survival for NZ patients predicted to relapse, according to the DE signature, was significantly lower than disease-free survival for patients predicted not to relapse (p=0.029; Fig. 2D).

Example 14: Comparison of NZ and DE prognostic signatures with current staging system

Significant differences in disease-free survival between patients predicted to relapse or not relapse were also observed within the same clinico-pathological stage (Figure 3). When patient predictions were stratified according to disease stage, the NZ signature was able to identify patients who were more likely to recur in both Stage II (p=0.0013, Fig. 3A), and Stage III subgroups (p=0.0295, Fig. 3A). This was mirrored to a lesser extent when the DE signature was applied to the NZ data set, where the difference was only observed for Stage III patients (p=0.0491, Fig. 3B). Again, the decreased predictive accuracy of the DE signature was likely due to the absence of five genes from the NZ data that decreased the LOOCV prediction rate. Example 15: Genes in signatures are related to CRC disease progression

A number of genes in the NZ signature (Table 3) including G3BP2 (16), RBMSl

(17), HMMR (18), UBE2L3 (19), GNAS (20), RNASE2 (21) and ABCC9 (22) have all been reported to be involved in cancer progression, while RBMSl (23), EIF3S7 (24) and GTF3C5 (25) are involved in transcription or translation. PBK is a protein kinase, which is involved in the process of mitosis (26), and the only gene common to the NZ and DE signatures. Eleven of 19 genes in the DE signature (Table 4) are involved in the immune response including 4 chemokine ligands (CXCL9, CXCLlO,

CXCLIl, CXCLl 3; (27)), PBK (28), INDO (29), GBPl (30), GZMB (31), KITLG (32), and two receptors of the tumor necrosis factor family (TNFRSFl IA, FAS; 33)).

Eighty six genes were found to be moderately correlated (Pearson correlation coefficient > 0.75) with at least one of the four chemokine ligands in the DE data. Ontology analysis found that 39 of these 65 genes were in the category of immune response (p< 10^" ). This result suggests a key role for the host immune response in deteimining CRC recurrence.

Example 16: Discussion of NZ and DE Prognostic Signatures

It has been shown that the two different prognostic signatures can be used to improve the current prognosis of colorectal cancer.

For the DE signature, it was surprising and unexpected that the stage I/II samples could be used to predict stage III outcome. It was also surprising that many genes associated with recurrent disease are related to the immune response. The immune response has an important role in the progression of different cancers and T- lymphocyte infiltration in CRC patients is an indicator of good prognosis (36-38). All of the eleven immune response (Table 5) genes were down-regulated in recurrent patients which would be unexpected based on known biological mechanisms.

To further confirm these results, 4 chemokine genes were chosen for further analysis. Chemokine ligands not only reflect the activity of the immune system and mediate leukocyte recruitment but also are involved in chemotaxis, cell adhesion and motility, and angiogenesis (36). To investigate the role of the immune response genes, 86 genes co-expressed with the chemokine ligands were identified. Almost half of these genes had a Gene Ontology classification within the "immune response" category suggesting that the primary function of these genes in the recurrence process is the modulation of the immune response. Furthermore, CD4+ and CD8+ T cell antigens (CD8A, CD3, PRPl₅ TRA@, TRB@) or functionally related antigens, for example, major histocompatibility molecules, interferon gamma induced proteins, and IL2RB, were found in the co-expressed gene list. The activation of tumor specific CD4+ T cells and CD8+ T cells has been shown to result in tumour rejection in a mouse colorectal cancer model (37). Collectively, these findings suggest that the lyphocytes form part of a tumor-specific host response involved in minimising the spread of cells from the primary tumour.

Example 17: Selection of additional prognostic signatures

The performance of the two prognostic signatures described above was excellent in terms of cross-validation between the two data sets. Further studies were carried out, using a purely statistical approach, to develop a range of signatures, in addition to the aforementioned, that would also predict prognosis for other data sets. One of the additional goals of these studies was to ensure that the method used to normalize the microarray data (robust multi-array average) was not exerting undue influence on the choice of genes.

Figure 4 shows the classification rates obtained from signatures of varying lengths. The classification rate is the proportion of correct relapse predictions (expressed as a percentage of total predictions), i.e., the proportion of samples correctly classified. The classification rates were determined using 11-fold cross validation. For this cross validation, a randomly selected stratified sample (i.e. same ratio of recurrent to non- recurrent tumours as the full data set) was removed as a validation set prior to gene selection of the genes, and model construction (using the training set of the remaining 50 samples). Cross-validation was then repeated a further ten times so that all 55 samples appeared in one validation set each. This 11 -fold cross-validation process was repeated as 10 replicates, and the results plotted in Figure 4 and Figure. The classification rates shown were corrected using bootstrap bias correction (43), to give the expected classification rates for the signatures to be applied to another data set. From this analysis, it was ascertained that shorter signatures produced the best classification rate. In addition, analysis of the genes that most frequently appeared in classifiers show that the discriminatory power was mostly due to the effectiveness of two genes: FAS and ME2. This is illustrated most clearly by figure 5 shows the effectiveness of the signatures, once the two genes FAS and ME2 were removed from the data set. For more detail see the legend to Figure 5.

The effect of normalization on feature selection was thoroughly investigated by generating gene lists from 1000 stratified sub-samples of the original set of tumours, each time removing 5 samples (i.e.1/11 of tile total number of samples) from the data set. (This is effectively the same as performing 11-fold cross-validation). A tally was made of the number of times each gene appeared in the "top-n" gene lists (i.e., top 10, top 20, top 100, and top 325). This value was termed the "top count". Top counts were generated using three different normalization methods (40) (Figure 6), and three different filtering statistics (Figure 7). There was substantial correlation in the top count between normalization schemes and filtering statistics (41, 42) used. Thus, while normalization and feature selection methods were important, many genes appeared in the gene lists independently of the method used to pre-process the data. This indicates that the choice of normalization method had only a minimal effect on which genes were selected for use in signature construction. The top count, summed across all normalization methods and statistics, was found to be a robust measure of a gene's differential expression between recurrent and non-recurrent tumours.

Genes from the gene lists (see Table 1 and Table 2), were used to generate signatures by random sampling. The generation of samples was weighted, such that genes with higher "top count" were more likely to be selected. A range of signatures was generated, using between 2 and 55 Affymetrix probes. Signatures were selected if they exhibited >80% median classification rate, using three methods of classifiers: k- nearest neighbours, with k=l; k-nearest neighbours, with k=3; and support vector machines, with a linear kernel function, and using leave-one-out cross-validation.

On average, longer prognostic signatures were preferred over shorter signatures in terms of ability to predict prognosis for new data sets (Figure 4 and Figure 5). The genes FAS and ME2 were also important (discussed, above). These two facts were used, along with the fact that short signatures that do not contain either FAS or ME2 perform less effectively, to select candidate signatures as shown in Table 9, below. Signatures were selected (from the pool of randomly generated signatures) if they exhibited >80% median classification rate (using three methods of classifiers: k- nearest neighbours, with k=l; k-nearest neighbours, with k=3; and support vector machines, with a linear kernel function), using leave-one-out cross-validation.

5 In addition, because, on average, longer signatures (>10 genes/signature) tended to perform better, we selected signatures with 20 or more genes/signatures from a pool of signatures with 30 or more probes/signature. It is expected that these signatures (Table 10) will perform with a classification rate of around 70% when applied to other data sets, on the basis of the results shown in Figures 4 and 5. It was found that all of

10 the signatures generated in this way contained both ME2, and all but one contained FAS, which may be due to the importance of these genes in providing prediction of prognosis. It was noted that the high classification rate obtained using this approach on the in-house data set did not necessarily mean that these signatures that would be expected to perform better than those set forth in Example 12, on other data sets.

15 Rather, the purpose was to produce a range of signatures expected to apply to other data sets as least as well as the previous signatures. The markers comprising the prognostic signatures are set forth in Table 9.

20 Table 9: Additional Prognostic signatures (note SVM=support vector machine, 3NN=3 nearest neighbours, INN=I nearest neighbour, Sens=sensitivity, Spec=specificity, for prediction of recurrence)

δδ

Example 20: Specific Application of Prediction Methods

5 In selection of the gene signatures described here, two different statistical methods were used to characterise the signatures: k-nearest neighbours, and support vector machines. These methods are provided as packages to the R statistical software . system (ref), through the packages class (ref) and elO71 (ref).. The signatures described in this document were tested as follows. In both cases, the 10 data used to develop the prediction models for a given signature were the gene expression values (raw normalised intensities from the Affymetrix array data) for the probes corresponding to genes that comprise that signature, across both recurrent and non-recurrent samples:

• For k-nearest neighbours, we used leave-one-out cross validation with k=l and 15 k=3 to obtain sensitivity (proportion of positive, i.e. recurrent, samples correctly classified) and specificity (proportion of negative samples, i.e. nonrecurrent samples correctly classified) described in table 9 • The dataset was used to generate leave-one-out cross-validation sensitivity and specificity data using the following support-vector machine parameters: The support vector machine models were generated using a linear kernel, and all other parameters used were the default values obtained from the svm function of the el 071 package.

Note the genes comprising the signatures were themselves obtained from the list of significantly differentially expressed probes, and those from the list of genes which were found to correlate with genes from the NZ 22-gene signature. In some cases there was more than one significant (or correlated) probe per gene. In these cases, the prediction models used the median intensity data across all significant probes (i.e. those in the significant probe list, see table 1) for that gene.

References

1. Arnold CN, Goel A, Blum HE, Richard Boland C. Molecular pathogenesis of colorectal cancer. Cancer 2005;104:2035-47.

2. Anwar S, Frayling IM, Scott NA, Carlson GL. Systematic review of genetic influences on the prognosis of colorectal cancer. Br J Surg 2004;91: 1275-91.

3. Wang Y, Jatkoe T, Zhang Y, et al. Gene expression profiles and molecular markers to predict recurrence of Dukes' B colon cancer. J Clin Oncol 2004;22:1564- 71.

4. Eschrich S, Yang I, Bloom G, et al. Molecular staging for survival prediction of colorectal cancer patients. J Clin Oncol 2005 ;23: 3526-35.

5. Barrier A, Lemoine A, Boelle PY, et al. Colon cancer prognosis prediction by gene expression profiling. Oncogene 2005;24:6155-64. 6. Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol 2005;23:7332-41.

7. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005;365:488-92.

8. Marshall E. Getting the noise out of gene arrays. Science 2004;306:630-31. 9. Birkenkamp-Demtroder K, Christensen LL, Olesen SH, et al. Gene expression in colorectal cancer. Cancer Res 2002;62:4352-63.

10. Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996;5:299-314. 11. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004;3: Article 3.

12. Gentleman RC, Carey VJ₅ Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol

2004;5:R80.

13. Smyth GK, Speed TP. Normalization of cDNA microarray data. In: Carter D, ed. METHODS: Selecting Candidate Genes from DNA Array Screens: Application to Neuroscience. Vol. 31; 2003:265-73. 14. Mzarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics

2003,4:249-64.

15. Harrington DP, Fleming TR. A class of rank test procedures for censored survival data. Biometrika 1982,69:553-66. 16. Barnes CJ, Li F, Mandal M, Yang Z, Sahin AA, Kumar R. Heregulin induces expression, ATPase activity, and nuclear localization of G3BP, a Ras signaling component, in human breast tumors. Cancer Res 2002;62: 1251-55.

17. NiM T, Izumi S, Saegusa Y, et al. MSSP promotes ras/myc cooperative cell transforming activity by binding to c-Myc. Genes Cells 2000,5: 127-41. 18. Rein DT, Roehrig K, Schondorf T, et al. Expression of the hyaluronan receptor RHAMM in endometrial carcinomas suggests a role in tumor progression and metastasis. J Cancer Res Clin Oncol 2003,129:161-64.

19. Fernandez P, Carretero J, Medina PP, et al. Distinctive gene expression of human lung adenocarcinomas carrying LKBl mutations. Oncogene 2004,23:5084-91. 20. Frey UH, Eisenhardt A, Lummen G, et al. The T393C polymorphism of the G alpha s gene (GNASl) is a novel prognostic marker in bladder cancer. Cancer

Epidemiol Biomarkers Prev 2005;14:871-77.

21. Niini T, Vettenranta K, Holhnen J, et al. Expression of myeloid-speciflc genes in childhood acute lymphoblastic leukemia - a cDNA array study. Leukemia 2002,16:2213-21.

22. Yasui K, Mihara S, Zhao C, et al. Alteration in copy numbers of genes as a mechanism for acquired drug resistance. Cancer Res 2004;64: 1403-10. 23. Nomura J, Matsumoto K, Iguchi-Aiiga SM₅ Ariga H. Positive regulation of Fas gene expression by MSSP and abrogation of Fas-mediated apoptosis induction in MSSP-deficient mice. Exp Cell Res 2005,305:324-32.

24. Mayeur GL, Fraser CS, Peiretti F, Block KL, Hershey JW. Characterization of eIF3k: a newly discovered subunit of mammalian translation initiation factor elF3.

Eur J Biochem 2003;270:4133-39.

25. Hsieh YJ, Wang Z₅ Kovelman R, Roeder RG. Cloning and characterization of two evolutionarily conserved subunits (TFIIIC 102 and TFIIIC63) of human TFIIIC and their involvement in functional interactions with TFIIIB and RNA polymerase III. MoI Cell Biol 1999;19:4944-52.

26. Matsumoto S, Abe Y, Fujibuchi T, et al. Characterization of a MAPKK-like protein kinase TOPK. Biochem Biophys Res Commun 2004;325:997-1004.

27. Dong VM, McDermott DH, Abdi R. Chemokmes and diseases. Eur J Dermatol 2003; 13:224-30. 28. Abe Y, Matsumoto S, Kito K, Ueda N. Cloning and expression of a novel MAPKK-like protein kinase, lymphokine-activated killer T-cell-originated protein kinase, specifically expressed in the testis and activated lymphoid cells. J Biol Chem 2000,275:21525-31.

29. Logan GJ, Smyth CM, Earl JW, et al. HeLa cells cocultured with peripheral blood lymphocytes acquire an immuno-inhibitory phenotype through up-regulation of indoleamine 2,3-dioxygenase activity. Immunology 2002;105:478-87.

30. Lubeseder-Martellato C, Guenzi E, Jorg A, et al. Guanylate-binding protein- 1 expression is selectively induced by inflammatory cytokines and is an activation marker of endothelial cells during inflammatory diseases. Am J Pathol 2002,161:1749-59.

31. Phillips SM, Banerjea A, Feakins R, Li SR, Bustin SA, Dorudi S. Tumor- infiltrating lymphocytes in colorectal cancer with microsatellite instability are activated and cytotoxic. Br J Surg 2004;91:469-75.

32. Oliveira SH, Taub DD, Nagel J₅ et al. Stem cell factor induces eosinophil activation and degranulation: mediator release and gene array analysis. Blood

2002,100:4291-97.

33. Xanthoulea S, Pasparakis M₅ Kousteni S, et al. Tumor necrosis factor (TNF) receptor shedding controls thresholds of innate immune activation that balance opposing TNF functions in infectious and inflammatory diseases. J Exp Med 2004;200:367-76.

34. Brennan DJ, O'Brien SL, Fagan A, et al. Application of DNA microarray technology in determining breast cancer prognosis and therapeutic response. Expert Opin Biol Ther 2005;5 : 1069-83.

35. Canna K, McArdle PA, McMillan DC, et al. The relationship between tumor T-lymphocyte infiltration, the systemic inflammatory response and survival in patients undergoing curative resection for colorectal cancer. Br J Cancer 2005;92:651- 54. 36. Rossi D, Zlotnik A. The biology of chemokines and their receptors. Annu Rev Immunol 2000; 18:217-42.

37. Miyazaki M, Nakatsura T, Yokomine K, et al. DNA vaccination of HSP105 leads to tumor rejection of colorectal cancer and melanoma in mice through activation of both CD4 T cells and CD8 T cells. Cancer Sci 2005;96:695-705. 38. Ein-Dor L, KeIa I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005;21:171-78.

39. Becker RA, Chambers, JM and Wilks AR The New S Language. Wadsworth & Brooks/Cole 1988.

40. Gentleman R., Carey VJ, Huber W., Irizarry RA, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer 2005.

41. Bauer DF. Constructing confidence sets using rank statistics. Journal of the American Statistical Association 1972;67:687-690.

42. Lδnnstedt I. and Speed TP. Replicated microarray data. Statistica Sinica 2002;12:31-46. 43. Efron, B. and Tibshirani, R. An Introduction to the Bootstrap. Chapman & Hall. 2005

44. Harraway J. Introductory Statistical Methods and the Analysis of Variance. University of Otago Press 1993.

45. McCabe GP, Moore DS Introduction to the Practice of Statistics W.H. Freeman & Co. 2005

46. Casella G, Berger RL Statistical Inference Wadsworth 2001

47. McLaughlan GJ, Do K, Ambroise C Analyzing Microarray Gene Expression Data (Wiley Series in Probability and Statistics) 2004 48. Wright GW₅ Simon RM A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003;19:2448-2455

49. Hastie T, Tibshirani R, Friedman J The Elements of Statistical Learning Data Mining, Inference and Prediction Springer 2003 50. Neter J, Kutner MH, Wasserman W₅ Nachtsheim CJ, Applied Linear Statistical Models McGraw-Hill/frwin 1996

51. Venables, WN₅ Ripley, BD Modern Applied Statistics with S. 4^th ed.. Springer 2002.

52. Ripley, B. D. Pattern Recognition and Neural Networks Cambridge University Press 1996

53. Cristianini N₅ Shawe-Taylor J An Introduction to Support Vector Machines (and other kernel-based learning methods) Cambridge University Press 2000

54. Breiman L₅ Friedman J, Stone CJ, Olshen RA Classification and Regression Trees Chapman & Hall/CRC 1984 55. Good, PI Resampling Methods: A Practical Guide to Data Analysis Birkhauser 1999

Wherein in the description reference has been made to integers or components having known equivalents, such equivalents are herein incorporated as if individually set fourth.

Although the invention has been described by way of example and with reference to possible embodiments thereof, it is to be appreciated that improvements and/or modifications may be made without departing from the scope thereof.

Claims

What is claimed is:

1. A prognostic signature for determining progression of CRC, comprising two or more genes selected from Tables 1 and 2.

2. The signature of claim I₅ selected from any one of the signatures in any one of Tables 3, 4 or Table 9.

3. A device for determining prognosis of CRC, comprising: a substrate having one or more locations thereon, each location having two or more oligonucleotides thereon, each oligonucleotide selected from the group of genes from Tables 1 and 2.

4. The device of claim 3, wherein said the two or more oligonucleotides are a prognostic signature selected from in any one of Tables 3, 4 or Table 9.

5. A method for determining the prognosis of CRC in a patient, comprising the steps of;

(i) determining the expression level of a prognostic signature comprising two or more genes from Tables 1 and 2 in CRC tumour sample from the patient, (ii) applying a predictive model, established by applying a predictive method to expressions levels of the predictive signature in recurrent and non-recurrent tumour samples,

(iii) establishing a prognosis.

6. The method of claim 5, wherein the signature is selected from any one of Tables 3, 4 or Table 9.

7. The method of claim 5, wherein said predictive method is selected from the group consisting of linear models, support vector machines, neural networks, classification and regression trees, ensemble learning methods, discriminant analysis, nearest neighbor method, bayesian networks, independent components analysis.

8. The method of any one of claims 5 to 7, wherein the step of determining the expression level of a prognostic signature is carried out by detecting the expression level of mRNA of each gene.

9. The method of any one of claims 5 to 7, wherein the step of determining the expression level of a prognostic signature is carried out by detecting the expression level of cDNA of each gene.

10. The method of claim 9, wherein the step of determining the expression level of a prognostic signature is carried out using a nucleotide complementary to at least a portion of said cDNA.

11. The method of claim 8, wherein the step of determining the expression level of a prognostic signature is carried out using qPCR method using a forward primer and a reverse primer.

12. The method of claim 8, wherein the step of determining the expression level of a prognostic signature is carried out using a device according to claim 3 or claim 4.

13. The method of any one of claims 5 to 1, wherein the step of determining the expression level of a prognostic signature is carried out by detecting the expression level of the protein of each marker.

14. The method of any one of claims 5 to 7, wherein the step of determining the expression level of a prognostic signature is carried out by detecting the expression level of the peptide of each marker.

15. The method of claim 12 or claim 13, wherein said step of detecting is carried out using an antibody directed against each marker.

16. The method of any one of claims 12 to 14, wherein said step of detecting is carried out using a sandwich-type immunoassay method.

17. The method of any one of claims 12 to 15, wherein said antibody is a monoclonal antibody.

18. The method of any one of claims 12 to 15, wherein said antibody is a polyclonal antiserum.